The shift to cloud-native delivers unheard of application resilience and flexibility. As we've opined before, it mandates a new approach to troubleshooting and understanding system failures.
Today's architectures have exponentially more complexity. The question to answer isn't "what's wrong with my source code?" Instead, teams need to address a series of questions when issues arise:
Which component of the workload is having a problem?
How do we trace the relevant requests through the entire workload?
How do we find all diagnostic information from the components that processed that request?
And how do we do all of this as soon as possible?
The situation is compounded further when different teams own different pieces of the workload. Why? A select few have an end-to-end understanding of the entire workload.
But just as the industry has rallied around microservices, so too have we rallied to simplify the troubleshooting of these modern, distributed systems.
The answer: distributed tracing. These tools help engineers understand the scale of interactions between system components. Which brings us to PCF Metrics, the integrated metrics module for Pivotal Cloud Foundry.
PCF Metrics: A Single Set of Facts, and Now a Full Picture of Your System
PCF Metrics gives your engineering organization a single repository of application telemetry. Dev and ops teams use the data therein to kick-start issue mitigation. Events, metrics, and logs are shown on an intuitive timeline.
But these features don't sufficiently answer the questions posed earlier. To get a complete picture your workload, engineers need to understand the scale of interactions between components.
That's what PCF Metrics 1.3 delivers with Trace Explorer! Use Trace Explorer to:
Examine distributed tracing across microservices - with correlated logs in the same view
Perform log filtering on specific HTTP requests within a trace
View a dependency tree that shows parent-child relationship for microservices within the trace
PCF Metrics is tightly integrated with UAA, Pivotal Cloud Foundry's identity management service. That means Metrics automatically respects the permissions of the user. Engineers only see the apps they are authorized to view.
SRE Life Without Trace Explorer
How does Trace Explorer work in the real world? Consider a scenario where an e-commerce site experiences latency in user checkout.
Our hypothetical system is composed of these elements:
User-facing properties, the UI and API that power the shopper's experience. Let's call this collection of services
Stock inventory management (
Payment processing (
Order processing (
Order notifications (
Suppose latency exists in the completion and processing of an order. How do on-call engineers approach this problem without Trace Explorer? This is a common flow:
The SRE gets paged by the monitoring tool, because of the latency in user checkouts.
She logs in monitoring tool and finds that the
checkoutHTTP request is slower than usual. She drills down further and discovers the slowness is really from
She opens a new tab in the monitoring tool for the
She zooms into the relevant time window inside the
Further analysis shows that
paymentprocessing was slow.
She follows the same troubleshooting steps for
paymentas those for
order. She finds that
While this investigation in the metrics tool unfolds, the SRE also examines the logging tool. She reviews logs for the desired time window from all apps -
After searching and filtering for the right time window in the log tool, she correlates the metrics from
payments/charge-cardto the application logs from
payments. She sees that charge-card verification with the external bank was very slow. This introduced the latency.
This grunt work is the problem solved by Trace Explorer and PCF Metrics. No more alt-tabbing between tools. No more wading through logs after you've correlated the time window.
...And With Trace Explorer
Let's see how the scenario plays out for our SRE with Trace Explorer. An animated GIF is worth a thousand words as they say:
With Trace Explorer, troubleshooting time goes from hours to minutes!
What's more, you don't need intimate knowledge of the holistic system to find issues. Trace Explorer puts it all in context for you.
Trace Explorer makes every developer a capable troubleshooter for all your organizations' microservices!
Getting Started with PCF Metrics 1.3
Trace Explorer is the flagship feature in this release, but there are other new features too. The module captures more app events, and improves how logs and metrics are retained during tile upgrades.
Got feedback on Trace Explorer or PCF Metrics? We want to hear from you! Simply click on the Feedback icon inside the product and tell us what you think. Want to read about other new features in Pivotal Cloud Foundry 1.10? Check out the overview.
About the Author
Mukesh is a product manager at Pivotal, helping transform how the world monitors software.More Content by Mukesh Gadiya