The 3 Stages to Observability for Modern Apps

June 13, 2018 Alexis Richardson

Editor's note: This is a guest post written by Alexis Richardson, CEO at Weaveworks

Developers are writing applications for the Pivotal Cloud Foundry® (PCF) platform which today means:

  • 12-factor applications in Pivotal Application Service 
  • Distributed applications and data services in Pivotal Container Service
  • A growing set of “serverless” functions and other new capabilities 

The objective is to have a single joined up strategy for monitoring all of these as well as the platform they run on.  In addition you will need to think about what metrics to pay attention to and what you can leave under the control of the platform.  

In this post we provide a step by step strategy for adding monitoring and observability for to your platform team in a simple and cohesive way.

Observability

The goal of a platform is to enable developers to focus on user happiness and business logic - so you will need to prioritize user and business metrics.  To relate these back to overall system operations you will need to care about Observability.  

Observability is a property of your system. In a nutshell - if you can’t observe your system then you can’t understand it, operate it properly or fix it when it goes wrong.   Advanced automated platforms like Pivotal Cloud Foundry and Kubernetes run hundreds or even thousands of applications as part of a distributed system, so developers need to cultivate a shared understanding of how their apps work on such platforms.  

If you become aware that user experience is degrading you will want to be able to visualize your whole system, locate potential areas of failure, and interact with components, log stores and other services. The ability for developers to observe any part of a system, ask questions and find answers quickly is a precondition for successful operations.


  
Three stages to success

We believe that everyone on the cloud native journey will proceed at their own speed.  Our recommendation is to attend to each of these stages in turn. At each step success can be measured based on more productive users.

Collection – Enterprises using dynamic platforms like Kubernetes need Prometheus to monitor their apps and infrastructure to collect the right metrics and most importantly create alerts on those metrics so that developers can react appropriately.
Correlation – Once metrics are collected, and alerts generated, developers need to understand an application through visualization, logging and interactive debugging tools. A unified dashboard across PAS and PKS helps you understand your system. Also by practicing GitOps you can maintain the state of your system and recover more easily from system disaster. 
Causation – And with the right tools available, developers can monitor applications with the goal of gaining complete observability to determine the root cause of application problems. 

Step One - Collection

The first objective is to collect any metrics that you will need.  We recommend instrumenting your services using Prometheus - the cloud native monitoring and alerting tool.

There are two basic requirements. First, you must be able to move seamlessly between business, user, app, cluster and host metrics. Second, you are dealing with a highly automated and dynamic environment, so you cannot use “old” monitoring tools that are hard bound to machines, hosts, and relatively fixed IP networks.

In a container-native environment using microservices, the velocity of change is much greater compared to a monolithic setup using virtualized machine environment.  Containers for an app or service get created and destroyed every second. Along with that, container orchestration software like Kubernetes dynamically creates and destroys nodes, pods, and replicas to scale with the needs of your service or app or to “self-heal” when any of these components have failed.

This is why using Prometheus for monitoring and alerting is gaining popularity.  Prometheus is designed for the high volume of fine grained metrics you will need when running containers and microservices as part of your cloud native platform.  Prometheus has a vast range of application level data collectors including of course Spring, Docker, Kubernetes, etc.  

A Collection Solution for PCF

Weaveworks provide an enterprise-grade managed Prometheus which can be added to PCF (PAS, PKS) in a few clicks so users can start collecting metrics from all components of PCF including the apps and services running on it. You can use Weave Cloud to visualize all of your services in a graphical context-sensitive map, and also use it to monitor, troubleshoot and send you alerts about your applications. Check out the Weave Cloud for PCF available on Pivotal Services Marketplace.


Step Two - Correlation

Congratulations!  You solved the first challenge, and you are now tracking hundreds (if not thousands) of ephemeral containers and app components.  You will be able adapt to any microservices as you evolve.  Your business is “as dynamic as the platform”.

Your new objective is to understand and explain the system. Operations teams are responsible for user happiness expressed as service health - SLIs like error rate, request latency, or queries per second, and SLOs like minimal downtime.  

Monitoring can tell you if you are not meeting Service Level Indicators (SLI) and Service Level Objectives (SLO) but have little explanatory power to help you diagnose and fix issues.  A richer overall approach is needed that accounts for Observability.  Here are some use cases for delivering Observability:  

Validate that an application or component is in the correct state, by comparing it with a description of the desired state (eg. a config file, or an alerting threshold)
Correlate deployment events and histories with application metrics.  This becomes especially important when you have a high velocity team that deploys multiple times a day.  Tracing, logging, and visualization of the services are the other techniques for collecting data that indicate the operational wellness of the service 
Correlate between multiple components to describe a high level behaviour impacting business users, for example the orchestrator repeatedly starts and stops a buggy container, leading to unhappy users.
Troubleshoot serious application issues as they arise. How do you quickly diagnose a problem in an ever changing environment?  

A Correlation Solution for PCF

To help with these use cases you will need to observe and understand many components of your system at once.  Weaveworks provide support for this in PCF:

Aggregation of full stack monitoring - host, cluster & app metrics - with visualization and interactive management eg ssh/debugging, log viewing
A unified dashboard across across the full stack, with filtering and auto-generation to save human users from mental overload
Integration of different exploration tools for team level diagnostics eg. incident handling notebooks and support for developer-created Prometheus tools 

One of the obstacles with aggregating metrics from tracing, logging and visualization is managing and making sense out of the volume of data. It is a huge cognitive challenge to determine the relative importance of each data point and what it means in context with any other source of metrics or logs. Hence the aggregation of metrics and visualization of trending data in a focused and actionable dashboard becomes key. Ideally you want to focus on key performance indicators that can alert you of any potential bottlenecks so you can proactively put an improvement in place or identify the root cause. 

The tool or dashboard should be application-centric. This means that  the dashboard should be able to generate service metrics, correlate them with deployment events and histories, so that you can analyze and compare current vs. past performance to make more informed decisions.

With that in mind, we want to introduce you to an observability solution for PCF - Weave Cloud. A developer-centric tool that allows you to gather observability metrics as well as real-time views of your entire PCF platform and the apps running on it. 

 

Weave Cloud allows you to gather and push time-series metrics about the health of the PCF platform itself, including metrics about e.g. CPU and memory usage, how many apps are running, and other metrics that are valuable for a PCF operator.

For example, here we can group the CPU usage in the system by BOSH jobs:


Step Three - Causation

Let’s summarize the steps thus far.

First, collect metrics and generate alerts.  For Kubernetes, Docker and Cloud Foundry we highly recommend Prometheus and some enterprise features.
Second, observe and correlate across components, to make the system more understandable.  Filter out noise using dashboarding to focus the UX.

There’s still one final objective: establishing Causation. This can be very hard.  And so at the risk of disappointing readers who have come with us so far, we shall be brief.  We’ll describe some of the issues, and one day maybe hope to write a survey of solution techniques and new products in the space.

Microservices form complex networks of behaviour and involve many layers of technology - routing, discovery, etc.  Given an apparently localized fault, how do we establish a root cause for the problem?  How do we dig into systems that keep on misbehaving?  How do we understand the causes of complex distributed systems failures?

This is where Observability is so important.  A system is observable if developers can understand its current state from the outside -- and therefore have even half a chance of figuring out what may be wrong.  Please do explore further details on Observability and fixing things that go wrong, on the Weavework’s blog.

What are we looking for?  

  1. Bad user experiences 
  2. Failures that get fixed by the scheduler before we can register negative user impact
  3. The dreaded “grey failures” (read this)
  4. Patterns in high cardinality data (read this

Making an application or service observable means developers can be in charge of not only monitoring an app’s behavior, but also the impact it will have on their app’s users.  You can solve some of these issues using platform tools like PCF and Weave Cloud, but ultimately this is where we hit the breaking edge of technology and analytics.  For now, let’s wrap by saying “it is a good idea to think about how to observe an application, as you develop it”.

Conclusion

The era of cloud native applications is upon us and developer velocity is everyone’s goal.  Does this mean that operations is no longer needed?  Certainly not.  Your ops team is more valuable than ever before, but their role has changed and they must learn a new language. That is the language of Observability - the latest evolution of monitoring. 

Start understanding your system and download the Weave Cloud for PCF tile from the Pivotal Network - your 30 day free trial can start now. 

About the Author

Alexis Richardson

Alexis is the co-founder and CEO of Weaveworks. He is also the chairman of the TOC for Cloud Native Compute Foundation (CNCF), and the co-founder of the Coed:Code meetups. Previously he was at Pivotal, as head of products for Spring, RabbitMQ, Redis, Apache Tomcat and vFabric.

Previous
Secure All the Services! How Banks Use Pivotal Cloud Foundry and the Open Service Broker API to Make It Happen.
Secure All the Services! How Banks Use Pivotal Cloud Foundry and the Open Service Broker API to Make It Happen.

Banks of all sizes are modernizing how they do IT and software development. This blog series explores how b...

Next
Enterprise Architects, It's Time to Learn How the CredHub Service Broker Applies the Principle of Least Privilege to Your Secrets.
Enterprise Architects, It's Time to Learn How the CredHub Service Broker Applies the Principle of Least Privilege to Your Secrets.

The CredHub Service Broker is now a beta. It's a service broker that helps developers secure off-platform s...