The Stoplights Metrics Dashboard for Cloud Foundry: Go or No Go

August 25, 2015 Vick Kelkar

sfeatured-pcfPivotal’s Cloud Ops team provide production operation support for Pivotal Web Services, which offers Pivotal Cloud Foundry as a hosted solution. Our responsibilities include deploying releases and providing break-fix support. Monitoring the health of the system is a critical part of this process. We rely on our “stoplights” dashboard to give us a quick go/no-go look at the operational status of Pivotal Web Services (PWS).

Cloud Foundry is an open source project comprised of 90+ components. Using PWS, customers can easily deploy and rapidly scale their applications without coordinating infrastructure resources. If the metric is green, the component is healthy. A yellow metric is a warning. A red light or metric deserves immediate attention.

image00

As an operator, the first thing we want to monitor is the availability of the main components. The BOSH HealthMonitor component provides metrics for Cloud Controller and gorouter availability by emitting the `bosh.healthmonitor.system.healthy` metric. In this case, operators filter the metric for API and router to check for any sudden spikes or changes in the system. We check for any significant changes in the number of routes and application instances in the Cloud Foundry system, as they might demand operator attention and investigation.

Other metrics to keep an eye on are `cf.collector.router.requests` and `cf.collector.router.requests_per_second`. The former is a counter, and the latter derives from that metric. We filter these by the ‘component’ tag, with possible values of nil (router itself), ‘app’ (DEA), and ‘cloudcontroller’ (CC). We expect some variability in these figures, but switch to immediate attention when the numbers drop below a certain range.

Monitoring capacity or usage of the system can help you take proactive steps as an operator. Cf.collector also emits metrics for reservable_stagers and available memory in the Cloud Foundry system. Another use case can be found in the monitoring of Diego, the beta Droplet Execution Agent;s replacement, which has shipped with Cloud Foundry since July 2014. You can monitor Diego’s cell capacity by capturing the `MetronAgent.forwarder.rep.CapacityRemainingMemory` metric.

Another useful marker for an operator is the utilization of the system. You can plot a virtual machine’s CPU utilization by capturing the `bosh.healthmonitor.system.cpu.user` metric on your dashboard.

If all these metrics seem like a lot to track down in your already-deployed Cloud Foundry system, we have published our stoplights dashboard in hopes that it will be useful to other Cloud Foundry operators. We are using datadog to aggregate our metrics. You can find our dashboard configuration and the production stoplight dashboard in this GitHub repo.

The stoplights dashboard and the metrics mentioned in this blog post operate under the understanding that if they fall below a certain threshold, an action or an investigation should be performed by a Cloud Foundry operator. Creating dashboards with trends and or threshold metrics will help you identify operational patterns in your Cloud Foundry deployment. By monitoring the distributed Cloud Foundry system and the behavior of each component through metrics, the operator can establish a baseline and automate corrective actions.

Learn more:

About the Author

Biography

More Content by Vick Kelkar
Previous
That Cloud-Native Lifestyle
That Cloud-Native Lifestyle

In this episode, Andrew Clay Shafer is back for a discussion with Coté on what Cloud Native means. The pair...

Next
The Purity & Tyranny Of A Blank Screen: The Greenfield Journey—Part 2
The Purity & Tyranny Of A Blank Screen: The Greenfield Journey—Part 2

In the second part of this series on the "Cloud Native Journey," Coté focuses on the greenfield journey, d...

Enter curious. Exit smarter.

Register Now