Monitoring Pivotal Cloud Foundry Health and Status: Hybrid Models, KPIs, and More

January 27, 2015 Jamie O'Meara

featured-PCF-health2014 was a very busy year for many customers adopting Platform as a Service (PaaS). I spent many weeks conducting “Operator Workshops” on Pivotal Cloud Foundry, helping customers make decisions on how to organize their environments and preparing them for PaaS. During our discussions, a recurring theme formed around the ability to monitor the health and status of Pivotal Cloud Foundry.

This post highlights many of the best practices for creating custom dashboards to monitor the health and status of Pivotal Cloud Foundry. We’ll also discuss the key performance indicators (KPIs) and which KPIs give us the best insight into Pivotal Cloud Foundry performance. Many of the technologies used in this post were based on customer installs, and Pivotal Cloud Foundry integrates with a number of dashboarding solutions.

The Set Up

A very common enterprise use case is to leverage hybrid cloud computing, combining public and private datacenters into a single operational environment. Pivotal Cloud Foundry is an ideal platform for hybrid computing as it supports a number of cloud providers. To accelerate my delivery, I too leveraged the public cloud in the form of vCloud Air. I installed Pivotal Cloud Foundry, Pivotal Ops Metrics and vRealize Hyperic in my vCloud Air virtual datacenter to gather the required Pivotal Cloud Foundry metrics.

Pivotal Ops Metrics exposes the metrics via a JMX endpoint and provides a simple access point for any dashboard solution to access and display the metrics. In my case, I used a Hyperic plugin that makes JMX requests and stores the results in a local Hyperic database. As you can see in the configuration properties section of Hyperic, the only requirement is a URL pointing to the Pivotal Ops Metric endpoint and the username and password configured during installation.

PCF_Dashboard-hyperic_config

In my private datacenter, I had a previous version of vRealize Operations Manager monitoring my on-premise production version of Pivotal Cloud Foundry. I configured the vRealize Operations Manager to periodically connect to Hyperic server (in the vCloud Air Data Center) to extract the Pivotal Ops Metrics data for analysis and display in vRealize Operations Manager.

High Level Architecture of Hybrid Cloud

image02

Dashboards and KPIs

To monitor the health and status of Pivotal Cloud Foundry, I created the three main dashboards. These dashboards include Pivotal Cloud Foundry Health, DEA Health and Application Health. Let’s review each of the dashboards to understand the key performance indicators collected by Pivotal Ops Metrics.

Pivotal Cloud Foundry Health Dashboard

The Pivotal Cloud Foundry Health Dashboard provides an operational view of the health and status of Pivotal Cloud Foundry. It provides insight on the collection pipeline, router health and status, Pivotal Cloud Foundry architecture components, app and users, and the Cloud Controller interactions between developers and administrative tasks.

PCF-Dashboard-01-Health

Collection Panel

The Collection Health Panel displays the health of Hyperic Server, Hyperic agent and the Hyperic database that collect the metrics. Any concerns with the collection pipeline will be identified early by the 9 predictive analytic algorithms in the vRealize Operations Manager. Here, you can see the Hyperic server has a health index of 85, and operators can select it to gain deep insight into its rating.

Router Health and Router Panels

The Router Health Panel displays the total routes, total requests and bad gateways. You’ll also notice its sister panel, Router, which provides an additional metric regarding the registry update time. The two metrics I watch the most are:

  • Total Routes: In Pivotal Cloud Foundry, applications are responsible for broadcasting their routes and a significant decline in routes could represent network connectivity or message bus (NATs) issues. If your installation has multiple routers, which most do for high availability, it is best to monitor the total routes for each router to identify any misbehaving routers.
  • Registry Update in Milliseconds: These capture prolonged updates, greater than 5 seconds, that can affect the platform’s performance.

Ops Manager Director Panel

The Ops Manager Director Panel represents the health and status of MicroBOSH. This panel is busy during cloud operations or infrastructure failures, and it also displays disk consumption CPU Average Usage %, Writes Per Second, and Memory Usage %.

Pivotal Cloud Foundry Architecture Panel

The Pivotal Cloud Foundry Architecture Panel displays the health of each of the Pivotal Cloud Foundry components.

Apps and Users Health Panel

The Apps and User Health Panel displays the total number of users as well as the total number of applications instances.

Cloud Controller Health Panel

The Cloud Controller Health Panel provides visibility of developer and administrative interaction with Pivotal Cloud Foundry. This gives operators an understanding of the level of service provided by Pivotal Cloud Foundry.

DEA Health Dashboard

The DEA Health Dashboard provides insight into the platforms ability to stage applications, create new application instances and determine the best time to scale the platform for further application deployments.

PCF-Dashboard-02-DEA_Health

DEA Free Memory Health Panel

The DEA Free Memory Health Panel measures the amount of free memory in each DEA. Free memory represents future capacity for additional application instances. As a DEA nears the 1 GB threshold of free RAM, it can no longer stage applications.

Available Stagers Panel

The Available Stagers Panel represents an estimate of the remaining stagers in a given DEA. When Available Stagers is zero, the DEA can no longer stage. If none of the DEAs can stage, capacity is full and future application deployments or application scale requests will receive a “no available stagers” error. Operators should be aware that an application requires free memory on the DEA of 1GB or the requested memory size for the application, whichever is greater, to stage.

Disk and Memory Ratio Panel

The Disk and Memory Ratio Panel shows the metric for each DEA. For details on disk and memory ratios, please see the Pivotal Cloud Foundry documentation on Ops Metrics

Application Health Dashboard

The Application Health Dashboard provides an operational view of the applications running in the platform. Its primary concern is to display the health of application instances, identify missing applications instances, highlight any crashed application instances and evaluate the number of messages processed by the health manager.

PCF-Dashboard-03-Application-Health

Application Health Panel

The Application Health Panel displays three important metrics to measure application health. These metrics highlight the application health and help operators determine if there are any issues providing capacity and service to developers.

  • The number of desired apps is plotted in red. This metric shows operators how many applications are desired to run on the platform.
  • The number of applications with all instances reporting, which is plotted in blue and not visible because of the number of desired apps metric. This metric tells operators if an application has all its application instances reporting, regardless of state (e.g. Starting, Running, Crashed).
  • The number of running application instances are plotted in green. This metric shows the number of running application instances with a state of “starting” or “running.”

Missing Application Panel

The Missing Application Panel helps operators identify applications with missing instances. The metrics used are:

  • The number of apps with missing instances reporting. This metric shows the number of desired applications for which an instance is missing (i.e. the instance is simply not heartbeating at all).
  • The number of missing indices, which represents the number of applications missing. This metric shows the number of missing instances, and these are instances that are desired but are simply not heartbeating at all.
  • The number of undesired running apps. This metric measures the number of undesired applications with at least one instance reporting as STARTING or RUNNING. Undesired applications are applications the developer has decided to stop but the application instances are still active (Starting/Running).

Crashed Application Panel

This panel identifies application crashes using these metrics:

  • The number of crashed application instances which measures the number of instances reporting as crashed. This number represents the total number of crashed containers that remain on the Droplet Execution Agents (DEAs). Crashed containers remain on the DEAs up to 60 minutes after a crash is detected and gives developers an opportunity to inspect the crashed application instance via the API.
  • The number of crashed indices, which represents applications in the platform that are crashed. This metric shows the number of indices reporting as crashed. Because of the restart policy, an individual index may have very many crashes associated with it. This metric helps identify crashed applications in the platform that require attention.

Health Manager Messages Sent Panel

This panel measures the number of events per minute processed by the Health Manager. The events included are Start Crashed, Start Evacuating, Start Missing, Stop Duplicate, Stop Evacuation Complete and Stop Extra.

As you plan, install and operate your Pivotal Cloud Foundry service, it is important to provide the correct level of service for your developers and end users. The best way to gain operational insight and avoid service disruption is to build and operate a predictive analytics dashboard using Pivotal Ops Metrics. Happy dashboarding!

Learn More:

About the Author

Biography

More Content by Jamie O'Meara
Previous
Why Services are Essential to Your Platform as a Service
Why Services are Essential to Your Platform as a Service

PaaS gives developers and IT operations groups tremendous advantages, and a core capability includes the ea...

Next
5 Steps to Writing Better Documentation
5 Steps to Writing Better Documentation

Even if you’ve never had to document a feature you worked on, you’ve probably used open source software in ...

How do you measure digital transformation?

Take the Benchmark