PCF Healthwatch: “Out of the Box” Monitoring for Your Platform

January 12, 2018 Amber Alston

“Distributed systems are hard ... but wicked cool."

That quote, from Cornelia Davis of Pivotal, sums up the state of software delivery.

If you’re an operator at a big company, the phrase is especially apt. You help your developers harness the “cool,” while taming the hard stuff.

PCF operators - the engineers who look after the platform - achieve remarkable efficiencies. It’s common to see an ops team of 8 supporting hundreds of developers. (SpringOne Platform attendees recently heard many such examples.)

Of course, there are always more jobs to be done. Especially in the world of distributed systems, where everything changes all the time.

One pressing job to be done: a better way to keep tabs on the PCF itself. Operators told us they wanted a better monitoring solution for the platform. Sure, there’s lots of tools to track VMs and infrastructure. But there isn’t one optimized for PCF platform metrics. Until now!

Say hello to PCF Healthwatch, announced earlier this month. The product is now GA, and part of Pivotal Cloud Foundry 2.0. The tile is available on the Pivotal Network.

PCF Healthwatch helps operators monitor and understand the current health of the platform. The service tracks the recommended performance and scaling indicators for a given version of PCF.

Here’s the cool part: the product renders operational data in colorful dashboards. Is everything OK? What needs my attention? Is anything on fire? PCF Healthwatch shows you instantly.

You need new tools to wrangle distributed systems. That’s why we offer BBR (for backup and restore) and PCF Metrics (to troubleshoot microservices). Think of PCF Healthwatch in this same vein: a new product, designed for the era of distributed systems.

Let’s take a deeper look at why PCF Healthwatch is so handy for platform operators.

PCF Healthwatch is an Operational Dashboard for the Platform

First up: the dashboard. PCF Healthwatch covers the recommended Key Performance Indicators and Key Scaling Indications. Data is grouped into three sensible categories:

  • End-User Impact shows you how your apps are doing in production. (“Is latency a problem for me right now?”)

  • Developer Impact conveys the health of the Cloud Foundry CLI and useful details about available capacity. (“Can my devs push code as expected? Is there sufficient memory for them to push and scale apps?”)

  • Platform Impact displays the status of Ops Manager, BOSH, and the underlying VMs. (Is BOSH healthy and managing my VM resiliency as expected? Can I proceed with a platform upgrade now?)

PCF Healthwatch shows you essential data about the health of your Pivotal Cloud Foundry installation.

PCF Healthwatch is helpfully updated to track the most important indicators for a given release of PCF. Previously, operators would have to tweak their bespoke platform monitoring setup. Say goodbye to that toil - PCF Healthwatch keeps you updated automatically!

So how does the product work? Where does the data come from? Simple. Healthwatch constantly runs validation tests in four areas:

  1. Cloud Foundry CLI Health. The CLI is how developers push apps to the platform. PCF Healthwatch executes a continuous test suite that validates the core functions of the CLI. With this approach, you don’t need to wait for your devs to report an issue. Healthwatch will flip these metrics from green to red immediately if there’s a problem.

  2. Ops Manager Health. You use Ops Manager to do upgrades and scale PCF. If an issue crops up, your ability to perform these tasks could be compromised. Healthwatch monitors Ops Manager availability for you. Once again, the dashboard refreshes to show you when an undesirable condition pops up.

  3. Apps Manager Health. For Healthwatch, Apps Manager performs a unique function: it’s a canary app. The product checks on Apps Manager health as a leading indicator for availability and responsiveness. You’ll know about any hiccups in Apps Manager right away. This way, it’s easier for you to get in front of issues related to apps running on the platform.

  4. BOSH Director Health. The BOSH Director is buried deep in the guts of the platform. Issues with BOSH Director rarely impact the end-users of your apps running PCF. But they can mean the loss of resiliency in BOSH-managed VMs. That’s why Healthwatch checks on the BOSH Director as a part of its test suite.

Capacity and logging performance loss rates are tracked too. Documentation goes into much more detail about each metric, and why they matter.

It gets better. PCF Healthwatch works with the Loggregator Firehose. It’s easy to publish the results of validation tests to your favorite monitoring tools. Hurray for extensibility!

Use PCF Healthwatch to “Manage the Panic”

We’ve been testing a beta of PCF Healthwatch with a few customers. Internal teams here at Pivotal have been helping us too. Here’s a few initial impressions from operators:

  • “Healthwatch helps us ‘manage the panic’.” One ops team had just completed an upgrade of the platform. Healthwatch flagged an unhealthy job: `clock global`. This is an obscure process, without obvious documentation. But thanks to Healthwatch, the team could see that the platform was working normally. The error wasn’t affecting other parts of the platform. The team confidently concluded that the error wasn’t critical. The issue was later resolved during normal working hours.

  • “A Pivotal-provided service is just what we need.” Pivotal knows its products best. An opinionated platform monitoring solution that reflects the company’s expertise is a welcome enhancement.

  • “Automatic updates that track new KPIs with each PCF versions is a big help.” Operations teams don’t have time to adjust homegrown dashboards with each new release. This work is toil that doesn’t help the business. With PCF Healthwatch, ops have the new metrics tracked immediately after an upgrade.

Let’s Learn Together

Pivotal is a learning organization.

Our goal with the launch of PCF Healthwatch is the same as with any other release: to learn from you. We look forward to partnering with you on your journey to get better at software. While we’re at it, let’s make distributed systems that much easier!

Ready to try out PCF 2.0? Check out Small Footprint, PCF Dev, or spin up a free trial on Pivotal Web Services.

About the Author

Amber Alston

Amber Alston is a Principal Product Manager for Pivotal Cloud Foundry. From consumer apps to immersive training simulators, she has 10+ years of experience in identifying and translating strategic business objectives into the delivery of highly useful and usable technology solutions. She holds dual Master's Degrees in Engineering & Communication. She's been described as a learner, a gamer, a foodie, and a random hobby skills collector.

Previous
Automated PCF Upgrades with Concourse - Rich Ruedin, Express Scripts
Automated PCF Upgrades with Concourse - Rich Ruedin, Express Scripts

Staying up-to-date with updates and patches is not an easy task. When updates come out multiple times a wee...

Next Presentation
How to continuously deliver your platform with Concourse
How to continuously deliver your platform with Concourse

SpringOne Platform 2017 Ryan Pei, Pivotal; Brian Kirkland, Verizon "Continuous delivery is not only essent...