How To: Configuring Nagios to Monitor Pivotal Cloud Foundry

February 25, 2015 Guest Blogger

featured-nagios-CFIn this post, guest blogger Chris Mattingly, a solutions architect at EMC, shares how to set up Nagios-based monitoring for Pivotal Cloud Foundry based on his experience of setting this up in the field.

For those of us that have worked with Pivotal Cloud Foundry, a good monitoring solution to watch over the Pivotal Cloud Foundry infrastructure is, simply put, a wonderful thing. How else will you know that the persistent disk on the RiakCS nodes is close to filling up or that there is a bad query driving up the CPU utilization on the MySQL nodes?

Nagios to the rescue! If you’re not familiar with Nagios you can check it out at http://www.nagios.org. Basically, Nagios provides a solid, open source infrastructure monitoring solution that can be used with Pivotal Cloud Foundry. I won’t cover the initial Nagios installation and setup here, instead I will leave that to the reader to research. The Nagios setup is very straightforward. GitHub examples, along with the script explanations below, can provide additional details on the installation.

Background

The lack of monitoring became very real for my team recently in our development space. We had installed the RiakCS tile with the defaults, and this meant we only had 10GBs for the persistent disks on each of the cluster nodes. The complaints from the developers started rolling in when they were unable to retrieve anything from RiakCS and did not know why. A status check in Ops Manager showed the problem rather quickly—the disks were 100% full, and in turn, the CPU was also fully spiked.

It is operationally unrealistic to task someone to periodically check the status page for all installed tiles on a daily basis (or even more frequently). If you try, this will become real clear, real fast. The only option here is automation, and so we realized we needed a monitoring solution in place to start watching some key metrics for our Pivotal Cloud Foundry infrastructure and immediately alert us to problems. This lead us to the familiar territory of Nagios.

So How Can Nagios Monitor Pivotal Cloud Foundry?

Good question! Especially considering that we don’t want to:

  1. Have to modify the Pivotal Cloud Foundry stemcells.
  2. Install any agents on the Pivotal Cloud Foundry nodes—which could be lost or overwritten if Pivotal Cloud Foundry needed to rebuild the VM from the stemcell.

Before we can dig into the Nagios configuration for the Pivotal Cloud Foundry montoring, the first prerequisite is that you have installed the “Ops Metrics” tile. Once the tile is installed and running, you will need to make notes of:

  1. The “JMX Provider Credentials” you provided during the tile installation. You can retrieve them from the “Credentials” tab on the Ops Metrics tile if you didn’t write them down—or change them from the “Settings” tab.
  2. The Ops Metrics VM’s IP address is available from the “Status” tab on the Ops Metrics tile.

By having Ops Metrics available, we now have a JMX interface for host-level statistics on every VM running in our Pivotal Cloud Foundry environment. These statistics include CPU, disk, load average and an “is healthy” check.

Note: At the time of this writing, your Nagios monitoring server must be in the same subnet as the Pivotal Cloud Foundry Ops Metrics VM—to do this, we must assign an IP address from the exclusion range as referenced in Step 7, Configuring Ops Manager Director for VMware vSphere, within the Pivotal Cloud Foundry documentation. JMX and RMI do not play well with NAT or port-forwarding. There may be workarounds, but I have not been able to thoroughly test. Also, they would likely not be supported by Pivotal, so I will not discuss them here. Your Nagios host could certainly be multi-homed if your environment can allow for that.

The Nitty Gritty Details

Firstly a disclaimer, the purpose of this write-up is not to serve as a Nagios installation guide. It will be presumed that you already have a Nagios environment configured and have a general understanding of the various Nagios configuration files.

The precise details of how you place the Nagios configuration files can certainly be different for any installation. For the purposes of this document, here is the layout for my installation:

  • Base Nagios configuration directory: /etc/nagios
  • Primary Nagios configuration file: /etc/nagios/nagios.cg
  • Primary location for Pivotal Cloud Foundry-related Nagios configuration files (cfg_dir setting from nagios.cfg): /etc/nagios/conf.d
  • Nagios plugin directory ($USER1$ is set to this value in /etc/nagios/private/resource.cfg): /usr/lib64/nagios/plugins

The first step is to add the check_jmx plugin script and check_jmx.jar files to your Nagios installation. These can be downloaded from http://snippets.syabru.ch/nagios-jmx-plugin/download.html. Once you unzip the download, fix the permissions and ownership on the check_jmx script and check_jmx.jar files. The check_jmx script should be owned by root and set to 755 permissions. The check_jmx.jar file should be owned by root and set to 644 permissions. You can then place these files anywhere, but I would recommend keeping them in the same location as your other plugin scripts and binaries (e.g. /usr/lib64/nagios/plugins, aka $USER1$).

What I chose to do was to place all of my Pivotal Cloud Foundry monitoring-related files on cfg_dir. Then I created the following files, which are available on GitHub:

  • jmx-service.cfg—This file contains the nagois command for the JMX checks.
  • pcf-hosts.cfg—This file contains the host definition for the Pivotal Cloud Foundry Ops Metrics VM.
  • pcf-hostgroup.cfg—This file defines a hostgroup which contains only the Pivotal Cloud Foundry Ops Metrics VM host definition defined in pcf-hosts.cfg.
  • pcf-<servicename>.cfg—I recommend creating a separate file for each Pivotal Cloud Foundry service you want to monitor to help keep things tidy. This is where you are defining which items you are actually going to monitor along with the warning and alert threshholds. On Github, the example servicename is pcf-riak.

The first 3 files follow the standard Nagios format and should be simple to understand from the examples attached. The pcf-<servicename>.cfg, however, gets a bit more involved. I will detail how these values were determined.

For this sample, I want to watch the system memory usage percent for one of the DEA nodes. Here’s the stanza in the pcf-dea.cfg file:

define service {
use                               generic-service,srv-pnp
hostgroup_name      	pcf-jmx-checks
service_description  	DEA 5 Memory
check_command      	check_jmx!44444!-O "org.cloudfoundry:deployment=cf-69bac1c21671eb753df7,
job=dea-partition-2cb84343bb390d334dad,index=5,
ip=null" -A system.mem.percent --username <username> --password <password> -w 75 -c 90
}

Note: The ‘check_command’ line has artificial carriage returns for readability (lines 5-7), but should all be on a single line.

The first entry tells Nagios to use the generic-service definition as well as the srv-pnp definition. ‘generic-service’ is part of the default Nagios installation. ‘srv-pnp’ is part of an add-on which gives historical graphing capabilities to Nagios. See https://docs.pnp4nagios.org for more information on pnp4nagios.

The second and third lines should be familiar based on your current Nagios installation.

The fourth line, ‘check_command’, is the real meat. The first part of this line, ‘check_jmx’, is referencing the ‘check_jmx’ command defined in the jmx-service.cfg file. The second argument, ‘44444’, is the default port that JMX is configured to listen on in Ops Metrics. The last argument tells the check_jmx Nagios plugin everything else it needs to know. The ‘-O’ argument is the JMX object we are going to inspect. The ‘-A’ argument tells it the attribute inside the object we want to see. The <username> and <password> arguments are the admin credentials defined back in the Pivotal Cloud Foundry Ops Manager tile for Ops Metrics. Finally, the last two arguments tell Nagios the thresholds at which we want to be notified. In this example, we want to see a warning message when the system memory hits 75% and a critical alert (sound the alarms!) when it has hit 90%.

Finding the values for the ‘-O’ and ‘-A’ arguments is a little bit of an exercise, but nothing terribly difficult. To start the process, you need to download a jar—do so from your Nagios VM—from http://crawler.archive.org/cmdline-jmxclient/cmdline-jmxclient-0.10.4.jar.

From the command-line on the Nagios VM, execute:

java -jar cmdline-jmxclient-0.10.3.jar <username>:<password> <IP>:<port>  > all-mbeans.txt

Where username, password, IP and port are the values for your Pivotal Cloud Foundry Ops Metrics tile.

You now have a text file to reference to determine the first set of values that need to be plugged into pcf-.cfg—that is the mbean name.

As an example, let’s say I want to add a new Nagios monitor for the MongoDB nodes, then I would first execute:

grep mongodb all-mbeans.txt

This will give me as output:

[root@nagios ~]# grep mongodb all-mbeans.txt
org.cloudfoundry:deployment=p-mongodb-b54b22f7317dbf518fb4,index=0,ip=null,job=cf-mongodb-partition-2cb84343bb390d334dad
[root@nagios ~]#

This is confirming that I have only one MongoDB node—it only returned one line of output and its index is 0.

The next step is to determine which parameter you want to monitor and determine its mbean name. Continuing with the MongoDB example from above, the command looks like this:

java -jar cmdline-jmxclient-0.10.3.jar <username>:<password> <IP>:<port> org.cloudfoundry:deployment=p-
mongodb-b54b22f7317dbf518fb4, Index=0,ip=null,job=cf-mongodb-partition-2cb84343bb390d334dad | grep system |
awk -F: '{print $1}'

Note: This should be all one a single line, it has been wrapped here for readability.

The extra pipes for the grep and awk just help clean up the output, which will look like:

       system.disk.ephemeral.percent
       system.cpu.wait
       system.mem.kb
       system.disk.persistent.inode_percent
       system.healthy
       system.swap.percent
       system.swap.kb
       system.load.1m
       system.cpu.sys
       system.disk.system.percent
       system.disk.ephemeral.inode_percent
       system.disk.persistent.percent
       system.mem.percent
       system.disk.system.inode_percent
       system.cpu.user
       system.disk.ephemeral.percent
       system.cpu.wait
       system.mem.kb
       system.disk.persistent.inode_percent
       system.healthy
       system.swap.percent
       system.swap.kb
       system.load.1m
       system.cpu.sys
       system.disk.system.percent
       system.disk.ephemeral.inode_percent
       system.disk.persistent.percent
       system.mem.percent
       system.disk.system.inode_percent
       system.cpu.user

This is the same format for all of the components you can monitor through JMX and Ops Metrics, and the only difference being some tiles/nodes will have persistent disks and some will not. That makes for good news—you can save off this most recent output and have it as a reference, just as you do with the all-mbeans.txt.

With these two bits of information you now have what you need to create your own pcf-.cfg files.

I hope this article has been helpful and please feel free to reach out to me with any questions in the comments below.

Learning More:

Screen Shot 2015-01-20 at 12.35.04 PM About the Author: Chris Mattingly is a Solutions Architect at EMC, supporting Pivotal Cloud Foundry for the Professional Services Application Solutions team, among other functions. Chris has worked for nearly 20 years in various IT roles, generally focused on the infrastructure and operational aspects. Roles have included System Programmer for NC State University, Systems and Security Engineer for an acquired startup, Middleware support for a top-3 financial institution, and most recently oversaw all infrastructure for Adaptivity. EMC acquired Adaptivity in 2013 where Chris joined the PS App Solutions team. Chris has a Bachelor’s degree in Statistics from NC State University.

About the Author

Biography

More Content by Guest Blogger
Previous
All Things Pivotal Podcast Episode #17: Pivotal CF–Orgs, Spaces & Roles
All Things Pivotal Podcast Episode #17: Pivotal CF–Orgs, Spaces & Roles

In this episode, we take a quick look at how Orgs and Spaces work in Pivotal CF to help you organise all of...

Next
Simple BDD Android Testing with Robolectric
Simple BDD Android Testing with Robolectric

At Pivotal Labs, we’re all about TDD and BDD. Android testing is no exception. On a recent Android project ...

How do you measure digital transformation?

Take the Benchmark