Chaos Lemur: Testing High Availability on Pivotal Cloud Foundry

May 1, 2015 Paul Harris

Chaos LemurChaos Lemur is a cousin to Chaos Monkey, but built for Pivotal Cloud Foundry (not AWS).

We’ve been working on deploying Spring XD on Pivotal Cloud Foundry (PCF) with a particular emphasis on high availability (HA). Here, we’re dealing with an application that requires several other components to function (e.g. Redis, RabbitMQ), which means there are many potential points of failure in our system. One of the expectations of highly available systems is that they will continue to function if an instance disappears without notice. To replicate this behavior during testing, we have created the Chaos Lemur project, which causes such failures in a controlled, but random, manner.

If you’ve ever had to test for high availability, there’s a good chance you’re already thinking of Netflix’s Simian Army, in particular Chaos Monkey. If you’re not familiar with it, Chaos Monkey is an application that destroys instances in a controlled but random manner. This simulates natural failures by terminating virtual machines without warning. If your system can survive a Chaos Monkey attack, it has a good chance of surviving unexpected failures.

Why Chaos Lemur Was Born

Chaos Monkey is great, but it has some assumptions that make it unsuitable for our use. It is tied to Amazon Web Services (AWS), whereas we need to work against any infrastructure that Pivotal Cloud Foundry supports — OpenStack, VMware, AWS, and others. It also assumes the use of AWS’s Auto Scaling Groups, a feature that Pivotal Cloud Foundry does not use. While the implementation did not meet our needs, the approach was exactly what we were looking for.

Chaos Lemur is an alternative to Chaos Monkey that was designed with Pivotal Cloud Foundry in mind. Using BOSH to determine the candidates for termination allows us to be agnostic with regards to infrastructure. A Service Provider Interface (SPI) for terminating instances ensures that additional infrastructure types can be added without major changes to the project. At the moment Chaos Lemur works against both AWS and vSphere, but additional implementations are welcome.

Chaos Lemur is designed to run as an application on Pivotal Cloud Foundry itself. It can be pushed to a Pivotal Cloud Foundry instance just like any other app and is configured using these environment variables.

Examples: Working With Chaos Lemur From The Command Line

By default Chaos Lemur will consider terminating any VM that BOSH tells it about. This includes VMs from supporting systems, such as Pivotal Elastic Runtime, that you probably don’t want to destroy. It’s a good idea to start Chaos Lemur in ‘Dry Run’ mode initially. This mode reports on VMs that would have been destroyed without actually terminating them. An example is shown in the CLI output below:

$ cf set-env chaos-lemur DRYRUN true
$ cf restart chaos-lemur
[CHAOS LEMUR] INFO Beginning run…
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: postgresql, name: postgresql-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: cf, job: nats, name: nats-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: cf, job: cloud_controller_worker, name: cloud_controller_worker-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/1]
[CHAOS LEMUR] INFO Chaos Lemur Destruction ():
4 VMs destroyed:
• cloud_controller_worker-partition-default_az_guid/0
• nats-partition-default_az_guid/0
• postgresql-partition-default_az_guid/0
• redis-partition-default_az_guid/1

The list of candidates for termination will give you an idea of which deployments and jobs you might want to exclude from testing. From the sample above, for example, we might want to protect the ‘cf’ deployment, and the PostgreSQL job:

$ cf set-env chaos-lemur BLACKLIST cf,postgresql
$ cf restart chaos-lemur
[CHAOS LEMUR] INFO Beginning run…
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: xd-admin, name: xd-admin-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: sentinel, name: sentinel-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Chaos Lemur Destruction ():
3 VMs destroyed:
• redis-partition-default_az_guid/0
• sentinel-partition-default_az_guid/2
• xd-admin-partition-default_az_guid/2

The same deployment and job information from BOSH also allows for different termination probabilities. For example, a probability of 0.1 means that a VM will be destroyed, on average, once every 10 times it’s considered for deletion:

$ cf set-env chaos-lemur REDIS_PROBABILITY 0.5
$ cf set-env chaos-lemur RABBITMQ_PROBABILITY 0.1
$ cf restart chaos-lemur
[CHAOS LEMUR] INFO Beginning run…
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: xd-container, name: xd-container-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: zookeeper, name: zookeeper-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Chaos Lemur Destruction ():
4 VMs destroyed:
• redis-partition-default_az_guid/0
• redis-partition-default_az_guid/2
• xd-container-partition-default_az_guid/2
• zookeeper-partition-default_az_guid/2

Chaos Lemur runs destroys on a schedule, by default once per hour, but the REST API allows you to trigger an ad hoc destroy as needed:

$ curl -X POST http:///chaos -d ‘{ “event”: “DESTROY” }’ -H “Content-Type: application/json”
[CHAOS LEMUR] INFO Beginning run…
– [05/03/2015:12:44:38 +0000] “POST /chaos HTTP/1.1” 202 0 “-” “curl/7.37.1″ 10.85.30.254:58794 x_forwarded_for:”10.26.2.169, 10.85.30.254” vcap_request_id:50a15e90-6ab7-43ca-4479-6183c72899d7 response_time:0.012717088 app_id:f90e7801-8256-4c59-ae55-ae758f36cfb1
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/1]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: service-broker, name: service-broker-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Chaos Lemur Destruction ():
3 VMs destroyed:
• redis-partition-default_az_guid/0
• redis-partition-default_az_guid/1
• service-broker-partition-default_az_guid/2

Chaos Lemur is deliberately no more complicated than we need it to be, but it does have features we haven’t mentioned here, such as a reporting SPI. Check out the Readme for more information, including a guide to getting started. Chaos Lemur is a great tool for understanding how HA systems work, and we hope you are tempted to give it a try. As an open source project, forking is easy, and we look forward to seeing pull requests from you.

Learning More:

About the Author

Biography

Previous
“DevOps, You Keep Using That Word…”—What Is DevOps? A Discussion And History
“DevOps, You Keep Using That Word…”—What Is DevOps? A Discussion And History

In this episode, Pivotal's Andrew Clay Shafer and Coté talk about this history of DevOps—what it is, and ho...

Next
Dear Developers, Small Batch Releases Are Your Friend
Dear Developers, Small Batch Releases Are Your Friend

It really seems like the world of software development today includes only two types of projects—those that...