Chaos Lemur: Testing High Availability on Pivotal Cloud Foundry

May 1, 2015 Paul Harris

Chaos Lemur is a cousin to Chaos Monkey, but built for Pivotal Cloud Foundry (not AWS).

We’ve been working on deploying Spring XD on Pivotal Cloud Foundry (PCF) with a particular emphasis on high availability (HA). Here, we’re dealing with an application that requires several other components to function (e.g. Redis, RabbitMQ), which means there are many potential points of failure in our system. One of the expectations of highly available systems is that they will continue to function if an instance disappears without notice. To replicate this behavior during testing, we have created the Chaos Lemur project, which causes such failures in a controlled, but random, manner.

If you’ve ever had to test for high availability, there’s a good chance you’re already thinking of Netflix’s Simian Army, in particular Chaos Monkey. If you’re not familiar with it, Chaos Monkey is an application that destroys instances in a controlled but random manner. This simulates natural failures by terminating virtual machines without warning. If your system can survive a Chaos Monkey attack, it has a good chance of surviving unexpected failures.

Why Chaos Lemur Was Born

Chaos Monkey is great, but it has some assumptions that make it unsuitable for our use. It is tied to Amazon Web Services (AWS), whereas we need to work against any infrastructure that Pivotal Cloud Foundry supports — OpenStack, VMware, AWS, and others. It also assumes the use of AWS’s Auto Scaling Groups, a feature that Pivotal Cloud Foundry does not use. While the implementation did not meet our needs, the approach was exactly what we were looking for.

Chaos Lemur is an alternative to Chaos Monkey that was designed with Pivotal Cloud Foundry in mind. Using BOSH to determine the candidates for termination allows us to be agnostic with regards to infrastructure. A Service Provider Interface (SPI) for terminating instances ensures that additional infrastructure types can be added without major changes to the project. At the moment Chaos Lemur works against both AWS and vSphere, but additional implementations are welcome.

Chaos Lemur is designed to run as an application on Pivotal Cloud Foundry itself. It can be pushed to a Pivotal Cloud Foundry instance just like any other app and is configured using these environment variables.

Examples: Working With Chaos Lemur From The Command Line

By default Chaos Lemur will consider terminating any VM that BOSH tells it about. This includes VMs from supporting systems, such as Pivotal Elastic Runtime, that you probably don’t want to destroy. It’s a good idea to start Chaos Lemur in ‘Dry Run’ mode initially. This mode reports on VMs that would have been destroyed without actually terminating them. An example is shown in the CLI output below:

$ cf set-env chaos-lemur DRYRUN true
$ cf restart chaos-lemur
[CHAOS LEMUR] INFO Beginning run…
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: postgresql, name: postgresql-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: cf, job: nats, name: nats-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: cf, job: cloud_controller_worker, name: cloud_controller_worker-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/1]
[CHAOS LEMUR] INFO Chaos Lemur Destruction ():
4 VMs destroyed:
• cloud_controller_worker-partition-default_az_guid/0
• nats-partition-default_az_guid/0
• postgresql-partition-default_az_guid/0
• redis-partition-default_az_guid/1

The list of candidates for termination will give you an idea of which deployments and jobs you might want to exclude from testing. From the sample above, for example, we might want to protect the ‘cf’ deployment, and the PostgreSQL job:

$ cf set-env chaos-lemur BLACKLIST cf,postgresql
$ cf restart chaos-lemur
[CHAOS LEMUR] INFO Beginning run…
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: xd-admin, name: xd-admin-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: sentinel, name: sentinel-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Chaos Lemur Destruction ():
3 VMs destroyed:
• redis-partition-default_az_guid/0
• sentinel-partition-default_az_guid/2
• xd-admin-partition-default_az_guid/2

The same deployment and job information from BOSH also allows for different termination probabilities. For example, a probability of 0.1 means that a VM will be destroyed, on average, once every 10 times it’s considered for deletion:

$ cf set-env chaos-lemur REDIS_PROBABILITY 0.5
$ cf set-env chaos-lemur RABBITMQ_PROBABILITY 0.1
$ cf restart chaos-lemur
[CHAOS LEMUR] INFO Beginning run…
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: xd-container, name: xd-container-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: zookeeper, name: zookeeper-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Chaos Lemur Destruction ():
4 VMs destroyed:
• redis-partition-default_az_guid/0
• redis-partition-default_az_guid/2
• xd-container-partition-default_az_guid/2
• zookeeper-partition-default_az_guid/2

Chaos Lemur runs destroys on a schedule, by default once per hour, but the REST API allows you to trigger an ad hoc destroy as needed:

$ curl -X POST http:///chaos -d ‘{ “event”: “DESTROY” }’ -H “Content-Type: application/json”
[CHAOS LEMUR] INFO Beginning run…
– [05/03/2015:12:44:38 +0000] “POST /chaos HTTP/1.1” 202 0 “-” “curl/7.37.1″ 10.85.30.254:58794 x_forwarded_for:”10.26.2.169, 10.85.30.254” vcap_request_id:50a15e90-6ab7-43ca-4479-6183c72899d7 response_time:0.012717088 app_id:f90e7801-8256-4c59-ae55-ae758f36cfb1
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/0]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: redis, name: redis-partition-default_az_guid/1]
[CHAOS LEMUR] INFO Destroyed (Dry Run): [id: vm-, deployment: p-spring-xd, job: service-broker, name: service-broker-partition-default_az_guid/2]
[CHAOS LEMUR] INFO Chaos Lemur Destruction ():
3 VMs destroyed:
• redis-partition-default_az_guid/0
• redis-partition-default_az_guid/1
• service-broker-partition-default_az_guid/2

Chaos Lemur is deliberately no more complicated than we need it to be, but it does have features we haven’t mentioned here, such as a reporting SPI. Check out the Readme for more information, including a guide to getting started. Chaos Lemur is a great tool for understanding how HA systems work, and we hope you are tempted to give it a try. As an open source project, forking is easy, and we look forward to seeing pull requests from you.

Learning More:

Download Chaos Lemur on GitHub
More Pivotal Cloud Foundry Blog Posts
Pivotal Cloud Foundry Product Info, Downloads, and Documentation

About the Author

Biography

“DevOps, You Keep Using That Word…”—What Is DevOps? A Discussion And History

In this episode, Pivotal's Andrew Clay Shafer and Coté talk about this history of DevOps—what it is, and ho...

Dear Developers, Small Batch Releases Are Your Friend

It really seems like the world of software development today includes only two types of projects—those that...

Chaos Lemur: Testing High Availability on Pivotal Cloud Foundry

Why Chaos Lemur Was Born

Examples: Working With Chaos Lemur From The Command Line

About the Author

Previous

Next

Chaos Lemur: Testing High Availability on Pivotal Cloud Foundry

Why Chaos Lemur Was Born

Examples: Working With Chaos Lemur From The Command Line

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.