Restoring Pivotal Cloud Foundry After Disaster

May 4, 2015 Duncan Winn

Joint work performed on this blog by Duncan Winn, Ferran Rodenas and Haydon Ryan.

I’ve been exploring how to recover from some disaster scenarios recently. Its important to ensure Pivotal Cloud Foundry (PCF) is disaster proof. If you’ve played around with Pivotal Cloud Foundry for a while you should be well aware of the four-levels of High Availability (HA).

Pivotal Cloud Foundry restarts your failed application processes.
Monit restarts your failed system processes.
Bosh looks after Pivotal Cloud Foundry, restarting failed VM’s if the resurrector plugin is on.
Finally you have the ability to stripe your apps and components across availability zones for a higher level of HA if your diskstore or AZ goes down.

All that is great. It saves pain and the need for cumbersome multi-layer manual clustering techniques. It ultimately makes your app life better.

But what about the unexpected disaster scenarios? Are there any critical components that, if they were to disappear, would leave you in bad way? What if someone with admin rights, say a disgruntled administrator on their way out the door was to login and delete key VMs? There are two VM’s that are not managed by Bosh and so will not “just come back“. These are the PCF Ops Manager VM and the Bosh Director VM. Without these two VM’s your system will still work, apps with still run, but you are stuck if you want to change your deployment and you lose your Bosh level of HA, the ability to bring VMs back, as the Bosh Director will not be there to invoke the resurrector plugin.

So how should you protect your environment in order to recover from these two VMs getting deleted? It turns out there are two things you need to do. One is well known—backing up the Ops Manager configuration, and the other, restoring Bosh itself, is less well known. So lets start with the easy step of backing up the Ops Manager.

In the PCF Ops Manager click on the icon and export the configuration (see http://docs.pivotal.io/pivotalcf/customizing/backup-settings.html#settings).

Then if your Ops Manager disappears you can simply create a new one and reimport the settings… easy!

So, what about restoring the Bosh Director VM? For PCF we are using a single instance Bosh VM. If this goes away you can’t just create a new Bosh and import all the deployment manifests again because Bosh and Ops Manager are tightly coupled, along with Bosh and your deployment (via the Bosh agents and the Bosh Director). So the general thought is that if Bosh goes away you are, to put it mildly, in a bad way!

If you still have the original Bosh manifest (lets call it bosh.yml) then you can bring your original Bosh back. This is providing you have a snapshot of the disk volume that contains the Bosh postgres DB and blobstore, or you use an external DB like MySQL on AWS-RDS and an external blobstore like AWS-S3. You must however bring that Bosh back on the original IP as Ops Manager still stores the Bosh IP from the original Bosh.

But here lies both a problem, and a staggeringly cool solution that not many folks are aware exists. First the problem: if you are using PCF then Ops Manager will delete the bosh.yml after bringing up Bosh. This is important as that manifest contains sensitive information like your AWS keys. Allowing that manifest to exist on the file system in plain text is a security risk. So it’s right that it’s deleted but then the question is, “without the manifest how can you bring back Bosh”? The solution is simple; Ops Manager will bring Bosh back for you.

The Technical Detail

Go into Ops Manager and make a small change to the settings for example adjust the number of NTP servers (I deleted the fourth NTP server in a list of four). When you redeploy, Ops Manager will look for Bosh to change the PCF deployment and if it can’t find Bosh it will recreate it. There is one additional step that’s required. You needed to delete:

/var/tempest/workspaces/default/deployments/bosh-deployments.yml file.

This file contains a vm_cid: <VM_ID> and so Ops Manager (or rather the bosh cli running on Ops Manager) will look for this VM believing there is a deployment and it will fail when it does not find Bosh. Deleting this file causes Ops Manager to treat the deploy as a new deployment, recreating any missing VMS—namely Bosh and ignoring any existing ones such as your existing Pivotal Cloud Foundry deployment.

Ferdy and I tested this out. His reaction encapsulated how we felt about deleting bosh from our running PCF deployment:

Bosh Terminated

Bosh is Back

The result: we managed to bring Bosh and a working Pivotal Cloud Foundry back. For a while we could not log onto PCF even though all apps were running and all VM’s including the Cloud Controller and UAA were back because as part of the deployment these processed were stopped and restarted. About 10 minutes later PCF was fully working. Bosh ensured the environment was eventually consistent and during this time we had no downtime (go Bosh)!

A Word of Caution

Bringing back Bosh using either of these two approaches (the Bosh.yml or Ops Manager) assumes you are using an external Database like MySQL on AWS-RDS and an external blobstore like AWS-S3. If you are not there are some additional steps required to backup and restore Bosh’s internal Database and Blobstore.

Steps to Backup Bosh’s Internal Database and Blobstore

SSH into bosh via ssh -i key vcap@boship.
Become root via ‘su -’ (find vm credentials from Ops man credentials -> Ops Man Director / vcap/creds).
Run ‘monit summary’ to view all Bosh processes.
Run ‘monit stop all’ to cleanly stop all Bosh processes.
Take a snapshot in AWS of the Bosh persistent disk volume.

Steps to Restore Bosh’s Internal DB and Blobstore

Rebuild Bosh as described above (it will have created a new EMPTY persistent disk).
Repeat above backup steps 1-4, you will need to stop all processes before detaching the persistent disk.
Detach the persistent disk (deleting it).
Create a new volume from the snapshot and then manually attach the new volume to the Bosh vm.
Starts all processes again. Bosh will now have the same bosh uuid (because it’s stored in the database).

Conclusion

So there we go. Using Pivotal Cloud Foundry keeps your apps, processes and VM’s highly available, even more so if you use multi AZ’s. However you must ensure you backup your Ops Manager configuration to bring the Ops Manager back if a catastrophe or malicious person causes it to die. Once Ops Manager is back you can use it to bring back Bosh. If you use an external Database and Blobstore for Bosh then you won’t have to worry about snapshotting Bosh’s persistent volume. There are other key things you should do to make your system secure that we will discuss in the next post but this post’s focus has been on Bosh and the PCF Ops Manager.

About the Author

Biography

Announcing New Pivotal Cloud Foundry Features

This month’s Pivotal Cloud Foundry release adds a host of new features that make it easier to develop distr...

Industry Day Signals Big Things For Cloud Foundry Ecosystem

Last week marked the first Pivotal Cloud Foundry Industry Day, where 21 companies came together to create n...

Restoring Pivotal Cloud Foundry After Disaster

The Technical Detail

Bosh Terminated

Bosh is Back

A Word of Caution

Steps to Backup Bosh’s Internal Database and Blobstore

Steps to Restore Bosh’s Internal DB and Blobstore

Conclusion

About the Author

Previous

Next

Restoring Pivotal Cloud Foundry After Disaster

The Technical Detail

Bosh Terminated

Bosh is Back

A Word of Caution

Steps to Backup Bosh’s Internal Database and Blobstore

Steps to Restore Bosh’s Internal DB and Blobstore

Conclusion

About the Author

Previous

Next

Related content in this Stream

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!