Joint work performed on this blog by Duncan Winn, Ferran Rodenas and Haydon Ryan.
I’ve been exploring how to recover from some disaster scenarios recently. Its important to ensure Pivotal Cloud Foundry (PCF) is disaster proof. If you’ve played around with Pivotal Cloud Foundry for a while you should be well aware of the four-levels of High Availability (HA).
- Pivotal Cloud Foundry restarts your failed application processes.
- Monit restarts your failed system processes.
- Bosh looks after Pivotal Cloud Foundry, restarting failed VM’s if the resurrector plugin is on.
- Finally you have the ability to stripe your apps and components across availability zones for a higher level of HA if your diskstore or AZ goes down.
All that is great. It saves pain and the need for cumbersome multi-layer manual clustering techniques. It ultimately makes your app life better.
But what about the unexpected disaster scenarios? Are there any critical components that, if they were to disappear, would leave you in bad way? What if someone with admin rights, say a disgruntled administrator on their way out the door was to login and delete key VMs? There are two VM’s that are not managed by Bosh and so will not “just come back“. These are the PCF Ops Manager VM and the Bosh Director VM. Without these two VM’s your system will still work, apps with still run, but you are stuck if you want to change your deployment and you lose your Bosh level of HA, the ability to bring VMs back, as the Bosh Director will not be there to invoke the resurrector plugin.
So how should you protect your environment in order to recover from these two VMs getting deleted? It turns out there are two things you need to do. One is well known—backing up the Ops Manager configuration, and the other, restoring Bosh itself, is less well known. So lets start with the easy step of backing up the Ops Manager.
In the PCF Ops Manager click on the icon and export the configuration (see http://docs.pivotal.io/pivotalcf/customizing/backup-settings.html#settings).
Then if your Ops Manager disappears you can simply create a new one and reimport the settings… easy!
So, what about restoring the Bosh Director VM? For PCF we are using a single instance Bosh VM. If this goes away you can’t just create a new Bosh and import all the deployment manifests again because Bosh and Ops Manager are tightly coupled, along with Bosh and your deployment (via the Bosh agents and the Bosh Director). So the general thought is that if Bosh goes away you are, to put it mildly, in a bad way!
If you still have the original Bosh manifest (lets call it bosh.yml) then you can bring your original Bosh back. This is providing you have a snapshot of the disk volume that contains the Bosh postgres DB and blobstore, or you use an external DB like MySQL on AWS-RDS and an external blobstore like AWS-S3. You must however bring that Bosh back on the original IP as Ops Manager still stores the Bosh IP from the original Bosh.
But here lies both a problem, and a staggeringly cool solution that not many folks are aware exists. First the problem: if you are using PCF then Ops Manager will delete the bosh.yml after bringing up Bosh. This is important as that manifest contains sensitive information like your AWS keys. Allowing that manifest to exist on the file system in plain text is a security risk. So it’s right that it’s deleted but then the question is, “without the manifest how can you bring back Bosh”? The solution is simple; Ops Manager will bring Bosh back for you.
The Technical Detail
Go into Ops Manager and make a small change to the settings for example adjust the number of NTP servers (I deleted the fourth NTP server in a list of four). When you redeploy, Ops Manager will look for Bosh to change the PCF deployment and if it can’t find Bosh it will recreate it. There is one additional step that’s required. You needed to delete:
This file contains a vm_cid: <VM_ID> and so Ops Manager (or rather the bosh cli running on Ops Manager) will look for this VM believing there is a deployment and it will fail when it does not find Bosh. Deleting this file causes Ops Manager to treat the deploy as a new deployment, recreating any missing VMS—namely Bosh and ignoring any existing ones such as your existing Pivotal Cloud Foundry deployment.
Ferdy and I tested this out. His reaction encapsulated how we felt about deleting bosh from our running PCF deployment:
Bosh is Back
The result: we managed to bring Bosh and a working Pivotal Cloud Foundry back. For a while we could not log onto PCF even though all apps were running and all VM’s including the Cloud Controller and UAA were back because as part of the deployment these processed were stopped and restarted. About 10 minutes later PCF was fully working. Bosh ensured the environment was eventually consistent and during this time we had no downtime (go Bosh)!
A Word of Caution
Bringing back Bosh using either of these two approaches (the Bosh.yml or Ops Manager) assumes you are using an external Database like MySQL on AWS-RDS and an external blobstore like AWS-S3. If you are not there are some additional steps required to backup and restore Bosh’s internal Database and Blobstore.
Steps to Backup Bosh’s Internal Database and Blobstore
- SSH into bosh via ssh -i key vcap@boship.
- Become root via ‘su -’ (find vm credentials from Ops man credentials -> Ops Man Director / vcap/creds).
- Run ‘monit summary’ to view all Bosh processes.
- Run ‘monit stop all’ to cleanly stop all Bosh processes.
- Take a snapshot in AWS of the Bosh persistent disk volume.
Steps to Restore Bosh’s Internal DB and Blobstore
- Rebuild Bosh as described above (it will have created a new EMPTY persistent disk).
- Repeat above backup steps 1-4, you will need to stop all processes before detaching the persistent disk.
- Detach the persistent disk (deleting it).
- Create a new volume from the snapshot and then manually attach the new volume to the Bosh vm.
- Starts all processes again. Bosh will now have the same bosh uuid (because it’s stored in the database).
So there we go. Using Pivotal Cloud Foundry keeps your apps, processes and VM’s highly available, even more so if you use multi AZ’s. However you must ensure you backup your Ops Manager configuration to bring the Ops Manager back if a catastrophe or malicious person causes it to die. Once Ops Manager is back you can use it to bring back Bosh. If you use an external Database and Blobstore for Bosh then you won’t have to worry about snapshotting Bosh’s persistent volume. There are other key things you should do to make your system secure that we will discuss in the next post but this post’s focus has been on Bosh and the PCF Ops Manager.
About the AuthorMore Content by Duncan Winn