TUTORIAL: Automating ERT Backups with BBR and Concourse

August 17, 2017 Therese Stowell

Operators have a new way to backup and restore Pivotal Cloud Foundry deployments. BOSH Backup and Restore (BBR) is now generally available.

The product aims to simplify how Pivotal Cloud Foundry and related components are backed-up and restored.

Backup and restore in distributed systems are thorny issues. Custom code, data services, and platforms constantly change across multiple infrastructure targets. How do you accurately capture the state of the system? How do you bring it back reliably?

BOSH helps ease these challenges, so we made it the backbone of the new product. After all, Cloud Foundry components are simply BOSH deployments. We discussed our design thinking in the BBR public beta launch blog.

BBR Performs Backups of PCF Easily and Reliably

BOSH Backup and Restore (BBR) is a CLI for orchestrating the backup and restore of BOSH deployments and BOSH Directors. It triggers the backup (or restore) process on the deployment or Director, and transfers the backup artifact to and from the deployment or Director.

How is it working for our customers so far? Here’s a slice of feedback from our beta users:

  • The reduced Cloud Foundry API downtime and easy compatibility are a big plus.

  • For private cloud users, we’ve confirmed that BBR works on a wide range of hardware.

  • Customers are replacing production-tested custom backup scripts with BBR.

  • Operators appreciate that BBR provides a framework for centrally backing up custom BOSH releases. BBR acts as a contract between the platform operators and these third-party release vendors.

There’s more exciting news: BOSH Backup and Restore is now in the Cloud Foundry incubator program. That means BBR is on-track to become a full-fledged project under the guidance of the Cloud Foundry Foundation.

So how do you use BBR in practice? We’ll run a backup manually first, then show you how to automate the process!

Getting Started with BBR

First things first - here are the prerequisites:

  • A supported BOSH deployment or BOSH Director. BBR support was first added in PCF1.11, so we’ll back up Elastic Runtime 1.11 (ERT). See our documentation for all the components BBR currently supports.

  • A jumpbox with the BBR binary. You can use the Ops Manager VM, as long as it has enough persistent disk to hold the backup artifacts. The Ops Manager VM has network access to the BOSH Director and the ERT VMs - these are also requirements for backup and restore with BBR. If you choose another jumpbox option, you will need to ensure these connections are in place.

To install the BBR binary onto the Ops Manager VM, let’s ssh into it, then copy the BBR binary from Github to the VM:

:~$ wget https://github.com/cloudfoundry-incubator/bosh-backup-and-restore/releases/download/v1.0.0/bbr-1.0.0.tar

:~$ tar -xvf bbr-1.0.0.tar 
releases/
bbr
releases/bbr-mac
releases/checksum.sha256

:~$ cp releases/bbr .

Provide Arguments & Credentials

Next, we’ll gather the arguments we’ll pass to the BBR binary. We need to provide the name of the deployment to back up, and the credentials for connecting to the BOSH Director.

Let’s identify the BOSH Deployment name for ERT. We need to tell BBR which BOSH deployment we want to backup (ERT in this case). To do this, we need to connect to the BOSH Director and find the deployment named cf*.

Log into the Ops Manager BOSH Director from the Ops Manager VM. You can find the IP in the Ops Manager Director Status tab.

:~$ bosh2 --ca-cert /var/tempest/workspaces/default/root_ca_certificate  -e 10.0.0.5 login

Email (): director

Password ():

Successfully authenticated with UAA

Now, let’s see what deployments are on this VM:

:~$ bosh2 --ca-cert /var/tempest/workspaces/default/root_ca_certificate -e 10.0.0.5 deployments

Using environment '10.0.0.5' as user 'director' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)

Name  Release(s) Stemcell(s)  Team(s) Cloud Config  

cf-9d536bda70e40707c83d  push-apps-manager-release/661.1.24    bosh-google-kvm-ubuntu-trusty-go_agent/3421.3  -        latest      

cf-9d536bda70e40707c83d is the BOSH deployment name for ERT. This is what we’ll use with BBR.

Now, the credentials. BBR needs UAA credentials to connect to the BOSH Director. Get these creds in the Credentials tab of Ops Manager:

Configure ERT to be in a Backup-Ready State

Lastly, we’ll check that ERT is configured for backups. We’ll examine the backup-prepare node.

Follow instructions for enabling the backup-prepare node for MySQL, in the Internal MySQL configuration page in ERT. BBR uses this node to generate a backup of the ERT MySQL. (NOTE: we plan to improve this interface in a forthcoming version of PCF).

Run BBR

Now that ERT can be backed up, we can run BBR! To the cli:

:~$ ./bbr deployment --target 10.0.0.5 --username bbr_client --password <> --deployment cf-9d536bda70e40707c83d --ca-cert /var/tempest/workspaces/default/root_ca_certificate backup

BBR first identifies the scripts that are implemented in the deployment:

[bbr] 2017/08/10 15:27:21 INFO - Running pre-checks for backup of cf-9d536bda70e40707c83d...

[bbr] 2017/08/10 15:27:21 INFO - Scripts found:

[bbr] 2017/08/10 15:27:39 INFO - mysql/c0f85f8a-b9c7-4cd6-9c76-06615ae2677d/mysql-restore/metadata

[bbr] 2017/08/10 15:27:39 INFO - mysql/c0f85f8a-b9c7-4cd6-9c76-06615ae2677d/mysql-restore/restore

[bbr] 2017/08/10 15:27:43 INFO - uaa/7d4ec7b9-9625-4714-9b57-69bc0bbf42bd/uaa/backup

[bbr] 2017/08/10 15:27:43 INFO - uaa/7d4ec7b9-9625-4714-9b57-69bc0bbf42bd/uaa/restore

[bbr] 2017/08/10 15:27:43 INFO - uaa/eae565d8-7c72-44f4-bae2-911db56e53ff/uaa/backup

[bbr] 2017/08/10 15:27:43 INFO - uaa/eae565d8-7c72-44f4-bae2-911db56e53ff/uaa/restore

[bbr] 2017/08/10 15:27:48 INFO - cloud_controller/f67f2e0a-9e8b-4ec9-b17c-1a7f271ea3d7/cloud-controller-backup/post-backup-unlock

[bbr] 2017/08/10 15:27:48 INFO - cloud_controller/f67f2e0a-9e8b-4ec9-b17c-1a7f271ea3d7/cloud-controller-backup/pre-backup-lock

[bbr] 2017/08/10 15:27:49 INFO - cloud_controller/b10ccf4c-abaf-474a-bceb-da396ce0904a/cloud-controller-backup/post-backup-unlock

[bbr] 2017/08/10 15:27:49 INFO - cloud_controller/b10ccf4c-abaf-474a-bceb-da396ce0904a/cloud-controller-backup/pre-backup-lock

[bbr] 2017/08/10 15:27:54 INFO - nfs_server/0c3c574f-e704-474d-ad85-5165ec03e942/blobstore-backup/backup

[bbr] 2017/08/10 15:27:54 INFO - nfs_server/0c3c574f-e704-474d-ad85-5165ec03e942/blobstore-backup/restore

[bbr] 2017/08/10 15:27:56 INFO - backup-prepare/3591d800-f728-46e6-988a-8d518dfae90d/mysql-backup/backup

[bbr] 2017/08/10 15:27:56 INFO - backup-prepare/3591d800-f728-46e6-988a-8d518dfae90d/mysql-backup/metadata

Then, BBR calls the scripts, in a specific order:

  1. All lock scripts - these scripts prevent the data from being changing during the backup operation to enable consistency across data stores. For example, the script might stop an API server from serving requests.
  2. All backup scripts - these scripts perform a database dump.
  3. All unlock scripts - these scripts will undo the lock script. For example, the script, might restart an API server.

BBR timestamps each step; operators can track how long the Cloud Foundry API is locked. And the detailed logging of each step helps operators identify and debug any failures.

[bbr] 2017/08/10 15:28:01 INFO - Starting backup of cf-9d536bda70e40707c83d...

[bbr] 2017/08/10 15:28:01 INFO - Running pre-backup scripts...

[bbr] 2017/08/10 15:28:01 INFO - Locking cloud-controller-backup on cloud_controller/f67f2e0a-9e8b-4ec9-b17c-1a7f271ea3d7 for backup...

[bbr] 2017/08/10 15:28:25 INFO - Done.

[bbr] 2017/08/10 15:28:25 INFO - Locking cloud-controller-backup on cloud_controller/b10ccf4c-abaf-474a-bceb-da396ce0904a for backup...

[bbr] 2017/08/10 15:28:51 INFO - Done.

[bbr] 2017/08/10 15:28:51 INFO - Done.

[bbr] 2017/08/10 15:28:51 INFO - Running backup scripts...

[bbr] 2017/08/10 15:28:51 INFO - Backing up uaa on uaa/7d4ec7b9-9625-4714-9b57-69bc0bbf42bd...

[bbr] 2017/08/10 15:28:52 INFO - Done.

[bbr] 2017/08/10 15:28:52 INFO - Backing up uaa on uaa/eae565d8-7c72-44f4-bae2-911db56e53ff...

[bbr] 2017/08/10 15:28:52 INFO - Done.

[bbr] 2017/08/10 15:28:52 INFO - Backing up blobstore-backup on nfs_server/0c3c574f-e704-474d-ad85-5165ec03e942...

[bbr] 2017/08/10 15:28:52 INFO - Done.

[bbr] 2017/08/10 15:28:52 INFO - Backing up mysql-backup on backup-prepare/3591d800-f728-46e6-988a-8d518dfae90d...

[bbr] 2017/08/10 15:32:13 INFO - Done.

[bbr] 2017/08/10 15:32:13 INFO - Running post-backup scripts...

[bbr] 2017/08/10 15:32:13 INFO - Unlocking cloud-controller-backup on cloud_controller/f67f2e0a-9e8b-4ec9-b17c-1a7f271ea3d7...

[bbr] 2017/08/10 15:32:32 INFO - Unlocking cloud-controller-backup on cloud_controller/b10ccf4c-abaf-474a-bceb-da396ce0904a...

[bbr] 2017/08/10 15:32:48 INFO - Done.

[bbr] 2017/08/10 15:32:48 INFO - Done.

Finally, BBR copies the backup artifacts from the deployment to the jumpbox. BBR checksums the artifacts to verify their integrity :

[bbr] 2017/08/10 15:32:48 INFO - Copying backup -- 4.0K uncompressed -- from uaa/7d4ec7b9-9625-4714-9b57-69bc0bbf42bd...

[bbr] 2017/08/10 15:32:48 INFO - Finished copying backup -- from uaa/7d4ec7b9-9625-4714-9b57-69bc0bbf42bd...

[bbr] 2017/08/10 15:32:48 INFO - Starting validity checks

[bbr] 2017/08/10 15:32:48 INFO - Finished validity checks

[bbr] 2017/08/10 15:32:48 INFO - Copying backup -- 4.0K uncompressed -- from uaa/eae565d8-7c72-44f4-bae2-911db56e53ff...

[bbr] 2017/08/10 15:32:48 INFO - Finished copying backup -- from uaa/eae565d8-7c72-44f4-bae2-911db56e53ff...

[bbr] 2017/08/10 15:32:48 INFO - Starting validity checks

[bbr] 2017/08/10 15:32:49 INFO - Finished validity checks

[bbr] 2017/08/10 15:32:49 INFO - Copying backup -- 3.0G uncompressed -- from nfs_server/0c3c574f-e704-474d-ad85-5165ec03e942...

[bbr] 2017/08/10 15:33:40 INFO - Finished copying backup -- from nfs_server/0c3c574f-e704-474d-ad85-5165ec03e942...

[bbr] 2017/08/10 15:33:40 INFO - Starting validity checks

[bbr] 2017/08/10 15:35:28 INFO - Finished validity checks

[bbr] 2017/08/10 15:35:30 INFO - Copying backup -- 70M uncompressed -- from backup-prepare/3591d800-f728-46e6-988a-8d518dfae90d...

[bbr] 2017/08/10 15:35:31 INFO - Finished copying backup -- from backup-prepare/3591d800-f728-46e6-988a-8d518dfae90d...

[bbr] 2017/08/10 15:35:31 INFO - Starting validity checks

[bbr] 2017/08/10 15:35:33 INFO - Finished validity checks

[bbr] 2017/08/10 15:35:33 INFO - Backup created of cf-9d536bda70e40707c83d on 2017-08-10 15:35:33.545182089 +0000 UTC

Let’s examine the “origin” directory where we kicked off the BBR process. You can see below that BBR created a folder named after the deployment and a timestamp of when the backup occurred.

:~$ ls

bbr  bbr-1.0.0.tar  cf-9d536bda70e40707c83d_20170810T152801Z  releases

So what’s in this directory? A tar file for each backup script and its associated metadata. The metadata file has checksums for each backup artifact, plus the start and finish time of the backup.

:~$ ls cf-9d536bda70e40707c83d_20170810T152801Z/

metadata  mysql-artifact.tar  nfs_server-0-blobstore-backup.tar  uaa-0-uaa.tar    uaa-1-uaa.tar

That’s it, we now have a backup of ERT! (For those interested in a complete backup of PCF 1.11, check out the docs here.)

You Haven’t Really Backed Up Until You Can Restore

Backing up is only half the job, so let’s run a restore. The arguments to BBR are the same: the deployment name and the UAA credentials.

The difference is that we specify ‘restore’ instead of ‘backup,’ and we provide the path to the backup artifact directory.

./bbr deployment --target 10.0.0.5 --username bbr_client --password <> --deployment cf-9d536bda70e40707c83d --ca-cert /var/tempest/workspaces/default/root_ca_certificate restore --artifact-path cf-9d536bda70e40707c83d_20170810T152801Z

BBR takes the backup artifacts created by each backup script in the deployment, and proceeds to pass it to the corresponding restore script.

[bbr] 2017/08/14 08:58:45 INFO - Starting restore of cf-9d536bda70e40707c83d...

[bbr] 2017/08/14 09:00:19 INFO - mysql/c0f85f8a-b9c7-4cd6-9c76-06615ae2677d/mysql-restore/metadata

[bbr] 2017/08/14 09:00:19 INFO - mysql/c0f85f8a-b9c7-4cd6-9c76-06615ae2677d/mysql-restore/restore

[bbr] 2017/08/14 09:00:26 INFO - uaa/7d4ec7b9-9625-4714-9b57-69bc0bbf42bd/uaa/backup

[bbr] 2017/08/14 09:00:26 INFO - uaa/7d4ec7b9-9625-4714-9b57-69bc0bbf42bd/uaa/restore

[bbr] 2017/08/14 09:00:26 INFO - uaa/eae565d8-7c72-44f4-bae2-911db56e53ff/uaa/backup

[bbr] 2017/08/14 09:00:26 INFO - uaa/eae565d8-7c72-44f4-bae2-911db56e53ff/uaa/restore

[bbr] 2017/08/14 09:00:29 INFO - cloud_controller/f67f2e0a-9e8b-4ec9-b17c-1a7f271ea3d7/cloud-controller-backup/post-backup-unlock

[bbr] 2017/08/14 09:00:29 INFO - cloud_controller/f67f2e0a-9e8b-4ec9-b17c-1a7f271ea3d7/cloud-controller-backup/pre-backup-lock

[bbr] 2017/08/14 09:00:29 INFO - cloud_controller/b10ccf4c-abaf-474a-bceb-da396ce0904a/cloud-controller-backup/post-backup-unlock

[bbr] 2017/08/14 09:00:29 INFO - cloud_controller/b10ccf4c-abaf-474a-bceb-da396ce0904a/cloud-controller-backup/pre-backup-lock

[bbr] 2017/08/14 09:00:36 INFO - nfs_server/0c3c574f-e704-474d-ad85-5165ec03e942/blobstore-backup/backup

[bbr] 2017/08/14 09:00:36 INFO - nfs_server/0c3c574f-e704-474d-ad85-5165ec03e942/blobstore-backup/restore

[bbr] 2017/08/14 09:00:40 INFO - backup-prepare/3591d800-f728-46e6-988a-8d518dfae90d/mysql-backup/backup

[bbr] 2017/08/14 09:00:40 INFO - backup-prepare/3591d800-f728-46e6-988a-8d518dfae90d/mysql-backup/metadata

[bbr] 2017/08/14 09:00:41 INFO - Copying backup to mysql/0...

[bbr] 2017/08/14 09:00:43 INFO - Done.

[bbr] 2017/08/14 09:00:43 INFO - Copying backup to uaa/0...

[bbr] 2017/08/14 09:00:43 INFO - Done.

[bbr] 2017/08/14 09:00:43 INFO - Copying backup to uaa/1...

[bbr] 2017/08/14 09:00:43 INFO - Done.

[bbr] 2017/08/14 09:00:43 INFO - Copying backup to nfs_server/0...

[bbr] 2017/08/14 09:04:51 INFO - Done.

[bbr] 2017/08/14 09:04:51 INFO - Running restore scripts...

[bbr] 2017/08/14 09:04:51 INFO - Restoring mysql-restore on mysql/c0f85f8a-b9c7-4cd6-9c76-06615ae2677d...

[bbr] 2017/08/14 09:06:18 INFO - Done.

[bbr] 2017/08/14 09:06:18 INFO - Restoring uaa on uaa/7d4ec7b9-9625-4714-9b57-69bc0bbf42bd...

[bbr] 2017/08/14 09:06:18 INFO - Done.

[bbr] 2017/08/14 09:06:18 INFO - Restoring uaa on uaa/eae565d8-7c72-44f4-bae2-911db56e53ff...

[bbr] 2017/08/14 09:06:18 INFO - Done.

[bbr] 2017/08/14 09:06:18 INFO - Restoring blobstore-backup on nfs_server/0c3c574f-e704-474d-ad85-5165ec03e942…

[bbr] 2017/08/14 09:06:57 INFO - Done.

[bbr] 2017/08/14 09:06:57 INFO - Completed restore of cf-9d536bda70e40707c83d

Of course, backing up PCF components is something you should do regularly. Sure, you can do this process manually if you like. But we recommend automating it!

Automate Your Backups with Concourse!

Why use Concourse for your backups? Three reasons.

  1. Concourse excels at task automation. That’s a great description of the backup process with BBR, as it involves a set of steps that will be run repeatedly.
  2. Concourse is pluggable. Operators can readily customize their backup policies with Concourse primitives. ‘Resources’ in Concourse automate workflows when there’s a change in an external resource (like a Git repo, an S3 bucket, or even time itself). Concourse offers a library of existing Resources, or operators can build their own. ‘Tasks’ in Concourse encapsulate an activity: the execution of a script in an isolated environment, with dependent resources available to it. Running BBR to back up a PCF component fits well with this construct.
  3. Concourse visualizes the state of the automation. Failures are easily discoverable. The BBR team at Pivotal runs a continuous, end-to-end backup and restore of PCF. This pipeline is shown below.

Each rectangle is a task, and a red rectangle indicates failure. Zoom in on a rectangle to inspect the task details.

To Automate Backups, Customize Our Reference Pipelines

We’ve created a reference pipeline and tasks that you can use as a starting point for backing up PCF via Concourse.  Use these, and you’ll simply have to configure 3 things: how often backups are run (frequency), when the job should run (what time of day), and finally where to store the backups.

Concourse Resources can help you automate your chosen configuration. For example, the time-resource can trigger backups on a regular schedule. The S3 resource can copy backup artifacts from the jumpbox to S3 storage.

We’re working on reference pipelines and tasks for the restoration of PCF. The restore process is the same, regardless of whether the process is done manually or via automation.

The Road Ahead

As we often say, software is never done, just shipped. To this end, the BBR team has an exciting roadmap! We plan to:

  • Support additional restore scenarios (that don’t require creating a new PCF installation to restore into)

  • Support backup and restore of ERT configured with external databases and external blobstores

  • Enable operators to validate their backup artifacts without performing a full restore

  • Work with data services (the new BOSH-deployed versions of RabbitMQ, Redis, MySQL, Pivotal Cloud Cache) teams to add support for BBR.

Have thoughts on our roadmap or feedback on the product? Please get in touch!

About the Author

Therese Stowell

Therese Stowell is Product Manager at Pivotal. She has worked in the software industry for 20+ years as programmer, interface designer, and product manager. She worked on Windows, developing the command line environment, founded a successful social enterprise, and was part of a startup team to win a Nesta Open Data Institute £40,000 prize. She also has an MA in Fine Art.

More Content by Therese Stowell
Previous
Agile Architecture
Agile Architecture

Can architecture be agile? If so, when do agile engineers "architect"?

Next
The Best Thing A Product Manager Can Do is Learn From Failure
The Best Thing A Product Manager Can Do is Learn From Failure

This week for Pivotal Voices, we’re featuring Amjad Sidqi, Product Manager at Pivotal Sydney.Photo by Peter...

Save $300 on SpringOne Platform. Use code S1P300_Anniv by 4/27.

Register Now