Implementing Comprehensive PCF Automation Pipelines

July 22, 2019 Bright Zheng

[NOTE: This post refers to outdated components. We recommend Platform Automation for this use case instead.]

Pivotal Cloud Foundry (PCF) has been proven an amazing cloud platform along the adoption of most of the fortune 500 and more enterprises; around which there is a huge ecosystem.

If you, as a platform engineer, browse through Pivotal Network, you can easily find, download and install a series of desired enterprise-grade services for your developers, be it cache/DB service, message queue service, APM service or anything else, so that they can follow a unified and streamlined way to consume it for building comprehensive cloud-native applications, by simply issuing commands like:

$ cf create-service p.mysql db-small mysql-for-spring-music
$ cf bind-service spring-music mysql-for-spring-music

$ cf restage spring-music

Everything just works and your app (“spring-music” here as the command shows), is now working with the DB (“mysql-for-spring-music” here) to serve your users.

That is already a good start!

How About The Operability?

Now, the questions may have been raised:

  • How to install these services?

  • More importantly, while looking at long term operations, how to upgrade and patch them, should there be any updates / patches, in an easy but consistent way?

  • How to roll out the practices to more foundations, say “dev”, “qa”, “prod” and more?

Frankly speaking, installing, upgrading, and patching PCF and its products, which are called tiles, are already easy. By using the Ops Manager UI and following a consistent manual process, all these can be achieved within several clicks.

Below are the typical steps to install a tile:

  1. Browse Pivotal Network and download the right product you're going to install;

  2. Click the "Import a Product" button on Ops Manager to upload and import it;

  3. Once it's been uploaded and imported, click the "+" button to stage it;

  4. A tile will be shown in Ops Manager dashboard, click and configure it by following the well-documented product docs;

  5. Once configuration is done, click the "REVIEW PENDING CHANGES" and click "APPLY CHANGES";

  6. After a while, the tile is fully installed and the service offered by it is ready to use.

How about upgrades and patches? Well, the processes are almost the same.

Behind the scenes, BOSH -- I simply call it “IaaS orchestrator for PCF” for those who just heard about it -- does all the magic for the install, upgrade and patch without any downtime introduced to the services and applications running atop, by employing the canary deployment mechanism.

But think of the vision of “platform as a product” and the Site Reliability Engineering (SRE) practices, like “Google places a 50% cap on the aggregate ‘ops’ work for all SREs”, can we improve further on the platform operability by reducing more risks and toil while installing, upgrading, and patching PCF and its products?

Fortunately, the short answer is “yes” and I’ll tell you “how” in details.

Introducing Platform Automation for PCF

Recently Pivotal announced the GA of Platform Automation for PCF, which provides essential building blocks for automating the installation, upgrades and patches of PCF foundations and services. Platform engineers can realize the benefits of small, constant platform upgrades—and significantly reduce risk, streamline upgrading, and improve stability while operating PCF.

So let’s get started on the innovation journey to implementing comprehensive PCF automation pipelines by using these building blocks offered by Platform Automation for PCF.

The Model To Pursue

Firstly, we need to set and align on the model that we’re going to pursue. The model must be easy, consistent and sustainable.

The Categories in PCF World

In PCF world, there is an Ops Manager which provides the UI and APIs to drive all the tiles’ lifecycle management. Please don’t forget, even Pivotal Application Service (PAS) or Pivotal Container Service (PKS) is just “another” tile, from Ops Manager perspective.

So we may simply categorize the PCF components, in a very high level, as Ops Manager (or simply “OpsMan”) and Tiles.

The Operational Types

It’s obvious that we can think of the operational types, from a generic operational lifecycle management view, to be “install”, “upgrade” and “patch”.

In most of the cases, “install”,  “upgrade” and “patch” are different things. For example, “install” has to build something from scratch, while “upgrade” and “patch” can build something from some baseline, established from “install” or previous “upgrade” and “patch”. Meanwhile, “patch”, from Semantic Versioning perspective, shouldn’t introduce breaking changes while “upgrade” of major versions do.

The Operational Stages

It may be a good idea to start with the current operational activities I mentioned while describing the manual process and map them to some generic stages.

From a high level perspective, the stages can be:

  1. Download, upload and stage a desired product, or a set of products;

  2. Generate the product(s)’ config (so that we have the “code” to work with);

  3. Configure the product(s) (with templatized/parameterized “code”);

  4. Apply the changes;

  5. Export current settings (as a way of backing up our “code”)

As a summary, the model can be illustrated as below:

The GitOps Experience

There is no standard definition yet about GitOps, unfortunately. But the idea of GitOps, which can be considered as a way to do continuous delivery by having a Git repository as the source of truth, has been widely accepted and practised.

Platform Automation for PCF has introduced a recommended file structure to get started for how to manage and version the configs.

Let’s enhance a little bit, based on some real world experience, to make it more GitOps friendly.

A Concourse Resource To Fulfil Semantic Version Requirements

We need a mechanism to drive semantic version based upgrades and patches with on-demand products along the platform journey.

For a specific product, there is a way to “describe”, as required by “download-product” task. Let’s take PKS as an example, it can be described as something like:

pivotal-container-service:

  product-version: "1.4.0"

  pivnet-product-slug: pivotal-container-service

  pivnet-api-token: ((pivnet_token))

  pivnet-file-glob: "*.pivotal"

  stemcell-iaas: google

So we need a Concourse Resource Type to fully support semantic version requirements:

  1. To differentiate semantic versions between “upgrades” and “patches”;

  2. To retrieve a set of configs based on the desired semantic version.

To achieve these, we need to have a mechanism to indicate what semantic versions you care by some configurable patterns.

For upgrade, let’s start a bit conservatively, so we should be able to track changes by patterns like “m.n.*” -- which simply means that “for upgrades, I care major and/or minor version changes”. So upgrades are something like: “1.12.2” -> “2.0.3”, “2.4.7” -> “2.5.3”.

For patch, let’s track only patch version changes, so a pattern like “*.*.p” simply means that “I care only patch version changes”. So patches are something like: “2.4.7” -> “2.4.12”.

And we should put all products together, for example, within one single YAML file, for a specific foundation so tracking the products and versions can also be human friendly. Let’s name it “products.yml” here.

Once a desired change is detected, the desired set of configs should be retrieved to drive the proper pipelines.

Note: I already implemented a Concourse Resource, named “Semver Config Resource”, to fulfill these requirements.

Multi-foundation Support

There are many ways to support multiple PCF foundations across an organization, as described here.

To make things easier, I’d suggest to have one single Git repository to support multi-foundation by having a “root” folder for each foundation.

For example, if we have two foundations: a development foundation, let’s give it a code “dev”, and a production foundation, let’s give it another code “prod”. We may simply start building the Git repository with something like this:

├── README.md

├── dev

│    └── products.yml

└── prod

        └── products.yml

Establish Naming Conventions

The “convention over configuration” pattern is still valid here.

By following the recommended file structure, we can build the Git repository a bit further:

├── README.md

├── dev

│    ├── config

│    │ ├── auth.yml

│    │ └── global.yml

│    ├── env

│    │ └── env.yml

│    ├── state

│    │ └── state.yml

│    └── products.yml

└── prod 

        └── <OMITTED FOR BREVITY>

For products, we should use the right identifier, where the product slug is really the right candidate, to name the files. There are 3 more folders created for products related config management.

├── README.md

├── dev

│    ├── config

│    │ ├── auth.yml

│    │ └── global.yml

│    ├── env

│    │ └── env.yml

│    ├── generated-config

│    │ └── <PRODUCT-SLUG>.yml

│    ├── products

│    │ └── <PRODUCT-SLUG>.yml

│    ├── state

│    │ └── state.yml

│    ├── vars

│    │ └── <PRODUCT-SLUG>-vars.yml

│    └── products.yml

└── prod

        └── <OMITTED FOR BREVITY> 

Notes:
1. There is already a sample Git repository hosted in my GitHub for your reference, here.
2. The product related files can be generated and populated later, after “flying” the pipelines, along the templitization and parameterization process.

The S3 Service and Buckets

S3 service should be used for hosting artifacts of Platform Automation for PCF, which include a pre-baked Docker image file (e.g. platform-automation-image-3.0.1.tgz) and a tasks file (e.g. platform-automation-tasks-3.0.1.zip).

And S3 service offers a great place to host the exported installation settings files too.

So we should have below buckets pre-created:

  1. platform-automation: The bucket to host platform-automation artifacts, e.g. the Docker image (.tgz) and the tasks (.zip);

  2. <FOUNDATION-CODE>, e.g. dev: one bucket per foundation is recommended for hosting the exported installation files.

Implement The Pipelines

Guided by the model we established, let’s implement the pipelines.

For OpsMan:

  1. Install OpsMan

  2. Upgrade OpsMan

For Tiles/Products:

  1. Install Products

  2. Upgrade Products

  3. Patch Products

Notes:

1. For those impatient, you may simply checkout the pipelines from this GitHub repository.
2. After real world exercises, I found that install and upgrade products can be identical so these two pipelines can be merged as one.

Install OpsMan

By reusing the tasks offered by Platform Automation for PCF, implementing pipelines becomes trivial, especially when you have clear design and know what you’re going to achieve.

The purpose of this pipeline is to install OpsMan, configure OpsMan’s authentication, and then configure OpsMan Director, so two products should be involved:

  • Ops Manager

  • BOSH Director

The Concourse jobs we should consider include:

  1. create-opsman-and-configure-auth: use “download-product” task to download Ops Manager artifact; use “create-vm” task to create Ops Manager instance; use “configure-authentication” or “configure-ldap-authentication” or “configure-saml-authentication” task to configure any desired authentication method at first time before logging into Ops Manager;

  2. generate-director-config: use “staged-director-config” task to generate staged BOSH Director config. Please note that it’s a good practice to configure BOSH Director through the Ops Manager UI at first and then generate it for further templatization and parameterization;

  3. configure-director: use “configure-director” task to apply the BOSH Director’s config file from Git repository to make sure that we follow GitOps’ spirit -- take Git repository as the source of the truth;

  4. apply-director-changes: use “apply-director-changes” task to apply changes only to BOSH Director;

  5. export-installation: use “export-installation” task to export installation settings as a zip file and then push it to S3. This is a good practice and should be embedded to any pipelines to have a backup of the settings, should there be any changes.

There is a very important process, that I simply call it “templatization and parameterization”, which simply means that:

  1. Once a product is configured, the product config file would be generated automatically or on-demand by triggering “generate-director-config” job and pushed into Git repository as “<FOUNDATION-CODE>/generated-config/<PRODUCT-SLUG>.yml”

  2. This <PRODUCT-SLUG>.yml file is a raw file, which should be copied into two folders: 

    1. “<FOUNDATION-CODE>/products/<PRODUCT-SLUG>.yml”. In this case the “<PRODUCT-SLUG>.yml” is “director.yml”

    2. “<FOUNDATION-CODE>/vars/<PRODUCT-SLUG>-vars.yml”. In this case the “<PRODUCT-SLUG>-vars.yml” is “director-vars.yml”

  3. Then templatize and parameterize both files to make sure that:

    1. The files under “/products” should be reusable across foundations, if possible;

    2. The files under “/vars” should keep only the variables, be it static values or Credential Manager supported tokens so credentials and sensitive info can be stored and managed properly through Credential Managers, like CredHub.

We should reserve some decision making points in this pipeline to make it a fully “controlled” process:

  • configure-director: we can idempotently configure director by applying the config file, which is “director.yml” here, once the “templatization and parameterization” process is done, or anytime after you’ve made sensible changes in Git repository about “director.yml” and/or “director-vars.yml”.

  • apply-director-changes: whether to apply the changes to Director is always a decision making point. If you configure the BOSH Director correctly, applying director changes will always perform a check, and build a brand new BOSH Director, should there be any changes, without introducing any downtime to PCF’s services and applications.

This pipeline can be illustrated as below example:

Upgrade OpsMan

Having the “install-opsman” walked through means we can build “upgrade-opsman” much easier as the baseline has been successfully established.

Before upgrading OpsMan, it’s always a good idea to export current settings just in case, even every pipeline should already have the export job embedded.

So the Concourse jobs we should consider include:

  1. export-installation-before: use “export-installation” task to export current installation settings and push it to S3;

  2. upgrade-opsman: use “download-product” task to download desired version of Ops Manager artifact; use “upgrade-opsman” task to upgrade Ops Manager instance; once upgrade of Ops Manager is successfully done, we should use “staged-director-config” task to generate the latest BOSH Director’s config and push it to S3’s “/generate-config/director.yml” too;

  3. configure-director: as usual, use “configure-director” task to apply the BOSH Director’s config file from Git repository to make sure that we follow GitOps’ spirit -- take Git repository as the source of the truth;

  4. apply-director-changes: use “apply-director-changes” task to apply changes only to BOSH Director;

  5. export-installation: use “export-installation” task again to export installation settings as a zip file and then push it to S3. 

Please note that we should use the version pattern as "m.n.p" so “upgrade-opsman” pipeline should take care of any Ops Manager’s version changes, be it major, minor, and patch. And the process should pay the same attention to Ops Manager for any upgrades and patches.

This pipeline can be illustrated as below example:

Install Products

I was thinking to break it down to have a pattern of “one pipeline one product” so that we can have a granular control of one specific product.

It turns out that it may not be a good idea for some reasons:

  1. Some products may be dependant on other product(s). As an example, CredHub Service Broker is dependant on Pivotal Application Service. There is a mechanism for tile developer to declare the dependencies in the tile metadata, like:

requires_product_versions:
- name: cf
  version: '>= 2.4.0'

So it may be great to let Ops Manager to automatically discover the dependencies.

  1. In the real world, bulk installation of some products is very common;

  2. And don’t forget, with the newly built “Semver Config Resource”, granular control is still always in hand for what product(s) to proceed.

While marching towards configurable products in one pipeline, YAML templating becomes a must-have requirement.

ytt” is a great templating tool that understands YAML structure, allowing you to focus on your data instead of how to properly escape it.

So we can break down the pipeline requirements to several parts:

  1. groups: about how to group the jobs for better logical view/focus

  2. resource_types: the custom resource types we’re using, if any

  3. resources: the resources involved in our pipelines

  4. jobs: the Concourse jobs being orchestrated by our pipelines

We’d build all these with templates, one file each. And use a master pipeline template file, say “install-products.yml”, to scratch the skeleton, like this:

#@ load("groups.lib.yml", "groups")

#@ load("resource-types.lib.yml", "resource_types")

#@ load("resources.lib.yml", "resources")

#@ load("jobs.lib.yml", "jobs")

---

groups: #@ groups()

resource_types: #@ resource_types()

resources: #@ resources()

jobs: #@ jobs()

How to build the templates becomes very trivial and straightforward. I’d recommend you to check out ytt’s website for details about features, syntax and tips.

Similarly, for each product, there should be jobs like:

  1. download-upload-stage-product

  2. generate-product-config

  3. configure-product

And they should be aggregated to below jobs:

  1. apply-product-changes

  2. export-installation

Please note that each product should be tracking on different section of semantic version based configs within the “products.yml” so again, the granular control is still with you.

The decision making points, even compared to Ops Manager pipelines, are almost the same:

  • configure-product-<PRODUCT-ALIAS>: we can idempotently configure product by applying the config file, which is “/products/<PRODUCT-SLUG>.yml” here. Once the “templatization and parameterization” process is done, or any time it makes sense to you after config changes.

  • apply-product-changes: whether to apply the changes is always a decision making point. Applying products’ changes will employ the canary deployment mechanism to apply through the changes to products, without introducing any downtime to the applications.

This pipeline can be illustrated as below example:

Note: Click the specific “group”, say “pas” or “csb” or “redis” here, can focus on specific product related Concourse jobs.

Upgrade Products

Interestingly, you will eventually find that install and upgrade products are identical, including the decision making points.

So it might be the right time to rename the pipeline from “install-products” to “install-upgrade-products” to indicate that this pipeline actually caters for both products’ install and upgrade.

Patch Products

As per discussion about semantic versioning, patches shouldn’t introduce any breaking changes so it may be simplified a little bit compared to install/upgrade products.

Similarly, for each product, there should be jobs like:

  1. download-upload-stage-product

  2. configure-product

And they should be aggregated to below jobs too:

  1. apply-product-changes

  2. export-installation

Patching products can be fully automated -- once you merge the pull request (PR) or update the “products.yml” directly after a patch planning, it can simply trigger and walk through the patching process.

But it might still be a good idea to have a “psychologically safe” window to apply the changes to production environment, so you can configure that by having a “patch-products-schedule” like this:

- name: patch-products-schedule

  type: time

  source: ((schedule.patch))

And you can inject the preferred schedule through a vars file with:

schedule:

  patch: 

    start: 12:00 AM                         # by following Time Resource https://github.com/concourse/time-resource

    stop: 1:00 AM

    location: Asia/Singapore

This just simply instructs the pipeline that “please patch the configured/desired products at midnight for me, instead of office hours, now”.

This pipeline can be illustrated as below example:

Note: Similarly, click the specific “group”, say “pas” or “csb” or “redis” here, can focus on specific product related Concourse jobs.

The Concourse With The Control Plane

Control Plane is another big topic which is out of this article’s scope. You may refer to the Control Plane Reference Architectures here for how it works for you best.

In general, we need some major components, as already mentioned above:

  1. Concourse (of course, that’s what we’re building with)

  2. S3 Service

  3. Git Repository

This idea can be illustrated as below diagram:

Conclusion

I’ve put all this idea together and implemented it as a set of workable pipelines, here, by purely orchestrating the Concourse tasks offered by Platform Automation for PCF, without having a need to build a custom task -- This has been a clear evidence to prove that the building blocks offered by Platform Automation for PCF are great, mature to start with.

The highlights of this implementation can be summarized as:

  • It’s an end-to-end PCF automation solution, built on top of Platform Automation for PCF, with a series of best practices embedded;

  • There are literally FOUR(4) pipelines only for ONE(1) foundation, with whatever products you desire to install and operate; 

  • It’s designed for multi-foundation, so rolling out to have more PCF foundations would just work too;

  • It’s fully compatible with GA'ed Platform Automation for PCF v3.x

This implementation has been attracting more and more interest and getting encouraging positive feedback from the PCF communities and the customers along the engagements I have. It’s still a community driven project so pull requests are always welcome. I believe you will see an officially curated set of pipelines offered by Pivotal very soon, but the idea should be similar and the product config files should be reusable. 

I hope you eventually realize the power of Platform Automation for PCF and Concourse while reading through this sharing and start innovating on how to drive automation not only for applications, but also cloud-native platforms like PCF! 

About the Author

Bright Zheng

Bright Zheng is a Senior Solutions Architect at Pivotal.

Previous
Pivotal Application Analyzer: Your Forensic Source Code Analysis
Pivotal Application Analyzer: Your Forensic Source Code Analysis

Next
Developing, Architecting, Testing, & Documenting your API [Part 4 of 4]
Developing, Architecting, Testing, & Documenting your API [Part 4 of 4]