PCF 2.1 and the Quest for a One Pizza Ops Team

March 28, 2018 Amit Gupta

The Pivotal Cloud Foundry product family makes a few promises to platform engineers and operators. First, we help you deliver high availability across your software estate. That means all your apps and containers can meet and exceed your SLAs and target SLOs. Second, we deliver operational efficiency. You don’t need a large team to run your application portfolio on PCF, even across multiple data centers. (In fact, several of our customers are delivering at remarkable scale with a developer to ops ratio of 500:1.)

A third promise would be multi-cloud. PCF runs cleanly on your chosen cloud, and it runs the same way everywhere. All three promises have been central to Pivotal Cloud Foundry since its inception.

Our customers have offered so much feedback in the years that followed. They told us they wanted these same promises to apply to more use cases, to solve more ‘jobs to be done’ as their PCF deployments scaled up.

It’s fitting, then, that Pivotal Cloud Foundry 2.1 - just released - includes a slew of enhancements that reinforce our core promises in new, useful ways.

Let’s take a deeper look at what you can expect from Ops Manager and BOSH Backup & Restore. For you BOSH power users out there, we’ll review a few new capabilities you’ll want to try out!

Operations Manager

Multi-tenancy Improvements

We’ve talked to many of you about multi-tenancy in Ops Manager. We tend to explore two major dimensions:

  • actions - limiting the types of things (read, write, administer) that a user can do

  • products - limiting the tiles a user can touch

For actions, Ops Manager 2.0 added support for 5 profiles (Admin, Full Control, Restricted Control, Full View, and Restricted View). In 2.1, Ops Manager takes the next major step; it allows multiple Full View and Restricted View users to be logged into Ops Manager at the same time. (Previously, if your colleague was logged in, and you wanted to take a look at something in Ops Manager, you would have to kick your colleague out.)

So what about the “products” dimension? Keep reading!

[Alpha] Support for Individual Tile Deploys - Feedback Wanted

Over the last year, we’ve refactored several areas of Ops Manager. (For example, we reduced the coupling between tiles, and pushed down the “source of truth” for shared configuration between products from Ops Manager into BOSH.)

We’re now exploring the scenario whereby operators can deploy tiles individually. This workflow may be more attractive than “Apply Changes.” An “individual tile deploy” feature could help you a few ways:

  • Enjoy shorter feedback loops

  • Gain more control over the “blast radius” of unplanned downtime during upgrades

  • Have more predictability over when - and how long - you may incur planned downtime when performing an upgrade (e.g. some of our services incur non-zero downtime for the apps that consume them, during stemcell patches).

This feature, in alpha, is also a step towards better multi-tenancy and separation of concerns in the product dimension alluded to previously.

But we have more work to do here, and we’d like your help. This feature is “hidden” in PCF 2.1, as we gather feedback about the edge cases. If you’d like to try this feature, we’d love to talk with you. Contact your platform architect!

More Control over Floating Stemcells

Like the prior features, this enhancement gives you more granular control over how changes get applied. And, you can limit the cascading impacts of a single change.

The existing floating stemcells feature makes it extremely simple for an operator to roll out a new stemcell out to an entire foundation. You can perform this task as soon as you receive the PivNet alert that a new stemcell is available. This rapid deployment feature is essential when there’s a critical CVE. When a new stemcell has the fix, our customers want to roll it out as fast, and simply as possible. (After all, rapid patching is a key feature for Pivotal Cloud Foundry.)

But there are times when an operator may want to defer applying a new stemcell to a particular tile. This way, they control when that tile gets patched. Ops Manager’s new floating stemcell feature affords you that flexibility.

NOTE: The feature clearly indicates that you have unpatched software. Don’t forget to come back to this task!

New IaaS Enhancements

We’re always updating Ops Manager so our customers can take advantage of new compute, storage, and networking features. (Support for native encryption and security features is often important too). To this end, we have added new capabilities for popular IaaS providers.

AWS

  • Application Load Balancers (a.k.a. ELB v2)

  • User-provided KMS keys for all disk encryption

  • Multi-VPC and multi-region support (caveat: setting up VPNs between VPCs is an exercise for the operator)

Google Cloud Platform

  • Google Cloud Storage for BOSH Director blobstore

vSphere

  • Ops Manager availability zones can consist of multiple vSphere clusters

We’re also working with Microsoft to support the Azure Application Gateway service, for more powerful load balancing. Stay tuned!

Keeping Up with IaaS Targets

You can now use the Ops Manager API to apply custom vm_extensions to instance groups.  This feature unlocks a huge number of capabilities “for free” in these IaaS use cases:

  • Using spot instance pricing on AWS

  • Defining finely-tuned IaaS security groups on a per-instance-group basis (rather than applying globally to the whole foundation)

  • Applying IAM instance profiles on AWS

  • Associating certain instances with GCP internal load balancers

This makes the PCF experience more flexible, even as IaaS targets change and improve rapidly. BOSH operators have long enjoyed these benefits. Now Ops Manager users can do the same.

Improved Handling of Networks and IPs

One of the original reasons we created Ops Manager was to handle the complexity of IP address management (IPAM). No two networks ever look the same, as experienced network admins will tell you.

In recent years, both Ops Manager and BOSH haggled over the “source of truth” for IPAM.  As of PCF 2.1, Ops Manager delegates all of its IPAM to BOSH. This unlocks several nice improvements for operators:

  • Ops Manager used to artificially limit the number of usable IPs for “bookkeeping” purposes. This was especially painful for large deployments on small CIDRs; that limitation is now gone!

  • A large class of issues where certain operations could lead to IP collisions (two things expecting to use the same IP). This will no longer be possible to occur!

  • Operators no longer need to specify a special “services network” for on-demand broker tiles. This is a big convenience for cases where a tile offers both on-demand and pre-provisioned modes of operation and the operator really only wants to use the pre-provisioned mode.

Platform Recovery

The impact of BOSH Backup & Restore is being felt in more PCF components. Here’s how that comes through in PCF 2.1.

Reduced PAS API Downtime During Backups

In PCF 1.11, we introduced BOSH Backup and Restore (bbr) -- a framework and SDK for BOSH releases to define their own backup and restore logic. (We also included a CLI for operators to easily execute all that logic.)  We hoped that this new mechanism would dramatically improve upon the existing backup methods. Signs are promising on this front. We recently learned that one of our customers has gone from 10 hours of PAS API downtime with an older backup method, down to only 15 minutes with bbr!

We know that the best way to make upgrades boring (while keeping the platform secure) is to apply updates to Ops Manager, stemcells, and all tiles as soon as they’re available. Further, we want you to be able to take backups every time you hit Apply Changes.

At the same time, we know that PAS API downtime can impact productivity for thousands of developers. What’s the solution? Well, in PCF 2.1, we’ve made PAS API downtime during backups even shorter. If you have ambitious SLOs (Service Level Objectives) for the platform, don’t sacrifice security to stay up to date! (NOTE: You will need the latest version of PCF and BBR for these shorter windows to be realized.)

PAS Backup Supports External MySQL

PAS needs a MySQL database. It’s used by Cloud Controller, UAA, Diego, and other components. PAS allows you to use either the “internal” MySQL Galera cluster that comes with PAS or an “external” MySQL service (e.g. AWS RDS, GCP Cloud SQL, or an existing on-premises service). Whether your MySQL is internal or external, you can use bbr for backup and restore with the same guarantees around consistency and correctness. (Previously, only internal MySQL was supported.)

PAS Backup Supports S3 Versioned Blobstores

PAS also needs a blobstore to store the files associated with applications. One option is the “internal” WebDAV, which has been supported since PCF 1.11. Now, PCF 2.1 adds support for backup and restore via  bbr of PAS when configured to use an external S3-compatible blobstore when versioning is enabled on the buckets.

BOSH

BOSH vm_resources -- three benefits in one!

BOSH now allows compute and disk resources for VMs in an instance group to be specified directly in a deployment manifest. Previously, operators had to specify a vm_type in the manifest, and then specify the compute resources for the vm_type in a cloud config.

Who does this help?

  1. Cloud Providers can optimize their respective CPIs to select their recommended instance types for each BOSH-deployed VM. Each CPI can be more streamlined and more efficient, based on required resources and other factors (like the availability of certain instance types in a given region). This is a big win for our partners at Google and Microsoft.

  2. OSS Software Vendors targeting BOSH can more easily ship manifests as a standalone representation of their product (e.g. cf-deployment, mysql-deployment, concourse-deployment, etc.). Now, the consumer does not have ensure that their cloud config meshes well with the provided manifest.

  3. Once Ops Manager leverages this feature in a future PCF release, customers using Ops Manager will benefit from optimized instance type selection logic implemented in the CPIs.

Multiple CPI Configs - Coming Soon!

With BOSH, the CPI configuration specifies what IaaS to target, and also specific things like:

  • What region or datacenter (e.g. Azure region or vCenter URL) to target

  • What credentials to use when BOSH rolls out a deployment to the chosen region or datacenter.

We recently improved BOSH so that operators could specify multiple CPI configs. If an operator constructs manifests, cloud configs, and CPI configs just right, you can use this feature to:

  • Have services instances from different on-demand service brokers be deployed “on behalf of” different IaaS tenants. In this scenario, new service instances draw down against separate IaaS tenant quotas. This helps with overall capacity planning.

  • Treat entire vCenters or OpenStack installations as a single availability zone. This is useful when your on-premises IaaS architecture doesn’t easily support scaling up compute capacity within a single vCenter/OpenStack. (Customers in this predicament have to add capacity by stamping out whole new integrated hardware stacks that includes chassis, storage, networking, server blades, hypervisor, and vCenter appliance or OpenStack APIs.)

This is a foundational feature that will make its way into PCF in the coming months. Stay tuned for updates!

Solving Your Jobs to be Done

In recent months we’ve honed the operator experience in the platform. Recent highlights include an expanded set of APIs, faster upgrades, collocated errands, and the launch of BOSH Backup & Restore. With these enhancements - plus those in PCF 2.1 - you have the tools to deliver breakthrough efficiencies and availability, on your chosen cloud.

We look forward to helping you achieve superior business outcomes.

About the Author

Amit Gupta

Amit joined Pivotal in 2012, where he works as Director of Product Management, Pivotal Cloud Foundry. His focus is the platform operator experience.

Follow on Twitter More Content by Amit Gupta
Previous
Pivotal Cloud Foundry 2.1 Adds Cloud-Native .NET, Envoy & Native Service Discovery to Boost Your Transformation
Pivotal Cloud Foundry 2.1 Adds Cloud-Native .NET, Envoy & Native Service Discovery to Boost Your Transformation

Pivotal Cloud Foundry 2.1 is now GA. The cloud-native platform expands its support for Windows containers a...

Next
Another Perspective on Interrupting Developers at Work
Another Perspective on Interrupting Developers at Work

A comic about collaboration.Recently, I saw a comic about interrupting developers making its rounds among t...

Enter curious. Exit smarter.

Register Now