The timeline to update Pivotal Web Services with the Meltdown fix. The total elapsed time: 16 hours. The total engineer-time was less than an hour.
There’s nothing certain in life except death and taxes. We can safely add a third item to this list—security updates.
On the heels of a massive corporate breach, the business world was rocked by Meltdown and Spectre. It’s a certainty that more threats will arise in the coming months. You’ll need to protect your systems and data from those too.
How can you survive an onslaught of unpredictable threats? It’s easier than you think.
Change how you think about repairing vulnerabilities. This is a wonderful summary of where things stand today:
I was in the middle of creating this slide (wrt patch hygiene) and had to stop half-way through and ask myself - aren’t we all just making this worse? pic.twitter.com/fCTAYDc3Pn— Sam Newman (@samnewman) January 14, 2018
Let’s focus at the operating system level. Why? This layer can have a large attack surface. It’s where many attackers spend their time probing for weaknesses.
Your risk profile plummets when you stay up to date and quickly apply security fixes to your stack. With the right tools, processes, and automation, it’s trivial to do so.
Pivotal helps our customers improve their security posture with a rapid update capability. We give them confidence in their software supply chain (more on this later). The process works like this:
- A CVE is identified. A fix is supplied, and a new OS image is created.
- Pivotal conducts end-to-end tests with the new OS image.
- After the image has passed these tests, it is posted as a new “stemcell” on Pivotal Network.
- Customers are automatically notified about the availability of these updates.
- Platform engineers download the new files, then add them to PCF Ops Manager. Mature platforms with multiple PCF foundations tend to have automation pipelines for this flow. These pipelines allow engineers to manage versions and configuration for several foundations centrally. (We recommend Concourse.)
- Once the deployment is kicked off through Ops Manager or a Concourse pipeline, updated stemcells are automatically rolled out to the Pivotal Cloud Foundry installation.
The human involvement in the process takes less than an hour. The whole process may take a few hours to a day or more, depending on deployment size. No downtime is required, because a canary-style, rolling deployment process is used, and platform engineers apply the update during business hours. (No more emergency requests for a downtime window!) Your customers, partners, and developers don’t notice a lick of difference.
Applying security updates to Pivotal Cloud Foundry is highly automated and requires no downtime.
Pivotal Cloud Foundry customers run this way today. The visionaries at these companies have realized that rapid security updates are the must-have feature in 2018.
The threat landscape changes rapidly. You need to move fast to stay ahead of bad actors.
Case Study: The Rapid Update to Pivotal Web Services
Pivotal’s Cloud Operations team thinks this way. The engineers that look after Pivotal Web Services (our multi-tenant version of Pivotal Cloud Foundry) deploy to production daily.
When Meltdown and Spectre hit, the team did what it always does. Engineers kicked off the stemcell update using their Concourse deployment pipeline, then went home for the evening. The tooling, automation, and platform worked per usual.
The Pivotal Web Services deployment pipeline, managed by Concourse.
The only eventful thing that happened during the deployment: the fix itself was applied, boosting the security posture of the system.
The timeline at the top of the post illustrates how tens of thousands of application instances got updated in a matter of hours, with zero downtime. The total elapsed time from “stemcell is ready” to “deployed on all containers” (where it matters most): about 16 hours for several hundred VMs. The total engineer-time was less than an hour; manual tasks were just pipeline configuration and initiation.
Four Things Make Security Updates Push-Button Simple
Do you want security fixes applied this quickly to your systems? Here are the four things you need.
1. Confidence in Your Software Supply Chain
A reliable software supply chain helps you quickly apply new security updates. With trusted vendors, you don’t have to fiddle with extensive compatibility testing and other toil.
Remember those 6 steps from earlier? That’s a good example of a supply chain you can feel good about.
Here’s how the process works in the Cloud Foundry ecosystem. Cloud Foundry components feature an embedded operating system, provided by stemcell. As a result, adopters don’t have to deal with the operating system (OS) directly.
Canonical, the publisher of Ubuntu, releases a fix for the vulnerability.
The Cloud Foundry stemcell pipeline detects the new OS change. The pipeline automatically produces, validates, and publishes the new stemcell.
The new stemcell is then consumed by “downstream” engineering teams to ensure it does not introduce issues.
Once downstream teams have confirmed there are no issues with the stemcell, it can then be applied to mitigate the CVE in Cloud Foundry deployments.
Pivotal curates these new updates for its Cloud Foundry customers. We also perform comprehensive tests with the updated OS images. That means that Pivotal confidently deploys the updated images to running systems, and our customers do too. (We also published detailed vulnerability reports like this and this.)
Our approach is holistic and tests the operating system and platform components together. This full-stack testing reduces risk, as it ensures that all layers in the stack interact and perform as expected.
So, for a PCF customer, it’s simple and fast. Download the new bits, do lightweight testing, get on with applying the fix.
With automated notification of new stemcells and rapid, largely automated consumption of said stemcell, your “mean time to repair” KPI plummets.
2. The Right Organizational Structure
Want to move fast? Organize your technical talent into smaller teams. Empower them to focus on frequent, small-batch deployments. Ideally, your teams are pushing code to production daily. Does that mean you need to go hire a bunch of new people? Absolutely not. Just rethink how your developers and operators work today.
Use value-stream mapping and lean principles to find ways to go faster. Most high-velocity teams work in small, cross-functional teams. Further, they are incented to delivering value to customers.
Pivotal’s Cloud Operations team, which runs PWS, consists of seven engineers and one product manager. Cloud Ops deploys daily, but they’re always striving for more frequent, smaller deployments. In the spirit of continuous improvement, the team is investigating ways to fully automate some types of deployment. The goal: to deploy multiple times a day, including overnight.
Pivotal Cloud Foundry customers use this same model to go faster too.
3. A Modern Platform
High-velocity vulnerability fixes are made possible with an automated platform that’s built on immutable infrastructure concepts. You should also use a platform that allows for seamless, no-downtime updates. Armed with this capability, you can use a canary deployment model for your operating system updates.
With a canary deployment, you can roll out a new update slowly to some instances, in order to ensure it performs properly. If an issue occurs, it’s easy to roll back the changes. If all is well, you simply continue to deploy to the rest of the instances.
Canary deployments are essential for zero downtime updates. For the Meltdown fix—and other updates, for that matter—the Cloud Ops team rolls out the update gradually across “cells” (collections of containers).
A pipeline view of the phased PWS container deployment.
Cloud Foundry adopters use these deployment models daily, usually as part of a larger automated process. It’s also worth noting that these “paths to production” are indicative of a set of platform characteristics. For example, bundled monitoring, metrics, and logging keep your apps (and the platform itself) observable. Multiple layers of high availability deliver scalability and stability.
MyPOV on how #PaaS systems like @cloudfoundry and CI/CD pipelines can eliminate patch firedrills: Meltdown and Spectre underscore the ongoing need for infrastructure automation. https://t.co/nZUicohaND— Kurt Marko (@krmarko) January 15, 2018
4. Automation and Tooling
You need a frictionless path for new updates to get to production. If you have change control boards and go/no go meetings, you’re not going fast enough.
One way to re-evaluate your deployment stack: start with the end goal in mind. For Pivotal’s CloudOps team, that means embracing Site Reliability Engineering practices.
An important part of SRE is the idea of a Service Level Indicator (SLI) - a metric that will tell you if users of your service are experiencing an issue. For Cloud Foundry, engineers have two core SLIs, each of which represents a fundamental user expectation for the platform.
Is a well-understood, scaled test application routable? (Expectation: my app will stay up.)
Are we are able to
cf pusha test application? (Expectation: I will be able to update an existing app or bring up a new app.)
“Between them, these two indicators give us a high-fidelity signal representing the customer experience on our platform,” said Marie Cosgrove-Davies, Cloud Ops PM. “We collect a new data point every minute for each SLI, and continuously monitor the system with tools like Healthwatch. We also use them as inputs for our deploys. If an SLI starts to fall during a deploy, our pipeline will automatically pause.”
“This way, we can find and fix the root cause of failures before potentially rolling out a bad config or code bug to the full platform.”
Cosgrove-Davies goes on to note the importance of observability.
“Knowing that our monitoring will alert us if the customer experience starts to degrade is a huge factor in Cloud Ops’s willingness to walk away from running deploys. If there’s customer impact, and there’s rarely customer impact, we know we’ll get paged.”
So what about those “pipelines” we’ve talked so much about? Enter Concourse.
Concourse is used extensively by Pivotal Cloud Foundry adopters for this scenario. Why? Concourse is ideal for:
Teams that practice test-driven development with extensive automated testing
Multiple code versions that must maintain compatibility
Many code derivatives, such as multiple IaaS targets or configurations
High-velocity teams that deliver code frequently
That’s why Corey Innis, staff software engineer for CloudOps, says “our pipeline config is the bill of materials for any given update to the platform. We can simply reconfigure, kick off the pipeline, and walk away.”
With the power of pipelines, complex automation becomes achievable.
So Advanced, It’s Simple
You can see how a DevOps culture, a modern platform, observability, and continuous integration start to fit together. And of course, you have a range of options when it comes to these four factors. But you’re going to need all four of them.
“Biggest story out of all this: applying the Meltdown fix is so trivial, we no longer find it impressive,” said Cosgrove-Davies. “We have the Spectre fix now, and just did the same process. The world will be a much less stressful place when every company can operate this way.”
Ready to think differently about how you update your systems? Try Pivotal Cloud Foundry! Check out Small Footprint, PCF Dev, or spin up a free trial on Pivotal Web Services. While you’re at it, check out our cloud-native security topic page and this whitepaper Security and Compliance with Pivotal Cloud Foundry.
About the Author
Jared works in product at Pivotal.Follow on Twitter More Content by Jared Ruckle