Running Your Platform—Upgrade Much?

September 30, 2015 Simon Elisha

sfeatured-podcastA common question for operations when considering running a platform for the first time is, what do you do to the platform when you are running all your apps on it? Be it for patches, upgrades, routine maintenance, or the ever-present security upgrades—how do you do this easily and without downtime?

Pivotal Cloud Foundry has the answer!

PLAY EPISODE

SHOW NOTES

TRANSCRIPT

Speaker 1:
Welcome to the Pivotal Perspective’s podcast. The podcast at the intersection of agile, cloud and big data. Stay tuned for regular updates, technical deep dives, architecture discussions, and interviews. Let’s join Pivotal’s Australia and New Zealand CTO, Simon Elisha, for the Pivotal Perspective’s podcast.

Simon Elisha:
Hello friends, and welcome back to the podcast, so glad that you could join me again today. It’s been a little while, so I thought I’d catch up with some hopefully useful information. Today, I want to speak with you a little bit about managing Pivotal Cloud Foundry, and what you do to the platform when you’re running the platform. This topic often comes up because it is not thought about very much by people who are either building their own platform, or running a platform. With many things we get excited about, the initial design, and all the bells and whistles and the componentry, but the very boring but majority of the time, is running the platform and doing things like patching and security vulnerability checking, and upgrades and maintenance. All that care and feeding that you need to do.

This is really important because it can consume a great amount of time if you’re not careful, or, it could be so hard to do and leave you with such a brittle platform, that you can’t make the changes you need to make, where necessary. For example, if a security vulnerability comes out, you need to apply a fix almost immediately, but you can’t because you’re going to break your whole platform. That’s not a great position to be in, and it’s certainly the antithesis of the DevOps ideal and the agile lifestyle that we’re all looking to lead. Fortunately, the people who build Pivotal Cloud Foundry think about this a great deal. They think about it all the time.

They do this because firstly, they run a platform in the guise of Pivotal Web Services, but also they’re trying to service our customers who want to run a platform themselves either on the cloud of their choice or internally, and have 24 by 7 requirements, have very strong security requirements, etc. What do we do from a platform perspective? How do we manage the platform? We need to think about the platform first in terms of how it’s divided up, and the areas of concern. If we think about the application layer, obviously the application developer creates deployable artifacts that then get deployed into production or into development, depending on what you want to do, onto the Pivotal Cloud Foundry platform.

They’re responsible for determining how many instances of the application should exist, but the platform takes care of all the other stuff, like the load balancing and the routing and the health checking etc. This platform doesn’t exist in isolation. It’s not an entity that just magically appears. It is an underlying set of components that operate and run. This is where the operations team comes into play. This could be a separate operations team, it could be operators who are part of a project team, it really doesn’t matter, because the operational component is reasonably lightweight when it comes to Cloud Foundry.

Basically, the platform operations team is responsible for deploying the platform in the first place, to define the standard runtimes and services that are available to the developers, to monitor the platform to ensure its ongoing health and that it’s behaving the way it should, although the platform itself does a lot of self-maintenance. Scaling, to ensure that there’s sufficient capacity, not over-capacity or under-capacity in the environment, and, of course, upgrading the platform with a zero downtime. That’s the area that I want to speak about in a little bit of detail. Why would you need to update your platform? The most obvious one that people think about is there’s a new version that comes out. Version 1.6 comes out. “I want to get all the goodness that it brings, so I need to update the platform.”

That’s very true, and a very viable change that needs to be made on a regular basis, and we’ll cover that shortly, but the actual more important one, and the more common one, is when you need to address some sort of security vulnerability. These typically come in the guise of CVEs or common vulnerabilities and exposures, and these are publicized, understood risks, that exist in different components of code on the internet. Because these things come out on a regular basis, we want to patch for them on a regular basis. What happens when you’re using the Pivotal Cloud Foundry, is we take care of a lot of that heavy lifting for you.

Our security team works constantly to monitor and understand the CVEs that come out, and the ones that may be applicable to components within the platform, and when they have a security vulnerability they want to fix, they release a new version of the components. The elastic runtime for example. That gets released to the support website, which is network.pivotal.io and customers can then download the new componentry from there. That’ll include stemcells and other components that are required, or bundled up, packaged for you. Once you download that new tile, you can implement it or install it via the ops manager, so you just import it into the ops manager. There’ll be links to all these things, of course, in the shownotes, and you then apply that change to the system.

If you’re doing something that affects the underlying system, so the elastic runtime service, for example, where we’re affecting the DEAs, or the droplet execution agents, and applying upgraded versions of operating systems and packages etc., this is done in a fully automated and rolling upgrade through the good agencies of BOSH. We’ve spoken about BOSH previously, but BOSH is working under the covers. You don’t have to drive it yourself, the ops manager does that for you. What it will do, is it will go through, in the case of DEAs, and it will do a canary deploy. Canary deploy is a very important concept.

It’s analogous to the canary in the coalmine, where the miners used to go down, have a canary in a cage, the canary’s lungs are much smaller, so if there was any sort of gas or leakage that was threatening to humans, the canary would die first and then the miners would get right out of there very quickly. We like canary deploys, because we try a little bit, make sure it’s okay, rather than having a disaster. In a software sense, this means when we’re upgrading a distributed system, instead of upgrading everything at the same time, we do a rolling upgrade. We upgrade parts of it as we go, and the canary component means we only do a certain percentage of it up front, and then we ascertain whether the upgrade was successful. Let’s say we’ve got a fleet of 50 DEAs, we may choose to upgrade the first five as our canary deploy.

Should that deploy prove successful, which we hope it would, then the deployment continues, and we continue to roll out until we finish. If the deployment has failed, what it means is that we’ve got five nodes that we will analyze, and see what went wrong and do any rectification. Our platform is still up and running, and everything is still operational, because we have not done a complete roll through and broken everything. This is a very powerful concept that’s taken care of for you. The other part that happens here is we can do this in a zero downtime approach, because if you’re deploying cloud native applications they should be automatically deployed across multiple DEAs. As each individual DEA comes offline and online, the workload is redistributed automatically, and is taken care of automatically for you.

The health checking takes place, the traffic is routed to DEAs that are still operational for the application, and when new DEAs come online, the application will spin up additional instances again. What this means, is that you can do significant roll through of patches and security updates without affecting your operational components. This becomes really exciting if you’re operating 24/7 and if you don’t want to have disruption. The other nice thing, is you’re moving out of the business of maintaining and managing operating systems. In fact, my colleague Andrew Clay Shafer often says no CEO ever said, “Hey, IT team, way to manage those operating systems this year.” It just doesn’t happen. It’s not a business you should be in any more.

We’ve taken it off your hands and said, “Treat the platform as a platform, and we’ll maintain the security posture of the underlying platform to best practice.” This means that your management component is now point and click. You’re basically saying, “I’ve got the new version I want to roll out. I indicate when I want to make that change,” and rolls it out with zero downtime. Of course when you’re deploying a platform, you need to deploy it using redundancy, and there’s some instructions about how to deploy various components of the platform in a resilient way. Really, that’s a case of setting how many instances of components you want, and so I’ll allude to that in the shownotes. Also, for certain components, there may be a small pause while they’re being updated, or you may not be able to do a particular action.

What doesn’t change, is that the applications on the platform operate all the time. You can still monitor the applications, you can still manage those applications, they’re still resilient, they’re still operational, which is really really powerful. The other part of this is a convenience component. It means when a vulnerability is identified and the press is full of it and management is asking, “Are we at risk, are we vulnerable?” Once you’ve accepted the CVE update and rolled it through your environment, you can say, hand on heart, “Yes. We have solved the vulnerability, we have patched all those systems, we are good to go.”

That’s again done in an automated rolling fashion. You imagine trying to hand tool that and roll it yourself, recompile packages, distribute application components, distribute framework components, verify that every single component is changed. The other wrinkle you get, particularly if you’re cobbling together a platform from lots of different bits, is which upstream bits have been patched, which haven’t been patched, which are lagging behind, can you update them altogether, are you now running into compatibility issues, etc. Nightmare scenario. What we do, is we take all the security features and roll it through our entire integration pipeline. Everything is verified and validated that works, that nothing has broken, there are no regressions, and the latest patches are, in fact, there across the board.

The work is done for you, and then you simply have to deploy, which makes things a lot easier. Really, it’s about that self-service, instant gratification type approach without overhead, and without management complexity. This is why we see people who are operating the Pivotal Cloud Foundry platform spending a lot more time thinking strategically about which services they’d like to offer, how they’d like to scale the platform, where they want to deploy the platform, where certain applications should be run, rather than having to think about patching, compilation, integration, and other nuances etc. I hope that’s been useful to you, and interesting to you. Again, I’ll link to the shownotes some of the information about how to upgrade the products and what the processes are.

Of course, whenever doing an upgrade, what must you do first? That is backup your environment, and we do give you instructions about doing that as well. Hopefully that was useful, and certainly if your platform can’t do that, you need to give it some thought. Until next time, keep on building.

Speaker 1:
Thanks for listening to the Pivotal Perspective’s podcast with Simon Elisha. We trust you’ve enjoyed it and ask that you share it with other people who may also be interested. We’d love to hear your feedback, so please send any comments or suggestions to podcast@pivotal.io. We look forward to having you join us next time, on the Pivotal Perspective’s podcast.

About the Author

Simon Elisha is CTO & Senior Manager of Field Engineering for Australia & New Zealand at Pivotal. With over 24 years industry experience in everything from Mainframes to the latest Cloud architectures - Simon brings a refreshing and insightful view of the business value of IT. Passionate about technology, he is a pragmatist who looks for the best solution to the task at hand. He has held roles at EDS, PricewaterhouseCoopers, VERITAS Software, Hitachi Data Systems, Cisco Systems and Amazon Web Services.

More Content by Simon Elisha
Previous
MADlib’s Journey To Apache: Math, Stats & Machine Learning Methods
MADlib’s Journey To Apache: Math, Stats & Machine Learning Methods

We are excited to announce that our collaborative, open source math, statistics, and machine learning libra...

Next
The Way to Hadoop Native SQL
The Way to Hadoop Native SQL

Today, Pivotal announced it has open sourced HAWQ and MADlib, contributing them to the Apache Software Foun...

How do you measure digital transformation?

Take the Benchmark