A friend of mine just got back from a leadership offsite where one of the “fun” activities involved teams rowing a boat. As one might expect, the final race produced mixed, and somewhat hilarious, results. Some teams found themselves literally going in circles. Others had a promising start but faded when weaker team members couldn’t sustain the rhythm. And a couple teams made it across the finish line thanks to synchronized strokes. What drove success was the performance of the entire unit, together. The same goes for benchmarking a platform.
Case in point: the Cloud Foundry team just wrapped up a performance test that sustained an astonishing 250,000 containers in one environment, but what was more impressive was that such a benchmark depended on a reliable performance from the entire platform.
Cloud Foundry is the sum of many parts. It’s not just a premium container runtime, but also a collection of subsystems for dynamic routing, app staging, service mediation, log aggregation, and much more. Performance testing any one of those platform components in isolation wouldn’t be representative of the real world. A useful performance test of anything calling itself a platform requires flexing all the components that the customer depends on during the application lifecycle.
With those non-negotiable criteria out of the way, any worthwhile platform performance test has to do two things: run a mix of applications with different sizes and stability profiles, and sustain the application load while simulating day-to-day operations. Otherwise, the results can’t be transferred to your real life scenarios. You should be confident that you can spend all your time above the “value line” while your platform transparently handles increasing load. The goal of our Cloud Foundry test: prove that 250,000 real-life application containers could be deployed and maintained in a single environment for a sustained period with minimal user interaction.
For our Cloud Foundry test, we designed multiple modes for the same example app. Some app instances requested lots of memory, a few crashed at random intervals, and others made lots of outbound requests. Why does this mix matter? Because this is what your production environment probably looks like! We constructed the following batches made up of 25 application instances. To conduct the performance test, we pumped 10,000 of these batches into the Cloud Foundry cluster.
During the test, we paused at certain intervals—5,000, 50,000, and 250,000 containers— to measure utilization and scale the environment. As the size of the cluster increased, It was vital to keep validating that container scheduling and deployment times weren’t dramatically impacted. We saw the “time to stage and start all the apps” metric go from about 18 seconds in the first batch of 10,000 apps to about 4 ½ minutes in the last batch of apps. Why did that happen? The Diego auction—which decide which cell should host the app—still only took 2 seconds during the last batch of 10,000 apps. But we did see a performance impact from containerizing these final workloads on busy machines. This wasn’t a surprise given the cell density, and is something you should expect with containerized platforms.
The final cluster size needed to support 250,000 app containers: 1,250 “Diego Cell” nodes running applications, 305 nodes running core platform components (50 edge routers, 80 load balancers, databases, log aggregators) and 28.5 TB of allocated memory. We also made sure to fully test the application instance lifecycle, so the cluster easily handled 10,000 inbound app requests per second, and successfully rescheduled all the intentionally crashed application instances. In the image below, notice that we always had close to 250,000 “long running process” (application instances), with thousands always intentionally crashing and coming back online automatically.
We purposely did a staggered deployment in order to simulate reality. Nobody starts with an empty platform and IMMEDIATELY fills it to the gills. That doesn’t prove anything! Rather, it’s important to fill it gradually, and see that the running (and crashing!) apps don’t wobble as new workloads are deployed to the platform. And that’s exactly what happened here. Not only did the Diego container scheduler handle this load like a champ, but the rest Cloud Foundry proved up to the task.
This entire performance test ran atop the Google Compute Engine. Pivotal Cloud Foundry is fully supported on the Google Cloud Platform (GCP), which was an ideal choice for such an intensive benchmark. Between extremely fast virtual machine provisioning times and pre-warmed load balancers, GCP made it possible to stand up this massive environment in just hours, and quickly scale up without any pre-planning. Pivotal is going to continually run performance tests across all PCF-supported IaaS platforms to ensure that we give our customers confidence that their preferred cloud will handle their growth.
We definitely learned a few things during this test. Our initial sizing estimates were almost spot on. The final cluster only needed half (50) the planned number of edge cloud controllers, but we ended up using more RAM (60GB) than expected on the relational databases that the cloud controllers depended on. As you might expect, the cell density impacted how fast Garden created containers (see above), but Diego’s scheduling decisions were fast regardless of cluster size. Finally, we saw that this test stressed our Loggregator subsystem, but it collected the vast majority of logs that came its way.
Cloud Foundry has proven to be among the most scalable platforms available. But reaching 250,000 containers in production isn’t the headline here. What mattered was that a unified platform—made up of container scheduling, infrastructure management, log aggregation, traffic routing, app deployment, and more—gracefully scaled to support an unprecedented workload. That should give you confidence that Cloud Foundry is ready for you, regardless of where you are on your software journey.