For well over 100 years, engineers have utilized wind tunnels to study the effects of aerodynamic forces on airplanes, buildings, automobiles, and other structures. This testing is conducted as part of a full suite of analyses intended to demonstrate, among other things, the ability of these structures to withstand various forms of stress.
As part of our discussion of the different approaches to building container based cloud application platforms, we designed an experiment analogous to applying a wind tunnel to Cloud Foundry, utilizing Pivotal Web Services (PWS). PWS is a publicly available instance of Cloud Foundry operated on Amazon Web Services by Pivotal’s Cloud Operations team and used as the production environment for continuous delivery by Pivotal’s Cloud Foundry engineering teams. Not only does PWS support our efforts to continuously drive higher quality into the Cloud Foundry platform, but it also provides a powerful production runtime platform for many of our customers.
To quickly review, here is a summary of the results:
As part of this experiment a nascent codebase was created that allows us to do two important things:
- Using a Spring Boot Java application with a built-in “kill switch” endpoint, simulate the crash of n application instances in parallel;
- Submit n parallel requests to a “health check” endpoint on the same application.
We have now christened this codebase Platform Wind Tunnel. Over time this tool will be extended such that it will not only attack application instances, but also the underlying platform (e.g. by randomly powering off container hosting machines). The eventual goal will be to provide Netflix Chaos Monkey-like services to help evaluate cloud application platforms, as well as to stress production applications!
Over the next few weeks, we’re planning to deploy our test application to multiple cloud application platforms, including Google App Engine, Amazon Elastic Beanstalk, Red Hat OpenShift, and Cloud Foundry’s next-generation runtime layer, Diego. We’ll then perform the same wind tunnel experiments on each of these platforms and share our observations with you. These will be non-scientific, point-in-time tests, and should not be used to set expectations of general results you can expect from these platforms. However, it’s our hope that this will provide users with a better understanding of how the existing ecosystem of platforms aligns with our vision of automated operations at scale.
We’ll begin this week by testing out Heroku. Heroku was one of the first cloud application platforms, founded in 2007 as a cloud platform for hosting Ruby on Rails applications, but eventually adding support for multiple language runtimes by developing buildpacks (since adopted by Cloud Foundry). Heroku was acquired by Salesforce.com in 2010. In order to deploy our test application to Heroku, we first had to convert it from a Spring Boot CLI application to a standard Spring Boot application with a Maven pom. This conversion was completed on a branch, and included changing the diagnostic output to reflect the information available to a running Heroku application:
As we continue the Wind Tunnel series, we’ll integrate these changes into an application which will detect the target platform from its runtime environment and display the appropriate information.
Heroku applications are deployed to Linux containers called “dynos.” Upon pushing changes to the appropriate Git remote, Heroku detects the application type, selects the appropriate buildpack, and then builds the application “slug.” Slugs are compressed and pre-packaged copies of your application that are optimized for distribution to Heroku’s dyno manager. A copy of this slug is deployed inside the dyno. In this first test, we’ll pick up after this staging process has completed.
Heroku limits its customers to 100 dynos per application by default. We begin the analysis by timing the start of 25 instances, followed by a 4X increase to the maximum of 100 instances.
Heroku App Instance Scaling Test
Cloud Foundry App Instance Scaling Test
As you can see, Heroku’s scaling performance is quite good and compares extremely well with that of Cloud Foundry:
|Scale 1 to 25||34 seconds||48 seconds|
|Scale 25 to 100||153 seconds||68 seconds|
One anomaly related to Heroku scaling that you’ll notice in the video is that we get to 99 running instances in 58 seconds, but getting the 100th instance running takes an additional 95 seconds. This anomaly has been consistently duplicated over many test runs.
Application Health Management
Heroku App Health Management Test
Next we’ll test Heroku’s ability to heal crashed applications. We’ll start where we left off, with 100 instances of our application running. We then use Wind Tunnel to submit 25 concurrent requests to our test application’s “/killswitch” endpoint, causing 25 of our application instances to crash (theoretically there’s no guarantee we’ll get routed to 25 unique instances given Heroku’s random routing policy, but the probability is pretty good).
We’re looking for two primary behaviors:
- HTTP clients will be unaffected by harshly removing a large percentage of capacity, and requests will not be routed to failed instances.
- Application crashes will be quickly and automatically remediated.
We test for the first behavior by using Wind Tunnel to repeatedly submit 200 concurrent requests to our application’s “/health” endpoint. As you can see in the video, all 200 of our requests repeatedly receive HTTP 200 OK responses, so Heroku passes this test.
Heroku is also able to heal 23 of our crashed instances within 34 seconds. The final two instances are in very different states:
- Dyno 89 seems to be stuck in the “restarting” state. After an additional 88 seconds passes, the logs appear to show Heroku toggling the dyno from “starting” to “down” and back to “starting.”
- Dyno 34 is in the “crashed” state. Further investigation of the logs shows that the application crashed during Heroku’s attempt to restart it, due to some type of network failure. According to Heroku’s Dyno Crash Restart Policy, this dyno is now in a 10 minute cool-off period, so another restart attempt won’t come until that period expires.
Cloud Foundry App Health Management Test
We finally attempt the same exercise on Cloud Foundry. Interestingly enough, even with “random routing,” 25 “/killswitch” requests repeatedly only terminates between 15-20 instances of the application. Some basic log analysis using Papertrail shows that duplicate application instances are often chosen by the router to serve these requests before the instances have crashed. Expect to hear more about this as we do a deeper exploration of gorouter’s randomness.
As you can see in the video, Cloud Foundry also passes our HTTP client test, as all 200 of our requests repeatedly receive HTTP 200 OK responses.
Cloud Foundry is also able to heal all 15 of our crashed instances within 86 seconds, with no “stuck” anomalies. So, while Cloud Foundry is roughly 4X slower than Heroku at healing the same number of instances in this test, 86 seconds is still a really fast recovery for 15 app instances. As we’ve said before, imagine how long it would take to manually restore these app instances!
So, to summarize the health management story:
|Instances Healed||Time Elapsed|
|Cloud Foundry||15||86 seconds|
While Heroku and Cloud Foundry perform very similarly with respect to app instance scaling and health management, there are some interesting policy differences with respect to “cooling off” periods for applications that repeatedly crash. For reference, here is Heroku’s restart policy (from https://devcenter.heroku.com/articles/dynos#dyno-crash-restart-policy):
If a dyno crashes within the first 10 minutes of launch, Heroku will immediately attempt to restart it. If a dyno again crashes within 10 minutes of starting, Heroku will continue to attempt to restart it, but the attempts will be spaced apart by increasing intervals. If the second restart attempt results in the dyno crashing within 10 minutes of start, there will be a 10 minute cool-off period before the third attempt. If the third restart attempt fails, there will be a 20 minute cool-off period, followed by a 40 minute cool-off period and so forth up to a maximum 24 hour cool-off period between restart attempts.
Understanding this, in one experiment, we repeatedly used Wind Tunnel’s kill function to get an application into a state where all instances are crashed and remain crashed for some time t < 10 minutes, where t is equal to the time remaining for the oldest crashed process. We can remediate this state by either scaling to num_instances + 1, or by scaling down to one instance followed by scaling to n. The latter scenario requiring two operations happens because, if we scale down to 1 instance that was previously crashed, that instance remains crashed. Scaling up to two has the effect of starting both the new instance and the crashed instance.
Interestingly enough, this also holds true in the following scenario: if we scale down to one instance, and that instance was previously in a crashed state, and Heroku subsequently “idles” it (which it does for apps with only one dyno), then Heroku will not bring it out of the idle state (as it does when receiving requests for an idle application). We have to scale to two to get it back.
Cloud Foundry’s Health Manager, HM9000, takes a less conservative approach to its crash restart policy:
- Instances are restarted immediately after their first three crashes
- For each subsequent crash, a 30 second cool-off period is applied.
- An exponential back off is applied up to a maximum of a 16 minute cool-off period between restart attempts.
Overall, Heroku aligns quite well our vision of automated scaling and recovery. Heroku’s app instance scaling and health management performance are comparable with that of Cloud Foundry. And while they take a more conservative approach to crash restart policy, their overall behavior is extremely consistent with that policy.
We hope this article has been useful to help educate you about the ecosystem of available cloud application platforms, and we look forward to providing you with further analyses. Over time we’ll continue to refine our wind tunnel and in turn use the results to improve Cloud Foundry. For us this journey has only just begun. Stay tuned for more episodes of Platform Wind Tunnel!
About the AuthorMore Content by Matt Stine