Root Cause of an Application Outage on Kubernetes, and How We Fixed It

February 7, 2019

A lot of the work I do with customers centers around helping them become successful, as quickly as possible. For some, that's helping install the PKS platform, or automating build pipelines. For others, it’s just helping containerize applications, and getting them ready to run on Kubernetes (and all that entails).

Outages happen to all applications—it’s just a part of running imperfect software and imperfect humans. This is the story of an outage one application took while running on Kubernetes, how we determined the root cause, and how we fixed it.

The Setup

This application is a scale-out Java application, running on-premises atop Pivotal Container Service (PKS). It receives requests from outside K8s, makes appropriate database calls (against a DB also outside K8s), and returns results. The app is accessed via an Ingress, and has a standard Service endpoint. The one goofy thing about this application is that it does a ton of processing upon startup. It warms up its cache and a slew of other tasks. That’s about 20-24 minutes worth of startup time before it can take requests. That's fine—we can deal with that via liveness and readiness checks.

It ran fine for weeks, scaled out with a Deployment to about 30 Pods.

The Challenge

The underlying cluster (about eight nodes) needed to be upgraded from PKS 1.2.6 to 1.3. Now, with PKS, that's a mostly an automated process that usually goes something like this:

Initiate the upgrade;
BOSH does the rest:

Node by node:
- Drain + cordon the Node
- Delete the node from the IaaS
- Create new Node from new image on IaaS
- Add node to cluster

In PKS, each node really only takes about 3-4 minutes to cycle through these steps. It’s all handled by the system.

The Event

About 25 minutes after the upgrade was started, the application monitoring systems started noticing that transactions were failing. Over 95% of requests to the application were timing out. After about 27 minutes, the application was entirely unresponsive.

Kubernetes should have migrated those Pods off the nodes as they were drained and restarted elsewhere. The process should have been invisible to end users.

How did this happen?

The Root Cause

The core of the problem turned out to be the aforementioned 25-minute startup time. Each of the 8 nodes in the cluster ran about 3 of the Pods in the deployment. When the first Node was drained and its ~3 Pods evicted, they restarted on other Nodes. But the Pods on the updated nodes needed to perform their own 25-minute startup sequence.

After 3-4 minutes of working on the first node, the second node experienced the same fate. Its ~3 Pods were evicted, and beginning their startup sequence. Even worse: some of these Pods on the second node were the ones carried over from the first node that still hadn't started, and just got killed again.

Roll through the entire cluster in about 20 minutes, and you end up in a situation where all the Pods are executing per the Deployment's design. But none of them are ready to accept traffic, and therefore aren't part of the Service or Ingress yet. This resulted in the complete failure of the application.

The Fix

Thankfully, there is a technique in Kubernetes that prevents this. What we want to be able to say is, "Don't drain this node if the overall application would be overly impacted." This is measured (generally) as a disruption budget—that is, how many components of this application can be down simultaneously without impacting our SLA?

For this particular application, the answer is about 70%. We could suffer a loss of about 70% of the Pods without a significant negative impact.

The way that Kubernetes can handle this problem is with a Pod Disruption Budget, or PDB, where we can declare our value, and apply it to any given set of selectors.

In this case, the fix was to create a PDB that looks something like this:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
 name: app-pdb
spec:
 maxUnavailable: 30%
 selector:
   matchLabels:
     app: myappname

With this in place, as a drain process starts, and the Pod Disruption Budget will check the current state of all the Pods aligned to that selector (app: myappname). The upgrade workflow will then wait to drain the node until there are enough Pods to stay below that maxUnavailable value of 30%.

Conclusion

Fortunately, in this case, by the time the last node was upgraded and replaced, the first node's Pods were nearly complete with their startup. The outage in total only lasted about 14 minutes.

It was a fairly inexpensive lesson for the team and we’re now doing a few things differently:

Larger scale destructive testing: Rather than the basic tests that were tried, like killing a couple Pods manually, the team now is developing tests that include entire AZ failures and upgrades in the staging environment as part of their release QA process.
Documenting the best practices like PDBs in a central place.
Figuring out how to reduce the startup time of the application.

Overall, this was a good experience for this customer to understand some of the features Kubernetes includes to keep applications in proper working order, and how multiple layers of a stack can influence each other in unexpected ways.

Start Your (Kubernetes) Engines: VMware PKS Competency Goes Live

VMware launched their PKS competency for VMware partners today. Learn what VMware PKS competency is, and wh...

Kubernetes: One Cluster or Many?

One of the many strengths of Kubernetes is just how much flexibility you have when deploying and operating ...

Root Cause of an Application Outage on Kubernetes, and How We Fixed It

The Setup

The Challenge

The Event

The Root Cause

The Fix

Conclusion

Previous

Next

Root Cause of an Application Outage on Kubernetes, and How We Fixed It

The Setup

The Challenge

The Event

The Root Cause

The Fix

Conclusion

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.