SLIs and Error Budgets: What These Terms Mean and How They Apply to Your Platform Monitoring Strategy

June 1, 2018 Amber Alston

This is the first post in a series about monitoring distributed systems. We introduce several important concepts for readers who may be newer to the topic.

Modern software engineering teams seem to have their own language. Two terms that folks seem to be using a lot lately? Service Level Indicator (SLI) and Error Budgets.

We use these terms a lot within Pivotal, and with our customers. Based on dozens of conversations, we thought it might be useful to put together a primer of sort. We wanted to define what these terms really mean when we talk about platform observability and management goals.

Terminology

Let’s start with some definitions.

KPI (Key Performance Indicator). A given metric, usually in the form of a counter or gauge value. The metric helps convey the health/status, performance, or usage of a given component or a set of related components.

SLI (Service Level Indicator). A “derived result measurement” from a purposeful validation test. An SLI has the goal of confirming that a specific, high-value user workflow is both available and acceptably performant for your end-users. You can think of an SLI as a measurement of your user’s expectations. For example, the end-users of my API would expect it to be available and return the requested response within 10 seconds. If my API fails to respond (or responds more slowly), my end-users will be unhappy with my API service.

Measurements of user expectations should be written in plain, easily-understood language. SLIs should be agreed upon by the entire team. Further, SLIs should also be coded with something that can programmatically measure them.

SLO (Service Level Objective). A threshold you established for your defined SLI, i.e. the percentage of your SLI testing that must pass for your users to be generally satisfied with your service. When defining your SLO target, it helps to think about what happens when your service doesn’t meet its defined SLI. If your SLI measures an internal-facing business enablement service, whereby brief outages may be more acceptable, you can potentially choose a lower SLO target than a service used in a popular customer-facing application.

Error Budget. Directly related to your target SLO percentage. Your Error Budget represents the quantified amount of downtime, or lowered performance levels, that you are willing to allow within a rolling 30-day window. Each time your SLI check fails, you are consuming some of your allowed error budget. (See below for a reference chart showing how SLO Target Percentage maps to an Error Budget value.)

By now, you might wonder how all these terms relate to each other. Let’s examine how a platform engineer may use these terms in the real world.

Example: A SLI for Pivotal Application Service (PAS), part of PCF

This is a Service Level Indicator used by an internal PCF Ops team.

SLI: As an App Developer, I expect to be able to successfully CF Push my app within 2 minutes.

Availability Measure	SLO Target	Error Budget per 30 days	Unit of Measure
CF Push Availability	99.9%	43.2 minutes	Success/Fail metric from PCF Healthwatch

We’ve monitored the SLO as a rolling 30-day window. We then check to see how the actual performance measures up to our target.

If availability is *above* our SLO target - yay! we release new features
If availability is *below* our SLO target - we halt releases, and focus on reliability

Deploying new code, or environmental configuration change, into production environments always carries some inherent risk no matter how well these changes have been tested. As long as we are within our SLO target, we have created the space to accept some level of downtime risk in exchange for the benefits of deploying desired features.

SLIs or KPIs? It’s Really SLIs AND KPIs

SLI monitoring is a more meaningful measure of user impact. That does not mean however, that you should stop monitoring the critical metrics your system emits about its own performance. Key performance metrics (KPIs) can be important for deeper troubleshooting. They are also useful indicators for things like the need to increase resources for a given component.

Think about the difference between KPIs and SLIs this way:

KPIs often change as the system changes. If your underlying system architecture changed, it would be expected for the KPIs of high operational value to also change with those components.

To wit: Pivotal Cloud Foundry. We modify and improve the platform every quarter. As a result, the KPIs that our customers need to care about are slightly different in each version. So we update the PCF Healthwatch product on the same cadence. The most relevant KPIs are always visible to platform engineers in the dashboard.

SLIs should not change if user needs are the same. As a representation of user value, and not the underlying technology, SLIs should be architecture agnostic. Assuming your system purpose stays the same, an existing SLI should remain valid through any underlying re-architecture.

Why SLI is the Preferred First-Level Monitoring Option

Focusing on SLI monitoring allows you to reduce the overall amount of monitoring work for your system.

Let’s consider another example. If our Pivotal Ops team is paged for the CF Push SLI test repeatedly failing, our SLO goals are at risk. Including this level of detail is instantly more meaningful; the on-call engineer immediately understands the end-user impact. If I’m on call, I don’t want to be paged in the middle of the night for some noticeable latency spikes in component metrics. But I *do* want to be paged if my Application Uptime SLI (via Canary testing) starts failing.

One continuously-run, user-functionality measurement test can tell you much more about the underlying performance of your system than monitoring and alerting on dozens of metrics in isolation.

Choosing an Error Budget: It’s a Balance

What should your Error Budget be? To figure that out, we need a Target SLO first. Selecting a Target SLO, often referred to in terms like “three nines” or “five nines”, directly correlates to your error budget in a given time window (see reference chart below).

At Pivotal, we measured our own systems for SLO adherence for many months. We have found it significantly more meaningful to talk in terms of the Error Budget left/spent. In our experience, humans can more quickly understand the impact of “time consumed” and “time remaining” instead of adherence to a displayed percentage number. We have subsequently shifted our own internal communications to talk more about how much error budget consumption is acceptable to “trade” for deploying a big or risky change.

It may be tempting to slow the pace of change--if I don’t deploy then I don’t risk my error budget. This must be avoided as it quickly leads to stagnation of the platform.

Teams need to be able to deploy new features into the system; security patches are critically necessary. Any sort of update introduces the possibility of instability. Use error budget conversations to set realistic expectations. You should publish these measures and targets freely within your organization. For public apps, consider posting your goals externally for your customers. Without these, users often default to an unrealistic expectation of 100% system reliability.

By establishing shared SLO Targets & Error Budgets as an organization, you put the focus on the right balance between innovation and reliability. And you create a shared language for prioritized investments.

This shared language helps in another scenario. If you’re frequently violating your SLO and consuming an agreed error budget, your teams can have a meaningful discussion around the need for additional investments. A common output is to expand efforts that increase resiliency and performance. Or perhaps your team will consider the feasibility of lowering agreed-upon reliability objectives. This would make more space for riskier, but necessary innovation work.

References

Selecting a Target SLO

Target SLO	Allowable Downtime (per 30 days)	Likely Requires
99.999% (5 nines)	0.43 minutes	Automated Failover
99.99% (4 nines)	4.32 minutes	Automated Rollback
99.95% (3.5 nines)	21.6 minutes
99.9% (3 nines)	43.2 minutes	Comprehensive monitoring and on-call system in place
99.5% (2.5 nines)	216 minutes (~3.5 hours)
99% (2 nines)	432 minutes (~7 hours)

Thanks to the Pivotal teams that contributed to this article, including the Pivotal Platform Reliability Engineering practice and Pivotal Cloud Ops.

About the Author

Amber Alston is a Principal Product Manager for Pivotal Cloud Foundry. From consumer apps to immersive training simulators, she has 10+ years of experience in identifying and translating strategic business objectives into the delivery of highly useful and usable technology solutions. She holds dual Master's Degrees in Engineering & Communication. She's been described as a learner, a gamer, a foodie, and a random hobby skills collector.

From CD-ROMs to Continuous Delivery: A Look at Enterprise IT Evolution

Luckily, nobody installs software by floppy disk anymore. Inceasingly, continuous delivery is the norm. Fin...

Getting Ready to cf push? Great! Now, Think About These 5 Things Before You Write a Line of Code.

Many enterprises are standardizing on Pivotal's app platform, PAS. Once that decision is made, developers t...

SLIs and Error Budgets: What These Terms Mean and How They Apply to Your Platform Monitoring Strategy

Terminology

Example: A SLI for Pivotal Application Service (PAS), part of PCF

SLIs or KPIs? It’s Really SLIs AND KPIs

Why SLI is the Preferred First-Level Monitoring Option

Choosing an Error Budget: It’s a Balance

References

Recommended Reading

Selecting a Target SLO

Thanks to the Pivotal teams that contributed to this article, including the Pivotal Platform Reliability Engineering practice and Pivotal Cloud Ops.

About the Author

Previous

Next

SLIs and Error Budgets: What These Terms Mean and How They Apply to Your Platform Monitoring Strategy

Terminology

Example: A SLI for Pivotal Application Service (PAS), part of PCF

SLIs or KPIs? It’s Really SLIs *AND* KPIs

Why SLI is the Preferred First-Level Monitoring Option

Choosing an Error Budget: It’s a Balance

References

Recommended Reading

Selecting a Target SLO

Thanks to the Pivotal teams that contributed to this article, including the Pivotal Platform Reliability Engineering practice and Pivotal Cloud Ops.

About the Author

Previous

Next

Most Recent

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.

Explore Spring's exceptional NPS score of 75, surpassing industry benchmarks by 18%. Discover why it matters.

SLIs or KPIs? It’s Really SLIs AND KPIs