SRE and the value of treating operations as a software problem

September 3, 2019 Derrick Harris

Site reliability engineering—better known as SRE—has been around since about 2004, but really took off in around 2016 when Google engineers wrote a book describing how they use SRE principles to keep the company's applications online. Since then, SRE practices and "site reliability engineer" roles have popped up in technology companies large and small, and the topic comes up in any meaningful conversation about modern IT operations.

In this episode of Cloud Native in 15 Minutes, Dave Rensin, a senior director of engineering at Google, explains some of the key SRE principles and why he thinks of SRE as "as business practice that evolved in a technical culture." In addition to IT-centric topics such as what to measure and how SRE relates to DevOps, Rensin also discusses in some detail the theories behind error budgeting and service-level objectives (you should listen just for that), and how they can apply across the entire organization.

Here are a few quotes from the episode, where Rensin gives some of the basics around what SRE is and how to think about what to measure to really make a difference.

SRE is machines working for humans

“The way I like to think of it is this: You can live in one of two worlds. In the first world, a machine, called a pager, wakes you up at 3:00 in the morning because some other machine is having a hard time. In that world, you work for the computers.

“The world you want to live in is one where some system you’re responsible for is having a problem, it sort of mitigates itself, and then it writes a bunch of information out for you to debug the next morning after your morning coffee. That’s a world where the machines work for you. SRE is a world where the machines work for you. …

“That’s the difference between staring at a monitor and poking at a keyboard, versus trying to write software or implement systems that fix themselves.”

SRE and DevOps are not so far apart

“SRE and DevOps developed mostly independently of one another, mostly at the same time, in response to exactly the same set of problems. So, unsurprisingly, they landed in really similar spaces. They share 99 percent of the same principles, so I don’t like to argue about chronology and history and all that stuff. The way I like to think of it mentally is that SRE is a concrete, opinionated—and it certainly is—implementation of DevOps principles.

“The thing I like about SRE is if you do SRE work at Google, and then go to LinkedIn or Netflix or some other place with an SRE culture, the activities will rhyme with one another. You will recognize the things.”

There is such a thing as too good (in error budgeting)

“If you determine that you can tolerate 43 minutes of error, or bad minutes, over 30 days, and you’re consistently only spending, say, 10 minutes, that’s not a victory. That means you’ve over-engineered reliability. You have more reliability than your users need, and that’s time and resource and expense you could be applying to innovation or risk or some other thing.”

Measure what matters

“No one’s users care about CPU load or memory pressure or disk fullness. … They care about, ‘How long did the thing I want take, and did I get the correct answer?’ … You want measure the things your users care about.

“We like to say the important things are the symptoms, not the causes. The causes are important because you need the data to be able to debug and fix the thing, but your users care about the symptoms.”

Subscribe here

Cloud Native in 15 Minutes publishes bi-weekly, and you can find it on most of your favorite apps and platforms, including:

Learn more about SRE

Google's SRE homepage

Google SRE books (free to read)

Reliability Engineering for Humans (presentation)

Principles and Best Practices for Site Reliability Engineering (webinar)

Scale and velocity are driving the next generation of DevOps

About the Author

Derrick Harris is a product marketing manager at VMware.
More Content by Derrick Harris

When DevOps in enterprise is a dead end (or where ‘you build it, you run it’ breaks)

When traditional approaches to DevOps don't scale, it's time to rethink the process.

Replacing the Spring Cloud Services Circuit Breaker Dashboard

SRE and the value of treating operations as a software problem

SRE is machines working for humans

SRE and DevOps are not so far apart

There is such a thing as too good (in error budgeting)

Measure what matters

Subscribe here

Learn more about SRE

About the Author

Previous

Next

SRE and the value of treating operations as a software problem

SRE is machines working for humans

SRE and DevOps are not so far apart

There is such a thing as too good (in error budgeting)

Measure what matters

Subscribe here

Learn more about SRE

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.