Death and DevOps

December 19, 2017 Dormain Drewitz

Dealing with unfortunate things shape a culture.

There are certain universal human experiences that transcend cultures. The birth of a child. The coupling of two people in marriage. And death. From baptisms to burials, how we handle these moments is a reflection of our culture.

For better or worse, how death is dealt with leaves a lasting legacy that we can study. You can learn a lot about a culture — a mindset — by how people treat their dead. In medieval Europe, bodies were buried intact, facing east, so that they could rise facing Jerusalem upon the Resurrection. In Judaism, the dead are typically buried within a day and are never left unattended until burial. And, if you saw Coco, you saw a version of how families remember and honor the dead in Mexico.

Image from Pixar’s Coco.

So, what does that all have to do with DevOps?

Well, even in the DevOps world, things die. Servers die. VMs die. Applications die. With the right kind of automation and abstraction, these events matter less. But complex events and human error still cause outages, even at the most sophisticated of cloud-native companies. So, what happens when there’s an outage?

I noticed that of all my live-tweeting of SpringOne Platform, that tweet was getting more reactions. This got me wondering: Why does this four step process resonate with people? Then it hit me: this is a concrete example of DevOps “culture.” And we need those concrete examples.

What are your rituals?

At SpringOne Platform, many speakers said cultural change was the hardest part of their transformation. As Niki Allen from Boeing quoted, “Culture will eat strategy for lunch.” But for all the talk of how important culture is for transformation, examples were scarce — there were a few ping-pong references and photos of open office spaces.

But, as Ben Horowitz and Jason Rosenthal discussed on this great podcast episode, culture really matters when things go wrong. Culture has a lot to do with how we deal with the hard stuff, from the day-to-day to the life changing. The way most societies deal with hard stuff is a set of rituals. Rituals help us know what to do in really stressful situations (like a death) when it can be hard to think clearly.

Google has a lot of ritual around failure. Listening to Andrew Clay Shafer’s digest of the Google SRE book, I got the sense that the culture around learning from failure was borderline obsessive. It’s like a mourning process, one that ultimately lets people move on.

You need to have a set of rituals for when things go wrong. They may not look like Matt’s list and you may not even be aware of them. These rituals may vary from department to department, but they exist. They also may not be healthy and may be creating friction to change.

For example, your rituals for an outage may involve an urgent conference call. That turns into a blamestorming session, where a couple people or teams get thrown under the bus. Once the issue is resolved, a manager writes an email to only execs with a high-level overview. After that, the issue is considered buried and put to rest, only spoken of again on the next blamestorming call.

Defining a new way to mourn the outage

Unfortunately, you can’t just declare a new set of rituals and claim a cultural change. I mean, you could, but it probably wouldn’t work very well. If you want to have any hope of changing a culture, you have to first understand it. Then you can introduce new rituals to change the culture by teaching and practicing.

1) Learn and observe.

Abigail Stason has an insightful framework for what she calls Conscious Commitment. It’s intended (to my knowledge) for individuals, but can apply to teams and organizations as well. She starts with noticing patterns of what’s already happening, then allowing time to really study the behavior.

Observe what happens when something goes wrong. Include the good, bad, and ugly. Document the steps. Figure out what triggers different stages in the process. From there, you will be in a better position to recognize this pattern — and if need be head it off — once you’ve committed to a new set of rituals.

2) Define some new rituals and iterate.

Matt’s four steps are great: short and simple, making them easy to absorb as a multi-step process. They also emphasize humility and empathy, with can be easier said than done. That takes practice.

@mcrowther talking about culture and security along with values, behaviors, and practices. #SpringOne #pivotal

 — @Ash_Hathaway

You can also take inspiration from analysis of public outages, like one at GitLab. Google’s SRE book sounds like a good source of ideas (full disclosure: I haven’t read it). But, be careful not to adopt concepts, like “blameless post-mortem,” without internalizing what they mean.

I’m going to go out on a limb here and say there’s a lean approach to defining new cultural rituals as well. Don’t try to craft the perfect outage response guide. Identify a couple things that you think would be an improvement and start to try them out. See how they go. Learn from them and change or add to them over time. When part of your culture is continuous improvement — iterating on solutions — you move faster because you don’t get stuck in analysis paralysis trying to nail it the first time.

3) Teach and practice.

So, now you have a couple new things you want to see happen in the event of an outage. How do you make that happen? You need to educate people. You can’t expect them to know any other way than what they are already doing.

We aren’t born knowing what to do at a funeral. For example, I had my first experience with a Jewish funeral this summer. There were little, printed handouts at all the seats of the temple. Each one explained the practices, along with the translated Hebrew prayers. The rabbi explained a lot of what was happening and why as the funeral progressed. Elder members of the community leaned over to fill me in on this or that. Having never participated in that part of the culture, there were several ways for me to learn.

Write down the “this is what we do when there’s an outage” list or manifesto. You’re (probably) not writing it in stone, so it can change. Don’t wait for an outage to spring the list on people. To lift an idea from Adrian Cockcroft, you may even want to run a few fire drills where you can reinforce new rituals.

Enough death. What are some happy rituals?

Okay, okay. I get that death and mourning are heavy topics. I’m asking you to put your cultural anthropologist hat on (looks like this). There’s something to be learned from cultural response to death and your culture around outages. Both are stressful events that are a question of when, not if, they will happen.

But we can also apply the same thinking to other moments. What about the “birth” of a new product? How do you celebrate it’s launch? How does the community support the new “parents”? These are questions for another post, but I’ll leave you with this thought: If you go to observe your rituals for new products and can’t find any, is your birth rate too low?

About the Author

Dormain Drewitz

Dormain leads Product Marketing and Content Strategy for VMware Tanzu. Before VMware she was Senior Director of Pivotal Platform Ecosystem, including RabbitMQ, and Customer Marketing. Previously, she was Director of Product Marketing for Mobile and Pivotal Data Suite. Prior to Pivotal, she was Director of Platform Marketing at Riverbed Technology. Prior to Riverbed, she spent over 5 years as a technology investment analyst, closely following enterprise infrastructure software companies and industry trends. Dormain holds a B. A. in History from the University of California at Los Angeles.

Follow on Twitter Visit Website More Content by Dormain Drewitz
Previous
PCF 2.1 and the Quest for a One Pizza Ops Team
PCF 2.1 and the Quest for a One Pizza Ops Team

Pivotal Cloud Foundry 2.1 includes a number of important enhancements for platform operations teams. We rev...

Next
The Cloud-Native Ops Opportunity
The Cloud-Native Ops Opportunity

Digital business and the explosion of apps, faster development cycles, rogue IT are pushing operations team...