Google's New Book: The Site Reliability Workbook

High Scalability

Jul 25, 2018 — 2 min read

Google has released a new book: The Site Reliability Workbook — Practical Ways to Implement SRE.

It's the second book in their SRE series. How is it different than the previous Site Reliability Engineering book?

David Rensin, a SRE at Google, says:

It's a whole new book. It's designed to sit next to the original on the bookshelf and for folks to bounce between them -- moving between principle and practice.

And from the preface:

The purpose of this second SRE book is (a) to add more implementation detail to the principles outlined in the first volume, and (b) to dispel the idea that SRE is implementable only at “Google scale” or in “Google culture.”

The Site Reliability Workbook weighs in at a hefty 508 pages and roughly follows the structure of the first book. It's organized into three different parts: Foundations, Practices, and Processes. There are three appendices: Example SLO Document, Example Error Budget Policy, and Results of Postmortem Analysis.

The table of content is quite detailed, but here are the chapter titles:

How SRE Relates to DevOps.
Implementing SLOs.
SLO Engineering Case Studies.
Monitoring.
Alerting on SLOs.
Eliminating Toil.
Simplicity.
On-Call.
Incident Response.
Postmortem Culture: Learning from Failure.
Managing Load.
Introducing Non-Abstract Large System Design.
Data Processing Pipelines.
Configuration Design and Best Practices.
Configuration Specifics.
Canarying Releases.
Identifying and Recovering from Overload.
SRE Engagement Model.
SRE: Reaching Beyond Your Walls.
SRE Team Lifecycles.
Organizational Change Management in SRE.

What makes this book a tour de force are all the examples and case studies. You aren't just stuck with high level principles, you're given worked examples that make the principles concrete. That's hard to do and takes a lot of work.

In Chapter 2—Implementing SLOs—there's a detailed example involving the architecture for a mobile phone game. First, you must learn how to think "about how users interact with the system, and what sort of SLIs (Service Level Indicators) would measure the various aspects of a user’s experience." You're then taken through a number of SLIs and how to implement and measure them. Given the SLIs you learn how to calculate SLOs (Service Level Objectives). And once you have the SLO you're shown how to derive the error budget. That's not the end. You have to document the SLO and error budget policy. Then you need reports and dashboards that provide in-time snapshots of the SLO compliance of your services. Is that the end? No. You must continuously improve your SLO targets and learn how to make decisions using that information. And that's not the end either, but for the rest you'll need to read the book.

In Chapter 3—SLO Engineering Case Studies—Evernote and The Home Depot tell the story of their journey into SRE.

In Chapter 4—Monitoring—there are examples of moving information from logs to metrics, improving both logs and metrics, and keeping logs as the data source.

In Chapter 6—Eliminating Toil—there are detailed case studies on Reducing Toil in the Datacenter with Automation and Decommission Filer-Backed Home Directories.

And so it goes through nearly every chapter.

As you can see it's a very detailed and thorough book. The preface modestly contends it's a necessarily limited book, but I'd hate to see how many pages would be in the unlimited version.

Like the first book, the writing is clear, purposeful, and well organized. For a company well known for its influential publications, this is another winner.

Best of all? It's free until August 23rd!

Google's New Book: The Site Reliability Workbook

High Scalability

Read more

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale

The Swedbank Outage shows that Change Controls don't work