How does Google keep all its services up and running? They almost never seem to fail. If you've ever wondered we get a wonderful peek behind the curtain in a talk given at GCP NEXT 2016 by Melissa Binde, Director, Storage SRE at Google: How Google Does Planet-Scale Engineering for Planet-Scale Infrastructure.
Melissa's talk is short, but it's packed with wisdom and delivered in a no nonsense style that makes you think if your service is down Melissa is definitely the kind of person you want on the case.
Oh, just what is SRE? It stands for Site Reliability Engineering, but a definition is more elusive. It's like the kind of answers you get when you ask for a definition of the Tao. It's more a process than a thing, as is made clear by Ben Sloss 24x7 VP, Google, who defines SRE as:
what happens when a software engineer is tasked with what used to be called operations.
Let that bounce around your head for awhile.
Above and beyond all else one thing is clear: SREs are the custodian of production. SREs are the custodian of customer experience, for both google.com and GCP.
Some of the highlights of the talk for me:
- The Destructive Incentives of Pitting Uptime vs Features. SRE is an attempt to solve the natural tension between developers who want to push features and sysadmins that want maintain uptime by not pushing features.
- The Error Budget. This is the idea that failure is expected. It's not a bad thing. Users can't tell if a service is up 100% of the time or 99.99%, so you can have errors. This reduces the tension between dev and ops. As long as the error budget is maintained you can push out new features and the ops side won't be blamed.
- Goal is to restore service immediately. Troubleshooting comes later. This means you need a lot of logging and tooling to debug after a service has been restored. For some reason this made flash on a line from an earlier article, also based on a talk from a Google SRE: Backups are useless. It’s the restore you care about.
- No Boredom Philosophy of Paging. When a page comes in it should be for an interesting and new problem. You don't want SREs being bored handling repetitive problems. That's what bots are for.
Other interesting topics in the talk are: How is SRE structured organizationally? How are devs hired into a role focussed on production and keep them happy? How do we keep the team valued inside of Google? How do we help our teams communicate better and resolve disagreements with data rather than with assertions or power grabs?
Let's get on with it with it. Here's how Google does Planet-Scale Engineering for a Planet-Scale Infrastructure...