How Does Google do Planet-Scale Engineering for a Planet-Scale Infrastructure?
Monday, July 18, 2016 at 9:15AM
Todd Hoff in Example, google

 

How does Google keep all its services up and running? They almost never seem to fail. If you've ever wondered we get a wonderful peek behind the curtain in a talk given at GCP NEXT 2016 by Melissa Binde, Director, Storage SRE at Google: How Google Does Planet-Scale Engineering for Planet-Scale Infrastructure.

Melissa's talk is short, but it's packed with wisdom and delivered in a no nonsense style that makes you think if your service is down Melissa is definitely the kind of person you want on the case. 

Oh, just what is SRE? It stands for Site Reliability Engineering, but a definition is more elusive. It's like the kind of answers you get when you ask for a definition of the Tao. It's more a process than a thing, as is made clear by Ben Sloss 24x7 VP, Google, who defines SRE as:

what happens when a software engineer is tasked with what used to be called operations.

Let that bounce around your head for awhile.

Above and beyond all else one thing is clear: SREs are the custodian of production. SREs are the custodian of customer experience, for both google.com and GCP.

Some of the highlights of the talk for me:

Other interesting topics in the talk are: How is SRE structured organizationally? How are devs hired into a role focussed on production and keep them happy? How do we keep the team valued inside of Google? How do we help our teams communicate better and resolve disagreements with data rather than with assertions or power grabs? 

Let's get on with it with it. Here's how Google does Planet-Scale Engineering for a Planet-Scale Infrastructure...

Maintaining the Balance: The Destructive Incentives of Pitting Uptime vs Features

Skills: SREs are a Combination Seal Team and Priesthood

Organization: Give Devs a Reason Not to Let Operational Work Build Up

Environment: How do you keep devs happy in a production team?

Budget: Error Budget You Can Spend However You Want

SRE Support is on Spectrum

What Makes Things Go? Culture and Processes

Incident Management

Who do you blame?

Blameless Post Mortems

The No Boredom Philosophy of Paging

The Need for Much Stronger Debugging Tools

Stackdriver Error Reporting

 Related Articles 

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.