How Does Google do Planet-Scale Engineering for a Planet-Scale Infrastructure?

How does Google keep all its services up and running? They almost never seem to fail. If you've ever wondered we get a wonderful peek behind the curtain in a talk given at GCP NEXT 2016 by Melissa Binde, Director, Storage SRE at Google: How Google Does Planet-Scale Engineering for Planet-Scale Infrastructure.

Melissa's talk is short, but it's packed with wisdom and delivered in a no nonsense style that makes you think if your service is down Melissa is definitely the kind of person you want on the case.

Oh, just what is SRE? It stands for Site Reliability Engineering, but a definition is more elusive. It's like the kind of answers you get when you ask for a definition of the Tao. It's more a process than a thing, as is made clear by Ben Sloss 24x7 VP, Google, who defines SRE as:

what happens when a software engineer is tasked with what used to be called operations.

Let that bounce around your head for awhile.

Above and beyond all else one thing is clear: SREs are the custodian of production. SREs are the custodian of customer experience, for both google.com and GCP.

Some of the highlights of the talk for me:

  • The Destructive Incentives of Pitting Uptime vs Features. SRE is an attempt to solve the natural tension between developers who want to push features and sysadmins that want maintain uptime by not pushing features.
  • The Error Budget. This is the idea that failure is expected. It's not a bad thing. Users can't tell if a service is up 100% of the time or 99.99%, so you can have errors. This reduces the tension between dev and ops. As long as the error budget is maintained you can push out new features and the ops side won't be blamed.
  • Goal is to restore service immediately. Troubleshooting comes later. This means you need a  lot of logging and tooling to debug after a service has been restored. For some reason this made flash on a line from an earlier article, also based on a talk from a Google SRE: Backups are useless. It’s the restore you care about.
  • No Boredom Philosophy of Paging. When a page comes in it should be for an interesting and new problem. You don't want SREs being bored handling repetitive problems. That's what bots are for.

Other interesting topics in the talk are: How is SRE structured organizationally? How are devs hired into a role focussed on production and keep them happy? How do we keep the team valued inside of Google? How do we help our teams communicate better and resolve disagreements with data rather than with assertions or power grabs?

Let's get on with it with it. Here's how Google does Planet-Scale Engineering for a Planet-Scale Infrastructure...

Maintaining the Balance: The Destructive Incentives of Pitting Uptime vs Features

  • Sysadmins get cookies for Uptime, for when a site stays up. When a site stays up we get visitors and visitors give us money.

  • Developers get cookies for features. Release a new feature, visitors come, they give us money.

  • Production freezes, that is a freeze on new features, usually maps to increased uptime.

  • There’s a natural tension between devs and sysadmins. Developers get cookies for releasing features. Sysadmins get cookies for uptime.

  • So sysadmins are rewarded for preventing new features from going out. And the developers will be rewarded if they can figure away around the sysadmins.

  • Developers do what they call betas as a way of getting features out sooner.

  • Sysadmins do what they call launch reviews to slow down new features.

  • Your teams spend all their time fighting each other, so you get increased outages, increased risk, chaos, and anarchy.

  • What you want is to take whim and fiat out of the process. Handle it by rules so teams can have goals and work together.

  • There is a way to have dev and operations work together, as in devops. The problem is devops has a different meaning wherever you go. In contrast, SRE, Site Reliability Engineering is well defined.

  • SRE: what happens when you ask a software engineer to design and run operations -- Ben Sloss 24x7 VP, Google

    • Software engineer - it turns out services run better when people who known software also run the services. They have a deep understanding of what makes it tick.

    • Design and run - actually design your production environment rather than have it be a happy accident.

  • Let’s say there are 1000 SREs working on Google’s infrastructure: network, compute, storage, etc. How many are responsible for cloud?

    • All of them.

    • There’s no destinction between what makes google.com run and what makes GCP (Google’s Cloud Platform) run. Don’t want the overhead of having the cloud teams and the internal teams trying to communicate. They’ve created one environment that helps everything work together.

Skills: SREs are a Combination Seal Team and Priesthood

  • The section title is my characterization. Skill-wise SREs must be elite. Job-wise the are devoted only to this almost quasi-mystical thing called Production. 

  • SREs must be more skilled than developers to do the same job:

    • They need a greater breadth of skills.

    • All SREs must pass a full software developer interview to be hired.

    • All SREs must pass a non-abstract large system design interview.

  • SREs must have the same software skills, it’s a different domain of application.

    • Devs are beholden to product managers and make features.

    • SREs are beholden to production, to making production as good as it can be.

  • When both the dev and production oriented perspectives are combined the resulting design is much stronger.

  • An example of what SREs bring to the table is given by an example onboarding process, which happens when a team's project is brought under SRE's responsibility. When a team’s software was evaluated they found:

    • It was going to fail in Production when it hit scale.

    • The devs had implicitly assumed a certain type of call would never fail.

    • They had assumed the distribution of requests was uniform.

    • They had assumed they wouldn’t be hot spotted by users.

    • They assumed all requests were of an average size.

    • They failed on the two tails (no explanation given).

Organization: Give Devs a Reason Not to Let Operational Work Build Up

The system must be designed not to let operational work build up, because if devs aren’t doing the work they won’t care as much.

Devs budget for SRE. If you have a system that has a large operational overhead you don’t get as many devs, you don’t get to push as many features.

SREs have a completely different chain of command. They have their own VP, separate from the dev VPs. This gives them authority and power. It allows them to say no when production means they need to say no. A bunch of pager monkeys they are not.

When devs says they can donate headcount SRE does not have to accept it. SRE can say a service is not important enough, keep supporting it yourself.

SREs are a scarce resource. Not every team at Google has SREs. Cloud does, but not every other team, and not even every single small service in cloud, just the important ones.

Environment: How do you keep devs happy in a production team?

  • At least 50% of work needs to be project work. Not on-call. Not tickets. Not meetings. Actually doing project work.

  • If there’s too much project work either dev gives more head count to SRE or the extra work flows over to the devs team.

  • What is project work?

    • Improving the latency of a service by switching the underlying database technology.

    • Writing automation to speed up deployments.

    • Projects that span services. Google as a service internally that can be queried internally by other services, usually by software bots, that returns if it is safe to take a machine down, or if it is safe to take a rack down, or is it safe take a datacenter down?

  • SREs are a volunteer army. There is no draft.

    • You can transfer into another SRE team at any time.

    • You can transfer into dev at any time.   

    • Mission Control is a program where devs can try out SRE and see if they like it.

  • Teams are fluid. People are coming in out of teams, sharing experiences, sharing perspectives.

Budget: Error Budget You Can Spend However You Want

  • If you have 3 nines of availability the goal is not to push it to 4 nines, you have a .1% error budget, go for it.

  • If you want to push out features faster and make GCP even better the do it. Until you run out of error budget.

  • If you would prefer to have poor tests, have your software failing regularly and have to roll back constantly then you can choose that too, but you’ll run out of error budget much faster and you won’t be able to launch.

  • Error budgets go on quarterly cycles.

  • There’s an escape valve: three silver bullets.

    • A dev can say I really need to push, I’d like a silver bullet please.

    • The SRE will say OK, but you have to convince the VP that you actually need to push.

    • This ritual may sound silly, but it’s very powerful. It puts control in the hands of the devs. They have three silver bullets and it’s their VP that decides if it’s appropriate to push.

  • Error budgets are on a per service basis. So if multiple dev teams are on the same services they share the same budget.

    • SRE does not get in the middle of warring dev teams. They have to work out how the error budget is going to be spent.

  • Off-boarding. If all else fails and devs and SRE really can’t agree, SREs can offboard a dev team.

    • Like an amicable divorce.

    • It’s critical escape valve so teams don’t have festering disagreements over long periods.

    • It’s rare but has happened. An example scenario is if a team doesn’t want to use Spanner for their ACID type project, if the dev team says they want to build their own, the SRE team can say they don’t want to support the team if you are going to go build your own database because that’s not good for production.

  • SREs are the custodian of production, SREs are the custodian of customer experience, for both google.com and GCP.

SRE Support is on Spectrum

  • Chat and Consultation. Chatting with a dev. Having a whiteboard session.

  • Co-Designing. Creating a design with the dev.

  • Full Ownership. Fully owning service. All the capacity, all the provisioning, all of the pages.

  • Pages are a way of keeping you honest. They are not what SRE is about.

    • People responsible for production should take the pages because that keeps their skin in the game.

    • It also helps keep the SRE’s skill, expertise, and perspective up-to-date.

What Makes Things Go? Culture and Processes

  • Google does the usual sort of training and on-call shadowing.

  • Google also has a process called: Wheel of Misfortune - a roll playing game.

    • One person is the dungeon master and they have a victim and the team takes turns trying to guess what’s going on.

    • Google runs very complex systems. It’s rare that someone other than than the person running the training session actually knows what is going on and what the answer is.

    • It’s good for new oncallers. It let’s them test things out in a controlled environment.

    • Some teams have scenarios where they break something in production and have the newbies fix it.

    • It’s also good for the veterans. It’s good to refresh your knowledge, especially when working with very involved systems.

Incident Management

  • Scenario: you are on call for gmail and you get a ticket users can see other users emails. What do you do? Shut gmail down.

  • Oncallers are fully empowered to do whatever it takes to protect users, to protect information, to protect google. If that means shutting down gmail or even shutting down all of google.com then as an SRE you are going to be supported by your VP and you SVP for protecting google.

  • Goal is to restore service immediately. Troubleshooting comes later.

    • There are records of the binary state. There are logs.

    • Trouble shoot when awake, when devs are in the office, when everyone is present. The goal is to get the service back up and running.

Who do you blame?

  • When a “new dev” pushes code and breaks google.com for three hours, who do you blame? a) The new dev. b) The code reviews. c) The lack of tests (or ignored) tests. d) The lack of a proper canary process for the code. e) The lack of rapid rollback tools.

    • Everything except the new dev. If the new dev writes code that takes down the site it’s not the fault of the dev. It’s the fault of all the gates between the dev and working prod.

    • Human error should never be allowed to propagate beyond the human. Look at the process that allows the broken code to be deployed.

Blameless Post Mortems

  • Avoiding a blame culture is critical.

  • Studies show most incidents are caused by human error.

  • Incidents are best solved by knowing what actually happened. The best way to not know what happened? Open every incident by trying to find someone to blame.

  • People are really good at hiding, and making sure there’s no trail, and making sure you don’t actually know what happened. Trying to find blame just makes your job in finding out what happened much much harder.

  • At Google whoever screwed up writes the post mortem. This avoids naming and shaming. Gives them the power to make it right. Everyone who contributed to the failure goes in, as honest as possible, and write how you screwed up.

  • Bonuses have been given out at all-hands meetings for taking down the site because they owned up immediately that they did it. They got on IRC and set roll it back. They got a bonus for speaking up and taking care of it so quickly.

  • Blameless doesn’t mean there are not names and details. It means we are not picking the people as the reason things went wrong. There shouldn’t be any such thing as an outage that deserves a firing.

  • Defense in Depth

    • A post mortem template separates actions out into prevent, detect, mitigate because the strategy is defense in depth.

    • We want to prevent outages, we want to detect them faster, and we want to mitigate the impact.

    • If something like this happens again it won’t spread as far, or last as long, or impact as many customers.

The No Boredom Philosophy of Paging

What kind of pages does a team like to see? New and interesting ones.

Pages you know how to solve are boring. You should create a bot to handle the problem.

Google invents lots of robots. They don’t like being bored.

If you can write down the steps to fix it then you can probably write the automation to fix it.

Don’t be doing things that robots can be doing for you.

The result of the build a bot is that each page is ideally really new so there isn’t a chance to get bored. Even experienced engineers are probably seeing something new every time their pager goes off.

This is a fundamental change in philosophy. If nothing is routine and few incidents are repeated it means you can’t lean as heavily on previous experience when debugging the system.

The Need for Much Stronger Debugging Tools

  • If all your problems are new it means you need much stronger debugging tools to find problems.

  • Text logs are not a debugging tool. Standard debugging of looking for patterns in log files doesn’t scale if you don’t know what to look for. With a platform the size of GCP how many looks would you have to look through to find the one that is failing?

  • Google relies heavily on various visualization tools to troubleshoot unfamiliar problems and restore service as quickly as possible.

  • Graphing Tools: Graphite, InfluxDB + Grafana, OpenTSDB.

    • These and the other tools mentioned aren’t the tools Google uses and they aren’t being recommended, but they are Open Source examples of useful tooling.

    • Great to look at an aggregate of what’s going on. Google has billions of billions of processes so you need that aggregate view to make sense of things.

    • Google puts a lot of instrumentation in their binaries. With a novel situation you don’t always know what you are looking for.

  • Create a framework that makes it easy for devs to plug into the monitoring framework.

  • A huge amount of storage is dedicated to storing monitoring data.

    • The idea is you don’t want to be troubleshooting during an outage. An outage is all about restoring service.

    • Troubleshooting is what you do later when you are awake. Devs are often involved in the troubleshooting process as they have the deeper knowledge of the system.

    • Historical data must be available so troubleshooting can occur after the outage is restored. A restoral shouldn’t cause the outage monitoring data to be lost.

    • This approach allows outages to be kept as short as possible while being able to fix problems later.

  • Event Graphing - really useful for correlating events.

    • Take advantage of your human ability to pattern match, it’s hard to write robots to do this.

    • An example is given of graph where each row is a datacenter, the column is time, and the color in the cell is the event type.

    • This can help you find patterns that aren’t a single incident, like a software rollout that causes a cascading failure, or a cluster of errors that’s repeating together, or if you see a latency spike with a error spike immediately after it and that’s repeated. These will all help identify the root cause of the problem.

  • Visual Process Tracing - sometimes you need to get down to the process level to identify performance problems.

    • Not many Open Source Options: Performance Co-Pilot + vector.

    • Google has a very elaborate framework that pulls sample queries into storage and provides a full a trace of them.

    • The advantage of a visual tool is that making sense of timestamps is hard. A visual tool lets you collapse, expand, and compare events more easily.

  • Network Flows and Capacity

    • Open Source options: Cacti, Observium, Nagios

    • It turns out a lot of storage is slow problems are really network problems.

    • If you are looking at your storage system and can’t figure out why it’s slow look to the network.

    • You need a tool to quickly look at the state of the network. What links are overloaded? How many packet errors do you see? Is the link down?

  • Log Files - When all Else Fails

    • Open Source: ElasticSearch + Logstash (+Kibana)

    • You don’t want to be grepping through log files. You need a system with more SQL like queries so you can dig into the logs.

    • Logs should be easy to consume and easy to understand.

Stackdriver Error Reporting

  • If you would like to see an example of the kind of tools SRE has then you are in luck, take a look at Google Stackdriver Error Reporting.

    • It was an internal tool that they were able to make into a service.

    • Errors are grouped and deduped by analyzing stack traces

    • The system knows about the common frameworks that are used and groups errors accordingly.

  • The plan is to do more of this. Google has an extensive set of tools internally that they want to make available to cloud customers.