Strategy: Planning for a Power Outage Google Style
We can all learn from problems. The Google App Engine team has created a teachable moment through a remarkably honest and forthcoming post-mortem for February 24th, 2010 outage post, chronicling in elaborate detail a power outage that took down Google App Engine for a few hours.
The world is ending! The cloud is unreliable! Jump ship! Not. This is not evidence that the cloud is a beautiful, powerful and unsinkable ship that goes down on its maiden voyage. Stuff happens, no matter how well you prepare. If you think private datacenters don't go down, well, then I have some rearangeable deck chairs to sell you. The goal is to keep improving and minimizing those failure windows. From that perspective there is a lot to learn from the problems the Google App Engine team encountered and how they plan to fix them.
Please read the article for all the juicy details, but here's what struck me as key:
- Power fails. Plan for it. This seems to happen with unexpected frequency for such expensive and well tended systems. The point here is to consider how a complex distributed system responds to power failures, not dissecting the merits of various diesel powered backup systems.
- Double faults happen. Expect it. It's quite common when disaster planning to assume single independent failures when creating scenarios of doom. This makes for nice MTTR and MTTF numbers. It's also BS. Stuff fails in the most strange and unexpected ways. The GAE team experienced a double failure: hardware and procedure. This isn't an outlier. Black swan events happen, we just can't be sure of the distribution.
- Distributed ops responsibility. Drill it. There are many ways to structure operations. Developers can be responsible for dealing with problems or a separate op team can be responsible. GAE apparently uses a separate ops team, which is nice because it means no beepers for developers. It's also a problem though. When a separate ops team is involved it requires a very complicated set of procedures be put in place so people who know nothing about the software can manage it. It also requires a very high level of software automation so a system can be treated as a black box. This process broke down and the ops team didn't know what to do when unexpected problems arose. When you read it took 2 hours to reach "An engineer with familiarity with the unplanned failover procedure", that's not good. This happens for a lot reasons. Software evolves like 10 times faster than ops interactions, so something was left undone. Software releases need to become much more structured in their handoffs to the ops team so that these problems have less of a chance of happening. GAE also implemented a key procedure: drills. You have to practice handling failures constantly. Soldiers endlessly drill precisely so they can deal with emergency situations without thinking. A similar motivation applies to the ops-developer dynamic as well.
- Measure and correlate the right things. The first clue that there was a problem seemed to be a traffic drop. One can imagine many reasons for a traffic drop. Weren't there higher level alarms indicating which systems were failing? That might have led to a quicker diagnosis.
- Partial failure. Handle it. There's a line of thought that says you can assume partitioning can never happen in a datacenter. The network infrastructure should just work and you can assume all or nothing reliability. The GAE folks found this was not true. Not all of their machines failed which must have confused the failure detection and failover systems. The architecture GAE has chosen may have made the problem worse. They have a primary datacenter backed up by a secondary datacenter. Failover is an all or nothing situation, which makes the failover systems a single point of failure. Failing over smaller units of services can be rmore obust, but also has more failure scenarios involved. Pick your poison. In response GAE has decided to add A new option for higher availability using synchronous replication for reads and writes, at the cost of significantly higher latency. Interestingly Amazon has already done something similar with SimpleDB.
- Be complete, honest, make meaningful improvements. GAE has chosen to take the high road and be honest and forthcoming with the problems they've had and how they plan to fix them. Some of the problems they shouldn't have had, but they told us about them anyway. That helps build trust. Expecting 100% uptime is lunacy. What we can expect is people who are apologetic, who are sincere, who care, who try, who are honest, and who are dedicated to improving. That seems to be the case here.
- Have a backup plan. If you are a GAE client what do you do when their site goes down? It will. Unfortunately there's no clear answer for that today. If you have your own datacenter the answer is probably even harder because it's likely your software isn't componetized as well. Big balls of mud are hard to move.