This is a guest post by Steve Newman, co-founder of Writely (Google Docs), tech lead on the Paxos-based synchronous replication in Megastore, and founder of cloud service provider Scalyr.com.
Microsoft’s Azure service suffered a widely publicized outage on February 28th / 29th. Microsoft recently published an excellent postmortem. For anyone trying to run a high-availability service, this incident can teach several important lessons.
The central lesson is that, no matter how much work you put into redundancy, problems will arise. Murphy is strong and, I might say, creative; things go wrong. So preventative measures are important, but how you react to problems is just as important. It’s interesting to review the Azure incident in this light.
The postmortem is worth reading in its entirety, but here’s a quick summary: each time Azure launches a new VM, it creates a “transfer certificate” to secure communications with that VM. There was a bug in the code that determines the certificate expiration date, such that all VMs launched on February 29th (Leap Day) were inoperable. Beginning at 4:00 PM PST on February 28th (12:00 AM February 29th GMT), all Azure clusters worldwide were unable to launch new VMs. In the face of repeated VM failures, Azure mistakenly decided that machines were physically broken, and attempted to migrate healthy VMs off of them, compounding the problem. Identifying the bug, fixing it, and pushing the new build required roughly 13 hours.
The outage was quite embarrassing for Azure, but Microsoft comes off fairly well in the postmortem. The root cause was of the “it could happen to anyone” variety. The calendar bug was dumb, but it’s the sort of subtle, one-off dumbness that can happen even to good engineers at good companies.
Time is a single point of failure
It is a commonplace that things will go wrong no matter how careful you are. This is why the mantra of reliable systems design is “no single point of failure” (SPOF) — use multiple power supplies, multiple copies of data, multiple servers, multiple data centers. Last April’s AWS outage was magnified by the fact that the supposedly-independent “availability zones” in Amazon’s us-east region turned out to share a SPOF in the EBS control servers. But Amazon’s other regions did not share control servers with us-east, and so were protected.
I’m not especially familiar with Windows Azure, but it appears to follow good practice in this regard, using multiple data centers that are independent at both the hardware and software level. Yet the February 29th outage affected all regions. Why? Because the root cause was a bug that only manifests on leap days, and all regions entered Leap Day simultaneously. In other words, all regions share the same calendar, so the calendar is a SPOF.
This is hard to avoid (see: Y2K bug). You can distribute your data centers in space, but not in time; they’re all in the same “now”. (I see Dr. Einstein in the back raising an objection, but he’s out of order.) So how do you prevent time-related bugs from causing a global, correlated outage? There’s no great answer. You could run your servers on local time instead of GMT, but that’s messy, will probably cause more grief than it avoids, and at best it only spreads things out by a few hours. You could run a test cluster using a clock that’s set several days ahead, but that’s a lot of work to maintain.
You can, at least, avoid making major changes on Leap Day. More generally, avoid rocking the boat on any unusual occasion. Many companies have a policy to not push new builds, perform maintenance, etc. near a major holiday. Leap days, daylight savings transitions, and other calendar events may also be good occasions to leave things alone, as suggested in the Hacker News discussion of the outage. In this case, Microsoft was in the process of rolling out a new version of their server platform, which complicated the crisis.
Response speed is critical
You can’t always prevent problems, so it’s important that you quickly repair the problems that do occur. In this case, it took quite a while for Microsoft to sort things out. A timeline of the key events (all times PST):
- 4:00 PM — bug first manifests; no new VMs can be created from this point.
- 5:15 PM — first wave of machines marked bad; alerts trigger.
- 6:38 PM — root cause identified.
- 10:00 PM — remediation plan complete.
- 11:20 PM — bugfix code ready.
- 1:50 AM — bugfix code tested in a test cluster; production rollout begins.
- 2:11 AM — fix completely pushed to one production cluster.
- 5:23 AM — fix pushed to most clusters, Microsoft announces that the majority of clusters are healthy again.
In all, thirteen hours for what sounds like a one-line fix. If Microsoft had been able to respond more quickly, the impact could have been considerably reduced.
From the outside, it’s hard to second-guess the details of Microsoft’s response. But it’s worth asking yourself: in an emergency, how long would it take for you to produce a new build, run some basic tests, and push the fix into production? Protip: if you haven’t actually done it, you don’t know the answer. It’s a good idea to go through the exercise, and clearly document the precise steps involved, bearing in mind that your junior engineer may someday be following those instructions in a 3:00 AM daze.
In a crisis, keep things simple
When the crisis hit, Microsoft was almost done rolling out a new release of the server platform, but seven clusters had only just started deploying it. When pushing the fix, Microsoft decided to revert these clusters to the old release. This meant creating a build of the old release with the Leap Day bugfix. This build was done incorrectly, incorporating a mix of old and new components that did not work together. When Microsoft pushed the bad build, all servers in those seven clusters went offline.
Given that most clusters were already running the new release, and even these seven clusters had already started receiving it, it might have been better to use the new release everywhere and avoid the extra work of building and testing a Leap Day fix for the old release. Perhaps some factor not mentioned in the postmortem ruled out this approach. But in general, actions performed during a crisis should be kept as simple as possible. Don’t make a new build if you can muddle through with a configuration tweak; don’t make two new builds if you can get by with one.
Avoid compounding mistakes
Because the Leap Day fix to the old release was felt to be safe, and the build had passed some quick tests, Microsoft decided to bypass their normal slow-roll procedure and “blast” it to all servers on all seven clusters simultaneously. The result was a catastrophic outage in those clusters, as well as “a number of servers … in corrupted states as a result of the various transitions”.
In a crisis, there’s always a huge temptation to take shortcuts. And sometimes they’re necessary. But it’s important to keep a careful eye on the tradeoffs involved. In this case, at 2:47 AM, nine and a half hours after the first alerts fired, the team was probably running on caffeine and fumes. That point, with everyone exhausted and the finish line in sight, is when mistakes are most likely to happen. These mistakes can cause problems worse than the original incident. So it’s important not to race ahead too quickly, and to keep an eye on the risks involved in each action you take.
Rate limit dangerous actions
When a server crashes repeatedly, Azure marks the machine as bad and migrates VMs to other machines. The Leap Day bug caused this to happen to every machine that tried to launch a new VM, causing healthy VMs to be migrated off of those machines and triggering a failure cascade. When a certain number of machines were marked bad, Azure entered an emergency mode and stopped attempting to migrate VMs. This is an excellent defensive measure; without it, the entire Azure platform might have gone down. Such a cascade effect was at the heart of the April AWS outage. Kudos to Microsoft for having a cap in place.
There’s a general design principle here: rate limit dangerous actions. Marking a machine bad is potentially dangerous, as it reduces the cluster’s capacity and is disruptive to VMs on that server. Some bad servers are to be expected in normal operation, but if many servers are being marked bad then something deeper may be wrong, and it’s best to do nothing and request manual intervention.
Another place this arises is data deletion. I have seen a major production service experience a bug that caused it to begin madly deleting database records. By the time someone noticed the problem, catastrophic damage had been done. The team was able to recover the data, but only through extreme effort and some luck. A cap, or at least an alert, on the rate of record deletion would have caught the problem much sooner.
Milk each crisis for every lesson you can
It’s obvious that the root cause of a crisis — in this case, the faulty code for generating expiration dates — should be fixed. The Microsoft postmortem goes well beyond this, listing a dozen measures they have identified to better detect bugs before they trigger in production, increase the system’s resilience, improve their ability to repair problems quickly, and improve communication with customers during a crisis.
Every crisis carries multiple lessons. Like Microsoft, you should attempt to learn as many as possible. The more you learn from each crisis, the fewer “educational” crises you’ll have to suffer through.
Scalyr is looking for a few no-nonsense engineers. If you live and breathe code, enjoy challenges, need to feel proud of your work, and want to get in on the ground floor of something big -- check out https://scalyr.com/jobs.