Strategy: Disk Backup for Speed, Tape Backup to Save Your Bacon, Just Ask Google

In Stack Overflow Architecture Update - Now At 95 Million Page Views A Month, a commenter expressed surprise about Stack Overflow's backup strategy:

Backup is to disk for fast retrieval and to tape for historical archiving.

The comment was:

Really? People still do this? I know some organizations invested a tremendous amount in automated, robotic tape backup, but seriously, a site founded in 2008 is backing up to tape?

The Case of the Missing Gmail Accounts

I admit that I was surprised at this strategy too. In this age of copying data to disk three times for safety, I also wondered if tape backups were still necessary? Then, like in a movie, an event happened that made sense of everything, Google suffered the quintessential #firstworldproblem, gmail accounts went missing! Queue emphatic music. And what's more they were taking a long time to come back. There was a palpable fear in the land that email accounts might never be restored. Think about that. They might never be restored...

The Hero: Tape Restoration

But then, still like in the movies, a miracle happened. Over a period of a few days the accounts were restored, and the hero this time was: tape. Tape? Yes, the email accounts were restored from tape. A quite unexpected plot twist.

The story was told in an official Google blog post, Gmail back soon for everyone:

I know what some of you are thinking: how could this happen if we have multiple copies of your data, in multiple data centers? Well, in some rare instances software bugs can affect several copies of the data. That’s what happened here. Some copies of mail were deleted, and we’ve been hard at work over the last 30 hours getting it back for the people affected by this issue.
To protect your information from these unusual bugs, we also back it up to tape. Since the tapes are offline, they’re protected from such software bugs. But restoring data from them also takes longer than transferring your requests to another data center, which is why it’s taken us hours to get the email back instead of milliseconds. So what caused this problem? We released a storage software update that introduced the unexpected bug, which caused 0.02% of Gmail users to temporarily lose access to their email. When we discovered the problem, we immediately stopped the deployment of the new software and reverted to the old version.

The Moral: Disk Only is Risky

The moral of the story: storing all your replicas online is a risk. A single bug can wipe out everything.

In this case it was a software update that caused the problem. That is to be expected. More disasters have probably been caused by software updates than any other cause. The reason: bugs that effect control are more powerful than bugs that effect data.

The Villain: Software Update Induced Amnesia

It wasn't that all three data copies went bad simultaneously. That is unlikely. But software bugs at the control plane layer of a system is not a low probability at all. Software updates on a live running system operate in an almost unimaginably complex world. The success path is clear and usually works flawlessly, but so many faults can happen in such unexpected ways that failure is common and can be devastating.

For an analogy think of how DNA works. Change the DNA for a gene that helps build something and you've done a lot of damage. Corruption that happens in a single cell is usually caught by the immune system and destroyed. But when a mutation happens in the regulatory region of a gene then all hell can break loose. All the mechanisms to stop a cell from replicating, for example, can be destroyed and the result is cancer.

So it's the program controlling the system that are the most powerfully dangerous areas, as was shown in Google's Gmail problem.

Handling it Better

While Google did a great in having a tape backup and then in how they diligently worked to restore accounts from tape, handling communication with the users could have went better. Wtallis summed up the problem succinctly:

People who were affected had their entire Google accounts disabled, and upon trying to log in, they got exactly the same messages they would have gotten if Google had decided to delete the account for ToS abuse. Additionally, since the entire account (not just GMail) had to be disabled in order to repair things, stuff like shared google calendars went offline, so users whose accounts were not directly affected were getting misleading error messages, too. And the bounces didn't stop until the end of the working day on Monday on the east coast.

Some possible fixes:

  • Programs should give finer grained error messages rather than generic error messages.
  • Have finer grained account lockout so services for an entire account are down when just one service is down.
  • More and better communication.

The Happy Ending: Use Protection

With today's huge datasets isn't backing up to tape impractical? Seth Weintraub, in Google goes to the tape to get lost emails, estimated 200K tapes are needed to backup Gmail accounts. Wow. Others suggest this number is much lower in practice using techniques like compression, deduplication, incremental backups, and the fact that most users use a small fraction of their quota. So tape backup is practical in practice, even for very large datasets.

Why Google's own "immune system" didn't detect this problem isn't clear, but since problems like this are to be expected, Google protected themselves against by backing up to tape. If you have a really important dataset then consider backing up to tape instead of relying only on disk. Google was glad they did.