« Sponsored Post: New Relic, NetDNA, Torbit, GigaSpaces, AiCache, Logic Monitor, AppDynamics, CloudSigma, ManageEngine, Site24x7 | Main | Stuff The Internet Says On Scalability For July 6, 2012 »

Data Replication in NoSQL Databases

This is the third guest post (part 1, part 2) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.

blekko's home-grown NoSQL database was designed from the start to support a web-scale search engine, with 1,000s of servers and petabytes of disk. Data replication is a very important part of keeping the database up and serving queries. Like many NoSQL database authors, we decided to keep R=3 copies of each piece of data in the database, and not use RAID to improve reliability. The key goal we were shooting for was a database which degrades gracefully when there are many small failures over time, without needing human intervention.

Why don't we like RAID for big NoSQL databases?

Most big storage systems use RAID levels like 3, 4, 5, or 10 to improve reliability. A conservative choice for a big NoSQL database might be to use a bunch of nodes with RAID, and then replicate the data to several nodes. That doesn't seem to be common among big NoSQL databases. Here are some reasons why RAID is disliked:

  • RAID requires more disks per server than I might want to buy. If I want single disk servers, RAID means I have buy twice as many disks.
  • RAID requires prompt maintenance when things fail. If I have one online spare in a RAID set, I need to replace a failed disk promptly to avoid a second failure, and an outage.
  • RAID rebuilds cause degraded performance for the whole RAID set for a long time when a single disk fails.
  • RAID degrades performance of both reads and writes, compared to the aggregate raw performance of all of the disks in the RAID set.
  • Single-server RAID doesn't provide protection against a server failure.

R=3: An alternative

The alternative is to keep R=3 copies of each piece of data, on 3 different servers, and to use software consistency built into the database to keep the data consistent. This R=3 strategy fixes all of our RAID complaints:

  • R=3 scales down to 1 disk per server.
  • Rebuilds can use the entire cluster to finish quickly, without degrading performance of part of the cluster severely.
  • R=3 clusters can use 100% of the raw read capacity of all disks in a cluster.
  • R=3 protects against server failures. A failure of a single server means at least 2 copies of everything exist somewhere else in the cluster.
  • Finally, R=3 degrades gracefully as disks fail, if you get all of the details right.

Why 3 copies? It's a trade off between expense and availability. With R=3, my cluster is available as long as I can re-copy data on the failed disks before 3 failures happen.

Isn't 3 copies of everything expensive?

It's true that keeping 3 copies of everything involves more disks than the typical RAID setup. However, if you stay in the sweet spot of direct-attached storage, the first 8-12 direct-attach disks are pretty inexpensive compared to what you're spending on cpus, ram, flash disk, and your network. Having more disks helps performance -- it gives you more read bandwidth. The extra disks don't help write bandwidth, since everything must be written R=3 times. Network bandwidth also takes a hit, although that's necessary in any scheme which protects against server failures.

Rebuilds in the R=3 World

Rebuilds tend to be expensive in any system preserving reliability. A 5 disk, fully-used RAID-5 volume with a failure needs to read 4 disks completely and write one disk completely to return to full protection and performance. This can take quite a while, and is getting worse over time as disk capacities increase faster than disk bandwidth. In an R=3 cluster, a single disk contains chunks of data replicated to other disks elsewhere in the cluster. For rebuild purposes, it's nice to have at least 10 chunks on a disk, and it's also nice for the 2 replicas for these 10 chunks to be on as many servers as possible. A rebuild then involves copying 1/10 of a disk between 10 pairs of systems, and the entire rebuild can be done in 1/10 the time of reading an entire disk. The rebuild of all the data on a failed server with 10 disks would involve 100 pairs of servers. While this eats up network bandwidth in comparison to any rebuild that takes place within a single server, the performance impact can be spread out over an entire cluster, instead of creating a performance hot-spot. While we've chosen 10 chunks per disk at blekko, some other NoSQL databases have as many as a million chunks per disk. Having a small number of chunks per disk leads to slow rebuilds that have big performance impact to a small number of nodes, and a performance hot-spot. Failing to spread out chunks properly also increases the time to rebuild and leaves hot-spots.

Some jargon: Arrr!

There's a fair bit of jargon which isn't standardized yet in the world of R=3 NoSQL databases, so let me give you a few terms that we use inside blekko. We call the R-level of a cluster the minimum replica count of any chunk in the cluster, but just to confuse things, we also refer to the R-level goal of the cluster (usually 3) or the R-level of an individual chunk of data. So if our R3 (a goal of 3 copies of everything) cluster has a single chunk that's R2 (2 copies of this chunk), we call the entire cluster R2 (the worst R-level of any chunk in the cluster.) While you might think this triple-usage is confusing, in practice it seems to work out OK. As long as we're at R1 or better, we can read all of the data in the database. R1 is a scary place to be, though, since a single failure might cause the cluster to go to R0, where some data can't be read. Also, at some point between R3 and R1, we stop being able to write new data into the cluster efficiently. If all chunks are at R1, for example, and we return to R3, a lot of data will need to be replicated for all the replica chunks to become current.

Chunk placement

In an ideal world, each failure in an R3 cluster would only reduce the overall R-level of the cluster by one. Failures aren't random: they happen at the disk, server, PDU (power circuit), and network switch levels. If we let several replicas of a chunk live on the same disk, same server, same PDU, or same non-redundant network switch domain, we reduce the resilience of our database. Improving resilience in this circumstance is easy: let's make a rule that only one replica of a chunk can live on any disk, server, PDU (1 rack in our datacenter) or leaf network switch (3 racks in our datacenter.) Our central switch infrastructure is redundant, and brings everything down when it fails, so we don't have to worry about it in this context. Now that we've made the cluster behave more gracefully in response to failures, let's also consider administrative convenience. It's nice to be able to take down a large fraction of the cluster at once in order to do things like add RAM or upgrade software. We already have the ability to take down 3 racks (1 leaf network switch) at once without reducing the R-level by more than 1, but in theory we should be able to take down up to 1/3 of our cluster at once. To make this possible, we divide the cluster into 3 or 4 "zones", and make sure no chunk is replicated twice in a zone.

Properties of the Repair Daemon

Our NoSQL has a separate daemon, the Repair Daemon, responsible for duplicating, moving, and removing chunks in the database, subject to the placement rules in the previous section. This daemon also has some additional abilities related to "draining" a server or disk known to be failing. The information that something is about to fail can come from sources such as disk errors and ECC errors in syslog, SMART daemon complaints, and failed fans or high temperature events reported by the bmc. "draining" chunks off before the disk or server fails completely can reduce the time a cluster spends at R2, which makes it less likely that the cluster will go to R1 or R0 due to an unlucky rapid series of failures. The repair daemon drains a disk or server by making a temporary 4th copy of the chunks involved, and then deleting the chunks on the drained disk or server. When all chunks are removed, the disk or server can be repaired. Failing disks happen often enough in our environment that we have fully automated the decision to drain a disk, or an entire server if the failing disk is the system disk.

Failures which can easily cause R0

With all this talk of reliability, it's important to remember that there will always be classes errors which can easily cause R0. One such problem is bugs in the NoSQL database itself; if data in a row causes the database to fail, all 3 replicas of the chunk containing the row can fail. This sort of bug has happened to us several times.

Other NoSQL-related storage topics

There are a few storage-related topics that I'd like to briefly talk about, even though they aren't directly related to R=3 databases.

  • Checksums. Most systems rely on the storage system to give back the same bits that you wrote a long time ago. While there is a lot of ECC and other error detection in disks, anyone with a large number of disks is very aware that undetected errors happen. We use strong checksums to protect all of the data we write to spinning rust or flash disks. A failed checksum causes the entire chunk to fail. Having checksums also allows scrubbing data, similar to how RAID systems scrub disks in the background.
  • Network checksums. The same arguments about disk checksums apply to network checksums -- TCP and IP have weak checksums, but with modern systems, checksum offload means that you aren't protected against errors between the checksum offload engine and the CPU.
  • Booting. In an R=3 system, it's important that servers with failed disks -- disk failures often happen at boot or reboot -- come up with as many disks as possible, instead of waiting for admin intervention due to a single failed disk. We use a custom initscript to mount our data disks. It also runs a short performance test on each disk, to find the occasional disk which has become slow despite not reporting errors.
  • Trash. Like most NoSQL databases, ours stores the table pieces in chunks in a series of files, occasionally merging several files into a new file, deleting the old ones. Deleting big files is a high-latency operation in Linux and most other OSes. We avoid this issue by renaming files to be deleted into a trash directory, where a dedicated trash daemon for each disk takes one file at a time, truncating it slowly down to nothing. This spreads the load of file deletion over a long enough period of time that the latency hit disappears.

blekko's track record

You can probably tell from reading this post that we've learned a lot of lessons the hard way. We've run the system as described for about 4 years now, starting with 150 servers and growing to 1,500, scaling our raw disk storage from 1 petabyte to 15. During this time, we have never gone below R1 due to hardware failures. We've had site downtime, but it's always been related to software bugs (the "baseball game incident" was very unfortunate), or network issues.

Reader Comments (7)

Interesting read but I was puzzled by the 1/10th comment:

A rebuild then involves copying 1/10 of a disk between 10 pairs of systems, and the entire rebuild can be done in 1/10 the time of reading an entire disk.

Is your scenario here assuming that only a single chunk failed? If so, ok I get the math. In my initial read I thought you were talking about rebuilding for all chunks on a failed disk which would require 10 servers to reduce the contention to the point where you could realize 1/10th (am I missing something?)

So basically you are "just" talking about mirroring at a sub-disk level? Not that I take issue with it... I like the idea of pushing the redundancy to the layer you are describing for all the reasons you gave. Makes a lot of sense... improve the system's MTBF *and* the recovery time at the same time.

July 9, 2012 | Unregistered CommenterCharlie

Sorry that the example wasn't clear!

The scenario is that an entire disk failed. It had 10 chunks, which now only have 2 replicas each. To fix this, we need to restore all 10 chunks to 3 replicas, which involves 10 pairs of servers copying one chunk. This can be done in parallel, with the total time taking 1/10 the time of a pair of servers copying an entire disk.

July 9, 2012 | Unregistered CommenterGreg Lindahl

It can only be done in parallel if you can read/write the chunks in parallel

July 9, 2012 | Unregistered CommenterCharlie

Right.. Which is the case here, because we spread them all out over different servers and disks.

July 9, 2012 | Unregistered CommenterGreg Lindahl

If I'm not missing something, application-level replication with R=3 is equivalent to RAID1 with 3 disks. Like replication, RAID-1 with 3 disks does not require the immediate replacement of failed disks. RAID-1 with 3 disks also has better performance when 1 disk fails, because application reads can be served from one of the two remaining disks. It's more economical because it requires 3 x more disks instead of 3 x more servers (hard disks are cheaper than servers). And while 33 disk RAID-1 it doesn't protect against complete server failure, such failures are very rare compared to disk failures, and when they occur, you can simply transfer the disks to a new server.

If 99.9% uptime isn't good enough, I would combine the two approaches. RAID1 with 2 disks on each primary server and a DB replica with no RAID for each server. This will handle server failures as well as disk failures with just 2 servers and 3 disks versus 3 servers and 2 disks for the R=3 solution.

July 11, 2012 | Unregistered CommenterSeun Osewa

RAID1 with 3 disks is very different from R=3: no protection against server failure, long rebuild times, less graceful degradation over a long time with no human intervention. For this last point, if I have 10% free disk space in my R=3 cluster, I can survive quite a few disk failures before I have a problem. In the RAID1x3 cluster, not so much.

Your suggestion of RAID1 with 2 disks and primary/secondary databases is very popular in the SQL world (really RAID 5 or 10 plus primary/secondary.) I don't know of anyone with big NoSQL clusters doing it that way.

July 11, 2012 | Unregistered CommenterGreg Lindahl

Using RAID is not all that evil. Did you considered using a RAID0 + DB replica setup instead of JBOD + DB replica?
RAID0 can quite double your overall disk throughput READ/WRITE, 'cause the i/o bandwidth can be splitted to different disks in parallel.

July 26, 2012 | Unregistered CommenterDino Ciuffetti

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>