Update: Streamy Explains CAP and HBase's Approach to CAP. We plan to employ inter-cluster replication, with each cluster located in a single DC. Remote replication will introduce some eventual consistency into the system, but each cluster will continue to be strongly consistent.
Ryan Barrett, Google App Engine datastore lead, gave this talk Transactions Across Datacenters (and Other Weekend Projects) at the Google I/O 2009 conference.
While the talk doesn't necessarily break new technical ground, Ryan does an excellent job explaining and evaluating the different options you have when architecting a system to work across multiple datacenters. This is called multihoming, operating from multiple datacenters simultaneously.
As multihoming is one of the most challenging tasks in all computing, Ryan's clear and thoughtful style comfortably leads you through the various options. On the trip you learn:
- lowish latency writes
- datacenter failure survival
- strong consistency guarantees.
There's still a lot more to learn. Here's my gloss on the talk:
Consistency - What happens happens after you read after a write?Read/write data is one of the hardest kinds of data to run across datacenters. Users a expect a certain level of reliability and consistency.
Transactions - Extended form of consistency across multiple operations.
Why Operate in Multiple Datacenters?
Why Not Operate in Multiple Datacenters?
Your Different Architecture Options
- Better, but not great.
- Data are usually replicated asynchronously so there's a window of vulnerability for loss.
- Data in your other datacenters may not be consistent on failure.
- Popular with financial institutions.
- You get geolocation to serve reads. Consistency depends on the technique. Writes are still limited to one datacenter.
- So some choose to do it with just two datacenters. NASDAQ has two datacenters close together (low latency) and perform a two-phase commit on every transaction, but they have very strict latency requirements.
- Using more than two datacenters is fundamentally harder. You pay for it with queuing delays, routing delays, speed of light. You have to talk between datacenters. Just fundamentally slower with a smaller pipe. You may pay for with capacity and throughput, but you'll definitely pay in latency.
How Do You Actually Do This?What are the techniques and tradeoffs of different approaches? Here's the evaluation matrix:
- M/S = master/slave, MM - multi-master, 2PC - 2 Phase Commit
- What kind of consistency, transactions, latency throughput do we get for a particular approach? Will we lose data on failure? How much will we lose? When we failover for maintenance or we want to move things, say decommissioning a datacenter, how well do we do that, how well do the techniques support it?
- Replication is asynchronous so good for latency and throughput.
- Weak/eventual consistency unless you are very careful.
- You have multiple copies in the datacenters, so you'll lose a little data on failure, but not much. Failover can go read-only until the master has been moved to another datacenter.
- Datastore currently uses this mechanism. Truly multihoming adds latency because you have to add the extra hop between datacenters. App Engine is already slow on writes so this extra hit would be painful. M/S gives you most of the benefits of better forms while still offering lower latency writes.
- You figure out how to merge all the writes later when there's a conflict. It's like asynchronous replication, but you are serving writes from multiple locations.
- Best you can do is Eventual Consistency. Writes don't immediately go everywhere. This is a paradigm shift here. We've assumed with a strongly consistent system that backup and M/S that they don't change anything. They are just techniques to help us multihome. Here it literally changes how the system runs because the multiple writes must be merged.
- To do the merging you must find away to serialize, impose an ordering on all your writes. There is no global clock. Things happen in parallel. You can't ever know what happens first. So you make it up using timestamps, local timetamps + skew, local version numbers, distributed consensus protocol. This is the magic and there are a number of ways to do it.
- There's no way to do a global transaction. With multiple simultaneous writes you can't guarantee transactions. So you have to figure out what to do afterward.
- AppEngine wants strong consistency to make building applications easier, so they didn't consider this option.
- Failover is easy because each datacenter can handle writes.
- Semi-distributed because there's always a master coordinator for a given 2PC transaction. Because there are so few datacenters you tend to go through the same set of master coordinators.
- It's synchronous. All transactions are serialized through that master which kills your throughput and increases latency.
- Never serious considered this option because write throughput is very important to them. No single point of failure or serialization point would work for them. Latency is high because of the extra coordination. Writes can be in the 200msec area.
- This option does work though. You write to all datacenters or nothing. You get strong consistency and transactions.
- Need N+1 datacenters. If you take one down then you still have N to handle your load.
- Protocol: there's a propose step and then an agree step. You only need a majority of nodes to agree to say something is persisted for it to be considered persisted.
- Unlike 2PC it is fully distributed. There's no single master coordinator.
- Multiple transactions can be run in parallel. There's less serialization.
- Writes are high latency because of the 2 extra round coordination trips required in the protocol.
- Wanted to do this, but the they didn't want to pay the 150msec latency hit to writes, especially when competing against 5msec writes for RDBMSes.
- They tried using physcially close datacenters but the built-in multi-datacenter overhead (routers, etc) was too high. Even in the same datacenter was too slow.
- Paxos is still used a ton within Google. Especially for lock servers. For coordinating anything they do across datacenters. Especially when state is moved between datacenters. If your app is serving data in one datacenter and it should be moved to another that coordination is done through Paxos. It's used also in managing memcache and offline processing.
DiscussionA few things I wondered through the talk. Did they ever consider a distributed MVCC approach? That might be interesting and wasn't addressed as an option. Clearly at Google scale an in-memory data grid isn't yet appropriate.
A preference for the strong consistency model was repeatedly specified as a major design goal because this makes the job of the programmer easier. A counter to this is that the programming model for Google App Engine is already very difficult. The limits and the lack of traditional relational database programming features put a lot of responsibility back on the programmer to write a scalable app. I wonder if giving up strong consistency would have been such a big deal in comparison?
I really appreciated the evaluation matrix and the discussion of why Google App Engine made the choices they did. Since writes are already slow on Google App Engine they didn't have a lot of headroom to absorb more layers of coordination. These are the kinds of things developers talk about in design meetings, but they usually don't make it outside the cubicle or conference room walls. I can just hear Ryan, with voiced raised, saying "Why can't we have it all!" But it never seems we can have everything. Many thanks to Ryan for sharing.