« Stuff The Internet Says On Scalability For January 14, 2011 | Main | Riak's Bitcask - A Log-Structured Hash Table for Fast Key/Value Data »

Google Megastore - 3 Billion Writes and 20 Billion Read Transactions Daily

A giant step into the fully distributed future has been taken by the Google App Engine team with the release of their High Replication Datastore. The HRD is targeted at mission critical applications that require data replicated to at least three datacenters, full ACID semantics for entity groups, and lower consistency guarantees across entity groups.

This is a major accomplishment. Few organizations can implement a true multi-datacenter datastore. Other than SimpleDB, how many other publicly accessible database services can operate out of multiple datacenters? Now that capability can be had by anyone. But there is a price, literally and otherwise. Because the HRD uses three times the resources as Google App Engine's Master/Slave datastatore, it will cost three times as much. And because it is a distributed database, with all that implies in the CAP sense, developers will have to be very careful in how they architect their applications because as costs increased, reliability increased, complexity has increased, and performance has decreased. This is why HRD is targeted ay mission critical applications, you gotta want it, otherwise the Master/Slave datastore makes a lot more sense.

The technical details behind the HRD are described in this paper, Megastore: Providing Scalable, Highly Available Storage for Interactive Services. This is a wonderfully written and accessible paper, chocked full of useful and interesting details. James Hamilton wrote an excellent summary of the paper in Google Megastore: The Data Engine Behind GAE. There are also a few useful threads in Google Groups that go into some more details about how it works, costs, and performance (the original announcement, performance comparison).

Some Megastore highlights:

  • Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS. It has been used internally by Google for several years, on more than 100 production applications, to handle more than three billion write and 20 billion read transactions daily, and store a petabyte of data across many global datacenters.
  • Megastore is a storage system developed to meet the storage requirements of today's interactive online services. It is novel in that it blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS. It uses synchronous replication to achieve high availability and a consistent view of the data. In brief, it provides fully serializable ACID semantics over distant replicas with low enough latencies to support interactive applications. We accomplish this by taking a middle ground in the RDBMS vs. NoSQL design space: we partition the datastore and replicate each partition separately, providing full ACID semantics within partitions, but only limited consistency guarantees across them. We provide traditional database features, such as secondary indexes, but only those features that can scale within user-tolerable latency limits, and only with the semantics that our partitioning scheme can support. We contend that the data for most Internet services can be suitably partitioned (e.g., by user) to make this approach viable, and that a small, but not spartan, set of features can substantially ease the burden of developing cloud applications.
  • Paxos is used to manage synchronous replication between datacenters. This provides the highest level of availability for reads and writes at the cost of higher-latency writes. Typically Paxos is used only for coordination, Megastore also uses it to perform write operations. 
  • Supports 3 levels of read consistency: current, snapshot, and inconsistent reads.
  • Entity groups are now a unit of consistency as well as a unit of transactionality. Entity groups seem to be like little separate databases. Each is independently and synchronously replicated over a wide area. The underlying data is stored in a scalable NoSQL datastore in each datacenter.
  • The App Engine Datastore doesn't support transactions across multiple entity groups because it will greatly limit the write throughput when not operating on an entity group, though Megastore does support these operations.
  • Entity groups are an apriori grouping of data for fast operations. Their size and composition must be balanced. Examples of entity groups are: an email account for a user; a blog would have a profile entity group and more groups to hold posts and meta data for each blog. Each application will have to find natural ways to draw entity group boundaries. Fi ne-grained entity groups will force expensive cross-group operations. Groups with too much unrelated data will cause unrelated writes to be serialized which degrades throughput. This a process that ironically seems a little like normalizing and will probably prove just as frustrating.
  • Queries that require strongly consistent results must be restricted to a single entity group. Queries across entity groups may return stale results  This is a major change for programmers. The Master/Slave datastore defaulted to strongly consistent results for all queries, because reads and writes were from the master replica by default. With multiple datacenters the world is a lot ore complicated. This is clear from some the Google group comments too. Performance will vary quite a bit where entities are located and how they are grouped.
  • Applications will remain fully available during planned maintenance periods, as well as during most unplanned infrastructure issues. The Master/Slave datastore was subject to periodic maintenance windows. If availability is job one for your application the HRD is a big win.
  • Backups and redundancy are achieved via synchronous replication, snapshots, and incremental log backups.
  • The datastore API does not change at all.
  • Writes to a single entity group are strongly consistent.
  • Writes are limited to an estimated 1 per second per entity group, so HRD is not a good match when high usage is expected. This number is not a strict limit, but a good rule of thumb for write performance.
  • With eventual consistency, more than 99.9% of your writes are available for queries within a few seconds.
  • Only new applications can choose the HRD option. An existing application must be moved to a new application.
  • Performance can be improved at the expense of consistency by setting the read_policy to eventually consistent. This will be bring performance similar to that of Master/Slave datastore. Writes are not affected by this flag,  it only works for read performance, which are already fast in the 99% case, so it doesn't have a very big impact. Applications doing batch gets will notice impressive speed ups from using this flag.
  • One application can't mix Master/Slave with HRD. The reasoning is HRD can serve out of multiple datacenters and Master/Slave can not, so there's no way to ensure in failure cases that apps are running in the right place. So if you planned to use an expensive HRD for critical data and the less expensive Master/Slave for less critical data, you can't do that. You might be thinking to delegate Master/Slave operations to another application, but splitting up applications that way is against the TOS. 
  • Once HRD is selected your choice can't be changed. So if you would like to start with the cheeper Master/Slave for customers who want to pay less and use HRD who would like to pay for a premium service, you can't do that.
  • There's no automated migration of Master/Slave data, the HRD data. The application must write that code. The reasoning is the migration will require a read-only period and the application is in the best position to know how to minimize that downtime. There are some tools and code provided to make migration easier.
  • Moving to a caching based architecture will be even more important to hide some of the performance limitations of HRD. Cache can include memcache, cookies, or state put in a URL. 
  • Though HRD is based on the Megastore, they do not seem to be the same thing, not every feature of the Megastore may be made available in HRD. To what extent this is true I'm not sure.

Related Articles

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.

Reader Comments (8)

Hey Todd, App Engine Product Manager here. Thanks for writing such a great article on the High Replication Datastore! You've definitely pointed out a lot of the reasons we wanted to offer this configuration.

There's a ton of info here, but a few factoids look like they may be missing some context. Let me see if I can address them quickly here for you and your readers:

1. "Entities within an entity group are mutated with single phase ACID transactions. Two-phase commit is used for cross entity group updates, which will greatly limit the write throughput when not operating on an entity group." The App Engine Datastore doesn't actually support transactions across multiple entity groups at all, partially for this reason.

2. "Writes are limited to 1 per second, so HRD is not a good match when high usage is expected."
I think was supposed to say that writes are limited to 1 per second per entity group, not across the entire Datastore instance. Even that number not a strict limit, but a good rule of thumb we see for write performance.

3. "There's no automated migration of Master/Slave data, the HRD data. The application must write that code."
That's not completely true. We do provide some tools and code to make migration fairly easy, though admittedly not as easy as a button press.

4. "Performance can be improved at the expense of consistency by setting the read_policy to eventually consistent. This will be bring performance similar to that of Master/Slave datastore. "
This is only true for read performance. Writes are not affected by this. Regular reads are already pretty fast in the 99% case so this doesn't have a very big impact. If you're app is doing batch gets however, we've some pretty impressive speed ups from using this flag.

Thanks again for compiling all of this. Please let me know if you have any more questions about the High Replication Datastore or App Engine in general.

January 12, 2011 | Unregistered CommenterSean Lynch

Great, thanks for the clarifications Sean. I had already made the change for (2) but I integrated in the rest of your comments as well. It's always a challenge to pick through all this stuff and not make too many mistakes.

Especially for a first release, I think most of your choices make a lot of sense. The magic tunable knob in all of this is entity group selection and design. Like for Goldilocks in needs to be just right. It would be very interesting to have a greater expansion on that part of the paper that deals with application design and data modeling with entity groups in a HRD context. Some of the work on creating dynamic groups of keys to form transactional entities seem to help here. Anything like that planned?

January 13, 2011 | Registered CommenterTodd Hoff

Yeah I completely agree that entity group selection really holds the performance key (or bottleneck as the case may be) for a lot of applications. There's more we want to do around providing guidance on how to do that for Datastore. We're also really excited by the growing trend of NoSQL projects and adoption because that means that more and more people are educating themselves on these sorts of issues. I'm really looking forward to the day when College students learn about entity group design in their databases class.

I don't know if there's any plans to update the paper to tackle the topics of group selection/dynamic groups but I'll pass the question along to the authors.

January 13, 2011 | Unregistered CommenterSean Lynch

I have read the paper on MegaStore. It sounds a promising data storage option which blends the qualities of RDBMS and NoSql.

Is Megastore a downloadable like couchDb or Cassandra NoSql databases? Could you please share more details on to try out megaStore.

January 13, 2011 | Unregistered Commenterkarthik

karthik, Megastore is an internal Google service that they've exposed to the outside world via Google App Engine. It's not downloadable and it's not a product in a traditional sense, it's a service. So give GAE look see.

January 13, 2011 | Registered CommenterTodd Hoff


Thank you for the reply.

Can i ask u a help ? I am working on a project which is developing a transaction based service. we are looking for a noSQL option for a data storage.

with your expertise on noSQL, could you please suggest an suitable noSQL database for the below features...

1) Availability
2) Consistency
3) low latency
4) Scalability
5) Reads and writes at equal ratio.
6) Service is written in java (Spring and JBoss)

Thanks in advance,

January 13, 2011 | Unregistered Commenterkarthik

You've probably seen these, but here are some places to look: http://nosql-database.org/ and http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis and http://news.ycombinator.com/item?id=2052852. Number 6 might be a little overly specific.

January 13, 2011 | Registered CommenterTodd Hoff

Hi Sean,
I would be interested, which kind of routing algorithm is Megastore using. I mean how client (app doing reads and executing updates) knows, where are the replicas containing particular entity group. Each distributed database must somehow solve this basic problem and I haven't found how in the paper.
Do you use something similiar as in BigTable? Where are those replica location data stored and when they became bottleneck?
How do you handle replica location changes (I assume you asynchronly replicate them) and what you do in case, you hit non-existing replica? Do you plan replica location changes in advance or you just do them and system will handle that?

January 14, 2011 | Unregistered CommenterJan Kmetko

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>