Yahoo!'s PNUTS Database: Too Hot, Too Cold or Just Right?

So far every massively scalable database is a bundle of compromises. For some the weak guarantees of Amazon's eventual consistency model are too cold. For many the strong guarantees of standard RDBMS distributed transactions are too hot. Google App Engine tries to get it just right with entity groups. Yahoo! is also trying to get is just right by offering per-record timeline consistency, which hopes to serve up a heaping bowl of rich database functionality and low latency at massive scale:

We describe PNUTS [Platform for Nimble Universal Table Storage], a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of con-current requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

Some of the cool things about PNUTS are:

  • They actually talk about the hard problem of how to scale a system to 10 different data centers (each with 1,000+ servers) while supporting secondary indexes, materialized views, the ability to create multiple tables, and hash-organized tables. Multi-datacenter operation is so difficult it's usually ignored. PNUTS is designed specifically to operate in many datacenters with a strongish consistency model, which makes it a very interesting design point.
  • You can subscribe to a reliable ordered stream of updates on a table. This is massively convenient. For many applications numerous processes are tied to data changes and this is normally a pain to implement.
  • The consistency model is a per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order. This provides a consistency model that is between the two extremes of serialized transactions and eventual consistency. Conflicting records can't exist at the same time as is allowed by Dynamo.
  • Supports records, but accepts queries only on individual tables. There's no fixed schema for records and columns can be typed or be blobs. Transactions exist only at the record level. This means like with other NoSQL databases denormalization is the modeling strategy as there's no way to have transactions across tables.
  • The degree of read consistency desired can be specified. Records are versioned. You can ask for the latest record version or allow for potentially stale records.
  • Asynchronous replication is used to ensure low write latency while providing geographic replication. Reads should be fast everywhere and may return older versions, writes should be fast locally (in the same datacenter). [3]
  • A message broker that serves both as the replication mechanism and redo log of the database. The message broker guarantees no replica can receive updates out of order because it provided a reliable, totally ordered message channel. This also means you can't have transactions across tables, consistent joins, or foreign keys. They chose this approach over a gossip mechanism (like Dynamo) because it can be optimized for geographically distant replicas and because replicas do not need to know the location of other replicas.
  • PNUTS is hosted and centrally managed. It was built to reduce the overhead of creating and maintaining new applications rather than every property creating their own system. Failover, adding capacity, resharding, performance isolation and supporting applications with different usage profiles are all completely automated.
  • Predicate queries are supported using a scatter-gather mechanism which sends the query to every relevant storage tablet at once. The are gathered sent back to the client.
  • Performance is 1-10ms/request when caching layers are in place.

    From a system perspective PNUTS offers a lot of the good things: hosted, reliability, lowish latency, automation, scalability, supports many application models, and there's a lot of room to improvement that all applications will be able to take advantage of when available.

    From a programmer perspective the are also a lot of good things: it's hosted so fewer worries, notifications, flexible schemas, ordered records, secondary indexes, lowish latency, strong consistency on a single record, scalability, high write rates, reliability, and range queries over a small set of records.

    Unfortunately Goldilocks still needs to keep searching for just right, though she may be getting closer. From a system perspective Yahoo!'s ideas are good, but they don't help you as the system isn't available for you to use. From a programmer perspective the programmer's job is still way too hard. To be just right programmer's need low latency aggregate operators, complex transactions, scalable counters, automatic relationship management, and all the other features that will help them just buy instant porridge and be done with it.

    Related Articles

  • Anti-RDBMS: A list of distributed key-value stores
  • Details on Yahoo's distributed database by Greg Linden
  • Thoughts on Yahoo's PNUTS distributed database by Marton Trencseni
  • Data Challenges at Yahoo! - Ricardo Baeza-Yates & Raghu Ramakrishnan
    Yahoo! Research
  • PNUTS: Yahoo!'s Hosted Data Serving Platform by Lucian
  • Yahoo’s PNUTS by Henry Robinson. A very thoughtful and informative overview of the paper.
  • How robust are gossip-based communication protocols? by Lorenzo Alvisi et al.
  • Trading consistency for scalability in distributed architectures.
  • Asynchronous View Maintenance for VLSD Databases by Parag Agrawal et al.
  • BigTable
  • Dynamo
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • The Story of Goldilocks and the Three Bears
  • PNUTS - Platform for Nimble Universal Table Storage