DynamoDB Talk Notes and the SSD Hot S3 Cold Pattern

My impression of DynamoDB before attending a Amazon DynamoDB for Developers talk is that it’s the usual quality service produced by Amazon: simple, fast, scalable, geographically redundant, expensive enough to make you think twice about using it, and delightfully NoOp.

After the talk my impression has become more nuanced. The quality impression still stands. Look at the forums and you’ll see the typical issues every product has, but no real surprises. And as a SimpleDB++, DynamoDB seems to have avoided second system syndrome and produced a more elegant design.

What was surprising is how un-cloudy DynamoDB appears to be. The cloud pillars of pay for what you use and quick elastic response to bursty traffic have been abandoned, for some understandable reasons, but the result is you really have to consider your use cases before making DynamoDB the default choice.

Here are some of my impressions from the talk...

  • DynamoDB is a clean well lighted place for key-value data. The interface is simple, consisting of operations like CreateTable, PutItem, BatchWriteItem, GetItem, Scan, Query. Transactions are not supported in the batch write. Access is by a primary key and Composite Hash Key / Range Keys. The max size for data is 64KB. It’s schemaless. The supported type are string and numbers and sets of strings and numbers. Reads can be consistent, or inconsistent, writes are always consistent, consistent reads cost 2x more.
  • Operations are usually under 10 msecs, typically 3-4 msecs. Typical is ~1.5 msecs for the network path and ~1.5 msecs for the database component, depending on the size of your data. Customers wanted one or two millisecond response times so Amazon took the radical step of using SSD for storage. Initially they ran into performance problems due to SSD garbage collections cycles. Extensive work with vendors solved the problem, but because of IP issues we have no idea how.
  • Provisioned throughput model. This is the most innovative feature of DynamoDB and its most problematic, taking the “make it simple” Apple design aesthetic to 11. Programmers are in charge of specifying the number of database requests per second their application should be capable of supporting. DynamoDB does the rest. It automatically spreads data over the number of servers required to provide the desired performance. You can change the rate on the fly, but the catch is it can take up to 10 minutes for the system to support the new request rate.
  • Solving the noisy neighbor problem. A problem with the cloud is noisy neighbors can ruin the performance of other neighbors. To ensure DynamoDB can meet SLAs in a multi-tenancy environment, a maximum size restriction of 64KB is put on values. In addition, requests are priced in units of Read/Write Capacity equal to one read (or write) per second of capacity of 1KB in size. By fixing the data sizes and requiring the specification for a guaranteed throughput rate, DynamoDB has all the information it needs to schedule workloads in such a way as they can meet their SLAs with a high degree of confidence.
  • Pay for what you don’t use. A downside of the provisioned throughput model is you pay for the throughput you have reserved, not the traffic you have actually generated. From Amazon’s perspective they are reserving the capacity so in a sense you are using the capacity, but this is the same capacity planning issue that made the elastic nature of the cloud so attractive to begin with. How do you specify the correct throughput? If you underspecify you lose customers. If you over specify you lose money. And if your traffic is at all bursty you technically would have to reserve the peak usage, which is fiscal insanity. You can adjust your throughput reservation, but you can’t do it fast enough to meet a burst, and the drop notification mechanisms are clumsy, you get an email alert and then have to do the adjustment by hand.
  • Some cool features. DynamoDB supports both an increment operation and compare-and-swap, both of which are quite useful.
  • Not NoOp. Hard stuff is left for developers to solve. You’ll still have operations with DynamoDB, it will just be in different areas.
    • A lack of transactions means developers have to implement idempotency, garbage collection, and pay for all the accesses needed to implement features that DynamoDB should be supporting. Secondary indexes and search are not supported, for example, and are difficult and error prone to implement without direct system support.
    • Hot keys / Hot partitions are not supported. The provisioning model is not really automatic as hot keys are not automatically load balanced. If a lot of reads are going to one key or partition it’s up to the programmer to fix the problem, DynamoDB won’t do it. The NoOp promise falls a little short here.
    • Backups not supported. To backup data you have to use Hadoop. This is a problem that will likely be fixed, but it’s a hole for now.
    • > 64KB. Tricks to store data greater than 64KB without transactions invite corruption and impacts throughput.
  • The Intellectual Property shield. The talk was annoying at certain points because the IP excuse was used to not answer certain questions. Is, for example, hinted handoff used? Sorry, that’s IP. An advantage many other products have is that they are open. If you want to know something about Cassandra, VoltDB, Riak, Redis, MongoDB, etc, you’ll get the straight dope.
  • Limited to a single region. Several questions were about the need to operate across multiple regions, but if Amazon built this feature in it would become a SOP, so Amazon quite rightly stays out of the multi-region business.
  • Use Hive to browse data. DynamoDB is targeted at large data sets so you need to use Hadoop/Hive to take a look at the data. Eventually more functionality will be built into the dashboard, but it’s a powerful way to look at your data, just factor this kind of management traffic into your throughput requirements.

Store Hot Data in DynamoDB, Cold Data in S3, and use Hadoop/Hive to Make them Look the Same

One of the most interesting ideas of the talk is how a new Hadoop/Hive ecosystem is being used to act as a unifying bridge between DynamoDB and S3. The idea is that data stored in DynamoDB costs 10 times as much as data in S3, so what you want to do is move historical or cold data to S3 as soon possible and just keep the hot data in DynamoDB. For example, time series data is often stored by day, week, or month, so rather than keep all that historical time series data in DynamoDB, move it to S3 and save some money.

The problem is now you have two very different ways to access data. DynamoDB is purely programmatic access via tables and S3 is via files. How do you bridge that gap without writing a lot of code?

Using EMR and Hive sophisticated queries can be run against data in DynamoDB and S3, allowing a common data access layer against the cheapest storage option. It’s a good example of how well all these tools work together to provide a powerful ecosystem.