How would you implement a key-value storage system if you were starting from scratch? The approach Basho settled on with Bitcask, their new backend for Riak, is an interesting combination of using RAM to store a hash map of file pointers to values and a log-structured file system for efficient writes. In this excellent Changelog interview, some folks from Basho describe Bitcask in more detail.
The essential Bitcask:
- Keys are stored in memory for fast lookups. All keys must fit in RAM.
- Writes are append-only, which means writes are strictly sequential and do not require seeking. Writes are write-through. Every time a value is updated the data file on disk is appended and the in-memory key index is updated with the file pointer.
- Read queries are satisfied with O(1) random disk seeks. Latency is very predictable if all keys fit in memory because there's no random seeking around through a file.
- For reads, the file system cache in the kernel is used instead of writing a complicated caching scheme in Riak.
- Old values are compacted or "merged" to free up space. Bitcask has windowed merges: Bitcask performs periodic merges over all non-active files to compact the space being occupied by old versions of stored data. In certain situations this can cause some memory and CPU spikes on the Riak node where the merge is taking place. To that end, we've added the ability to specify when Bitcask will perform merges.
- Get and set concurrency are implemented using vector clocks by the software layer above Bitcask.
- The key to value index exists in memory and in the filesystem in hint files. The hint file is generated when data files are merged. On restart the index only needs to be rebuilt for non-merged files which should be a small percentage of the data.
Eric Brewer (CAP theorem) came up with idea with Bitcask by considering if you have the capacity to keep all keys in memory, which is quite likely on modern systems, you can have a relatively easy to design and implement storage system. The commit log can be used as the database itself, providing atomicity and durability. Only one write is required to persist the data. Separate writes to a data file and a commit log is not necessary.
When a value is updated it is first appended to the on-disk commit log. Then the in-memory hash table that maps keys to disk pointers is updated to point to the file and the offset of the record in the file. So a read takes just one file I/O. The hash key locates the file pointer and you just seek to the offset and read the value. For writes it's just an append to the file. Pretty slick. It's good compromise between a pure in-memory database and a disk based data store backed by a virtual memory layer.
Some potential issues:
- If you suspect you will have more keys than RAM then an architecture that keeps a working set in memory would be a better choice.
- It will be slower than a pure in-memory database.
- Problems commonly occur during the garbage collection phase as resource spike while space for deleted values are reclaimed. Bitcask hopes to lessen this cost by enabling the scheduling of garbage collection to certain periods, though in a 24x7 property with an international set of users this may not be sufficient.
- Syncing on every write could be a little painful. Write throughput could be increased if writes were buffered and the data was replicated synchronously to a backup node for high availability.
- I trust operating system caches not. An OS cache can't know your access patterns. A custom cache might introduce complexity, but it's hard to believe it wouldn't perform better or be more tunable when things go wrong. Basho seems happy with this approach, but it still makes me queasy. What happens when the traffic has a uniform distribution, or a pareto-like distribution? Benchmark your app!
- Changelog interview where Bitcask is described.
- Bitcask - A Log-Structured Hash Table for Fast Key/Value Data by Justin Sheehy and David Smith with inspiration from Eric Brewer.
- Redis Diskstore. Here's how Redis will do it.
- RethinkDB internals: the caching architecture. Here's how RethingDB does it.
- Bitcask Rocks by Jeff Darcy. Bitcask performs as advertised, at least for these simple tests. Without tuning, both reads and writes seem to be amortizing the cost of seeks over a very large number of operations.
- Beating the I/O Bottleneck: A Case for Log-Structured File Systems by J Ousterhout.
- Riak SmartMachine Benchmark: The Technical Details
- The Log-Structured Merge-Tree (LSM-Tree) by P O'Neil
- Damn Cool Algorithms: Log structured storage by Nick Johnson
- HBase Architecture 101 - Write-ahead-Log by Lars George.
- Cassandra Wiki Durability Entry
- Alfred - is a fast in-process key-value store for node.js. Alfred stores data on append-only files.
- Bigtable: A Distributed Storage System for Structured Data