Masstree - Much Faster than MongoDB, VoltDB, Redis, and Competitive with Memcached

The EuroSys 2012 system conference has an excellent live blog summary of their talks for: Day 1, Day 2, Day 3 (thanks Henry at the Paper Trail blog). Summaries for each of the accepted papers are here.

One of the more interesting papers from a NoSQL perspective was Cache Craftiness for Fast Multicore Key-Value Storage, a wonderfully detailed description of the low level techniques used to implement Masstree:

A storage system specialized for key-value data in which all data fits in memory, but must persist across server restarts. It supports arbitrary, variable-length keys. It allows range queries over those keys: clients can traverse subsets of the database, or the whole database, in sorted order by key. On a 16-core machine Masstree achieves six to ten million operations per second on parts A–C of the Yahoo! Cloud Serving Benchmark benchmark, more than 30 as fast as VoltDB [5] or MongoDB [2].

If you are looking for innovative detailed high performance design, this paper is for you. An example from the section on writer-writer coordination:

Masstree writers coordinate using per-node spinlocks. A node’s lock is stored in a single bit in its version counter. Any modification to a node’s keys or values requires holding the node’s lock. Some data is protected by other nodes’ locks, however. A node’s parent pointer is protected by its parent’s lock, and a border node’s prev pointer is protected by its previous sibling’s lock. This minimizes the simultaneous locks required by split operations; when an interior node splits, for example, it can assign its children’s parent pointers without obtaining their locks.

Here's the live blog writeup:

Let’s build a new fast KV-store! Would like to perform well on hard workloads! Should support range queries, skewed key popularity, small K-V pairs, many puts, and arbitrary keys. Their first attempt was a fast binary tree, achieving 3.7M qps, if sufficiently high-bandwidth network and disk hardware provisioned. Bottleneck turns out to be DRAM, so optimize caching craftiness, get 1.5x better performance. Their optimized thing is called “Masstree”. They evaluated it on a 16-node cluster with a bunch of SSDs, 64 GB RAM and 10 Gbps NIC per machine. In fact, when looking at local performance (without network/disk bottlenecks), cache craftiness results in a 1.7x improvement (I thought they said that they’d already evaded those bottlenecks?!).

To reduce DRAM latency, they constructed a lock-free, unbalanced 4-way tree with the same concurrency properties as a binary tree, but is only half as deep for the same amount of data -- plus each node fits into a cache line. However, it is pessimal (O(N)) for sequential inserts, so look at a balanced B+tree instead. Their particular implementation uses optimistic concurrency with versioning. It turns out, however, that the B+tree is 11% slower than the 4-way tree (because of cache line optimization, and since all nodes are full, while for their B+tree, only ~75% are)! Now realize that we can do software prefetch to get read multiple cache lines at a time with the B+tree, and things get 9% better than the 4-way tree. However, there are consistency issues with concurrent inserts in the B+tree (not with 4-way, as keys don’t move), so pre-pend a “permuter” (index list array) to each node, so we can atomically swap keys around. But there is an issue with long keys -- they require multiple memory accesses, and throughput drops rapidly as keys get longer. So they use a Trie of B+trees, with each level (B+tree) responsible for 8 bytes of key length. Now throughput scales much better. “Masstree” is the union of all these optimizations (I think), and despite having a trie of B+trees in it, it is 8% more efficient than a single B+tree with all the other optimizations.

Evaluation finds that they are much faster than common NoSQL databases (MongoDB, VoltDB), faster than Redis, and competitive with memcached.

Now, how do we scale this to multi-core? Other systems have an instance per core, and partition the key space. Masstree uses a single, shared key that is accessed by all cores (presumably using the OCC+versioning techniques they alluded to). They find that their performance is basically constant as they scale load when running on 16 cores, while a statically partitioned Masstree performs better under low load, but loses out subsequently (it didn’t become entirely clear how they varied the load). They also find that they scale fairly well to 16 cores (about ⅕ drop in performance over perfect scalability).

From the conclusion of the paper:

Masstree is a persistent in-memory key-value database. Its design pays particular attention to concurrency and to efficiency for short and simple queries. Masstree keeps all data in memory in a tree, with fanout chosen to minimize total DRAM delay when descending the tree with prefetching. The tree is shared among all cores to preserve load balance when key popularities are skewed. It maintains high concurrency using optimistic concurrency control for lookup and local locking for updates. For good performance for keys with long shared prefixes, a Masstree consists of a trie-like concatenation of B+ trees, each of the latter supporting only fixed-length keys for efficiency. Logging and checkpointing provide consistency and durability.