Hamsterdb: An Analytical Embedded Key-value Store
In this post, I’d like to introduce you to hamsterdb, an Apache 2-licensed, embedded analytical key-value database library similar to Google's leveldb and Oracle's BerkeleyDB.
hamsterdb is not a new contender in this niche. In fact, hamsterdb has been around for over 9 years. In this time, it has dramatically grown, and the focus has shifted from a pure key-value store to an analytical database offering functionality similar to a column store database.
hamsterdb is single-threaded and non-distributed, and users usually link it directly into their applications. hamsterdb offers a unique (at least, as far as I know) implementation of Transactions, as well as other unique features similar to column store databases, making it a natural fit for analytical workloads. It can be used natively from C/C++ and has bindings for Erlang, Python, Java, .NET, and even Ada. It is used in embedded devices and on-premise applications with millions of deployments, as well as serving in cloud instances for caching and indexing.
hamsterdb has a unique feature in the key-value niche: it understands schema information. While most databases do not know or care what kind of keys are inserted, hamsterdb supports key types for binary keys (fixed length or variable length) or numerical keys (i.e. uint32, uint64, real32, real64).
hamsterdb's databases are Btree indexes that can be stored either in a file or in memory. The Btree implementation is what makes hamsterdb so analytical. Btree indexes are implemented with C++ templates, with the template parameters depending on the configuration of the database. More specifically, the template parameters depend on the key type, the record's size (fixed length vs. variable length), and whether duplicate keys are enabled or not. Each Btree node is therefore highly optimized for its workload. For fixed length keys, there is zero overhead for each key, and the keys are serialized as a simple arrays. On the lowest level, a database with uint64 keys is implemented as a simple C array of type uint64_t[].
This representation is extremely compact, reduces I/O and has the added benefit of exploiting CPU caches much better. Modern CPUs are optimized for processing sequential memory. hamsterdb takes advantage this. For example, when searching through leaf nodes, binary search will be skipped when the remaining range has reached a certain threshold, and linear search is used. Also, hamsterdb has APIs equivalent to the SQL commands COUNT, COUNT DISTINCT, SUM and AVERAGE, all of which are extremely fast when working on fixed length keys thanks to running directly in the Btree.
hamsterdb also supports variable length keys. For this, each Btree node has a very small index upfront pointing into the node's payload. This can lead to defragmentation if existing keys are resized or deleted, so the node has to be periodically "vacuumized" to avoid wasting space. The vacuumize operation turned out to be a performance killer, and it was a big challenge to make it fast and call it as rarely as possible.
hamsterdb also supports duplicate keys, meaning a given key can have several records assigned to it. All records of a key are grouped together. These groups are managed with the same upfront index structure as variable length keys. (If a single key has too many duplicate records, they are removed from the Btree leaf and stored in a separate overflow area.)
hamsterdb supports ACID Transactions with a read-committed isolation level. Transactional updates (that are not yet flushed to disk) are stored as delta-operations in memory. Each Database has its own Transaction index as an "overlay", and updates in that Transaction index have higher priority than those in the Btree. Aborting a Transaction simply means dropping the transaction's updates from the Transaction index, and a commit flushes the transaction's updates to the Btree.
This unique design choice has a few significant advantages. Transactional updates stay in RAM instead of requiring I/O. Also, no undo logic is required because aborted Transactions are never persisted. Redo logic (in case of a crash) is performed with a simple logical journal. But there is also a major challenge: at runtime, both trees have to be consolidated. This turned out to be very complex; imagine using a database cursor (hamsterdb has very fast cursors) to perform a full scan. Some keys are in the Btree, while others are in the Transaction tree. Keys in the Btree can be overwritten or even deleted in the Transaction tree, or another Transaction could currently modify them. Things get even harder when duplicate keys are involved.
hamsterdb's driving feature, however, is testability. The essential criteria for a database – even more important than performance! – is that it should not lose data. Critical bugs are forbidden. In addition, one side effect of a 9+ year development history is that there are those occasional bad design decisions I’d like to remove in order to keep the technical debt low, remain agile (with a lower case 'a') and quickly react to user's wishes or new ideas. I’m constantly rewriting parts of the code or trying out new ideas. Having high test coverage is essential to give me the confidence that those changes do not break anything.
Focusing on testability and a high level of automation enables me to do all this. At the lowest level, hamsterdb's debug build is cluttered with asserts and many integrity checks. There are about 1,800 unit tests and about 35,000 acceptance tests (many of which are scripted, others of which are randomly generated). Those acceptance tests run in dozens of different configurations and are executed in parallel against BerkeleyDB. Both databases are permanently checked for equality, so any introduced bug will immediately show up. In addition, each test spits out a long list of metrics, including memory consumption, number of heap allocations, number of allocated pages, blobs, and btree splits and merges, among others.
Some of those tests are executed in valgrind. Some tests are executed, and their performance results are compared to those of the previous release. This way I can quickly spot and correct performance regressions.
Then there's another set of tests simulating crashes to test the recoverability features of hamsterdb. And last but not least, I use QuviQ's QuickCheck, a property based testing tool for Erlang. QuickCheck lets you describe properties of your software, then executes fancy pseudo-randomized instructions, always verifying the integrity of your properties.
Static code analysis is used with Coverity's open source program and clang's scan-build. Both have found a few minor issues.
All tests are fully automated and performed before each release. A full release cycle takes a few days, and they kill a hard drive every couple of months.
If I’ve learned any lessons I can offer, it’s that writing tests really can be fun if you see the benefits! And iterative development without reliable tests is simply impossible.
Let me also mention hamsterdb pro, the commercial twin. hamsterdb pro offers heavy-weight compression (zlib, snappy, lzf, and lzo) for keys, records and the journal, and AES encryption and SIMD optimizations for leaf node lookups. More compression algorithms (bitmap compression, prefix compression) are either planned or in progress right now. The webpage has more information.
So far so good, but what about hamsterdb’s performance? I have run hamsterdb 2.1.8 against Google's leveldb 1.15 in their own benchmark. Compression was disabled (hamsterdb does not offer compression, but hamsterdb pro does). Fsyncs were also disabled, as was hamsterdb's recovery feature (implemented with a Write-Ahead Log). I used test sizes ranging from small keys/small records to medium sized keys with larger records, and inserted data ranging from 100k to 100m items. In addition, I ran two analytical functions of hamsterdb (sum and count w/ predicate), and implemented the same for leveldb. All tests ran with cache sizes of 4 MB and 1 GB, on machines equipped with a HDD and an SSD.
hamsterdb's configuration is always set to fixed length keys, and for the 8 byte keys hamsterdb stores uint64 numbers. This is a bit to hamsterdb's advantage, since leveldb requires the number to be converted to a string.
I have also added tests with small records (size 8), because small records are typically used for secondary indexes, when they contain the primary key. I ran each test on both a machine with a HDD (Core i7 i7-950 with 8 cores and 8 MB cache), and a machine with an SSD (Core i5-3550 with 4 cores and 8 MB cache). I am not showing all the benchmark results here; if you are interested you can download them here.
Sequential writes; key size: 16, record size: 100 (HDD, 1 GB Cache)
Sequential reads; key size: 16, record size: 100 (HDD, 1 GB Cache)
Random writes; key size: 16, record size: 100 (HDD, 1 GB Cache)
Random reads; key size: 16, record size: 100 (HDD, 1 GB Cache)
Calculating the sum of all keys (HDD, 4 MB Cache)
Counting all keys which end with "77" (SSD, 1 GB Cache)
For random reads, hamsterdb performs better than leveldb. For random writes, hamsterdb is faster as long as the data size is not too large. Starting from 10 million keys and above, hamsterdb suffers from the typical problems of a BTree database: lots of non-sequential I/O with high disk seek latencies.
That being said, the tests nicely demonstrate the analytical powers of hamsterdb. In particular, both the sum and the count calculations are scaling quite nicely. hamsterdb also shines when inserting and scanning sequentially, and stays extremely fast regardless of the data size.
The benchmarks show my homework for the coming months: optimize random read/write operations by making hamsterdb concurrent. This is a big one I’m working on, and I’ve already sketched out a design and begun refactoring a few releases ago.
I hope this post has given you a good introduction to hamsterdb. If you’re interested, go have a look at the webpage (http://hamsterdb.com) and the software. The source code is available on github. I appreciate all kinds of feedback!