« Stuff The Internet Says On Scalability For August 8th, 2014 | Main | Sponsored Post: Apple, Gawker, FoundationDB, Monitis, Cie Games, BattleCry, Surge, Cloudant, CopperEgg, Logentries, Couchbase, MongoDB, BlueStripe, AiScaler, Aerospike, AppDynamics, ManageEngine, Site24x7 »

Tokutek White Paper: A Comparison of Log-Structured Merge (LSM) and Fractal Tree Indexing

What data structure does your database use? It's not something the typical database user spends much time pondering. Since data structure is destiny, what data structure your database uses is a key point to consider in your selection process.

We know CouchDB uses a modified B+ tree. We've learned a lot fascinating details over the years about the use of Log-structured merge-trees in Cassandra, HBase and LevelDB. So B+ trees and LSMs seem familiar by now.

What may not be so familiar is Tokutek's Fractal Tree Indexing technology that is supposed to be even better than B+ trees and LSMs.

As a comparison between Fractal Tree Indexing and LSMs, Bradley Kuszmaul, Chief Architect at Tokutek, has written a detailed paper, a must read for the algorithmically inclined or someone interested in database internals: A Comparison of Log-Structured Merge (LSM) and Fractal Tree Indexing.

Here's a quick intro to Fractal Tree (FT) indexes:

The idea behind FT indexes is to maintain a B tree in which each internal node of the tree contains a buffer. When a data record is inserted into the tree, instead of traversing the entire tree the way a B tree would, we simply insert the Eventually the root buffer will fill up with new data records. At that point the FT index copies the inserted records down a level of the tree. Eventually the newly inserted records will reach the leaves, at which point they are simply stored in a leaf node as a B tree would store them. The data records descending through the buffers of the tree can be thought of as messages that say “insert this record”. FT indexes can use other kinds of messages, such as messages that delete a record, or messages that update a record.

The paper starts with an explanation of write amplification, read amplification, and space amplification, the metrics that will be used to compare B trees, FT indexes, and LSMs. These are nicely detailed sections and well worth the price of admission.

Write amplification: the amount of data written to storage compared to the amount of data that the application wrote. Read amplification: the number of I/O’s required to satisfy a particular query. Space amplification: the space required by a data structure can be inflated by fragmentation or requirements for temporary copies of the data.

The paper then has sections explaining in great detail the inner magic of B trees, FT indexes, and LSMs. Which are also worth the price of admission.

B-tree index: a search tree in which every node of the tree can have many children. LSM: The idea behind LSM trees is to maintain a collection of sorted runs of data, each run containing key-value pairs with the keys sorted in ascending order. FT indexes were explained above.


As you might expect the analysis shows FT indexes the winner:

  • FT indexes provide excellent write amplification, read amplification, and space amplification, both asymptotically and in practice.
  • B trees have much higher write amplification than the other alternatives, and are good at read amplification, and pretty good at space amplification. Very few applications should use B trees.
  • Leveled LSM’s incur significantly more read amplification than FT indexes both asymptotically and in practice. Very few applications should use leveled LSM’s. 
  • Size-tiered LSM’s can provide much lower write amplification than Leveled LSM’s or FT indexes at a horrible cost for range query and pretty bad space amplification. A workload that performs few reads and is not sensitive to space amplification, but is very sensitive to write costs might benefit from size-tiered LSM’s. In my experience, it is an unusual application that is so short on write bandwidth that it can gladly give up a factor of 13 on read amplification. 

A solid data structure concept is not enough. The paper points out that the data structure a database uses is only one part of entire product. You must also consider features like MVCC, transactions with ACID recovery, two-phase distributed commit, backup, and compression.

Paper: A Comparison of Log-Structured Merge (LSM) and Fractal Tree Indexing.


On Hacker News

Reader Comments (6)

Hey Todd,

LSM read amplification is made better by use of Bloom Filters. Without Bloomfilters, LSMs are useless. But, most LSM implementations do have Bloom filters. So, we should consider LSM+Bloomfilter for write, read, space amplification than LSM alone in comparison. Otherwise, comparison should be treated as biased and improper.


The paper mentions the use of bloom filters for LSM 6 times, so I don't think it can be called "biased & improper" for ignoring them.

August 10, 2014 | Unregistered CommenterMark Callaghan

Why should bloom filter be a must in LSM implementation?
I worked a lot with Cassandra and unfortunately, in most cases bloom filter just eat memory and do not save any I/O.

January 18, 2015 | Unregistered CommenterNikolay

Is not supporting foreign keys [1] inherent in fractal tree design, or did Tokutek just didn't implement it yet?

1. from
Q. Does TokuDB enforce foreign key constraints?
A. No, TokuDB ignores foreign key declarations.

March 22, 2015 | Unregistered CommenterZ.T.

From the authors of High Performance MySQL, Third Edition:

Instead of using foreign keys as constraints, it’s often a good idea to constrain the values in the application. Foreign keys can add significant overhead. We don’t have any benchmarks to share, but we have seen many cases where server profiling revealed that foreign key constraint checks were the performance problem, and removing the foreign keys improved performance greatly.

March 25, 2015 | Unregistered CommenterGani Mendoza

... but we have seen many cases where server profiling revealed that foreign key constraint checks were the performance problem, and removing the foreign keys improved performance greatly.

Would that reflect more upon MySQL though, rather than FK constraints in general? A FK should not require validation unless it is the FK fields being updated, or upon insert. Common FK values should be available in the DB page cache, avoiding I/O for common cases. SQL statements should be pre-prepared to allow compilation to be amortized etc.

And remember, make it right, then make it fast. Incorrect data will always be incorrect no matter how much hardware you throw at it.

May 26, 2015 | Unregistered CommenterChristian Smith

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>