« The VeriScale Architecture - Elasticity and efficiency for private clouds | Main | How is Berkely DB fare against other Key-Value Database »

Paper: A practical scalable distributed B-tree

We've seen a lot of NoSQL action lately built around distributed hash tables. Btrees are getting jealous. Btrees, once the king of the database world, want their throne back. Paul Buchheit surfaced a paper: A practical scalable distributed B-tree by Marcos K. Aguilera and Wojciech Golab, that might help spark a revolution.

From the Abstract:

We propose a new algorithm for a practical, fault tolerant, and scalable B-tree distributed over a set of servers. Our algorithm supports practical features not present in prior work: transactions that allow atomic execution of multiple operations over multiple B-trees, online migration of B-tree nodes between servers, and dynamic addition and removal of servers. Moreover, our algorithm is conceptually simple: we use transactions to manipulate B-tree nodes so that clients need not use complicated concurrency and locking protocols used in prior work. To execute these transactions quickly, we rely on three techniques: (1) We use optimistic concurrency control, so that B-tree nodes are not locked during transaction execution, only during commit. This well-known technique works well because B-trees have little contention on update. (2) We replicate inner nodes at clients. These replicas are lazy, and hence lightweight, and they are very helpful to reduce client-server communication while traversing the B-tree. (3)We replicate version numbers of inner nodes across servers, so that clients can validate their
transactions efficiently, without creating bottlenecks at the root node and other upper levels in the tree.

Distributed hash tables are scalable because records area easily distributed across a cluster which gives the golden ability to perform many writes in parallel. The problem is keyed access is very limited.

A lot of the time you want to iterate through records or search records in a sorted order. Sorted could mean time stamp order, for example, or last name order as another example.

Access to data in sorted order is what btrees are for. But we simply haven't seen distributed btree systems develop. Instead, you would have to use some sort of map-reduce mechanism to efficiently scan all the records or you would have to maintain the information in some other way.

This paper points the way to do some really cool things at a system level:

  • It's distributed so it can scale dynamically in size and handle writes in parallel.
  • It supports adding and dropping servers dynamically, which is an essential requirement for architectures based on elastic cloud infrastructures.
  • Data can be migrated to other nodes, which is essential for maintenance.
  • Multiple records can be involved in transactions which is essential for the complex data manipulations that happen in real systems. This is accomplished via a version number mechanism that looks something like MVCC.
  • Optimistic concurrency, that is, the ability to change data without explicit locking, makes the job for programmers a lot easier.

    These are the kind of features needed for systems in the field. Hopefully we'll start seeing more systems offering richer access structures while still maintaining scalability.
  • References (1)

    References allow you to track sources for this article, as well as articles that were written in response to this article.

    Reader Comments (5)

    Any real-life implementation in C/C++ ?


    November 29, 1990 | Unregistered CommenterRaine

    I only briefly skimmed over the paper but this sounds a lot like what we've actually implemented and released in our distributed database project - MckoiDDB ( MckoiDDB implements a Btree structure where tree nodes are distributed over machines in a network. Each transaction is a snapshot version of the current Btree. There are no locks needed on reads or writes and MckoiDDB implements optimistic concurrency control (which means conflicts in concurrent transactions are detected at commit time).

    It's a really nice structure for low latency transactional distributed database systems. Implementing MVCC is very simple when using a Btree.

    November 29, 1990 | Unregistered CommenterTobias Downer

    My understanding is that the number of index pages you have to touch with a b-tree expands as the log of the number of rows. And that for a very large number of rows this means long insertion times.

    I could not understand how this is overcome by this approach.

    November 29, 1990 | Unregistered Commenterartsrc

    Well if you are inserting a lot of data with a random probability then you do hit a lot of pages. Fortunately in real data, you rarely see random probability.

    If you expect to experience this, you could always perform your indexing offline or pre-sort your data. There really isn't any other alternatives to using a tree for making sorted indexes anyway. Hash tables can only do equivalence queries.

    November 29, 1990 | Unregistered CommenterTobias Downer

    The paper was published in 2007

    November 4, 2009 | Unregistered CommenterCosmin

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>