advertise
« Scale-out vs Scale-up | Main | Pomegranate - Storing Billions and Billions of Tiny Little Files »
Wednesday
Sep012010

Paper: The Case for Determinism in Database Systems  

Can you have your ACID cake and eat your distributed database too? Yes explains Daniel Abadi, Assistant Professor of Computer Science at Yale University, in an epic post, The problems with ACID, and how to fix them without going NoSQL, coauthored with Alexander Thomson, on their paper The Case for Determinism in Database Systems. We've already seen VoltDB offer the best of both worlds, this sounds like a completely different approach.

The solution, they propose, is: 

...an architecture and execution model that avoids deadlock, copes with failures without aborting transactions, and achieves high concurrency. The paper contains full details, but the basic idea is to use ordered locking coupled with optimistic lock location prediction, while exploiting deterministic systems' nice replication properties in the case of failures.

The problem they are trying to solve is:

In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for to ACID systems to scale on shared-nothing architectures, and that is what we address in the research paper that we will present at VLDB this month.

Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.

The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized. So if three transactions (let's call them A, B and C) are active at the same time on an ACID system, it will guarantee that the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B, etc.

The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. 

I certainly don't understand how this works yet. But whenever you say coordination it means a protocol which must run across multiple nodes, which means latency, which means it's sensitive to node failure, it's slow, and doesn't scale as nodes are increased. So I'll be very curios to see how it works as all the details come out. 

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.

Reader Comments (3)

Hi Todd,

Thanks for talking a look at our research. Our approach to scaling ACID requires that input transactions feed through a layer that decides on the order that the DBMS will guarantee equivalence to. You can scale pretty far with this only being a single node (since all it has to do is receive xacts, batch them together, and forward them to the DBMS), or you can scale out this layer using traditional well known approaches. If you scale out this layer, you will incur additional latency, but the important thing is that this latency is entirely outside the critical path of the transaction, so you don't pay the lock contention costs (a huge barrier to scalability) that additional latency along the critical path of the transaction (within the DBMS) will cause.

September 3, 2010 | Unregistered CommenterDaniel Abadi

Hi Todd,

What VoltDB does right now by doing agreement on a global timestamp order is part of what a scaled out version of the preprocessor described in the paper might do. It does trade some latency for throughput, but the latency still manages to be single digit milliseconds.

What VoltDB does not do is the fine grained (and possibly higher order) lock analysis at the time the transaction is submitted as well as reordering the txns based on that analysis. If you think of a partitions as the lock (since they are single threaded and do not interleave transactions) then a transaction acquires a single lock (single partition transaction) or all locks (multi-partition txn).

Things are always in flux, but we are looking at implementing optimizations that are similar in that some "lock" analysis is performed at the time the txn is submitted.

One specific case we are looking at optimizing is a txn that operates on a small number of partitions (usually 2 or 3) with all data necessary to generate the queries and identify the involved partitions available at the time the txn is submitted. Such a txn could enter the system and a VoltDB preprocessor equivalent could run stored procedure code which can generate exactly one batch of SQL statements. At that point the txn can be entered into the global timestamp order and the plan fragments for the SQL statements can be sent to just the involved partitions which can execute them immediately and continue on to other work without blocking. The results are then returned to the procedure which has the option to commit or rollback, but not to generate new multi-partition SQL statements. This approach has the benefit of speculative execution without the need for locking or maintaining read write sets.

Usage for a process that transfers a balance from once account to another would look something like this:
txn #1
Get the balance from account A

txn #2
Get the balance from account B

txn #3
Decrement the balance in account A
Increment the balance in account B
Verify that the balance in A and B is as expected and commit or rollback

Such a transaction would scale linearly with cluster size because it only involves a constant number of partitions and does not cause any blocking. The advantage of the preprocessor approach described in the paper is that it can handle higher-order dependent transactions.

FYI Yang Zhang has a blog post responding to the paper where he says:

Here’s a half-baked idea: partition the central coordinator, and establish an ordering among the partitions. Time is divided into epochs, and the transactions received in each epoch are ordered according to the partition ordering. To prevent clock skew from drifting apart each partition’s notion of the same epoch (causing transactions to block on other transactions in the same epoch), the partitions would need to synchronize themselves every so often. This approach introduces some transaction latency, but hopefully not substantially more than typical transaction batching.

I think I need to see a written example to understand this.

Ariel Weisberg
VoltDB Developer

September 3, 2010 | Unregistered CommenterAriel Weisberg

The paper only references the Kemme/Allonso paper of 2000. Kemme later generalised that paper using snapshot isolation: paper here It seems to me the solution this paper suggests is the same as that second Kemme solution. If not, the paper should be more clear in explaining the difference between the two.

September 5, 2010 | Unregistered CommenterPaul

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>