« Stuff The Internet Says On Scalability For August 9, 2013 | Main | Sponsored Post: BlueStripe, Apple, Surge, Change, Booking, Rackspace, aiCache, Aerospike, ScaleOut, New Relic, LogicMonitor, AppDynamics, ManageEngine, Site24x7 »

RAFT - In Search of an Understandable Consensus Algorithm

 If like many humans you've found even Paxos Made Simple a bit difficult to understand, you might enjoy RAFT as described in In Search of an Understandable Consensus Algorithm by Stanford's Diego Ongaro and John Ousterhout. The video presentation of the paper is given by John Ousterhout. Both the paper and the video are delightfully accessible.

mcherm has a good summary of the paper:

A consensus algorithm is: a cluster of servers should record a series of records ("log entries") in response to requests from clients of the cluster. (It may also take action based on those entries.) It does so in a way that guarantees that the responses seen by clients of the cluster will be consistent EVEN in the face of servers crashing in unpredictable ways (but not loosing data that was synched to disk), and networks introducing unpredictable delays or communication blockages.
Here's what Raft does. First, it elects a leader, then the leader records the master version of the log, telling other cluster servers what's in that master record and "committing" a log entry then responding to the client of the cluster to acknowledge that entry only when more than half the cluster has recorded a given entry. That works unless the leader crashes or loses communication with too many others; in such a case Raft elects a new leader. The election process is designed to guarantee that any newly elected leader will have (at least) all of the already-committed entries.

We also have a treat in the form of a great roundtable discussion of the topic via a Think Distributed hangout, featuring several folks from Basho, Peter Bailis, and Diego Ongaro. 

Perhaps the most interesting part of the talk came late in the discussion when Peter commented that he was astounded that an academic paper already has so many open source implementations. RAFT already has 40 or so different implementations in many different languages. 

The key that others can learn from is: understandability. Most academic papers are opaque, to put it generously. Diego talks about this saying:

  • Understandability came from the perspective of building the best system possible. Asking what do you need to know if you are building this system?
  • RAFT made simplifying assumptions in having a leader, having a log, and how leader election occurs. These simplifications are fine most of the time during operation, but by simplifying they can write down a concrete explanation that people can understand. 
  • They built the system. Explaining without building leads to a lot of hand waving.
  • Academically this paper has been hard to publish. It lacks academic novelty in that it solves the same problem as Paxos. The key difference is if you look at the code they took a systems building approach, which obviously appeals to programmers more than academics. Which is very sad for the industry. Understandability should be valued in the academic world. 

Some other points in the discussion:

  • Andrew Stone brought up the need for RAFT by using all the single point of failure Redis servers in existence.
  • There was a discussion of a lack of clarity of using an AppendEntry to mean both HeartBeat and AppendEntry. I agree this is unclear in practice. While non-heartbeat messages can indicate a heartbeat, what do you do when there are no messages being sent? Sending an AppendEntry when nothing is being appended is trickiness without a real gain.
  • Peter Bailis asks if there's been any performance and scalability testing yet? The reply was no, so don't get too excited yet.
  • There's an interesting discussion if using RAFT could be used for all RAMCloud operations. Should your availability be dependendent on core dataservers or consensus groups or some other method?

Reader Comments (2)

1. In practice we view heartbeat as broadcast information to followers, which can contain liveness and command data. When the sever is under high usage, each heartbeat will contain command data. When the server is under low usage, sending sever Bytes of data every second is not a problem.

2. We have tested raft performance in go implementation, including leader election time and write operation time. For scalability, each raft cluster should contain less than 20 machines in most the case.

We do face some problem when using raft, such as how to decide timeout and how to make sure the slow machines will catch up with the majority eventually.

August 7, 2013 | Unregistered CommenterXiang Li

Dear Xiang Li,

you can solve your timeout problem "easily", replacing your current timeout model by an adaptive scheme. This would allow you to build resistance against slow lori attacks or simply put, just add more performance.

August 8, 2013 | Unregistered CommenterFerhat

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>