If like many humans you've found even Paxos Made Simple a bit difficult to understand, you might enjoy RAFT as described in In Search of an Understandable Consensus Algorithm by Stanford's Diego Ongaro and John Ousterhout. The video presentation of the paper is given by John Ousterhout. Both the paper and the video are delightfully accessible.
mcherm has a good summary of the paper:
A consensus algorithm is: a cluster of servers should record a series of records ("log entries") in response to requests from clients of the cluster. (It may also take action based on those entries.) It does so in a way that guarantees that the responses seen by clients of the cluster will be consistent EVEN in the face of servers crashing in unpredictable ways (but not loosing data that was synched to disk), and networks introducing unpredictable delays or communication blockages.Here's what Raft does. First, it elects a leader, then the leader records the master version of the log, telling other cluster servers what's in that master record and "committing" a log entry then responding to the client of the cluster to acknowledge that entry only when more than half the cluster has recorded a given entry. That works unless the leader crashes or loses communication with too many others; in such a case Raft elects a new leader. The election process is designed to guarantee that any newly elected leader will have (at least) all of the already-committed entries.
We also have a treat in the form of a great roundtable discussion of the topic via a Think Distributed hangout, featuring several folks from Basho, Peter Bailis, and Diego Ongaro.
Perhaps the most interesting part of the talk came late in the discussion when Peter commented that he was astounded that an academic paper already has so many open source implementations. RAFT already has 40 or so different implementations in many different languages.
The key that others can learn from is: understandability. Most academic papers are opaque, to put it generously. Diego talks about this saying:
- Understandability came from the perspective of building the best system possible. Asking what do you need to know if you are building this system?
- RAFT made simplifying assumptions in having a leader, having a log, and how leader election occurs. These simplifications are fine most of the time during operation, but by simplifying they can write down a concrete explanation that people can understand.
- They built the system. Explaining without building leads to a lot of hand waving.
- Academically this paper has been hard to publish. It lacks academic novelty in that it solves the same problem as Paxos. The key difference is if you look at the code they took a systems building approach, which obviously appeals to programmers more than academics. Which is very sad for the industry. Understandability should be valued in the academic world.
Some other points in the discussion:
- Andrew Stone brought up the need for RAFT by using all the single point of failure Redis servers in existence.
- There was a discussion of a lack of clarity of using an AppendEntry to mean both HeartBeat and AppendEntry. I agree this is unclear in practice. While non-heartbeat messages can indicate a heartbeat, what do you do when there are no messages being sent? Sending an AppendEntry when nothing is being appended is trickiness without a real gain.
- Peter Bailis asks if there's been any performance and scalability testing yet? The reply was no, so don't get too excited yet.
- There's an interesting discussion if using RAFT could be used for all RAMCloud operations. Should your availability be dependendent on core dataservers or consensus groups or some other method?