Google's Paxos Made Live – An Engineering Perspective

Todd Hoff's picture

This is an unusually well written and useful paper. It talks in detail about experiences implementing a complex project, something we don't see very often. They shockingly even admit that creating a working implementation of Paxos was more difficult than just translating the pseudo code. Imagine that, programmers aren't merely typists! I particularly like the explanation of the Paxos algorithm and why anyone would care about it, working with disk corruption, using leases to support simultaneous reads, using epoch numbers to indicate a new master election, using snapshots to prevent unbounded logs, using MultiOp to implement database transactions, how they tested the system, and their openness with the various problems they had. A lot to learn here.

From the paper:
We describe our experience building a fault-tolerant data-base using the Paxos consensus algorithm.
Despite the existing literature in the field, building such a database proved to be non-trivial. We describe
selected algorithmic and engineering problems encountered, and the solutions we found for them. Our
measurements indicate that we have built a competitive system.

Introduction
It is well known that fault-tolerance on commodity hardware can be achieved through replication [17, 18]. A
common approach is to use a consensus algorithm [7] to ensure that all replicas are mutually consistent [8,
14, 17]. By repeatedly applying such an algorithm on a sequence of input values, it is possible to build an
identical log of values on each replica. If the values are operations on some data structure, application of
the same log on all replicas may be used to arrive at mutually consistent data structures on all replicas. For
instance, if the log contains a sequence of database operations, and if the same sequence of operations is
applied to the (local) database on each replica, eventually all replicas will end up with the same database
content (provided that they all started with the same initial database state).
This general approach can be used to implement a wide variety of fault-tolerant primitives, of which a
fault-tolerant database is just an example. As a result, the consensus problem has been studied extensively
over the past two decades. There are several well-known consensus algorithms that operate within a multitude
of settings and which tolerate a variety of failures. The Paxos consensus algorithm [8] has been discussed in
the theoretical [16] and applied community [10, 11, 12] for over a decade.
We used the Paxos algorithm (“Paxos”) as the base for a framework that implements a fault-tolerant
log. We then relied on that framework to build a fault-tolerant database. Despite the existing literature on
the subject, building a production system turned out to be a non-trivial task for a variety of reasons:
While Paxos can be described with a page of pseudo-code, our complete implementation contains several
thousand lines of C++ code. The blow-up is not due simply to the fact that we used C++ instead
of pseudo notation, nor because our code style may have been verbose. Converting the algorithm into
a practical, production-ready system involved implementing many features and optimizations – some
published in the literature and some not.
• The fault-tolerant algorithms community is accustomed to proving short algorithms (one page of pseudo
code) correct. This approach does not scale to a system with thousands of lines of code. To gain
confidence in the “correctness” of a real system, different methods had to be used.
• Fault-tolerant algorithms tolerate a limited set of carefully selected faults. However, the real world
exposes software to a wide variety of failure modes, including errors in the algorithm, bugs in its implementation, and operator error. We had to engineer the software and design operational procedures
to robustly handle this wider set of failure modes.
• A real system is rarely specified precisely. Even worse, the specification may change during the im-
plementation phase. Consequently, an implementation should be malleable. Finally, a system might
“fail” due to a misunderstanding that occurred during its specification phase.

This paper discusses a selection of the algorithmic and engineering challenges we encountered in moving
Paxos from theory to practice. This exercise took more R&D efforts than a straightforward translation of
pseudo-code to C++ might suggest.

The rest of this paper is organized as follows. The next two sections expand on the motivation for this
project and describe the general environment into which our system was built. We then provide a quick
refresher on Paxos. We divide our experiences into three categories and discuss each in turn: algorithmic gaps
in the literature, software engineering challenges, and unexpected failures. We conclude with measurements
of our system, and some broader observations on the state of the art in our field.

Related Articles

  • ZooKeeper - A Reliable, Scalable Distributed Coordination System

  • Comments

    dkubb's picture

    Re: Google's Paxos Made Live – An Engineering Perspective

    Just the other day I came across a similar system called Scalaris (formerly Chord). The system is a fault-tolerant distributed data store with ACID compliant transactions. For a great introduction watch Building a transactional distributed data store with Erlang given at the Erlang eXchange conference in June. It was interesting to see how they tested the system by using the Wikipedia data, adapting their system to work with the data, and then subjecting it to an equivalent dynamic request load. Their system was able to handle the same loads with a dramatically lower amount of hardware and overall complexity.

    Todd Hoff's picture

    Re: Google's Paxos Made Live – An Engineering Perspective

    Interestingly Kent Langley at http://www.productionscale.com/home/2008/7/25/scalaris-one-to-watch.html has a nice post on Scalaris. I read a little and it looks interesting. It's just an attribute-value store however. If you want queries and sort orders and other hallmarks of civilization, you won't find it.

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.

    Post new comment

    The content of this field is kept private and will not be shown publicly.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd><div ?=?><p ?=?> <img ?=?><h1 ?=?><h2 ?=?><h3 ?=?>
    • Lines and paragraphs break automatically.
    • Glossary terms will be automatically marked with links to their descriptions
    • You may link to webpages through the weblinks registry

    More information about formatting options

    To combat spam, please enter the code in the image.