This is an unusually well written and useful paper. It talks in detail about experiences implementing a complex project, something we don't see very often. They shockingly even admit that creating a working implementation of Paxos was more difficult than just translating the pseudo code. Imagine that, programmers aren't merely typists! I particularly like the explanation of the Paxos algorithm and why anyone would care about it, working with disk corruption, using leases to support simultaneous reads, using epoch numbers to indicate a new master election, using snapshots to prevent unbounded logs, using MultiOp to implement database transactions, how they tested the system, and their openness with the various problems they had. A lot to learn here. From the paper: We describe our experience building a fault-tolerant data-base using the Paxos consensus algorithm. Despite the existing literature in the field, building such a database proved to be non-trivial. We describe selected algorithmic and engineering problems encountered, and the solutions we found for them. Our measurements indicate that we have built a competitive system. Introduction It is well known that fault-tolerance on commodity hardware can be achieved through replication [17, 18]. A common approach is to use a consensus algorithm  to ensure that all replicas are mutually consistent [8, 14, 17]. By repeatedly applying such an algorithm on a sequence of input values, it is possible to build an identical log of values on each replica. If the values are operations on some data structure, application of the same log on all replicas may be used to arrive at mutually consistent data structures on all replicas. For instance, if the log contains a sequence of database operations, and if the same sequence of operations is applied to the (local) database on each replica, eventually all replicas will end up with the same database content (provided that they all started with the same initial database state). This general approach can be used to implement a wide variety of fault-tolerant primitives, of which a fault-tolerant database is just an example. As a result, the consensus problem has been studied extensively over the past two decades. There are several well-known consensus algorithms that operate within a multitude of settings and which tolerate a variety of failures. The Paxos consensus algorithm  has been discussed in the theoretical  and applied community [10, 11, 12] for over a decade. We used the Paxos algorithm (“Paxos”) as the base for a framework that implements a fault-tolerant log. We then relied on that framework to build a fault-tolerant database. Despite the existing literature on the subject, building a production system turned out to be a non-trivial task for a variety of reasons: While Paxos can be described with a page of pseudo-code, our complete implementation contains several thousand lines of C++ code. The blow-up is not due simply to the fact that we used C++ instead of pseudo notation, nor because our code style may have been verbose. Converting the algorithm into a practical, production-ready system involved implementing many features and optimizations – some published in the literature and some not. • The fault-tolerant algorithms community is accustomed to proving short algorithms (one page of pseudo code) correct. This approach does not scale to a system with thousands of lines of code. To gain confidence in the “correctness” of a real system, different methods had to be used. • Fault-tolerant algorithms tolerate a limited set of carefully selected faults. However, the real world exposes software to a wide variety of failure modes, including errors in the algorithm, bugs in its implementation, and operator error. We had to engineer the software and design operational procedures to robustly handle this wider set of failure modes. • A real system is rarely specified precisely. Even worse, the specification may change during the im- plementation phase. Consequently, an implementation should be malleable. Finally, a system might “fail” due to a misunderstanding that occurred during its specification phase. This paper discusses a selection of the algorithmic and engineering challenges we encountered in moving Paxos from theory to practice. This exercise took more R&D efforts than a straightforward translation of pseudo-code to C++ might suggest. The rest of this paper is organized as follows. The next two sections expand on the motivation for this project and describe the general environment into which our system was built. We then provide a quick refresher on Paxos. We divide our experiences into three categories and discuss each in turn: algorithmic gaps in the literature, software engineering challenges, and unexpected failures. We conclude with measurements of our system, and some broader observations on the state of the art in our field.
Several months ago I attended a Joyent presentation where the spokesman hinted that Joyent had the chops to support a one billion page per month Facebook Ruby on Rails application. Even under a few seconds of merciless grilling he would not give up the name of the application. Now we have the big reveal: it was LinkedIn's Bumper Sticker app. For those not currently sticking things on bumps, Bumper Sticker is quite surprisingly a viral media sharing application that allows users to express their individuality by sticking small virtual stickers on Facebook profiles. At the time I was quite curious how Joyent's cloud approach could be leveraged for this kind of app. Now that they've released a few details, we get to find out.
* Joyent is a scale on demand cloud. Allows customers to have a dynamic data center instead of being stuck using their own rigid infrastructure.
* There's an API if you need one. The service is unmanaged, you get root on all your boxes.
* They consider their infrastructure to be better and more open than Amazon. You get access to a high end load balancer and the capabilities of OpenSolaris (Dtrace, Zones, lower request processing overhead, sub 10 second reboot times).
* Joyent's primary scalability principle is to organize apps around silos built from their powerful Accelerator blocks: put applications on different servers based on the quality of service you want to give them. For example, put static content on their own servers so the static content is always served fast and reliably. This allows you to prioritize based on what's important to you. You could, for example, prioritize the virality of your application by putting the Invite Friends functionality on their own servers, thus assuring the growth of your application through your viral functionality possibly at the expense of less important functionality.
* Has three data centers in the US and are opening a fourth, none in Europe.
* Considers their secret sauce to be their highly sophisticated administration system which allows a few people to easily manage a large infrastructure.
* Has a peering relationship with Facebook. That means there are direct high-speed fiber links between Joyent’s data center in Emeryville and Facebook’s data center in San Francisco.
Update: InfoQ links to a few excellent Eucalyptus updates: Velocity Conference Video by Rich Wolski and a Visualization.com interview Rich Wolski on Eucalyptus: Open Source Cloud Computing. Eucalyptus is generating some excitement on the Cloud Computing group as a potential vendor neutral EC2 compatible cloud platform. Two reasons why Eucalyptus is potentially important: private clouds and cloud portability: Private clouds. Let's say you want a cloud like infrastructure for architectural purposes but you want it to run on your own hardware in your own secure environment. How would you do this today? Hm.... Cloud portability. With the number of cloud offerings increasing how can you maintain some level of vendor neutrality among this "swarm" of different options? Portability is a key capability for cloud customers as the only real power customers have is in where they take their business and the only way you can change suppliers is if there's a ready market of fungible services. And the only way their can be a market is if there's a high degree of standardization. What should you standardize on? The options are usually to form a great committee and take many years to spec out something that doesn't exist, nobody will build, and will never really work. Or have each application create a high enough layer interface that portability is potentially difficult, but possible. Or you can take a popular existing API, make it the general API, and everyone else is accommodated using an adapter layer and the necessary special glue to take advantage of value add features for each cloud. With great foresight Eucalyptus has chosen to create a cloud platform based on Amazon's EC2. As this is the most successful cloud platform it makes a lot of sense to use it as a model. We see something similar with the attempts to port Google AppEngine to EC2 thus making GAE a standard framework for web apps. So developers would see GAE on top of EC2. A lot of code would be portable between clouds using this approach. Even better would be to add ideas in from RightScale, 3Tera, and Mosso to get a higher level view of the cloud, but that's getting ahead of the game. Just what is Eucalyptus? From their website: Overview ¶ Elastic Computing, Utility Computing, and Cloud Computing are (possibly synonymous) terms referring to a popular SLA-based computing paradigm that allows users to "rent" Internet-accessible computing capacity on a for-fee basis. While a number of commercial enterprises currently offer Elastic/Utility/Cloud hosting services and several proprietary software systems exist for deploying and maintaining a computing Cloud, standards-based open-source systems have been few and far between. EUCALYPTUS -- Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems -- is an open-source software infrastructure for implementing Elastic/Utility/Cloud computing using computing clusters and/or workstation farms. The current interface to EUCALYPTUS is interface-compatible with Amazon.com's EC2 (arguably the most commercially successful Cloud computing service), but the infrastructure is designed to be modified and extended so that multiple client-side interfaces can be supported. In addition, EUCALYPTUS is implemented using commonly-available Linux tools and basic web service technology making it easy to install and maintain. Overall, the goal of the EUCALYPTUS project is to foster community research and development of Elastic/Utility/Cloud service implementation technologies, resource allocation strategies, service level agreement (SLA) mechanisms and policies, and usage models. The current release is version 1.0 and it includes the following features: * Interface compatibility with EC2 * Simple installation and deployment using Rocks cluster-management tools * Simple set of extensible cloud allocation policies * Overlay functionality requiring no modification to the target Linux environment * Basic "Cloud Administrator" tools for system management and user accounting * The ability to configure multiple clusters, each with private internal network addresses, into a single Cloud. The initial version of EUCALYPTUS requires Xen to be installed on all nodes that can be allocated, but no modifications to the "dom0" installation or to the hypervisor itself. For more discussion see:
A report from the CloudCamp conference on cloud computing, held in London in July 2008.
Given S3's recent failure (Cloud Status tells the tale) Kevin Burton makes the excellent suggestion of fronting S3 with a caching proxy server. A caching proxy server can reply to service requests without contacting the specified server, by retrieving content saved from a previous request, made by the same client or even other clients. This is called caching. Caching proxies keep local copies of frequently requested resources. In normal operation when an asset (a user's avatar, for example) is requested the cache is tried first. If the asset is found in the cache then it's returned. If the asset is not in the cache it's retrieved from S3 (or wherever) and cached. So when S3 goes down it's likely you can ride out the down time by serving assets out of the cache. This strategy only works when using S3 as a CDN. If you are using S3 for its "real" purpose, as a storage service, then a caching proxy can't help you... Amazon doesn't used S3 as a CDN either Amazon Not Building Out AWS To Compete With CDNs. They use Limelight Networks. Some proxy options are: Squid, Nginx, Varnish. Planaroo shares how a small startup responds to an S3 outage (summarized):
Robert Scoble in an often poignant FriendFeed thread commiserating PodTech's unfortunate end, shared what he learned about creating a successful startup. Here's a summary of a Robert's rules and why Machiavelli just may agree with them:
Jeff Atwood started a barn burner of a conversation in Maybe Normalizing Isn't Normal on how to create a fast scalable tagging system. Jeff eventually asks that terrible question: which is better -- a normalized database, or a denormalized database? And all hell breaks loose. I know, it's hard to imagine database debates becoming contentious, but it does happen :-) It's lucky developers don't have temporal power or rivers of blood would flow. Here are a few of the pithier points (summarized):
ZooKeeper is a high available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates key configuration information. ZooKeeper can be used for leader election, group membership, and configuration maintenance. In addition ZooKeeper can be used for event notification, locking, and as a priority queue mechanism. It's a sort of central nervous system for distributed systems where the role of the brain is played by the coordination service, axons are the network, processes are the monitored and controlled body parts, and events are the hormones and neurotransmitters used for messaging. Every complex distributed application needs a coordination and orchestration system of some sort, so the ZooKeeper folks at Yahoo decide to build a good one and open source it for everyone to use. The target market for ZooKeeper are multi-host, multi-process C and Java based systems that operate in a data center. ZooKeeper works using distributed processes to coordinate with each other through a shared hierarchical name space that is modeled after a file system. Data is kept in memory and is backed up to a log for reliability. By using memory ZooKeeper is very fast and can handle the high loads typically seen in chatty coordination protocols across huge numbers of processes. Using a memory based system also mean you are limited to the amount of data that can fit in memory, so it's not useful as a general data store. It's meant to store small bits of configuration information rather than large blobs. Replication is used for scalability and reliability which means it prefers applications that are heavily read based. Typical of hierarchical systems you can add nodes at any point of a tree, get a list of entries in a tree, get the value associated with an entry, and get notification of when an entry changes or goes away. Using these primitives and a little elbow grease you can construct the higher level services mentioned above. Why would you ever need a distribute coordination system? It sounds kind of weird. That's more the question I'll be addressing in this post rather than how it works because the slides and the video do a decent job explaining at a high level what ZooKeeper can do. The low level details could use another paper however. Reportedly it uses a version of the famous Paxos Algorithm to keep replicas consistent in the face of the failures most daunting. What's really missing is a motivation showing how you can use a coordination service in your own system and that's what I hope to provide... Kevin Burton wants to use ZooKeeper to to configure external monitoring systems like Munin and Ganglia for his Spinn3r blog indexing web service. He proposes each service register its presence in a cluster with ZooKeeper under the tree "/services/www." A Munin configuration program will add a ZooKeeper Watch on that node so it will be notified when the list of services under /services/www changes. When the Munin configuration program is notified of a change it reads the service list and automatically regenerates a munin.conf file for the service. Why not simply use a database? Because of the guarantees ZooKeeper makes about its service:
Some Fast Facts
In the more cool stuff I've never heard of before department is something called Self Cleansing Intrusion Tolerance (SCIT). Botnets are created when vulnerable computers live long enough to become infected with the will to do the evil bidding of their evil masters. Security is almost always about removing vulnerabilities (a process which to outside observers often looks like a dog chasing its tail). SCIT takes a different approach, it works on the availability angle. Something I never thought of before, but which makes a great deal of sense once I thought about it. With SCIT you stop and restart VM instances every minute (or whatever depending in your desired window vulnerability).... This short exposure window means worms and viri do not have long enough to fully infect a machine and carry out a coordinated attack. A machine is up for a while. Does work. And then is torn down again only to be reborn as a clean VM with no possibility of infection (unless of course the VM mechanisms become infected). It's like curing cancer by constantly moving your consciousness to new blemish free bodies. Hmmm... SCIT is really a genius approach to scalable (I have to work in scalability somewhere) security and and fits perfectly with cloud computing and swarm (cloud of clouds) computing. Clouds provide plenty of VMs so there is a constant ready supply of new hosts. From a software design perspective EC2 has been training us to expect failures and build Crash Only Software. We've gone stateless where we can so load balancing to a new VM is not problem. Where we can't go stateless we use work queues and clusters so again, reincarnating to new VMs is not a problem. So purposefully restarting VMs to starve zombie networks was born for cloud computing. If a wider move could be made to cloud backed thin clients the internet might be a safer place to live, play, and work. Imagine being free(er) from spam blasts and DDOS attacks. Oh what a wonderful world it would be...
Flickr's lone database guy Dathan Pattishall made his excellent presentation available on how on how Flickr scales its backend to handle tremendous loads. Some of this information is available in Flickr Architecture, but the paper is so good it's worth another read. If you want to see sharding done right, at scale, take a look.