Paper: Consensus Protocols: Paxos  

Update:Barbara Liskov’s Turing Award, and Byzantine Fault Tolerance. Henry Robinson has created an excellent series of articles on consensus protocols. We already covered his 2 Phase Commit article and he also has a 3 Phase Commit article showing how to handle 2PC under single node failures. But that is not enough! 3PC works well under node failures, but fails for network failures. So another consensus mechanism is needed that handles both network and node failures. And that's Paxos. Paxos correctly handles both types of failures, but it does this by becoming inaccessible if too many components fail. This is the "liveness" property of protocols. Paxos waits until the faults are fixed. Read queries can be handled, but updates will be blocked until the protocol thinks it can make forward progress. The liveness of Paxos is primarily dependent on network stability. In a distributed heterogeneous environment you are at risk of losing the ability to make updates. Users hate that. So when companies like Amazon do the seemingly insane thing of creating eventually consistent databases, it should be a little easier to understand now. Partitioning is required for scalability. Partitioning brings up these nasty consensus issues. Not being able to write under partition failures is unacceptable. Therefor create a system that can always write and work on consistency when all the downed partitions/networks are repaired.

Related Articles

  • Google's Paxos Made Live – An Engineering Perspective
  • ZooKeeper - A Reliable, Scalable Distributed Coordination System
  • Impossibility of Distributed Consensus with One Faulty Process by Lynch et al
  • Consensus, impossibility results and Paxos by Ken Birman
  • Paxos for System Builders by Jonathan Kirsch and Yair Amir

    Click to read more ...

  • Friday

    Cloud Programming Directly Feeds Cost Allocation Back into Software Design

    Update 6: CARS = Cost Aware Runtimes and Services by William Louth.
    Update 5: Damn You Google, Damn You Yahoo! Why D'Ya Do This to Us? Free accounts on a cloud platform are a constant drain of money.
    Update 4: Caching becomes even more important in CPU based billing environments. Avoiding the CPU means saving money.
    Update 3: An interesting simple example of this idea showed up on the Google AppEngine list. With one paging algorithm and one use of AJAX the yearly cost of the site was $1000. By changing those algorithms the site went under quota and became free again. This will make life a lot more interesting for developers.
    Update 2: Business Model Influencing Software Architecture by Brandon Watson. The profitability of your project could disappear overnight on account of code behaving badly.
    Update: Amazon adds Elastic Block Store at $0.10 per 1 million I/O requests. Now I need some cost minimization storage algorithms!

    In the GAE Meetup yesterday a very interesting design rule came up: Design By Explicit Cost Model. A clumsy name I know, but it is explained like this:


    If you are going to be charged for an operation GAE wants you to explicitly ask for it. This is why some automatic navigation between objects isn't provided because that will force an explicit query to be written. Writing an explicit query is a sort of EULA for being charged. Click OK in the form of a query and you've indicated that you are prepared to pay for a database operation.

    Usually in programming the costs we talk about are time, space, latency, bandwidth, storage, person hours, etc. Listening to the Google folks talk about how one of their explicit design goals was to require programmers to be mindful of operations that will cost money made me realize in cloud programming cost will be another aspect of design we'll have to factor in.

    Instead of asking for the Big O complexity of an algorithm we'll also have to ask for the Big $ (or Big Euro) notation so we can judge an algorithm by its cost against a particular cloud profile. Maybe something like $(CPU=1.3,DISK=3,IN-BANDWIDTH=2,OUT=BANDWIDTH=3, DB=10). You could look at the Big $ notation for algorithm and shake your head saying that approach will never work for GAE, but it could work for Amazon. Can we find a cheaper Big $? 

    Typically infrastructure costs are part of the capital budget. Someone ponies up for the hardware and software is then "free" until more infrastructure is needed. The dollar cost of software design isn't usually an explicit factor considered.

    Now software design decisions are part of the operations budget. Every algorithm decision you make will have dollar cost associated with it and it may become more important to craft algorithms that minimize operations cost across a large number of resources (CPU, disk, bandwidth, etc) than it is to trade off our old friends space and time.

    Different cloud architecture will force very different design decisions. Under Amazon CPU is cheap whereas under GAE CPU is a scarce commodity. Applications between the two niches will not be easily ported.

    Don't be surprised if soon you go into an interview and they quiz you on Big $ notation and skip the dusty old relic that is Big O notation :-)


    Product: Lightcloud - Key-Value Database

    Lightcloud is a distributed and persistent key-value database from Performance is said to be comparable to memcached. It's different than memcachedb because it scales out horizontally by adding new nodes. It's different than memcached because it persists to disk, it's not just a cache. Now you have one more option in the never ending quest to ditch the RDBMS. Their website does a nice job explaining the system:

  • Built on Tokyo Tyrant. One of the fastest key-value databases [benchmark]. Tokyo Tyrant has been in development for many years and is used in production by, and (to name a few)...
  • Great performance (comparable to memcached!)
  • Can store millions of keys on very few servers - tested in production
  • Scale out by just adding nodes
  • Nodes are replicated via master-master replication. Automatic failover and load balancing is supported from the start
  • Ability to script and extend using Lua. Included extensions are incr and a fixed list
  • Hot backups and restore: Take backups and restore servers without shutting them down
  • LightCloud manager can control nodes, take backups and give you a status on how your nodes are doing
  • Very small foot print (lightcloud client is around ~500 lines and manager about ~400)
  • Python only, but LightCloud should be easy to port to other languages

    Click to read more ...

  • Thursday

    Product: Amazon Simple Storage Service

    Update: - Amazon S3 Performance Report. How fast is S3? Based on their own study has found: 10 to 12 MB/second when storing and receiving files and 140 ms per file stored as a fixed overhead cost. Update: A Quantitative Comparison of Rackspace and Amazon Cloud Storage Solutions. S3 isn't the only cloud storage service out there. Mosso is saying they can save you so money while offering support. There are number of scenarios in their paper, but For 5TB of cloud storage Mosso will save you 17% over S3 without support and 42% with support. For their CDN on a Global test Mosso says the average response time is 333ms for CloudFront vs. 107ms for Cloud Files which means globally, Cloud Files is 3.1 times or 211% faster than CloudFront. Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers. This service allows you to link directly to files at a cost of 15 cents per GB of storage, and 20 cents per GB transfer.

    Click to read more ...


    Strategy: In Cloud Computing Systematically Drive Load to the CPU

    Update 2: Linear Bloom Filters by Edward Kmett. A Bloom filter is a novel data structure for approximating membership in a set. A Bloom join conserves network bandwith by exchanging cheaper, more plentiful local CPU utilization and disk IO. Update: What are Amazon EC2 Compute Units?. Cloud providers charge for CPU time in voodoo units like "compute units" and "core hours." Geva Perry takes on the quest of figuring out what these mean in real life. I attended Sebastian Stadil's AWS Training Camp Saturday and during the class Sebastian brought up a wonderfully counter-intuitive idea: CPU (EC2) costs a lot less than storage (S3, SDB) so you should systematically move as much work as you can to the CPU. This is said to be the Client-Cloud Paradigm. It leverages the well pummeled trend that CPU power follows Moore's Law while storage follows The Great Plains' Law (flat). And what sane computing professional would do battle with Sir Moore and his trusty battle sword of a law? Embedded systems often make similar environmental optimizations. CPU rich and memory poor means operate on compressed serialized data structures. Deserialized data structures use a lot of memory, so why use them? It's easy enough to create an object wrapper around a buffer. Programmers shouldn't care how their objects are represented anyway. Yet we waste ginormous amounts of time and memory uselessly transforming XML in and out of different representations. Just transport compressed binary objects around and use them in place. Serialization and deserialization happen only on access (Pimpl Idiom). It never occurred to me that in the land of AWS plenty similar "tricks" would make sense. But EC2 is a loss leader in AWS. CPU is plentiful and cheap. It's IO and storage that costs you... The implication is that in your system design you should try and use EC2 as much as possible:

  • Compress data. Saves on bandwidth and storage (the expensive bits) and uses cheaper CPU to compress/decompress.
  • Slurp data. Latency cost is higher than performing operations locally. SDB can take up to 400 msecs between data centers and 200 msecs inside the same data center. This is very slow. It's usually faster, but it can take that long. Following the more traditional serial processing path of "get a record do a record" will take forever and cost more. Slurp up all your records from SDB and farm them out to your CPU nodes to be worked on in parallel.
  • Think parallel. Do multiple operations at once on your cheap CPUs rather than serially performing high latency operations on expensive storage. With enough nodes, total execution time approaches max latency.
  • Client side joins. Pull all data from the relatively expensive SDB and perform client side joins on relatively cheap EC2 nodes.
  • Leverage SQS. It's a relatively cheap part of the ecosystem. Keeping a work queue in SDB would be far more expensive. When all the implications are fully explored it's a little different take on designing a system. I found some interesting numbers in a Slashdot thread comparing values: No persistent storage; not great value: And it's still not a great value. It seems cheap. $72/mo for a 1.7GB RAM server. Well, look at Slicehost and you can get a 2GB RAM Xen instance (same virtualization software as EC2) for $140 WITH persistent storage and 800GB of bandwidth. That doesn't sound like a great deal UNTIL you calculate what EC2 bandwidth costs. 800GB would cost you $144 at $0.18 per GB bringing the total cost to $216 ($76 more than Slicehost). That 18 cents doesn't sound like much, but it adds up. The same situation happens with Joyent. For $250 you get a 2GB RAM server from them (running under Solaris' Zones) with 10TB of bandwidth. That would cost you $1,872 with EC2. Even if you assume that you'll only use 10% of what Joyent is giving you, EC2 still comes in at a cost of $252 - and without persistent storage!

    Click to read more ...

  • Wednesday

    Its time for auto scaling – avoid peak load provisioning for web applications

    Many web applications, including eBanking, Trading, eCommerce and Online Gaming, face large, fluctuating loads. In this post will describe how to achieve Right Sizing using virtualization and cloud computing. Will use a standard JEE web application to demonstrate how auto-scaling works on AWS Cloud without changing your application code.

    Click to read more ...


    Enterprise Architecture Conference by - John Zachman. Johannesburg (25th March) , Cape Town (27Th March) Dubai (23rd March)

    Why You Need To Attend THIS CONFERENCE • Understand the multi-dimensional view of business-technology alignment • A sense of urgency for aggressively pursuing Enterprise Architecture • A "language" (ie., a Framework) for improving enterprise communications about architecture issues • An understanding of the cultural changes implied by process evolution. How to effectively use the framework to anchor processes and procedures for delivering service and support for applications • An understanding of basic Enterprise physics • Recommendations for the Sr. Managers to understand the political realities and organizational resistance in realizing EA vision and some excellent advices for overcoming these barriers • Number of practical examples of how to work with people who affect decisions on EA implementation • How to create value for your organization by systematically recording assets, processes, connectivity, people, timing and motivation, through a simple framework For registrations, group discounts or further details please contact

    Click to read more ...


    Advanced BPM program in USA and India discount for Group Membership

    One day, Advanced BPM Certified program led by Global Leader, Steve Towers. Latest Case Studies and innovations - hands-on, practical. Event locations USA San Francisco 16 Mar 09 Atlanta 17 Mar 09 New York 19 Mar 09 Chicago 20 Mar 09 www.BESTBPMTRAINING.COM India Mumbai 23 Mar 09 Bangalore 24 Mar 09 Hyderabad 26 Mar 09 Delhi 27 Mar 09 www.BPMTRAININGNOW.COM For more information please visit For registrations, group discounts or further details please contact

    Click to read more ...


    Relating business, systems & technology during turbulent time -By John Zachman 

    If you want to understand Complexity and Contradiction in IT Architecture and struggling to manage non-adaptive and dysfunctional systems, you don't want to miss this. one day Certified conference by John Zachman in Dubai on 23rd March 2009. For more details visit For registrations, group discounts or further details please contact

    Click to read more ...


    Learn how to manage change and complexity by Zachman Live.

    John Zachman (Father of enterprise architecture) Given this renascent interest, who better to explain the principles behind Enterprise Architecture than the man himself, John Zachman, the originator of the " Zachman Framework for Enterprise Architecture" Join this workshop in Johannesburg 25th Mar 09 and Cape town in 27th March 09 and Mr.Zachman will explain how and why Enterprise Architecture provides measure, such an implementation is a daunting task with opportunities to fail lurking in many places. For more details visit For registrations, group discounts or further details please contact

    Click to read more ...