Entries from January 30, 2011 - February 5, 2011


Stuff The Internet Says On Scalability For February 4, 2011

Submitted for your reading pleasure...

  • Super Bowl Prediction: Pittsburgh 27, Green Bay 24. I'll be rooting for Green Bay, but the Pittsburgh defense will eventually win the day, beating back the fleet footed, quick tossing, and sharp shooting Aaron Rodgers. Roethlisberger will make exactly 3 plays that matter, but they'll be the right 3 plays.
  • Reddit is now at 1 billion page views a month. Congratulations!
  • Amazon S3 Cloud Stores 262 Billion Objects.  My god, it's full of stars...
  • Quora’s Technology Examined by Phil Whelan. Excellent detective work answering the question: How Does Quora Work?
  • Quotable Quotes:
    • @timoreilly: When hardware became commoditized, software was valuable. Now that software being commoditized, data is valuable. #strataconf
    • @coldfusionPaul: "Write someone a query, they'll go away for a day. Teach someone to query, they'll just go away." so, I use #NoSQL 555
    • @squarecog: To go *really* fast, you want to get rid of spokes in your wheels, and ditch tires. Also, turning is overrated. #nosql

Click to read more ...


Piccolo - Building Distributed Programs that are 11x Faster than Hadoop

Piccolo (not this or this) is a system for distributed computing, Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centersUnlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Traditional data-centric models (such as Hadoop) which present the user a single object at a time to operate on, Piccolo exposes a global table interface which is available to all parts of the computation simultaneously. This allows users to specify programs in an intuitive manner very similar to that of writing programs for a single machine.

Using an in-memory key-value store is a very different approach from the canonical map-reduce, which is based on using distributed file systems. The results are impressive:

Experiments have shown that Piccolo is fast and pro-vides excellent scaling for many applications. The performance of PageRank and k-means on Piccolo is 11×and 4× faster than that of Hadoop. Computing a PageR-ank iteration for a 1 billion-page web graph takes only 70 seconds on 100 EC2 instances. Our distributed webcrawler can easily saturate a 100 Mbps internet uplink when running on 12 machines.

Piccolo was presented at OSDI10. For the paper take a look at Piccolo: Building Fast, Distributed Programs with Partitioned Tables, here's the slide deck, and there's a video of the talk (very good).

Click to read more ...


Google Strategy: Tree Distribution of Requests and Responses

If a large number of leaf node machines send requests to a central root node then that root node can become overwhelmed:

  • The CPU becomes a bottleneck, for either processing requests or sending replies, because it can't possibly deal with the flood of requests.
  • The network interface becomes a bottleneck because a wide fan-in causes TCP drops and retransmissions, which causes latency. Then clients start retrying requests which quickly causes a spiral of death in an undisciplined system.

One solution to this problem is a strategy given by Dr. Jeff Dean, Head of Google's School of Infrastructure Wizardry, in this Stanford video presentation: Tree Distribution of Requests and Responses.

Instead of having a root node connected to leaves in a flat topology, the idea is to create a tree of nodes. So a root node talks to a number of parent nodes and the parent nodes talk to a number of leaf nodes. Requests are pushed down the tree through the parents and only hit a subset of the leaf nodes.

With this solution:

Click to read more ...


Sponsored Post: Karmasphere, Kabam, Opera Solutions, Percona, Appirio, Newrelic, Cloudkick, Membase, EA, Joyent, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

  • Percona Live to be held in San Francisco February 16th, 2011. A one day event run by the experts behind the MySQL Performance Blog.
  • A new round of Membase meetups have been planned for January 2011 for San Diego, Denver, Seattle, Vancouver and Chicago.

Cool Products and Services

Click to read more ...