« Piccolo - Building Distributed Programs that are 11x Faster than Hadoop | Main | Sponsored Post: Karmasphere, Kabam, Opera Solutions, Percona, Appirio, Newrelic, Cloudkick, Membase, EA, Joyent, CloudSigma, ManageEngine, Site24x7 »

Google Strategy: Tree Distribution of Requests and Responses

If a large number of leaf node machines send requests to a central root node then that root node can become overwhelmed:

  • The CPU becomes a bottleneck, for either processing requests or sending replies, because it can't possibly deal with the flood of requests.
  • The network interface becomes a bottleneck because a wide fan-in causes TCP drops and retransmissions, which causes latency. Then clients start retrying requests which quickly causes a spiral of death in an undisciplined system.

One solution to this problem is a strategy given by Dr. Jeff Dean, Head of Google's School of Infrastructure Wizardry, in this Stanford video presentation: Tree Distribution of Requests and Responses.

Instead of having a root node connected to leaves in a flat topology, the idea is to create a tree of nodes. So a root node talks to a number of parent nodes and the parent nodes talk to a number of leaf nodes. Requests are pushed down the tree through the parents and only hit a subset of the leaf nodes.

With this solution:

  • Fan-in at each level of the tree is manageable. The CPU cost of processing requests and responses is spread out across all the parents, which reduces the CPU and network bottlenecks.
  • Response filtering and data reduction. Ideally the parent can provide a level of response filtering so the root only sees a subset of the response data. This further reduces the network and CPU needed by the root.
  • Collocation. The parent can be collocated with leaves on one rack, which keeps all that traffic off your datacenter networks. 

In Google's search system, for example:

  • Leaves generate their best 10 or 15 responses.
  • Parents return the best 20-30 responses out of the 30 leaves the parent is responsible for.
  • This is a large degree of data reduction compared to the case the root had to process all that data directly.

Reader Comments (3)

Seems to me it was described already by Scimore in 2004. Check this: http://www.scimore.com/doc2/Network_Communications.html.

February 2, 2011 | Unregistered Commentermarius

Not as old as 2004, but this is what we call "fractal scaling" in the case where it is done with CouchDB in HTTP (see the figures 1/2 into the article)


February 7, 2011 | Unregistered CommenterJ Chris Anderson

The provided link to the video seems dead.
I watched the first half of it last week and would be very interessted to see the rest.

February 15, 2011 | Unregistered CommenterHaegar

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>