« Sponsored Post: Imo, Undertone, Joyent, Appirio, Tuenti, CloudSigma, ManageEngine, Site24x7 | Main | Announcing My Webinar on December 14th: What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications »

Strategy: Google Sends Canary Requests into the Data Mine

Google runs queries against thousands of in-memory index nodes in parallel and then merges the results. One of the interesting problems with this approach, explains Google's Jeff Dean in this lecture at Stanford, is the Query of Death.

A query can cause a program to fail because of bugs or various other issues. This means that a single query can take down an entire cluster of machines, which is not good for availability and response times, as it takes quite a while for thousands of machines to recover. Thus the Query of Death. New queries are always coming into the system and when you are always rolling out new software, it's impossible to completely get rid of the problem.

Two solutions:

  • Test against logs. Google replays a month's worth of logs to see if any of those queries kill anything. That helps, but Queries of Death may still happen.
  • Send a canary request. A request is sent to one machine. If the request succeeds then it will probably succeed on all machines, so go ahead with the query. If the request fails the only one machine is down, no big deal. Now try the request again on another machine to verify that it really is a query of death. If the request fails a certain number of times then the request if rejected and logged for further debugging.

The result is only a few servers are crashed instead of 1000s. This is a pretty clever technique, especially given the combined trends of scale-out and continuous deployment. It could also be a useful strategy for others. 

Reader Comments (2)

I can imagine rolling out across all the servers a possibly lethal query one, twice, but how many times should a query prove itself before being marked with a 'standard query' tag?

November 22, 2010 | Unregistered CommenterScooletz

Seen this in Hadoop where a map (somehow) triggers a server failure: Hadoop sees that the machines is down and reschedules the job to other machines with the same outcome. It's pretty hard to do this -you need to fill the filesystem, take down the tasktracker or do something else dramatic -but possible.

November 22, 2010 | Unregistered CommenterSteveL

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>