Google on Latency Tolerant Systems: Making a Predictable Whole Out of Unpredictable Parts  
Monday, June 18, 2012 at 9:16AM
Todd Hoff in Strategy, google

In Taming The Long Latency Tail we covered Luiz Barroso’s exploration of the long tail latency (some operations are really slow) problems generated by large fanout architectures (a request is composed of potentially thousands of other requests). You may have noticed there weren’t a lot of solutions. That’s where a talk I attended, Achieving Rapid Response Times in Large Online Services (slide deck), by Jeff Dean, also of Google, comes in:

In this talk, I’ll describe a collection of techniques and practices lowering response times in large distributed systems whose components run on shared clusters of machines, where pieces of these systems are subject to interference by other tasks, and where unpredictable latency hiccups are the norm, not the exception.

The goal is to use software techniques to reduce variability given the increasing variability in underlying hardware, the need to handle dynamic workloads on a shared infrastructure, and the need to use large fanout architectures to operate at scale.

Two forces motivate Google’s work on latency tolerance:

Others try to reduce variability by overprovisioning or running only like workloads on machines. Google makes use of all their resources, but sharing disk, CPU, network, etc, increases variability. Jobs will burst CPU, memory, or network usage, so running a combination of background activities on machines, especially with large fanouts, leads to unpredictable results.

You may think complete and total control would solve the problem, but the more you tighten your grip, the more star systems will slip through your fingers. At a small scale careful control can work, but the disadvantages of large fanout architectures and shared resource execution models must be countered with good design.

Fault Tolerant vs Latency Tolerant Systems

Dean makes a fascinating analogy between creating fault tolerant and latency tolerant systems. Fault tolerant systems make a reliable whole out of unreliable parts by provisioning extra resources. In the same way latency tolerant systems can make a predictable whole out of unpredictable parts. The difference is in the time scales:

Tolerating variability requires having a fine trigger so the response can be immediate. Your scheduling system must be able to make decisions within these real-time time frames.

Mom and Apple Pie Techniques for Managing Latency

These techniques the “general good engineering practices” for managing latency:

Cross Request Adaptation Strategies

The idea behind these strategies is to examine recent behavior and take action to improve latency of future requests within tens of seconds or minutes. The strategies are:

My notes on this section are really bad, so that's all I have for this part of the talk. Hopefully we can fill it in as more details become available.

Within-request Adaptation Strategies

The idea behind these strategies is to fix a slow request as it is happening. The strategies are:

Backup Requests with Cross-Server Cancellation

Backup requests are the idea of sending requests out to multiple replicas, but in a particular way. Here’s the example for a read operation for a distributed file system client:

A request could wait in a queue stuck behind an expensive query or a packet could be dropped, so if a reply is not returned quickly other replicas are tried. Responses come back faster if requests sit in multiple queues.

Cancellation reduces the frequency at which redundant work occurs.

You might think this is just a lot of extra traffic, but remember, the goal is to squeeze down the 99th percentile distribution, so the backup requests, even with what seems like a long wait time really bring down the tail latency and standard deviation. Requests in the 99th percentile take so long that the wait time is short in comparison.

Bigtable, for example, used backup requests after two milliseconds, which dramatically dropped the 99th percentile by 43 percent on an idle system. With a loaded system the reduction was 38 percent with only one percent extra disk seeks. Backups requests with cancellations gives the same distribution as an unloaded cluster.

With these latency tolerance techniques you are taking a loaded cluster with high variability and making it perform like an unloaded cluster.

There are many variations of this strategy. A backup request could be marked with a lower priority so it won't block real work. Send to a third cluster after a longer delay.Wait times could be adjusted so requests are sent when the wait time hits the 90th percentile.

Tainted Results - Proactively Abandon Slow Subsystems

Related Articles

Article originally appeared on (
See website for complete article licensing information.