« Sponsored Post: Nokia, Oracle, Percona Live, AiCache, ElasticHosts, Logic Monitor, Attribution Modeling, New Relic, AppDynamics, CloudSigma, ManageEngine, Site24x | Main | Stuff The Internet Says On Scalability For March 9, 2012 »

Google: Taming the Long Latency Tail - When More Machines Equals Worse Results

Likewise the current belief that, in the case of artificial machines the very large and the very small are equally feasible and lasting is a manifest error. Thus, for example, a small obelisk or column or other solid figure can certainly be laid down or set up without danger of breaking, while the large ones will go to pieces under the slightest provocation, and that purely on account of their own weight. -- Galileo

Galileo observed how things broke if they were naively scaled up. Interestingly, Google noticed a similar pattern when building larger software systems using the same techniques used to build smaller systems. 

Luiz André Barroso, Distinguished Engineer at Google, talks about this fundamental property of scaling systems in his fascinating talk, Warehouse-Scale Computing: Entering the Teenage Decade. Google found the larger the scale the greater the impact of latency variability. When a request is implemented by work done in parallel, as is common with today's service oriented systems, the overall response time is dominated by the long tail distribution of the parallel operations. Every response must have a consistent and low latency or the overall operation response time will be tragically slow. The implication: high performance equals high tolerances, which means your entire system must be designed to exacting standards.

What is forcing a deeper look into latency variability is the advent of interactive real-time computing. Responsiveness becomes key. Good average response times aren't good enough. You simply can't naively scale up techniques to build larger systems. The reason is surprising and has deep implications on how we design service dominated systems:

Google Likes Request Level Parallelism, Which is Easy

Google really likes request level parallelism, where a query is sent to many machines, but the code that runs each machine does not need to be parallelized. This is really nice and simple. To answer one query, for example, Instant Search has to instantly answer 5 or 6 queries. But it's easy to code.

Wimpy Cores Require Parallelization, Which is Hard

If you have wimpy CPUs, that is CPUs that are slow in order to be really energy efficient, then you are going to get to the point where you are going to need to parallelize each piece of code, which is harder. 

It doesn’t matter if you make a CPU infinitely energy efficient if it’s responsible for on 30-40% of your budget. You can at most improve energy efficiency by 30%. There are costs in terms of engineering time and performance for going to wimpy. It's hard to code.

Scale Properties of Warehouse Scale Computing Require Reliably Low Latency

Warehouse scale computing is the term Google invented to capture their idea the datacenter is the computer. Warehouse-scale Computing considers computer resources to be fungible, that is they are interchangeable and location independent.  Individual computers lose identity and become just a part of a service. What’s really important for Google is aggregate performance of the entire system because they do so much work in parallel.

That vision puts novel performance restrictions on your warehouse as a whole, particularly the latency tail, which Luiz illustrates with an example.

Imagine a client making a request of a single web server. Ninety-nine times out of a hundred that request will be returned within an acceptable period of time. But one time out of hundred it may not. Say the disk is slow for some reason. If you look at the distribution of latencies, most of them are small, but there's one out on the tail end that's large. That's not so bad really. All it means is one customer gets a slightly slower response every once in a while.

Lets' change the example, now instead of one server you have 100 servers and a request will require a response from all 100 servers. That changes everything about your system's responsiveness. Suddenly the majority of queries are slow. 63% will take greater than 1 second. That's bad.

Using the same components and scaling them results in a really unexpected outcome. This is a fundamental property of scaling systems: you need to worry not just about not latency, but tail latency, that is the longer events in your system. High performance equals high tolerances. At scale you can’t ignore tail latency.

A Networking Example of Latency Tail

In Microstorm in a teacup: Are you suffering from the long tail effect? there's a good discussion on the impact of long tail problems on trading:

A long tail distribution in a trading network can represent the distribution of latency experienced by trades The majority of trades experience very low latency, less than 5 milliseconds, but a small number can experience trade latencies upwards of 900ms.  Latency can be symptomatic of a long tail anomaly such as a bandwidth saturating microburst.

This latency could come from: RCP Library, DNS lookups, packet loss, microbursts, deep queues, high task response latency, locking, garbage collection, OS stack issues, router/switch overhead, transiting multiple hops, or slow processing code. 

For a good discussion how sensitive TCP is to packet loss and buffering, take a look at SPDY: What I Like About You.

To see how tricky these problems are to degug and correct, StackExchange had a microburst problem the was a devil to find: Per Second Measurements Don’t Cut It.

The implication of all of this is that this is real engineering. Anyone can stand up a web server and make a decent website. Building something large is completely different in kind. It takes real skills and an incredible attention to detail at every level. Every part of system must work reliably within bounded tolerances at all times. Not everyone can make this work.

Flash has High Latency so May Not Save the World

You might expect that flash would enter and save the world. Not so fast. Luiz explains that flash is great, but it is also depressing at the same time.

Great because random reads are 5 orders of magnitude faster for RAM when compared to disk. Flash bridges the gap between RAM and disk, in terms of latency and bandwidth, but especially bandwidth.  

Depressing because flash is really just a glorified EEPROM. You can’t really write to it. You can erase whole chunks of it and begin reprogramming it. Erasing is incredibly slow. You also have to try and minimize the number of erases. This creates the kind of performance idiosyncrasies you don’t have with disk drives.

With flash you can read 4KB of data in 100 microseconds or so, if your read is stuck behind an erase you may have wait 10s of milliseconds. That’s a 100x increase in latency variance for that particular variance that used to be very fast.  

These effects are real because they impact the latency tail, so in a strange way disk is better than flash at scale.

Disks Suck for Random IO and Flash is Good at Random IO

But flash will win in the end because flash has a better random IO story than disk. Disks keep getting better and better in terms of space, a little faster for sequential access, but have stayed constant for random access at 100 random reads/sec.

Disks will continue to get bigger and cheaper on a per GB basis, but that doesn't matter, because the drives are getting so big you can’t actually use that extra capacity effectively.

Software will need to find a way to mitigate the problems of flash.

How Will Latency Improve?

In Paper review: warehouse-scale computing: entering the teenage decade, Andrew Wang has a good summary of the current latency situation:

I/O latency variability right now is terrible, with basically all durable storage displaying a long latency tail. Random accesses to spinning disks are slow, flash writes are slow, and these high-latency events muck up the latency for potentially fast events. Network I/O suffers a similar problem. Using TCP and interrupts adds orders of magnitude of latency to network requests, making fast network hardware slow again in software.

To summarize, there are two big ideas in the talk. First, latency and variation in latency are the key performance metrics for services these days; today's web-based applications demand both to provide a good user experience. This may require reexamination of a lot of fundamental assumptions about IO. Second, increasing the utilization of resources in a cluster is important from an efficiency and performance standpoint. Server hardware should be a fungible resource that can be easily shared among different services.

In accordance with our high tolerance theme, Luiz has an ambitious research and development agenda for squeezing out the latency in the entire stack. More on that in The Three Ages Of Google - Batch, Warehouse, Instant.

I’m really surprised that Google hasn’t simply ripped all those layers out and run their abstractions directly on the hardware, used a simpler networking technology, etc. If the warehouse is the computer, why aren’t they making a custom system?

Some Software Techniques

These weren't in the talk, but I found them interesting and related, so why not toss them in?

Tree of Distribution Responses

This was an article I did while ago on: Google Strategy: Tree Distribution Of Requests And Responses:

The idea is to create a tree of nodes. So a root node talks to a number of parent nodes and the parent nodes talk to a number of leaf nodes. Requests are pushed down the tree through the parents and only hit a subset of the leaf nodes.

With this solution:

  • Fan-in at each level of the tree is manageable. The CPU cost of processing requests and responses is spread out across all the parents, which reduces the CPU and network bottlenecks.
  • Response filtering and data reduction. Ideally the parent can provide a level of response filtering so the root only sees a subset of the response data. This further reduces the network and CPU needed by the root.
  • Collocation. The parent can be collocated with leaves on one rack, which keeps all that traffic off your datacenter networks. 

Focus on the 99%

From Middleware, Fault-Tolerance and the Magical 1% A Study of Unpredictability:

We present an extensive empirical study of unpredictability in 16 distributed systems, ranging from simple transport protocols to fault-tolerant, middleware-based enterprise applications, and we show that the inherent unpredictability in these systems arises from a magical 1% of the remote invocations

The magical 1% suggests that, while the emergent behavior of middleware systems is not strictly predictable, enterprise applications could cope with the inherent unpredictability by focusing on statistical performance indicators, such as the 99th percentile of the end-to-end latency.

Latency Tied to Blocking Rather then Queueing

From Moving Network Server Latency Off the Disk Speed Curve:

We find that faster processors improve server capacity, but have little effect on latency. By experimenting with workloads of various sizes, we determine that when disk accesses occur, both mean and median latencies increase, though the median should be unaffected. We trace the roots of this problem to head-of-line blocking within filesystem-related kernel queues. This behavior, in turn, causes batching and burstiness, which has little impact on throughput, but severely degrades latency. By examining individual request latencies, we find that this blocking gives rise to a phenomenon we call service inversion, where requests are served unfairly.

Service inversion, where short requests are often served with much higher latencies than much larger requests. We also find that this phenomenon increases with load, and that it is responsible for most of the growth in server latency.

By addressing the blocking issues both in the Apache and the Flash server, we improve latency by more than an order of magnitude, and demonstrate a qualitatively different change in the latency profiles. The resulting servers also exhibit higher capacity, lower burstiness, and more fair request handling across a wide range of workloads. Finally, our results show that most server-induced latency is tied to blocking effects, rather than queuing

Related Articles

Reader Comments (8)

I'd disagree somewhat on the disks are better than ssd part, disks have really ugly latencies and latency variations, be it when there is some reallocation going on (order of seconds and up to 30 seconds) or even if it is just reordering IOs to improve its bandwidth metrics (latencies may vary from 10ms to 2 seconds). A single disk may be ok but at scale you'll surely have plenty of bad disks around.

March 12, 2012 | Unregistered CommenterBaruch

I believe you're using tolerances incorrectly. I asked a friendly mechanical engineer and she responded:

"If you increase the tolerance, you are increasing the acceptable range for dimensions on parts, meaning that precision is lowered. (ex: 4.50 inches + or - 0.01" versus 4.5 inches + or - 0.1")

I think we tend to use "tight tolerances" for extremely precise work.

It is confusing for sure."

So you should instead write "High Performance equals Low Tolerances"

March 12, 2012 | Unregistered CommenterIan Danforth

Great post. Latency is truly hard to control unless one is willing to sacrifice efficiency (system resource utility, redundancies, etc.) Intuitively, it would seem that a probabilistic approach to distribution should allow us to model systems at scale more effectively.

Now, since mechanics has its foundation in geometry, where mere size cuts no figure, I do not see that the properties of circles, triangles, cylinders, cones and other solid figures will change with their size. If, therefore, a large machine be constructed in such a way that its parts bear to one another the same ratio as in a smaller one, and if the smaller is sufficiently strong for the purpose for which it was designed, I do not see why the larger also should not be able to withstand any severe and destructive tests to which it may be subjected.

Somewhat off topic but the above was a surprise (considering the author). "Size" certainly "cuts a figure" in geometry. Compare the area of a circle of radius r and another at r' = r*k. Known since Euclid.

@Ian: Think Tolerance of measurement error to dispel confusion.

March 13, 2012 | Unregistered Commenteralphazero

When many disks are polled together to form the RAID array, to act collectively to the application, each long response from a single drive slows downs all application requests. In storage world, the receipt is to have a cache layer in front of the RAID array, so that the long tail effect caused by long latency tail can be minimized. What we are seeing in warehouse scale computing is another manifestation of the similar problem.

March 19, 2012 | Unregistered CommenterMQ

I agree re: the misuse of the term tolerance in this article.

A high tolerance has always meant "loose SLA" to me.

Good article otherwise, and crucial for engineers working on large scale web systems to understand.

March 28, 2012 | Unregistered CommenterET

Coiuld you explain how you calculated the 63% of slow requests? Or was it a measurement?

November 20, 2013 | Unregistered CommenterWalter Kriha


The service is fast 99% of the time.

The chance of the service being fast every single time for 100 times is .99^100.

Since just it being slow once makes the whole request slow:
The chance of it being slow is 1 - .99^100 = 63%.

September 11, 2014 | Unregistered CommenterSteve

I am a newbie for this field of discussion here.
But I have a query,
If trading off efficiency to reduce the latency really worked, then why don't we collect every system or hardware parameters and allocate requests based on their capacity. I mean, from the data in hand from every component, we can predict the occurance of microburst kind of disruption beforehand and do necessary path modifications to make the request handled by a different machine in acceptable amount of time.This would also most likely reduces the unpredictability of the system or increases the chances of getting a response all the 100 times within acceptable time frame.

December 27, 2016 | Unregistered CommenterKrishna kumar

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>