At Scale Even Little Wins Pay Off Big - Google and Facebook Examples

There's a popular line of thought that says don't waste time on optimization because developing features is more important than saving money. True, you can always add resources, but at some point, especially in a more mature part of a product lifecycle: performance equals $$$.

Two great examples of this evolution come from Facebook and Google. The upshot is that when you spend time and money on optimizing your tool chain you can get huge wins in performance, control, and costs. Certainly, don’t bother if you are just starting, but at some point you may want to switch to big development efforts in improving efficiency.

Facebook and HipHop

The Facebook example is quite well known: HipHop, a static PHP compiler released in 2010, after two years of development. PHP because Facebook implements their web tier in PHP. They've now developed a dynamic compiler, HipHop VM, using techniques like JIT, side exits, HipHop bytecode, type prediction, and parallel tracelet linking.

This is an incredible development effort as well as an immense effort in migrating their development and deployment infrastructure. Why is Facebook bothering? Why not just add more resources?

Even the early version of HipHop reduced CPU on their web servers by 50% and less CPU usage means fewer servers. With more development effort Facebook was able to serve 70% more traffic with the same hardware.

With the huge number of servers Facebook has, that’s serious money.

Control is a win too. The new HipHop VM, for example, will reduce releases times as only thin bytecode deltas need to be shipped.

Google and String Operations

In the no change too small category is Automated Locality Optimization Based on the Reuse Distance of String Operations from Google (via Greg Linden).

First, it helps to know that Google runs mixed loads on their servers, because this is why processor cache efficiency matters. They use most of their machines as generalized job execution engines. Not even Google has infinite resources, so they share resources. This also fits where their “the data warehouse is the computer” philosophy. Every node runs linux and participates with a scheduling daemon that schedules work across the cluster. Jobs of various types (CPU intensive, MapReduce, etc)  will be running on the same machine that runs a Bigtable tablet server. Jobs are not isolated on servers. Jobs impact each other and those impacts must be managed.

According to the paper, in a function profile across all Google datacenter applications, string operations like memcpy, memset, and memcmp, take 2 of the top 10 spots. The problem is:

String operations can cause performance problems because they flush large portions of the processor caches. Consider the case where the memcpy source is read again immediately thereafter, but the destination is not reused for a while. If the destination memory is not in cache, this will have a doubly negative effect. First, writing requires bringing each line to cache. To do so, previously cached data will have to be written back. That makes already two cache-to memory operations. Moreover, since the cache lines will not be reused for a while, this effectively reduces the usable size of the cache.

Interesting. But is this something most of us would worry about? No way. Just add more resources. Google has an unknown but probably ginormous number of servers and these functions are a source of performance problems, so there’s a substantial payback for solving the problem.

The paper describes a rather involved solution to predict string operations with a large reuse distance and replace them cache pollution reducing operations. The “initial improvement numbers, while not earth-shattering, are substantial.” Substantial is all you need when you have so many servers.