Paper: Heracles: Improving Resource Efficiency at Scale

Underutilization and segregation are the classic strategies for ensuring resources are available when work absolutely must get done. Keep a database on its own server so when the load spikes another VM or high priority thread can't interfere with RAM, power, disk, or CPU access. And when you really need fast and reliable networking you can't rely on QOS, you keep a dedicated line.

Google flips the script in Heracles: Improving Resource Efficiency at Scale, shooting for high resource utilization while combining different load profiles.

I'm assuming the project name Heracles was chosen not simply for his legendary strength, but because when strength failed, Heracles could always depend on his wits. Who can ever forget when Heracles tricked Atlas into taking the sky back onto his shoulders? Good times.

The problem: better utilization of compute resources while complying  service level objectives (SLOs) for latency-critical (LC) and best effort batch (BE) tasks:

Several studies have established that the average server utilization in most datacenters is low, ranging between 10% and 50%...The cost of such underutilization can be significant. For instance, Google websearch servers often have an average idleness of 30% over a 24 hour period. For a hypothetical cluster of 10,000 servers, this idleness translates to a wasted capacity of 3,000 servers.

The solution:

a) coordinated management of multiple isolation mechanisms is key to achieving high utilization without SLO violations; b) carefully separating interference into independent subproblems is effective at reducing the complexity of the dynamic control problem; and c) a local, real-time controller that monitors latency in each server is sufficient...Heracles manages 4 mechanisms to mitigate interference: core isolation, LLC isolation, power isolation, network traffic isolation.

The result:

We evaluate Heracles using production latency critical and batch work loads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated...Heracles also improves throughput/TCO by 15% to 300%.