Snakes in a Facebook Datacenter

What do you do when you find a snake in your datacenter? You might say this. (NSFW)

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. You might think Facebook solved all of its fault tolerance problems long ago, but when a serpent enters the Edenic datacenter realm, even Facebook must consult the Tree of Knowledge.

In this case, it's not good or evil we'll learn about, but Workload Placement, which is a method of optimally placing work over a set of failure domains.

Here's my gloss of the Fault Tolerance through Optimal Workload Placement talk:

  • A large snake once caused a short in one of Facebook's main switchboards. All the servers connected to the device went down, as did the snake. The loss of the switchboard caused cascading failures for some major services. This caused all user traffic in that datacenter to move to other datacenters.
  • Servers powered by one main switchboard are grouped into a fault domain. This one main switchboard going down meant they lost one fault domain. In a datacenter there are many fault domains.
  • The server capacity of the downed fault domain was less than 3% of the server capacity in the datacenter. Why then did it cause such a big disruption? Unfortunate workload placement. This one fault domain contained a large proportion of the capacity of some major services. Some specific services lost over 50% of their capacity, which caused cascading failures, which caused user traffic to be drained from the entire datacenter.
  • There can be many other causes for fault domain level failures: fire, lightning, water leaks, routine maintenance. The number incidences will increase 9x as the number and size of regions increase, so it's important to solve this problem in a better way than draining a datacenter's user traffic.
  • Goal is to be able to lose one fault domain without losing the whole datacenter. Key is the placement of hardware, services, and data. You can't seamlessly lose a fault domain if over 50% of the capacity for a service is in the fault domain.
  • It's not an isolated issue. Services are poorly spread across fault domains because fault domains were not taken into consideration when placing services. Hardware was placed wherever space and power were available and services were placed on whatever services were available at the time. A further problem is they didn't have a common definition of what a fault domain was.
  • Services need to be spread better across fault domains. This is where the push for optimal placement comes in. Optimal placement means hardware, services, and data are well spread across the fault domain within a datacenter, so if one fault domain goes down that they lose as small a proportion of capacity for each hardware type and service as possible.
  • Capacity loss still means service problems, so they install buffer capacity. Buffer capacity elsewhere in the datacenter can handle the failed over traffic.
  • How much buffer to buy directly depends on how workloads are placed. The current imbalance means buying a buffer that's 1.5x the size of a fault domain. Not acceptable.
  • By evenly spreading capacity they can significantly reduce the amount of buffer needed. This decreases the cost of the buffer and the amount of power and space dedicated to it.
  • Hardware, for example, compute and storage heavy machines, need to be spread across fault domains. Each of these racks contain servers across which services must be spread. Each service has a set of hardware types it can run on. The spread of services can only be as good as the spread of the hardware types they run on. Each service must ensure their data chunks are well spread across the servers they've be allocated.
  • Hardware placement is the determination where racks should be physically placed in a datacenter. Hardware types include compute, storate, or flash heavy machines. Hardware placement must take into consideration the physical constraints of the datacenter, such as power, cooling and networking. Racks are spread to make sure that each of the resources is well balanced and nothing is overloaded.
  • They main fault domain they plan for is a power domain of all the racks under a main switchboard. Previously compute resources were not well placed. They are still not able to achieve perfect placement, but it's much better than it was.
  • Over the years they've accumulated constraints that are in conflict with good spread. Cooling constraints, for example, dictated that some hardware types were best placed in certain parts of the building. By working with the mechanical team they were able to change the cooling and network domain constraints so they could achieve better hardware spread.
  • Changes take time in the hardware world. It takes several years for a rack to be decommissioned. So they can either wait for racks to age out or buy more buffer. Option three is move the racks to improve the imbalance to be able to buy the buffer they need much sooner. Not optimal, but better.
  • The next layer that needs optimal placement is the service layer. Without optimal service placement they'll need to buy 15% plus buffer to support the largest fault domain per service.
  • Existing service layer constraints made it impossible to even get close to the amount of hardware buffer they had bought. Two main constraints were full rack constraints and service preferences for specific hardware generations. Full rack means a service needed the entire rack to itself. Preferences would be CPU preference, for example. They worked with services to remove these constraints, so they were able to get below the hardware buffer they had purchased.
  • Service placement is always in flux. How do make sure service spread is being done correctly? Fault domain spread was added to the service placement system. By ensuring all new services go through these systems they will be allocated correctly.
  • Services swaps were used to fix existing imbalances. Every time new services or hardware resources are added, swaps are used to ensure fault tolerance constraints are maintained.
  • For stateful services data placement must be taken into account. Data chunks must be well spread so no data chunk is concentrated in any particular fault domain. Shard Manager is used to achieve data placement.
  • Optimal hardware, service, and data placement has saved a huge amount of money in buffer costs.
  • These same capabilities can be used in the future to reduce network bandwidth utilization or reduce power hot spots.
  • Spread is continuously maintained.
  • New datacenters are turned-up piece by piece so the overall capacity and the number of fault domains is smaller. There might be additional buffer at the beginning. The service mix might change as the datacenter ages.
  • Physical constraints of the datacenters are translated into linear equations and then use an integer programming solver to solve for.  For example, the sum of all power used by the racks must be less than X (couldn't understand). That's an example of the hard constraints you can't break. Added to that are objectives to minimize the pooling imbalance, minimize imbalance across fault domains, and then a combination of those two things and use a solver to come up with a hardware plan that satisfies all those constraints. A similar process is followed for services. A specialized assignment problem solver was created to solve their specific problems.
  • A tradition is signing the first rack in a datacenter.
  • A Disaster Recovery team continuously runs failure tests to test fault domains.
  • When a failure occurs live workloads are moved to buffer capacity. Sometimes there's a lag as shards and services are moved. Services are built to tolerate this lag. As long as one fault domain worth of capacity is impacted services should be able to handle it.