YouTube Strategy: Adding Jitter isn't a Bug

The adding jitter strategy was one of the most commented on techniques from 7 Years Of YouTube Scalability Lessons In 30 Minutes on HackerNews. Probably because it’s one of the emergent phenomena that you really can’t predict and is shocking when you see it in real life. Here’s the technique:

Add Entropy Back into Your System

  • If your system doesn’t jitter then you get thundering herds. Distributed applications are really weather systems. Debugging them is as deterministic as predicting the weather. Jitter introduces more randomness because surprisingly, things tend to stack up.
  • For example, cache expirations. For a popular video they cache things as best they can. The most popular video they might cache for 24 hours. If everything expires at one time then every machine will calculate the expiration at the same time. This creates a thundering herd.
  • By jittering you are saying  randomly expire between 18-30 hours. That prevents things from stacking up. They use this all over the place. Systems have a tendency to self synchronize as operations line up and try to destroy themselves. Fascinating to watch. You get slow disk system on one machine and everybody is waiting on a request so all of a sudden all these other requests on all these other machines are completely synchronized. This happens when you have many machines and you have many events. Each one actually removes entropy from the system so you have to add some back in.

Comments from HackerNews really help to fill out the topic with more detail:

There were some other examples of the thundering herd problem:

  • Adblock Plus had a thundering herd problem: Ad blocking lists check for updates every 5 days. Eventually, many users' update schedules would converge on Mondays because office computers did not run over the weekend. Updates that had been scheduled for Saturday or Sunday would spill over to Monday. From cpeterso.
  • Facebook uses jitter a lot in their cache infrastructure. From alexgartrell.
  • The Multicast DNS extension makes extensive use of randomness to reduce network collisions.  From makmanalp.
  • Jitter is used in all sorts of applications, from cron jobs to configuration management to memcache key expiration. Any time you have a lot of nodes that all need to do an operation at a specific time you rely on jitter to keep resources from bottlenecking. Probably used anywhere the systems or network has a miniscule amount of resource headroom (like cluster nodes that run at 93% utilization). From peterwwillis.
  • I frequently find myself adding a sleep for PID mod some appropriate constant to the beginning of big distributed batch jobs in order to keep shared resources (NFS, database, etc) from getting hammered all at once. From minimax.

There are a number of ways of adding jitter:

  • Use exponentially distributed delays: Then reoccurring events form poisson processes which are really easy to reason about when composed eg if you have N servers firing off events at the same exponentially distributed rate the times are distributed the same as if you had one server firing events at N x the rate and distributing jobs uniformly at random to the other servers. From jamii.
  • Use distinct prime numbers for periodicities in the first place. From colmmacc.

Not everyone adds jitter:

  • Jeff Dean at Google says they prefer to have a known hit every once in awhile, so they will have all the cron jobs go off at the same time versus introducing jitter. It perhaps has to do with their emphasis on reducing variability to control the size of the long tail distribution.
  • The Linux kernel tries to schedule timer events for the same deadline time. That allows the processor to sleep longer because the kernel doesn't need to wake up as often just to handle 1 or 2 timer events. From cpeterso.

An interesting paper: