advertise
« Ansible - A Simple Model-Driven Configuration Management and Command Execution Framework | Main | Instagram Architecture Update: What’s new with Instagram? »
Tuesday
Apr172012

YouTube Strategy: Adding Jitter isn't a Bug

The adding jitter strategy was one of the most commented on techniques from 7 Years Of YouTube Scalability Lessons In 30 Minutes on HackerNews. Probably because it’s one of the emergent phenomena that you really can’t predict and is shocking when you see it in real life. Here’s the technique:

Add Entropy Back into Your System

  • If your system doesn’t jitter then you get thundering herds. Distributed applications are really weather systems. Debugging them is as deterministic as predicting the weather. Jitter introduces more randomness because surprisingly, things tend to stack up.
  • For example, cache expirations. For a popular video they cache things as best they can. The most popular video they might cache for 24 hours. If everything expires at one time then every machine will calculate the expiration at the same time. This creates a thundering herd.
  • By jittering you are saying  randomly expire between 18-30 hours. That prevents things from stacking up. They use this all over the place. Systems have a tendency to self synchronize as operations line up and try to destroy themselves. Fascinating to watch. You get slow disk system on one machine and everybody is waiting on a request so all of a sudden all these other requests on all these other machines are completely synchronized. This happens when you have many machines and you have many events. Each one actually removes entropy from the system so you have to add some back in.

Comments from HackerNews really help to fill out the topic with more detail:

There were some other examples of the thundering herd problem:

  • Adblock Plus had a thundering herd problem: Ad blocking lists check for updates every 5 days. Eventually, many users' update schedules would converge on Mondays because office computers did not run over the weekend. Updates that had been scheduled for Saturday or Sunday would spill over to Monday. From cpeterso.
  • Facebook uses jitter a lot in their cache infrastructure. From alexgartrell.
  • The Multicast DNS extension makes extensive use of randomness to reduce network collisions.  From makmanalp.
  • Jitter is used in all sorts of applications, from cron jobs to configuration management to memcache key expiration. Any time you have a lot of nodes that all need to do an operation at a specific time you rely on jitter to keep resources from bottlenecking. Probably used anywhere the systems or network has a miniscule amount of resource headroom (like cluster nodes that run at 93% utilization). From peterwwillis.
  • I frequently find myself adding a sleep for PID mod some appropriate constant to the beginning of big distributed batch jobs in order to keep shared resources (NFS, database, etc) from getting hammered all at once. From minimax.

There are a number of ways of adding jitter:

  • Use exponentially distributed delays: Then reoccurring events form poisson processes which are really easy to reason about when composed eg if you have N servers firing off events at the same exponentially distributed rate the times are distributed the same as if you had one server firing events at N x the rate and distributing jobs uniformly at random to the other servers. From jamii.  
  • Use distinct prime numbers for periodicities in the first place. From colmmacc.

Not everyone adds jitter:

  • Jeff Dean at Google says they prefer to have a known hit every once in awhile, so they will have all the cron jobs go off at the same time versus introducing jitter. It perhaps has to do with their emphasis on reducing variability to control the size of the long tail distribution.
  • The Linux kernel tries to schedule timer events for the same deadline time. That allows the processor to sleep longer because the kernel doesn't need to wake up as often just to handle 1 or 2 timer events. From cpeterso.

An interesting paper:

Reader Comments (4)

Great writing, Todd. I recognize the pattern in one form or another, but doesn't have a name for it.
Just one small suggestion for reading improvement: can you put commas in some of your long sentences or quotes from others? make it easier to read.

April 18, 2012 | Unregistered CommenterTim Pham

http://en.wikipedia.org/wiki/Carrier_sense_multiple_access_with_collision_detection

Makes sense. Collide less...

April 18, 2012 | Unregistered CommenterTommy Becker

Nice post, and good summary! I'm curious about the tradeoffs between introducing jitter and having explicit throttling...

For example, at a previous startup we solved the thundering herd issue in a few situations by routing everything through a job queue and throttling the number of simultaneous workers. This worked well when we could silo jobs by the resources they were likely to consume... but not as well when there were unpredictable resource loads per job, and I can imagine wouldn't help at all for caching.

What do you think? Can we classify when jitter is going to be a better approach?

April 19, 2012 | Unregistered CommenterKevin Ball

Puppet (configuration management system) also provides jitter configuration. Invaluable after power black outs.

May 8, 2012 | Unregistered CommenterDaniel Sobral

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>