« Pinboard.in Architecture - Pay to Play to Keep a System Small | Main | Paper: CRDTs: Consistency without concurrency control »

Netflix: Continually Test by Failing Servers with Chaos Monkey

In 5 Lessons We’ve Learned Using AWS, Netflix's John Ciancutti says the best way to avoid failure is to fail constantly. In the cloud it's expected instances can fail at any time, so you always have to be prepared. In the real world we prepare by running drills. Remember all those exciting fire drills? It's not just fire drills of course. The military, football teams, fire fighters, beach rescue, virtually any entity that must react quickly and efficiently to disaster hones their responsiveness by running drills.

Netflix aggressively moves this strategy into the cloud by randomly failing servers using a tool they built called Chaos Monkey. The idea is:

If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

They respond to failures by degrading service, but they always respond:

  • If the recommendations system is down they'll show popular titles instead.
  • If the search system is slow then they'll switch to showing streaming titles.

FarmVille uses a similar strategy for degrading services when link latency increases or components fail. 

This strategy fits nicely with recent practices like continuous integration, continuous deployment, and continuous testing as part of the build process. Extending this idea into a production system takes some really huge huevos and a lot of careful coding. But they are right. To really prepare you have to run drills and there's nothing better than learning from real-life experience.

Related Articles

Reader Comments (1)

I think I read about such approach a decade or so ago - it's been proven by area of mathematics called neatly "catastrophe theory".
Also, certain human analogy can be found - skill not practiced is a vanising skill. On the other hand, danger of overtraining and degrading even further than with not enough tranining is also present.

December 28, 2010 | Unregistered CommenterPeter [PMD]

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>