Strategy

Netflix: Continually Test by Failing Servers with Chaos Monkey

High Scalability

Dec 28, 2010 — 1 min read

In 5 Lessons We’ve Learned Using AWS, Netflix's John Ciancutti says the best way to avoid failure is to fail constantly. In the cloud it's expected instances can fail at any time, so you always have to be prepared. In the real world we prepare by running drills. Remember all those exciting fire drills? It's not just fire drills of course. The military, football teams, fire fighters, beach rescue, virtually any entity that must react quickly and efficiently to disaster hones their responsiveness by running drills.

Netflix aggressively moves this strategy into the cloud by randomly failing servers using a tool they built called Chaos Monkey. The idea is:

If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

They respond to failures by degrading service, but they always respond:

If the recommendations system is down they'll show popular titles instead.
If the search system is slow then they'll switch to showing streaming titles.

FarmVille uses a similar strategy for degrading services when link latency increases or components fail.

This strategy fits nicely with recent practices like continuous integration, continuous deployment, and continuous testing as part of the build process. Extending this idea into a production system takes some really huge huevos and a lot of careful coding. But they are right. To really prepare you have to run drills and there's nothing better than learning from real-life experience.

Netflix: Use Less Chatty Protocols In The Cloud - Plus 26 Fixes

Capturing A Billion Emo(j)i-ons

This blog post was written by Dedeepya Bonthu. This is a repost from her Medium article, approved by the author. In stadiums, sports fans love to express themselves by cheering for their favorite teams, holding up placards and team logos. Emoji’s allow fans at home to rapidly express themselves,

Brief History of Scaling Uber

This blog post was written by Josh Clemm, Senior Director of Engineering at Uber Eats. This is a repost from his LinkedIn article, approved by the author. On a cold evening in Paris in 2008, Travis Kalanick and Garrett Camp couldn't get a cab. That's when

Behind AWS S3’s Massive Scale

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn. AWS S3 is a service every engineer is familiar with. It’s the service that popularized the notion of cold-storage to the

The Swedbank Outage shows that Change Controls don't work

This week I’ve been reading through the recent judgment from the Swedish FSA on the Swedbank outage. If you’re unfamiliar with this story, Swedbank had a major outage in April 2022 that was caused by an unapproved change to their IT systems. It temporarily left nearly a million

Related Articles

Read more

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale

The Swedbank Outage shows that Change Controls don't work