Netflix: Harden Systems Using a Barrel of Problem Causing Monkeys - Latency, Conformity, Doctor, Janitor, Security, Internationalization, Chaos
Wednesday, July 20, 2011 at 9:21AM
Todd Hoff in Strategy, netflix

With a new Planet of the Apes coming out, this may be a touchy subject with our new overlords, but Netflix is using a whole lot more trouble injecting monkeys to test and iteratively harden their systems. We learned previously how Netflix used Chaos Monkey, a tool to test failover handling by continuously failing EC2 nodes. That was just a start. More monkeys have been added to the barrel. Node failure is just one problem in a system. Imagine a problem and you can imagine creating a monkey to test if your system is handling that problem properly. Yury Izrailevsky talks about just this approach in this very interesting post: The Netflix Simian Army.

I know what you are thinking, if monkeys are so great then why has Netflix been down lately. Dmuino addressed this potential embarrassment, putting all fears of cloud inferiority to rest:

Unfortunately we're not running 100% on the cloud today. We're working on it, and we could use more help. The latest outage was caused by a component that still runs in our legacy infrastructure where we have no monkeys :)

To continuously test the resilience of Netflix's system to failures, they've added a number of new monkeys, and even a gorilla:

Just as important as the test tools is that they've gone to the trouble to identify failure scenarios, implement code to deal with them, and then implement code that can verify that the problems have been dealt with. That's just as important a process as the tools. Rarely do things just fail. They become intermittent, they become slow, they don't failover cleanly, the don't recover cleanly, they lie about what is really happening, they don't collect the data you need to figure out what is going, they corrupt messages, they drop work randomly, they do crazy things that are failures but don't fit in clean crisp failure definition boundaries. Figuring out what can be done in response to a complex set of interacting problems is the real magic here.

Layering is Key

Without an architecture built around APIs, that is one with aggressive layering to isolate components, it would be impossible to test a system in this way. By creating insertion points for normal services and for testing, it becomes possible to expose the entire system to scripting and manipulation. Big ball of mud systems can't do this, neither can monolithic 2-tier type systems.

When do monkeys run? Protecting Against Cascading Failure

The obvious downside of letting monkeys into your system is that you can easily imagine the whole thing melting down through an innocent coding error. There's a little more supervision than that. Dmuino explains:

Most monkeys only run when we have developers who could notice and fix problems. It also happens that our peak usage is while we're home and our quiet time is while we're at the office. That means, in general, the monkeys don't run on a Sunday night.

 

More details in the original post.

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.