Designing for Resiliency will be so 2013
Monday, December 31, 2012 at 10:25AM
Todd Hoff in Strategy

A big part of engineering for a quality experience is bringing in the long tail. An improbable severe failure can ruin your experience of a site, even if your average experience is quite good. That's where building for resilience comes in. Resiliency used to be outside the realm of possibility for the common system. It was simply too complex and too expensive.

An evolution has been underway, making 2013 possibly the first time resiliency is truly on the table as a standard part of system architectures. We are getting the clouds, we are getting the tools, and prices are almost low enough.

Even Netflix, real leaders in the resiliency architecture game, took some heat for relying completely on Amazon's ELB and not having a backup load balancing system, leading to a prolonged Christmas Eve failure. Adrian Cockcroft, Cloud Architect at Netflix, said they've investigated creating their own load balancing service, but that "we try not to invest in undifferentiated heavy lifting."

So resiliency is still not part of the standard package. There's an ROI calculation that has to be made. Yet the path Netflix would have to take in creating a hybrid architecture is fairly clear, Netflix prefers to concentrate on features rather than long tail events. That's a big difference. At one time designing for resiliency would have been unthinkable, now it's becoming a choice. 

A good New Year's resolution might be to learn more about resilience. It's a new way of thinking compared to straightforward high availability. It's a full stack, full team, full system, environment centric mode of thought.

Fortunately, Dr. Richard Cook, Professor of Healthcare Systems Safety and Chairman of the Department of Patient Safety at the Kungliga Techniska Hogskolan, has been thinking about resilience for a long time. And he gave a fascinating talk: How Complex Systems Fail on resilience, that is just detailed enough to be practical and high level enough to inspire new directions.

Here's a gloss of the essentials from his talk:

Why Don’t Systems Fail More Often?

The normal world is not well behaved. The real surprise is not that there are so many accidents but there are so few. Is this because of or in spite of our system designs? We all  have had the sense of barely escaping our just getting by. It seems like we should have crashes all the time. Why is that? What does that mean about IT design implementation and ops?

Summary of 25 years of Research

System as Imagined vs System as Found

What are people doing in these As Found systems? What should operations look like?

Resilience is the combination in systems of these four activities:

These are terms of what we are trying to describe as resilience.

Reliability is made out of these things at design time:

What we really want is resilience:

How do we design for resilience?

What’s the resilience agenda?

Final Thoughts

DevOps for sometime has been leading in the direction of unifying the System as Imagined with the System as Found, so disparate communities aren't formed around a system. Learning is being pushed up to the developers and back down through the code so the System as Found can become wedded to the System as Imagined through the entire stack.

But what Dr. Cook asks for is something developers can’t deliver: such a clear understanding of a complex system that you can hold it in the palm of your hand, turn it, twist it, interrogate it, and make it dance to your tune. Complex systems can only be built incrementally, which means there is only ever an incremental understanding of how the whole thing works, which means it can never be opened to the degree he wishes. A system will always be in large part subconscious, just like how in the the human brain the conscious mind is only the smallest window on a vast subconscious mind.

Related Articles

Article originally appeared on (
See website for complete article licensing information.