advertise

Recommend Designing for Resiliency will be so 2013 (Email)

This action will generate an email recommending this article to the recipient of your choice. Note that your email address and your recipient's email address are not logged by this system.

EmailEmail Article Link

The email sent will contain a link to this article, the article title, and an article excerpt (if available). For security reasons, your IP address will also be included in the sent email.

Article Excerpt:

A big part of engineering for a quality experience is bringing in the long tail. An improbable severe failure can ruin your experience of a site, even if your average experience is quite good. That's where building for resilience comes in. Resiliency used to be outside the realm of possibility for the common system. It was simply too complex and too expensive.

An evolution has been underway, making 2013 possibly the first time resiliency is truly on the table as a standard part of system architectures. We are getting the clouds, we are getting the tools, and prices are almost low enough.

Even Netflix, real leaders in the resiliency architecture game, took some heat for relying completely on Amazon's ELB and not having a backup load balancing system, leading to a prolonged Christmas Eve failure. Adrian Cockcroft, Cloud Architect at Netflix, said they've investigated creating their own load balancing service, but that "we try not to invest in undifferentiated heavy lifting."

So resiliency is still not part of the standard package. There's an ROI calculation that has to be made. Yet the path Netflix would have to take in creating a hybrid architecture is fairly clear, Netflix prefers to concentrate on features rather than long tail events. That's a big difference. At one time designing for resiliency would have been unthinkable, now it's becoming a choice. 

A good New Year's resolution might be to learn more about resilience. It's a new way of thinking compared to straightforward high availability. It's a full stack, full team, full system, environment centric mode of thought.

Fortunately, Dr. Richard Cook, Professor of Healthcare Systems Safety and Chairman of the Department of Patient Safety at the Kungliga Techniska Hogskolan, has been thinking about resilience for a long time. And he gave a fascinating talk: How Complex Systems Fail on resilience, that is just detailed enough to be practical and high level enough to inspire new directions.

Here's a gloss of the essentials from his talk:

Why Don’t Systems Fail More Often?


Article Link:
Your Name:
Your Email:
Recipient Email:
Message: