Resiliency is the New Normal - A Deep Look at What It Means and How to Build It

Perhaps it is because the whole world feels as if it’s riding on the edge of a jagged knife that the idea of resilience is becoming a dominant theme across so many domains. Resilience in beings first developed when cells evolved a way of maintaining inner order through homeostatic (stability through constancy) mechanisms. After homeostasis was mastered, allostasis (stability through change) developed as a way of responding to a dynamic world of challenge. In economics we have the idea of Transition Towns, which emphasizes developing local economies as a way of being resilient to global failures. In agriculture we have the idea of permaculture, building a permanent agriculture by embracing diversity, sustainability, perennial systems, avoiding monocultures, and using edge thinking. There are many more examples, including psychological resilience and the legendary resilience of ecosystems.

To explore the idea of resiliency we’ll look at a few sources:

  • Collapse Dynamics: Phase Transitions in Complex Social Systems by Noah Raford
  • Black Swans, Fragility, and Mistakes by Nassim Taleb
  • Why Cities Keep on Growing, Corporations Always Die, and Life Gets Faster  by Geoffrey West
  • How Complex Systems Fail by Dr. Richard Cook

The talk by Dr. Richard Cook was given at Velocity 2012 and is by far the most practical of all the talks, as it directly relates to DevOps, but I think each of the other talks holds their own special fascination as well. I hope you’ll share my conviction that this incredibly cool stuff that has only really begun to be explored and applied.

Collapse Dynamics: Phase Transitions in Complex Social Systems

Noah Raford has a great series of amazing videos deeply related to resilience: Collapse Dynamics: Phase Transitions in Complex Social Systems.

Some key ideas from his talk:

  • Put simply, too much connectivity, too much interactivity and too little resilience means even a tiny change can lead to massive fluctuation and collapse
  • The greater the degree of heterogeneity in an interactive system, the more resistant it is to collapse. Connectivity + conformity = instability
  • Multistable states are common in many systems; it is impossible to predict where the tipping points are until it's too late, this is interactive complexity, don't know the parameters of the equation or the equation itself; functional diversity builds resistance; management must cope with surprise and uncertainty.
  • Innovation, a super connected super responsive systems leads to super exponential growth which leads to collapse bubbles. Black swan territory. Everything changes all at once.
  • To survive periods of dynamic change, stay light, experiment, keep options open.
  • Collapse is endemic to many classes of complex adaptive systems, they tend to occur in certain regular ways, and we can use this knowledge to understand and prepare for radical change.
  • Looking at the network structure of ecosystems and how resilient they are to extinctions, they've found having a large food web of many different species decreases the chance of catastrophic failure. Drives home the point that the more heterogeneity in a system the less likely it is to go through a phase transition and more likely it is to survive if one occurs.
  • Simple systems have single points of failure easy to diagnose and correct. Complex systems have multiple points of failure that interact in unpredictable ways and are very hard to fix. We have created constellation of complex systems interacting in ways we simply do not understand, and therefore we are subject to cataclysmic unanticipated break-downs that are almost impossible to fix.
  • Societies experience increasing returns from complexity up to a certain point. Easy solutions are utilized first and have the greatest reward. Then once you exhaust those you have to start going to harder solutions which lead to greater investment, more organization and greater complexity. The more complex you become it's hard to escape it and move to a lower degree of complexity. People don't say let's undo all our work and start over. Learn to start fire, get a lot of benefit. Steam engine, more complex, get more benefit. It's a curve and at some point the curve shows decreasing returns. "Sunk cost effect" or "Concorde effect". You've already invested so much into an idea, everyone knows it's a bad idea, but because of the investment we keep doing it. At the highest level of complexity societies build the biggest monuments, then collapse. Becoming less flexible and rigid leads to collapse.

Taleb on Black Swans, Fragility, and Mistakes

Nassim Taleb of Black Swan fame has resiliency at the heart of many of his ideas. Taleb on Black Swans, Fragility, and Mistakes is a very good talk on the subject.

Some key ideas from his talk:

  • Many religions have had rules against borrowing and debt and we seem to forget those lessons over time. Overconfidence translates directly into debt so you can leverage up the wazoo without fear of an error rate. Forecasting is impossible, human error shouldn't penalize the multitude. How do we build a society in which people make mistakes? We should be able to make a lot of small mistakes. Any small mistake is really a discovery. Large mistakes are crippling. His mission is to figure out how to make a society in which mistakes are possible.
  • Forecasting has become much more difficult because random variables have become much more fat tailed because of complexification created by the internet. When you have interdependence you no longer have a central limit. When people in Spain influence people in NY you are no longer independent. So we are more prone to extreme events, the fat tail. We've had that rising since the internet. We are moving into an environment in which randomness is becoming freer of natural constraints so it's impossible to use models to predict fat tails.
  • Redundancy is the key to robustness. Don't spend more than you earn. Have a surplus. Why is this a mystery? Why do we still build systems based on forecasts? An economist would design a system with one kidney and we would just borrow it because that's optimized. Debt is the opposite of redundancy. Big guys naturally enter into collusion with the state. Let the big fail so there's feedback for stupid actions. With the bailout there wasn't a feedback loop. The market doesn't know how price. Thinks without government intervention the companies would self destruct. Large things are more fragile. Wants nature to destroy car companies early on so you don't have to pay their bonuses. If we let financial institutions destroy themselves then the financial stresses would unwind

Why Cities Keep on Growing, Corporations Always Die, and Life Gets Faster

Geoffrey B. West, theoretical physicist at the Santa Fe Institute, gives some really awesome talks. Take a look at Why Cities Keep on Growing, Corporations Always Die, and Life Gets Faster, he’s on TED, and he’s all over YouTube. His talk on the  Scaling Laws In Biology And Other Complex Systems is definitely worth an investment of time.

Some key ideas from his talks:

  • According to West, there are systematic and fundamental laws underlying biological systems, pertaining to resource distribution and economies of scale. He presented a series of compelling examples drawing on cellular systems and metabolic structures. For example, the base metabolic rate of any animal scales as the 3/4 power of body mass. This holds true for all animals, from single-celled organisms to the largest mammals. Having laid out this biological pattern, West set out to explore to what degree this can also be used to describe cities and corporate structures. His findings show that while the infrastructure patterns of cities - roads, water lines, number of gas stations - tend to follow similar economies of scale to biological ones, the social systems of cities operate very differently. As opposed to biological properties, social benefits and ills both increase by an exponent of 1.15 when the size of a city increases.
  • The simple scaling laws are the same across across extremely dissimilar and diverse organisms. Why would that be? Fractal networks. The underlying dynamics are derived from the networks that sustain life at all scales. We function through all sorts of networks: respiratory, circulatory, within cells, within mitochondria. These are all networked together following some common patterns and it's the mathematics of that networking that give rise to the scalable quality of life. These networks have fractal like qualities. No matter what scale you look at the look the same. These properties are independent of the organisms. These properties are built on, the are not invented. They are space feeling and have terminal units. And they are somehow optimizing. Humans share with all mammals a circulatory system that minimizes the work your heart has to do to pump blood. Double the length of the 12th branch you would have to do more work, halve the length of the 10th branch you would also have to do more work. The design sits in a basin of optimization. The principles are the across different systems. Look at a tree or a heart and they work the same.
  • The structure of a forest mimics the individual tree. If you ask anything about a tree the property will also hold for the forest. They are both systems. How far apart are trees of a given size? How much energy is flying through a particular branch? How large is the canopy? Any geometrical or dynamical property can be answered. The answer for the entire forest is mathematically the same as the individual tree. The reason is they are controlled by the network and the optimization implied by the network.
  • You can't just solve global warming, or the financial mess alone. It won't work. It's a system and they all move together. The scale of problems now occur faster than the political will to act.
  • Doubling the size of city systematically increases income, wealth, number of patents, number of colleges, number of creative people, number of police, crime rate, number of aids and flue cases, amount of waste --- all by ~ 15%, regardless of city
  • This scaling law is opposite of biology because the bigger you are the more you use, in biology it's the less you use. The power law: <1 pace of life decreases with size (biology). >1 pace of life increases with size (social). The pace of social life gets faster and faster in a systematic way. Speed of walking systematically increases in a city.
  • The growth equation is incredibly rich and the solution changes completely in character if b < 1 or > 1. You can keep growing you hit a singularity where population, revenue, etc hits infinity and you collapse. If you just keep going within the same paradigm you'll have fast growth and then collapse. Somewhere along the trajectory you need to make a major innovation to change the boundary conditions so you effectively start over again.
  • For continuous growth you need to continuously make major innovations. Unfortunately the time between innovations is systematically shorter. You have to innovate at an accelerated fashion. What took a thousand years ago in the past now takes 20 years. Is that sustainable? If it took 20 years for the last one, it will take 17 years next time, then eventually it will need to take a day. So the system is not sustainable.
  • For a company, on a per capita basis, income, assets, profits, all systematically decrease as the company increases in size whereas sales remain constant. This isn't sustainable. Ratio of profits to sales decreases with size. There are huge rises as companies form, but eventually flatten out as they age. They plotted 20,000 companies and they all flatten out about at a revenue $5-6 billion. There's a universality. The triumph of economies of scale over innovation. As a companies grows you need a bureaucracy, the great innovations that started a company are eventually strangled by the bureaucracy need to run them. That stranglehold beats open innovation. The ratio of profits to sales get smaller and smaller. That leads to the demise of the company. Fluctuations in profits are proportional to the size of the company. Double the size of a company and the fluctuations double in size. If you have a profit margin that's always getting smaller and the fluctuations are proportional to size that leads to vulnerability. Eventually your system degrades and fluctuation happens and you are not able to recover.
  • Are corporations more like animals or more like cities? They want to be like cities, with ever increasing productivity as they grow and potentially unbounded lifespans. Unfortunately, West et al.'s research on 22,000 companies shows that as they increase in size from 100 to 1,000,000 employees, their net income and assets (and 23 other metrics) per person increase only at a 4/5 ratio. Like animals and cities they do grow more efficient with size, but unlike cities, their innovation cannot keep pace as their systems gradually decay, requiring ever more costly repair until a fluctuation sinks them. Like animals, companies are sublinear and doomed to die.
  • What is the actual mechanism of difference? Research on that continues. "Cities tolerate crazy people," West observed, "Companies don't."

How Complex Systems Fail

There’s a long a history of thinking about resiliency in computer systems. Autonomic Computing is one such vision. The current, more pragmatic, champions of resilience in the software world is the modern DevOps movement.

On this subject Dr. Richard Cook, Professor of Healthcare Systems Safety and Chairman of the Department of Patient Safety at the Kungliga Techniska Hogskolan, was invited to talk at the Velocity 2012 conference. He gave a fascinating talk: How Complex Systems Fail, that is just detailed enough to be practical and high level enough to inspire new directions.

Why Don’t Systems Fail More Often?

The normal world is not well behaved. The real surprise is not that there are so many accidents but there are so few. Is this because of or in spite of our system designs? We all  have had the sense of barely escaping our just getting by. It seems like we should have crashes all the time. Why is that? What does that mean about IT design implementation and ops?

Summary of 25 years of Research

  • Story. A hospital replaced all its pumps at once. A year later 20% of the pumps failed at the same time. The cause was a forced software upgrade set one year into the future that was required to happen or the unit would simply not accept new commands. Each individual step seemed to work and seemed fine at the time, but nobody saw the one year time bomb ticking.
  • The real world often produces surprises, things you haven’t seen before. Sometimes these are an existential threat, that compromise your core missions.
  • Demand and op temp varies. It’s not constantly being beat on. Things aren’t always going wrong. It’s a real variable world. You can’t predict it very well.
  • Stuff don’t work as advertised. There’s lots of stuff in our systems that we don’t try out for many years and then when we do try them they don’t work as expected.
  • Novel conditions are common. Things that you haven’t seen before do occur relatively frequently.
  • There’s lots of adapting and tailoring. Systems aren’t working out of the box. They have to be tweaked and tuned and kissed and cuddled. Services are being started and stopped as parameters are being changed to get the system to work in the way you want.
  • Unexpected uses. People start using your system for things you didn’t expect, like sending very large files over email.
  • Continuous change of technology and people. The world you live in is anything but static.
  • Coping with shallow and deep conflicts. Shallow conflicts: can we make a little more, can this run a little better, is this guy giving me hard time on the phone. Deep conflicts: we have trade offs. Are we going to take the system down to fix it or fix it on the fly?

System as Imagined vs System as Found

  • System as imagined. Few people have contact with systems as found. They are imagined in state diagrams and layouts and other diagrams. It’s how we make stuff. They are static and deterministic. You rarely see pictures of people doing work, sitting in front of screens and making the system run. It’s encountered during design and development and reviews of outcomes when a system fails. Archetype is state transition diagram or fault tree analysis.
  • Systems as found. They are dynamic and stochastic. It’s constantly changing. It’s only predictable in some statistical way. Performance can’t be deterministically defined. We are always doing some kind of maintenance. Do not touch any of these wires. It’s what we encounter during implementation, operations, maintenance, and recovery from faults. They are very hard to draw these things. One way is using a FRAM (functional resonance analysis method) diagram.

What are people doing in these As Found systems? What should operations look like?

Resilience is the combination in systems of these four activities:

  • Monitoring. People look at the system to see what’s going on.
  • Responding. Not in a reacting sense, but understanding what is going on in the system to figure out where it’s going and trying to make changes to deal with that.
  • Adapting. Make the system work differently to get it to behave in a better way.
  • Learning. Key component. It’s happening in communities of operators, shift personal etc, of people that we don’t have contact with.

These are terms of what we are trying to describe as resilience.

Reliability is made out of these things at design time:

  • Stiff boundaries
  • Layers, formalisms
  • Defence in depth
  • Redundancy
  • Interference protection, security, abstraction, hiding of details
  • Assurance
  • Accountability

What we really want is resilience:

  • Withstand transients
  • Recovery swiftly & smoothly from failures
  • Prioritize to serve high level goals when the system is changing and we have to sacrifice something
  • Recognize and respond to abnormal situations, including situations that were never considered at design time
  • Adapt to change. Find someway to make system designed in the mindset of 4-7 years ago work for the next 3-5 years while we are building the next systems that will be obsolete when we install them.

How do we design for resilience?

  • We have the two quite separate worlds as Systems as Imagined as Systems as Found. They look at different things and they work on different sorts of principles. Can you bring them together? Can you can make an As Engineered world that includes the quality of resilience? Is it possible to engineer systems ahead of time so that it is possible in operational time to have the resilience you want them to have?
  • Support for continuous maintenance. Maintenance is not periodic. They are constantly being maintained in some way. That includes things like software, hardware, personnel,  physical plant. Maintenance needs to become part of the design rather than something added on later.
  • Reveal the controls of the system to the operators. When it’s 3AM in the morning the only people who are going to fix the problem are the operators and control people sitting in front of the screens that actually control the system. It’s essential that we trust operators with the controls to our systems. This is subversive in that typical designs are built around the idea of making it impossible for people to do things as a form of protection. If resilience is the goal then we have to develop some way of trusting people because we need them to act in situations where they are the only people available to act. Show the lift points. Almost all heavy machinery has markers to show where to lift it. We do that because we know the system is going to be moved. We have to start showing lift points in the system itself. Where can I pick it up and move it? How could I take it and move it outside the current system that I have and replace it with something else? We don’t do this well with code, structures, or organizations.
  • Support mental simulation. People need clear ideas about how the system is configured and running. You need to have operators mentally simulate if they take some action. This requires deep knowledge of the system and its status. Designers have to present that information to operators so they can see what’s going on. You need to think about the kind of situations operators are going to confront and the kinds of work they are going to do and you have  to support the mental simulation they are going to do as they are trying to figure out how to make the system work or recover.
  • Open your objects and open your methods. Hiding everything about the details of your system in levels of abstraction has run its course now. That idea that we can make these black box devices that we only know the shell of and have no knowledge of the internal workings of is a deep mistake. It turns out to be able to reason about how the system is working we have to have knowledge of what’s inside the black box.
  • Deep-six the don’t touch me’s. There are no parts of the systems that are closed.
  • Empower operator learning.

What’s the resilience agenda?

  • Operations are competent to hold the keys to the systems we build. You should be willing to hand over the keys to the operators.
  • Make resilience engineering the first priority of design for next gen systems.
  • Commit resources to discovering, understanding and supporting resilience through the system life-cycle.

Final Thoughts

Unfortunately there’s no easy way to wrap all this up into a tight little TLDR. “Be resilient” just sounds a little silly after experiencing the vast richness of the subject. It’s a pool of infinite depth.

And some of it is outright depressing. The idea that we need to keep increasing the pace of innovation just to stave off the next collapse is sobering. That there are limits to growth. That the same interconnectivity we crave as a way of bringing more richness to the world is also the seed of inevitable collapse.

It confounds me that software is simply not the right tool for creating software. And I find perplexing, in an all too human way, that diversity and heterogeneity are such key aspects of resilience, yet we continually find ourselves shunted onto platforms of convenience.

Dr. Richard Cook had a more easily recognizable path, one that DevOps is well on its way to following. Leading companies have been successfully unifying the System as Imagined with the System as Found, so that there aren’t these disparate communities formed around a system. Learning is being pushed up to the developers and back down through the code so the System as Found can become wedded to the System as Imagined through the entire stack.

But what Dr. Cook asks for is something developers can’t deliver: such a clear understanding of a complex system that you can hold it in the palm of your hand, turn it, twist it, interrogate it, and make it dance to your tune. Complex systems can only be built incrementally, which means there is only ever an incremental understanding of how the whole thing works, which means it can never be opened to the degree he wishes. A system will always be in large part subconscious, just like how in the the human brain the conscious mind is only the smallest window on a vast subconscious mind.

If there is a common theme it’s that shit happens, you can’t predict it, you can’t stop it, but you can be prepared for the next transition.