Retrospect on recent AWS outage and Resilient Cloud-Based Architecture

A bit over a month ago Amazon experienced its infamous AWS outage in the US East Region. As a cloud evangelist, I was intrigued by the history of the outage as it occurred. There were great posts during and after the outage from those who went down. But more interestingly for me as architect were the detailed posts of those who managed to survive the outage relatively unharmed, such as SimpleGeo, Netflix,SmugMug, SmugMug’s CTO, Twilio, Bizo and others.

Reading through the experience of others, I tried to summarize the patterns, principles and best practices that emerged from these posts, as I believe we can learn a lot from them on how to design our business applications to truly leverage on the benefits that the cloud offers in high availability and scalability.

The main principles, patterns and best practices are:

  • Design for failure
  • Stateless and autonomous services
  • Redundant hot copies spread across zones
  • Spread across several public cloud vendors and/or private cloud
  • Automation and monitoring
  • Avoiding ACID services and leveraging on NoSQL solutions
  • Load balancing

Looking at the above principles, patterns and best practices they all make perfect sense, and seem fundamental for any enterprise architect. So I started wondering how come so many modern systems don't apply them (as evident by the systems that failed during the Amazon outage as well as on similar cloud infrastructure failures). As Forrester states:

Few of today's business applications are designed for elastic scaling, and most of those few involve complex coding unfamiliar to most enterprise developers.

It requires an experienced and confident architect in the areas of distributed and scalable systems to design such architectures. The typical public cloud APIs also require developers to perform complex coding and utilize various non-standard APIs that are usually not common knowledge. Similar difficulties are found in testing, operating, monitoring and maintaining such systems. This makes it quite difficult to implement the above patterns to ensure the application’s resilience and scalability, and diverts valuable development time and resources from the application’s business logic that is the core value of the application.

The emerging solution to this complexity is a new class of application servers that offers to take care of the high availability and scalability concerns of your application, allowing you to focus on your business logic. Forrester calls these "Elastic Application Platforms", and defines them as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.

You can read more about elastic application platforms and see a reference elastic application platform (GigaSpaces) that implements the above principles, patterns and best practices here.