« Stuff The Internet Says On Scalability For June 10, 2011 | Main | Stuff to Watch from Google IO 2011 »

Retrospect on recent AWS outage and Resilient Cloud-Based Architecture

A bit over a month ago Amazon experienced its infamous AWS outage in the US East Region. As a cloud evangelist, I was intrigued by the history of the outage as it occurred. There were great posts during and after the outage from those who went down. But more interestingly for me as architect were the detailed posts of those who managed to survive the outage relatively unharmed, such as SimpleGeoNetflix,SmugMugSmugMug’s CTOTwilioBizo and others.

Reading through the experience of others, I tried to summarize the patterns, principles and best practices that emerged from these posts, as I believe we can learn a lot from them on how to design our business applications to truly leverage on the benefits that the cloud offers in high availability and scalability.

The main principles, patterns and best practices are:

  • Design for failure
  • Stateless and autonomous services
  • Redundant hot copies spread across zones
  • Spread across several public cloud vendors and/or private cloud
  • Automation and monitoring
  • Avoiding ACID services and leveraging on NoSQL solutions
  • Load balancing

Looking at the above principles, patterns and best practices they all make perfect sense, and seem fundamental for any enterprise architect. So I started wondering how come so many modern systems don't apply them (as evident by the systems that failed during the Amazon outage as well as on similar cloud infrastructure failures). As Forrester states:

Few of today's business applications are designed for elastic scaling, and most of those few involve complex coding unfamiliar to most enterprise developers. 

It requires an experienced and confident architect in the areas of distributed and scalable systems to design such architectures. The typical public cloud APIs also require developers to perform complex coding and utilize various non-standard APIs that are usually not common knowledge. Similar difficulties are found in testing, operating, monitoring and maintaining such systems. This makes it quite difficult to implement the above patterns to ensure the application’s resilience and scalability, and diverts valuable development time and resources from the application’s business logic that is the core value of the application.

The emerging solution to this complexity is a new class of application servers that offers to take care of the high availability and scalability concerns of your application, allowing you to focus on your business logic. Forrester calls these "Elastic Application Platforms", and defines them as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.

You can read more about elastic application platforms and see a reference elastic application platform (GigaSpaces) that implements the above principles, patterns and best practices here.

Reader Comments (3)

Best practise includes avoiding ACID? And then we wonder why enterprise products don't support these idioms? Are the writers assuming that cloud is only good for writing software where it really doesn't matter if some data is lost or duplicated?

"Enterprise" (goash I hate that word) solutions often involve data with healthcare status, invoicing, payments, etc. These do require quite strict solutions where data should be trusted.

June 9, 2011 | Unregistered Commenterburmanm

@burmanm - I agree. I don't see any reason why correctly architected SQL databases with redundancy shouldn't be resiliant. They've been used for much longer than NoSQL alternatives so patterns should be better understood with them if anything. Let's not throw the baby out with the bath water now.

June 13, 2011 | Unregistered CommenterJackson

Relational databases have indeed been in existence for a long time, and ACID provides a good guarantee for your data integrity under distributed transactions. however, many of today's high-end systems face challenges of huge volumes of data, very high throughput and/or very low latency (think Facebook, Twitter, etc.). RDBMS fail to meet these challenges (will you lock under distributed ACID transaction when you update a tweet to all your followers across the globe?). on the other hand, many of these systems do not need the full ACID and are willing to compromise some of it to gain performance (e.g. if your tweet gets updated to some of your followers with a few seconds' delay rather than atomically this is an acceptable constraint relaxation). This was the motivation for the NoSQL movement.
If you read my second post you saw that I have nothing against SQL as a language, and not even against relational databases. for batch calculations of aggregated statistics, for instance, relational database and full SQL capabilities will be appropriate. but for realtime analytics calculations, for instance, a more scalable and efficient solution is in place. different flows in your system can leverage on different technologies based on the required SLA. it is where the art and science of architecture comes into play.

June 20, 2011 | Registered CommenterDotan Horovits

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>