amazon

The Big List of Articles on the Amazon Outage

High Scalability

25 Apr 2011 — 4 min read

Please see The Updated Big List Of Articles On The Amazon Outage for a new improved list.

So many great articles have been written on the Amazon Outage. Some aim at being helpful, some chastise developers for being so stupid, some chastise Amazon for being so incompetent, some talk about the pain they and their companies have experienced, and some even predict the downfall of the cloud. Still others say we have seen a sea change in future of the cloud, a prediction that's hard to disagree with, though the shape of the change remains...cloudy.

I'll try to keep this list update as more information comes out. There will be a lot for developers to consider going forward. If there's a resource you think should be added, just let me know.

Amazon's Explanation of What Happened

Experiences from Specific Companies, Both Good and Bad

Lessons Netflix Learned from the AWS Outage by several Netflixians on the Netflix Tech Blog
How Heroku Survived the Amazon Outage on the Heroku status page
How SimpleGeo Stayed Up During the AWS Downtime by Mike Malone
How SmugMug survived the Amazonpocalypse by Don MacAskill (Hacker News discussion)
How Bizo survived the Great AWS Outage of 2011 relatively unscathed... by Someone at Bizo
Joe Stump's explanation of how SimpleGeo survived
How Netflix Survived the Outage
Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio Engineering's Blog (Hacker News thread)
On reddit's outage
What caused the Quora problems/outage in April 2011?
Recovering from Amazon cloud outage by Drew Engelson of PBS.
- PBS was affected for a while primarily because we do use EBS-backed RDS databases. Despite being spread across multiple availability-zones, we weren’t easily able to launch new resources ANYWHERE in the East region since everyone else was trying to do the same. I ended up pushing the RDS stuff out West for the time being. From Comment

Amazon Web Services Discussion Forum

A fascinating peek into the experiences of people who were dealing with the outage while they were experiencing it. Great real-time social archeology in action.

There were also many many instances of support and help in the log.

In Summary

Amazon EC2 outage: summary and lessons learned by RightScale
AWS outage timeline & downtimes by recovery strategy by Eric Kidd
The Aftermath of Amazon’s Cloud Outage by Rich Miller

Taking Sides: It's the Customer's Fault

So Your AWS-based Application is Down? Don’t Blame Amazon by The Storage Architect
The Cloud is not a Silver Bullet by Joe Stump (Hacker News thread)
The AWS Outage: The Cloud's Shining Moment by George Reese (Hacker News discussion)
Failing to Plan is Planning to Fail by Ted Theodoropoulos
Get a life and build redundancy/resiliency in your apps on the Cloud Computing group

Taking Sides: It's Amazon's Fault

Stop Blaming the Customers - the Fault is on Amazon Web Services by Klint Finley
AWS is down: Why the sky is falling by Justin Santa Barbara (Hacker News thread)
Amazon Web Services are down - Huge Hacker News thread

Lessons Learned and Other Insight Articles

Amazon’s EBS outage by Robin Harris of StorageMojo
People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield
Basic scalability principles to avert downtime by Ronald Bradford
Amazon crash reveals 'cloud' computing actually based on data centers by Kevin Fogarty
Seven lessons to learn from Amazon's outage By Phil Wainewright
The Cloud and Outages : Five Key Lessons by Patrick Baillie (Cloud Computing Group discussion)
Some thoughts on outages by Till Klampaeckel
Amazon.com’s real problem isn’t the outage, it’s the communication by Keith Smith
How to work around Amazon EC2 outages by James Cohen (Hacker News thread)
Today’s EC2 / EBS Outage: Lessons learned on Agile Sysadmin
Amazon EC2 has gone down -what would a prefered hosting platform be? on Focus
Single Points of Failure by Mat
Coping with Cloud Downtime with Puppet
Amazon Outage Concerns Are Overblown by Tim Crawford
Where There Are Clouds, It Sometimes Rains by Clay Loveless
Availability, redundancy, failover and data backups at LearnBoost by Guillermo Rauch
Cloud hosting vs colocation by Chris Chandler (Hacker News thread)
Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz
Complex Systems Have Complex Failures. That’s Cloud Computing by Greg Ferro
Amazon Web Services, Hosting in the Cloud and Configuration Management by Ian Chilton
Lessons learned from deploying a production database in EC2 by by Grig Gheorghiu of Agile Testing
Bezos on Amazon as a technology and invention company by John Gruber on Daring Fireball.

Vendor's Vent

Amazon Outage Proves Value of Riak’s Vision by Basho
Magical Block Store: When Abstractions Fail Us by Mark Joyent (Hacker News discussion)
On Cascading Failures and Amazon’s Elastic Block Store by Jason
An unofficial EC2 outage postmortem - the sky is not falling from CloudHarmony
Cloudfail: Lessons Learned from AWS Outage by Jyoti Bansal

Read more

Kafka 101

Kafka 101

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn. Originally developed in LinkedIn during 2011, Apache Kafka is one of the most popular open-source Apache projects out there. So far it

Capturing A Billion Emo(j)i-ons

Capturing A Billion Emo(j)i-ons

This blog post was written by Dedeepya Bonthu. This is a repost from her Medium article, approved by the author. In stadiums, sports fans love to express themselves by cheering for their favorite teams, holding up placards and team logos. Emoji’s allow fans at home to rapidly express themselves,

Brief History of Scaling Uber

Brief History of Scaling Uber

This blog post was written by Josh Clemm, Senior Director of Engineering at Uber Eats. This is a repost from his LinkedIn article, approved by the author. On a cold evening in Paris in 2008, Travis Kalanick and Garrett Camp couldn't get a cab. That's when

Behind AWS S3’s Massive Scale

Behind AWS S3’s Massive Scale

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn. AWS S3 is a service every engineer is familiar with. It’s the service that popularized the notion of cold-storage to the