amazon

The Updated Big List of Articles on the Amazon Outage

High Scalability

02 May 2011 — 5 min read

Since The Big List Of Articles On The Amazon Outage was published we've a had few updates that people might not have seen. Amazon of course released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Netlix shared their Lessons Learned from the AWS Outage as did Heroku (How Heroku Survived the Amazon Outage), Smug Mug (How SmugMug survived the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up During the AWS Downtime).

The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself.

Lesson for crisis handlers: deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention.

Amazon's Explanation of What Happened

Experiences from Specific Companies, Both Good and Bad

Lessons Netflix Learned from the AWS Outage by several Netflixians on the Netflix Tech Blog
How Heroku Survived the Amazon Outage on the Heroku status page
How SimpleGeo Stayed Up During the AWS Downtime by Mike Malone
How SmugMug survived the Amazonpocalypse by Don MacAskill (Hacker News discussion)
How Bizo survived the Great AWS Outage of 2011 relatively unscathed... by Someone at Bizo
Joe Stump's explanation of how SimpleGeo survived
How Netflix Survived the Outage
Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio Engineering's Blog (Hacker News thread)
On reddit's outage
What caused the Quora problems/outage in April 2011?
Availability, redundancy, failover and data backups at LearnBoost
How our small startup survived the Amazon EC2 Cloud-pocalypse from mobile app developer
Recovering from Amazon cloud outage by Drew Engelson of PBS.
- PBS was affected for a while primarily because we do use EBS-backed RDS databases. Despite being spread across multiple availability-zones, we weren’t easily able to launch new resources ANYWHERE in the East region since everyone else was trying to do the same. I ended up pushing the RDS stuff out West for the time being. From Comment

Amazon Web Services Discussion Forum

A fascinating peek into the experiences of people who were dealing with the outage while they were experiencing it. Great real-time social archeology in action.

There were also many many instances of support and help in the log.

In Summary

Amazon EC2 outage: summary and lessons learned by RightScale
AWS outage timeline & downtimes by recovery strategy by Eric Kidd
The Aftermath of Amazon’s Cloud Outage by Rich Miller

Taking Sides: It's the Customer's Fault

So Your AWS-based Application is Down? Don’t Blame Amazon by The Storage Architect
The Cloud is not a Silver Bullet by Joe Stump (Hacker News thread)
The AWS Outage: The Cloud's Shining Moment by George Reese (Hacker News discussion)
Failing to Plan is Planning to Fail by Ted Theodoropoulos
Get a life and build redundancy/resiliency in your apps on the Cloud Computing group

Taking Sides: It's Amazon's Fault

Stop Blaming the Customers - the Fault is on Amazon Web Services by Klint Finley
AWS is down: Why the sky is falling by Justin Santa Barbara (Hacker News thread)
Amazon Web Services are down - Huge Hacker News thread
The EC2/EBS outage: What Amazon didn’t tell you by Jeremy Gaddis

Lessons Learned and Other Insight Articles

Amazon’s EBS outage by Robin Harris of StorageMojo
People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield
Basic scalability principles to avert downtime by Ronald Bradford
Amazon crash reveals 'cloud' computing actually based on data centers by Kevin Fogarty
Seven lessons to learn from Amazon's outage By Phil Wainewright
The Cloud and Outages : Five Key Lessons by Patrick Baillie (Cloud Computing Group discussion)
Some thoughts on outages by Till Klampaeckel
Amazon.com’s real problem isn’t the outage, it’s the communication by Keith Smith
How to work around Amazon EC2 outages by James Cohen (Hacker News thread)
Today’s EC2 / EBS Outage: Lessons learned on Agile Sysadmin
Amazon EC2 has gone down -what would a prefered hosting platform be? on Focus
Single Points of Failure by Mat
Coping with Cloud Downtime with Puppet
Amazon Outage Concerns Are Overblown by Tim Crawford
Where There Are Clouds, It Sometimes Rains by Clay Loveless
Availability, redundancy, failover and data backups at LearnBoost by Guillermo Rauch
Cloud hosting vs colocation by Chris Chandler (Hacker News thread)
Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz
Complex Systems Have Complex Failures. That’s Cloud Computing by Greg Ferro
Amazon Web Services, Hosting in the Cloud and Configuration Management by Ian Chilton
Lessons learned from deploying a production database in EC2 by by Grig Gheorghiu of Agile Testing
Bezos on Amazon as a technology and invention company by John Gruber on Daring Fireball.
On Importance of Planning for Failure by Dmitriy Samovskiy

Vendor's Vent

Amazon Outage Proves Value of Riak’s Vision by Basho
Magical Block Store: When Abstractions Fail Us by Mark Joyent (Hacker News discussion)
On Cascading Failures and Amazon’s Elastic Block Store by Jason
An unofficial EC2 outage postmortem - the sky is not falling from CloudHarmony
Cloudfail: Lessons Learned from AWS Outage by Jyoti Bansal

Read more

Kafka 101

Kafka 101

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn. Originally developed in LinkedIn during 2011, Apache Kafka is one of the most popular open-source Apache projects out there. So far it

Capturing A Billion Emo(j)i-ons

Capturing A Billion Emo(j)i-ons

This blog post was written by Dedeepya Bonthu. This is a repost from her Medium article, approved by the author. In stadiums, sports fans love to express themselves by cheering for their favorite teams, holding up placards and team logos. Emoji’s allow fans at home to rapidly express themselves,

Brief History of Scaling Uber

Brief History of Scaling Uber

This blog post was written by Josh Clemm, Senior Director of Engineering at Uber Eats. This is a repost from his LinkedIn article, approved by the author. On a cold evening in Paris in 2008, Travis Kalanick and Garrett Camp couldn't get a cab. That's when

Behind AWS S3’s Massive Scale

Behind AWS S3’s Massive Scale

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn. AWS S3 is a service every engineer is familiar with. It’s the service that popularized the notion of cold-storage to the