Heroku Emergency Strategy: Incident Command System and 8 Hour Ops Rotations for Fresh Minds
Wednesday, April 27, 2011 at 8:35AM
Todd Hoff in Strategy, amazon

In Resolved: Widespread Application OutageHeroku tells their story of how they dealt with the Amazon outage. While taking 100% responsibility for the downtime, they also shared a number of the strategies they used to bring their service back to full working order.

One of Heroku's most interesting strategies wasn't a technical hack at all, but how they consciously went about deploying their Ops personnel in response to the emergency. An outline of their strategy is:

 

Incident Command System

Chrishenn in a comment on the Heroku post on Hacker News, thinks Heroku was using an emergency response system based on the Incident Command System model:  a systematic tool used for the command, control, and coordination of emergency response. A picture of the model from wikipedia:

I've never heard of ICS before, but it looks worth looking into if you are searching around for a proven structure. Chrishenn says it works:

I've experienced it first hand and can say it works very well, but I have never seen it used in this context. The great thing about it is it's expandability---it will work for teams of nearly any size. I'd be interested in seeing if any other technology companies/backend teams are using it.

Lessons Learned

Related Articles

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.