Major Websites Down: Or Why You Want to Run in Two or More Data Centers.
A lot of sites hosted in San Francisco are down because of at least 6 back-to-back power outages power outages. More details at laughingsquid.
Sites like SecondLife, Craigstlist, Technorati, Yelp and all Six Apart properties, TypePad, LiveJournal and Vox are all down. The cause was an underground explosion in a transformer vault under a manhole at 560 Mission Street. Flames shot 6 feet out from the manhole cover. Over PG&E 30,000 customers are without power.
What's perplexing is the UPS backup and diesel generators didn't kick in to bring the datacenter back on line. I've never toured that datacenter, but they usually have massive backup systems. It's probably one of those multiple simultaneous failure situations that you hope never happen in real life, but too often do. Or maybe the infrastructure wasn't rolled out completely.
Update: the cause was a cascade of failures in a tightly couples system that could never happen :-) Details at Failure Happens: A summary of the power outage at 365 Main.
It's just these sorts of emergencies that make us think. How would You handle a similar failure? How can you make your website span more than one data center? Is the cost of putting in all this infrastructure worth it compared to your website being down for a day? All good hard to answer and even harder to implement questions.
Geo-distributed clustering is not easy, which is why most companies don't do it, but for some help distributing your website take a look at /tags/geo-distributed-clusters.