Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.
The paper has too many details to cover here, but the big sections are:
In the recommendations we see some of our old favorites:
Personally, I'm still trying to figure out how to make something simple.
Next are some good thoughts on how to design operations friendly software:
And the paper continues along the same lines in each section. Good detailed advice on lots of different topics.
You'll undoubtedly agree with some of the advice and disagree with some. Greg wants faster release cycles, thinks having server affinity for some things is OK, and thinks the advice on allowing humans to throttle load won't work in a crisis. Perfectly valid points, but what's fun is to consider them. Some companies, for example, have a dead-man's switch that must be thrown before one master can failover to another in a multi-datacenter situation. Is that wrong or right? Only the shadow knows.
The advice to "document all conceivable component failures and modes and combinations" sounds good but is truly difficult to do in practice. I went through this process once on a telco project and it took months just to cover all the failure scenarios on a few cards. But the spirit is right I think.
My favorite part of the whole paper is:
We have long believed that 80% of operations issues originate in design and development, so this section
on overall service design is the largest and most important. When systems fail, there is a natural tendency
to look first to operations since that is where the problem actually took place. Most operations issues,
however, either have their genesis in design and development are best solved there.
Understand this and I think much of the rest follows naturally.