Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?
It sounds like a relatively fun environment for pushing software live. Getting software moved into production is often harder than the original coding and testing. Now I know what you are thinking. You somehow managed to procure the ssh login. So just login remotely and do the install yourself! Nobody will know. Oh so tempting. But it's not really good corporate citizenship. And you just might screw up, then there will be some esplaining to do.
Emphasing frequent releases and gutsy release policies makes it actually seem like someone is supporting developers instead of treating them like their software carries the plague. Data centers are often treated like quarantine stations and developers are treated like asymptomatic carriers of some unknown virulent disease. To be safe nothing should ever change, but that's not an attitude that makes things better. Nice to see that recognized.
To setup or not to setup a separate operations group? Facebook says "to be" and creates a seperate group. Amazon says "not to be" and has developers support their own software. Secretly I think Amazon gets better results by requiring developers to support their own software. Knowing it may be you getting the "It's Down!" call gives one proper perspective. But I like not being on call and I think most developers agree. Plus the idea "following the sun" to get 24 hour support is a smart idea.