Dream with me a little bit. Your startup becomes wildly successful. Hard work and random chance have smiled on you. To keep flirting with lady luck your system must scale. But how much stuff (space, hardware, software, etc) will you need to handle the growth, when will you need it and when will you need more?
That's what Flickr's John Allspaw helps you figure out in his ground breaking new book on capacity planning: The Art of Capacity Planning: Scaling Web Resources.
When I read statements about The Art of Capacity Planning like capacity planning is a term that to me means paying attention, All the information you need to make an educated forecast is in your historical metrics, and startups that are going to experience massive growth simply don't have time for anything but a 'steering by your wake' approach, I get the same sea change feeling I felt when the industry ran from waterfall design and embraced agile design. Current capacity planning is heavy. All up-front. Too analytical and too divorced from real life.
Other capacity planning books assault you with models, math, and simulations. Who has the time? John has developed a common sense, low math approach to capacity planning that works using the system you already have. John's goal is to have you say: Oh, right, duh. That's common sense, not voodoo.
Here's my email interview with John Allspaw on The Art of Capacity Planning. Enjoy.
I'm John Allspaw. I manage the Operations team at Flickr.com, and I've written a book (The Art of Capacity Planning: Scaling Web Resources) about capacity planning for growing websites.
This book is basically a guide to adaptive capacity planning for growing websites. It's an approach that relies much less on benchmarking and simulation, than on the close observation of production loads to guide future decisions. It's not rocket science, and I'm hoping people can use it to justify the what, why, and when of getting more resources to allow them to grow as fast as they need to. It's worked really well for me at Flickr and other organizations.
Capacity planning is a term that to me means paying attention. Web applications can fail in all sorts of dramatic ways, and you're not going to foresee all of them. What you can do, however, is make use of what you do know about what happens in your real world on a regular basis. Things like: my database can do X queries per second before it keels over. Or my cache can only keep Y minutes worth of changing objects. You're not going to predict every failure mode of the whole system, but knowing the failure modes of individual pieces should be considered mandatory. Armed with that, you can make decent forecasts about the future.
(more below the fold)
Func is used to manage a large network using bash or Python scripts. It targets easy and simple remote scripting and one-off tasks over SSH by creating a secure (SSL certifications) XMLRPC API for communication. Any kind of application can be written on top of it. Other configuration management tools specialize in mass configuration. They say here's what the machine should look like and keep it that way. Func allows you to program your cluster. If you've ever tried to securely remote script a gang of machines using SSH keys you know what a total nightmare that can be.
Some example commands:
Using the command line:
func "*.example.org" call yumcmd update
Using the Pthon API:
import func.overlord.client as fc
client = fc.Client("*.example.org;*.example.com")
client.yumcmd.update()
client.service.start("acme-server")
print client.hardware.info()
Func may certainly overlap in functionality with other tools like Puppet and cfengine, but as programmers we always need more than one way to do it and definitely see how I could have used Func on a few projects.
Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?
Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.
The paper has too many details to cover here, but the big sections are:
In the recommendations we see some of our old favorites:
From their website:
SystemImager is software that makes the installation of Linux to masses of similar machines relatively easy. It makes software distribution, configuration, and operating system updates easy, and can also be used for content distribution.
SystemImager makes it easy to do automated installs (clones), software distribution, content or data distribution, configuration changes, and operating system updates to your network of Linux machines. You can even update from one Linux release version to another!
From Wikipedia:
SmartFrog is an open-source software framework, written in Java, that manages the configuration, deployment and coordination of a software system broken into components. These components may be distributed across several network hosts. The configuration of components is described using a domain-specific language, whose syntax resembles that of Java. It is a prototype-based object-oriented language, and may thus be compared to Self. The framework is used internally in a variety of HP products. Also, it is being used by HP Labs partners like CERN.
Puppet implements a declarative (what not how) configuration language for automating common administration tasks. It's the system every large site writes for themselves and it's already made for you! Ilike was able to "easily" scale from 0 to hundreds of servers using Puppet. I can't believe I've never seen this before. It looks really cool. What is Puppet and how can it help you scale your website operations?
Jesse Robbins at O'Reily Radar has a nice post on how spending a little up front time on figuring out how to scale your operations process saves money on ops people and allows you to save time adding and upgrading servers. Adding, monitoring, and upgrading servers can get so incredibly screwed up that a herd of squirrels has to work overtime just to put out a release. Or it can be one button simple from your automated build system out to your servers. This is one area where "do the simplest thing that could possibly work" is a dumb idea and Jesse does a good job capturing the advantages of doing it right.
Recent comments
3 hours 39 min ago
4 hours 16 min ago
4 hours 49 min ago
20 hours 17 min ago
1 day 9 hours ago
1 day 9 hours ago
1 day 12 hours ago
1 day 19 hours ago
1 day 22 hours ago
1 day 22 hours ago