Strategy

How to get started with sizing and capacity planning, assuming you don't know the software behavior?

High Scalability

18 Dec 2013 — 3 min read

Here's a common situation and question from the mechanical-sympathy Google group by Avinash Agrawal on the black art of capacity planning:

How to get started with sizing and capacity planning, assuming we don't know the software behavior and its completely new product to deal with?

Gil Tene, Vice President of Technology and CTO & Co-Founder, wrote a very understandable and useful answer that is worth highlighting:

Start with requirements. I see way too many "capacity planning" exercises that go off spending weeks measuring some irrelevant metrics about a system (like how many widgets per hour can this thing do) without knowing what they actually need it to do.

There are two key sets of metrics to state here: the "how much" set and the "how bad" set:

In the "How Much" part, you need to establish, based on expected business needs, Numbers for things (like connections, users, streams, transactions or messages per second) that you expect to interact with at the peak time of normal operations, and the growth rate of those metrics that you need to be able keep up with. Also state expected things like data set size, data interaction rates, and data set growth rates.

For the "How Bad" part, you need to make sure your metrics include a description of what acceptable behavior is, remembering that without describing what us not acceptable, you have not described what acceptable is. Be specific. Saying "always fast" is not nearly as useful as "never slower than X", and saying "mostly" (or "on the average") is usually a way to avoid facing (or even considering) potential consequences of non typical behavior, the best approach here is to think of how often it is ok to have certain levels of bad things happen. (Don't get too greedy and ask for "perfect" here, or you'll get a big bill at the end.) So consider things like how often is it ok for the system to be out of commission for longer than X (for multiple values of X like a year, a week, a day, an hour, a minute, etc.). Also consider how often it is ok for the system react in longer than T (for multiple values of T, like an hour, a minus, a second, 50msec, etc.). Both of these are usually best stated as levels at percentiles, with availability being stated at percentiles of time, and responsiveness stared at percentiles of actual interactions. Don't forget to state the worst acceptable case for each.

Once you have a feel for business-driven requirements stated with "how much" and "how bad" metrics, design a set of experiments to test "how much" (measured in whatever capacity metrics your requirements use) the system can handle without EVER showing even the slightest hint of failing your "how bad" availability and responsiveness requirements. This will invariably include repeated testing under a wide range of "how much" levels to see how far things go before they start to fail.

Then run your experiments...

The rest, like padding for business requirements underestimating reality, and for being optimistically wrong in various ways in measurement, is a relatively easy exercise of arm wrestling between waiting to sleep well at night and wanting to have more beans left to count.

An important note: Before you run the actual experiments and start considering your results, validate that the experimental setup can actually measure what you want. The best way to do that is by artificially introducing certain conditions and verifying that the setup correctly reports on what you know to have actually happened. My favorite tools for this step are the physical pulling out network cables, power cords, and using ^Z (or equivalent signals). You may find yourself spending a good amount of time calibrating the experimental setup so that you can actually trust the results, but that us time well spent, as wasting your time (and risking your business) by analyzing and relying on badly measured data is a very expensive proposition.