Are long VM instance spin-up times in the cloud costing you money?
Are long VM instance spin-up times in the cloud costing you money? That's the question that immediately came to mind when James Urquhart, in an interview at the Stata Conference, made this thought provoking comment: the faster you can get the resources into the hands of the people who use them, the more money you save overall.
One of the many super powers of the cloud is elasticity, the ability to dynamically acquire and release resources in response to demand. But like any good superhero, their strength must also form the basis of a not quite fatal flaw. Years and years of angsty episodes are usually required to explore this contradiction.
In the case of the cloud, the weakness reveals itself in slow VM spin-up times. Spinning up a VM in EC2 can take a little as 1-3 minutes, or can average 5-10 minutes, or it can take much longer if there's heavy usage in your availability zone. EC2 is not alone. A common complaint about Google App Engine is the cold-start problem. When a request comes in, an application must be initialized to handle it, which takes time, which means the end-user experiences increased latency.
This means with VM oriented systems--be they IaaS or PaaS--your ability to deal with bursty traffic is much more limited in the cloud than you might have expected. You could of course reserve capacity, but that kind of defeats the point of elasticity, and really just moves the point where the problem occurs further down the curve. App Engine will have a feature to keep warm instances around, but you will pay for those too, which again defeats the point of on-demand pay for what you use elasticity.
All this might not matter, if it weren't for the idea that those spin-up times could cost you. Joe Weinman has taken a more formal look at this problem in his paper Time is Money: The Value of “On-Demand," and this is the paper James was referring to when he made his observation.
Joe Weinman, as the founder of Cloudonomics, a rigorous analytical approach leveraging mathematics and Monte Carlo simulation to characterize the sometimes counterintuitive multi-dimensional business of cloud computing and pay-per-use business models, has written a string of interesting papers on his website. Some of the titles include: Smooth Operator: The Value of Demand Aggregation (PDF); Cloud Computing is NP-Complete (PDF); Mathematical Proof of the Inevitability of Cloud Computing(PDF). At the core of these papers are many many pages of rigorous mathematical analysis, but fortunately this creamy goodness is bookended with chocolatey cookies explaining what it all means.
From the abstract of Time is Money: The Value of “On-Demand” :
Cloud computing and related services offer resources and services "on demand." Examples include access to "video on demand" via IPTV or over-the-top streaming; servers and storage allocated on demand in "infrastructure as a service;" or "software as a service" such as customer relationship management or sales force automation. Services delivered "on demand" certainly sound better than ones provided "after an interminable wait," but how can we quantify the value of on-demand, and the scenarios in which it creates compelling value?
We show that the benefits of on-demand provisioning depend on the interplay of demand with forecasting, monitoring, and resource provisioning and de-provisioning processes and intervals, as well as likely asymmetries between excess capacity and unserved demand.
In any environment with constant demand or demand which may be accurately forecasted to an interval greater than the provisioning interval, on-demand provisioning has no value. However, in most cases, time is money. For linear demand, loss is proportional to demand monitoring and resource provisioning intervals. However, linear demand functions are easy to forecast, so this benefit may not arise empirically.
For exponential growth, such as found in social networks and games, any non-zero provisioning interval leads to an exponentially growing loss, underscoring the critical importance of ondemand in such environments.
For environments with randomly varying demand where the value at a given time is independent of the prior interval—similar to repeated rolls of a die—on-demand is essential, and generates clear value relative to a strategy of fixed resources, which in turn are best overprovisioned.
For demand where the value is a random delta from the prior interval—similar to a Random Walk—there is a moderate benefit from time compression. Specifically, reducing process intervals by a factor of n results in loss being reduced to a level of 1/square root of n of its prior value. Thus, a two-fold reduction in cost requires a four-fold reduction in time.
Finally, behavioral economic factors and cognitive biases such as hyperbolic discounting, perception of wait times, neglect of probability, and normalcy and other biases modulate the hard dollar costs addressed here.
The degree of effect is related to traffic patterns:
We have seen that not only is there a time value of money, there is a money value of time, specifically, increased agility and responsiveness lead to reduced loss, including a reduction in missed opportunities. Time is money.
From a business perspective, one has to ask whether the reduction in monitoring or provisioning time that potentially results in reduced loss due to unserved demand or unused resources is worth it. I believe in most cases the answer is yes. The reason is that the costs of implementing such on-demand strategies are largely fixed, are a relatively minor portion of the total cost, or are already incorporated, say, into a cloud provider's offerings. For example, the cost for an enterprise or cloud provider to acquire and deploy dynamic provisioning software compared to the losses associated with unserved demand or unutilized capacity make it an attractive proposition.
For linearly growing or declining demand, a reduction in time (monitoring cycle or resource provisioning) offers a proportional reduction in cost.
For exponential demand, the loss associated with even fixed interval provisioning grows exponentially, so on-demand provisioning is essential.
The VM spin-up interval is your period of lost opportunity. If your traffic is bursty and/or growing exponentially, then you may be losing out on more profitable opportunities than you thought, because cloud elasticity doesn't match demand elasticity. While not quite the cloud's kryptonite, it is a flaw worth considering in your architecture.