At Monday's Cloud Computing Meetup, Paco Nathan gave an excellent Getting Started on Hadoop talk (slides). I found one of Paco's strategies particularly interesting: consider when a service starts charging in cost calculations. Depending on your use case it may be cheaper to go with a more expensive service that charges only for work accomplished rather than charging for both work + startup time.
The example is comparing the cost of running Hadoop on AWS yourself versus using Amazon's prepackaged Hadoop service, Elastic MapReduce (EMR). The thought may have gone through your mind as it did mine that it doesn't necessarily make sense to use Amazon's Hadoop service. Why pay a premium for EMR when Hadoop will run directly on AWS?
One reason is that Amazon has made significant changes to Hadoop to make it run more efficiently and easily on AWS. The other more surprising reason is cost.
When starting a 500 node Hadoop cluster, for example, you have to wait for all the nodes to start and join the cluster before computation can begin. Starting a large number of nodes can take a considerable amount of time. Depending on the machine type and the time of day, it may be difficult to get all the requested machines, which only makes the node startup time longer. Hedge funds, for example, periodically run large jobs and if you start your run during their run it may take quite a while before all your machines can start.
This matters because on AWS you pay by the hour. So you are paying from the time the first node starts to when the last node finally starts, even though no actual productive work has been performed. With EMR you will only be charged from the time the first task begins. You are not charged for startup time.
Paco estimated that startup costs could be in double digit percentages of the total cost, so the cost is not inconsequential. In certain cases using EMR could be more cost effective, which I found quite surprising and interesting. It's something to consider at least in your own calculations and when using other cloud services down the road.