« 13 Screencasts on How to Scale Rails | Main | Classifying XTP systems and how cloud changes which type startups will use »

The Implications of Punctuated Scalabilium for Website Architecture

Update: How do you design and handle peak load on the Cloud? by Cloudiquity. Gives a formula to try and predict and plan for peak load and talks about how GigaSpaces XAP, Scalr, RightScale and FreedomOSS can be used to handle peak load within EC2.

Theo Schlossnagle, with his usual insight, talks about in Dissecting today's surges how the nature of internet traffic has evolved over time. Traffic now spikes like a heart attack, larger and more quickly than ever from traffic inflow sources like Digg and The New York Times. Theo relates how At least eight times in the past month, we've experienced from 100% to 1000% sudden increases in traffic across many of our clients and those spike can happen as quickly as 60 seconds. To me this sounds a lot like Punctuated equilibrium in evolution, a force that accounts for much creative growth in species...

VMs don't spin up in less than 60 seconds so your ability to respond to such massive quick spikes is limited. This assumes of course that you've created an architecture that can automatically scale by adding VMs. Such elastic demand is usually met with a reservoir. You have more VMs in reserve to soak up temporary spikes. But who would do this in reality? Money would be going to non productive VMs, so you are likely to already have put those VMs into production.

Interestingly, Theo ties handling sudden unexpected spikes back to performance. We are always told performance and scalability are separate issues. And while I accept this notionally, in my heart of hearts I think they have more in common than not and I think Theo nails why. A well performing system acts as a kind of reservoir for handling spikes before you can ever notice there's a spike. That gives you some time to add more resources to your site if a spike continues. With that reservoir you are just crushed.

Theo gives four rules for for handling spikes: Be alert, Be prepared, Perform triage, and Be calm. Please see his site for more discussion of these rules.

A few things that might help:
  • Create fast booting VMs. It's easy to create VMs that boot glacially (intentional irony). The more you leave to run-time like software downloads and configuration, the slower your VMs boot and the slower you can react to spikes.
  • Cloud vendors offer a service to maintain an image cache. It would be useful if a service was offered that could guaranteed faster provisioning of VMs and quicker download of images.
  • Would an in-cloud service to offer stem cell VMs make sense? This is a VM that could quickly become any one of a number of different images on demand. So a service could keep a reservoir of stem cell VMs up and running, shared by a number of customers, and an application could request the low latency spin up of one of the reserved VMs.

    The idea that internet traffic patterns have evolved such that even our cloud architectures can't easily cope is an interesting one. I find it ironic that many of the techniques needed to build real-time systems are helpful to handle this new world too when at first glance the problems look nothing alike. Sometimes piling on more resources isn't enough, efficiency matters too.
  • Reader Comments (5)

    Typically only small(ish) sites typically need to worry about unexpected peaks of 1000%, or even 100%. Once your site is large enough it becomes almost statistically impossible for traffic to double even day-over-day, not to mention in 60 seconds. Topical sites (ala superbowl.com) are somewhat of a special case although arguably they can be considered small sites except at peak periods.

    November 29, 1990 | Unregistered Commenteraddy

    addy: Agreed. Just because only small-ish sites have to worry about it doesn't make it any less significant. Excellent post by Theo.

    November 29, 1990 | Unregistered CommenterAnonymous

    Thanks for the reference and comments. One of the things that inspired me to write the post is how we've seen a growing trend of this on "non-small" sites. Sites with 10 million+ regularly users. I agree that it easy to see a phenomenal spike on small sites (as the base is so low). The really surprising thing is seeing this on sites with already establish and significant traffic patterns.

    Again, thanks for the commentary.

    November 29, 1990 | Unregistered CommenterTheo Schlossnagle

    One of the clear benefits of running on a shared hardware is you can level demand with the other sites your sharing hardware with. If you are able to share hardware with other sites or services which have a non-correlated usage pattern to your own, then when you get a sudden spike in traffic, chances are the other sites will be running at average load, and you'll be able to make use of the spare CPU cycles.

    And no, it's not just the small site that have to worry about sudden spikes in load. Take for example, the news sites like http://news.bbc.co.uk/2/hi/science/nature/1540441.stm">BBC on 9/11. They normally handle a good amount of load and have plenty of spare capacity, but the sudden increase in traffic brought their site down. If BBC shared hardware with say, Expedia.com, then the decrease in traffic on the one site, would allow for more processing on the other.

    With services like EC2, the service provider can run clever algorithms to put highly non-correlated VM's on the same hardware.

    If you have an architecture where you can spin up and spin down VMs as needed, I can't imagine the 60 second boot up to be a limiting factor. If for example, your site lands on the front page of Digg.com, how long do you have before you start receiving a mountain of traffic?

    November 29, 1990 | Unregistered CommenterBrian Egge

    This could be more properly named as equilibrium for website
    http://underwaterseaplants.awardspace.com">sea plants
    http://underwaterseaplants.awardspace.com/seagrapes.htm">sea grapes...http://underwaterseaplants.awardspace.com/plantroots.htm">plant roots

    November 29, 1990 | Unregistered Commenterfarhaj

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>