Example

FaceStat's Rousing Tale of Scaling Woe and Wisdom Won

Lukas Biewald shares a fascinating slam by slam recount of how his FaceStat (upload your picture and be judged by the masses) site was battered by a link on Yahoo's main page that caused an almost instantaneous 650,000 page view jump on their site. Yahoo spends considerable effort making sure its own properties can handle the truly massive flow from the main page. Turning the Great Eye of the Internet towards an unsuspecting newborn site must be quite the diaper ready experience. Theo Schlossnagle eerily prophesized about such events in The Implications of Punctuated Scalabilium for Website Architecture: massive, unexpected and sudden traffic spikes will become more common as a fickle internet seeks ever for new entertainments (my summary). Exactly FaceStat's situation.

This is also one of our first exposures to an application written on Merb, a popular Ruby on Rails competitor. For those who think Ruby is the problem, their architecture now serves 100 times the original load.

How did our fine FaceStat fellowship fair against Yahoo’s onslaught?

Not a lot of details of FaceStat’s architecture are available, so it’s not that kind of post. What interested me is that it’s a timely example of Theo’s traffic spike phenomena and I was also taken with how well the team handled the challenge. Few could do better so quickly.

In fact, let’s apply Theo’s rubric for how to handle these situations to FaceStat:

Be Alert: build automated systems to detect and pinpoint the cause of these issues quickly (in less than 60 seconds). None initially, but they are building in more monitoring to better handle future situations. Better monitoring would have alerted them to the problems long before they actually were alerted. Perhaps many more potential customers might have been converted to actual customers. You can never have enough monitoring!

Be Prepared: understand the bottlenecks of your service systemically. As the system was relatively simple, new, and quickly changed, my assumption is they were fully aware of their system’s shortcomings, they were just busy with adding features rather than worrying about performance and scalability.

Perform Triage: understand the importance of the various services that make up your site. Definitely. They “started ripping out every database intensive feature” in response to the load.

Be Calm: any action that is not analytically driven is a waste of time and energy. They stayed amazingly calm as can be seen from the following quote: “It’s one thing to code scalably and grow slowly under increasing load, but it’s been a blast to crazily rearchitect a live site like FaceStat in a day or two.” I’m not sure how analytically driven they were however 

All-in-all an impressive response to the Great Eye’s undivided attention.

But not everyone was impressed as I. A commenter named Bernard said: Sorry, but this is a really dumb story. Given how dirt cheap things like slicehost and linode are, it is crazy that you launched a web app and had not already prepared a redundant, highly-scalable architecture… I’d say you were damn lucky that the disappointed users came back at all.

Commenter Will thought it was a “Nice problem to be having!” Which it is, of course, being noticed is better than being ignored. But Lukas was spot on when he lamented about being noticed too soon has a downside: After working so hard to get users to come to your site, it’s amazingly frustrating to see hundreds of thousands of people suddenly locked out.

Clearly we still don’t have the ability for developers to create scalable systems as simply as they create exploratory systems. Ed from Rackspace posted that they could help with their Auto Scale of Arrays feature. And Rackspace would be an excellent solution, but the cost would be $500/month and a $2500 setup fee. No “let’s put on a show” startup can afford those costs. The mode FaceStat was in is typical: We find that a Rails-like platform is invaluable for rapidly prototyping a new site, especially since we started FaceStat as a pure experiment with no idea whether people would like it or not, and with a very different feature set in mind compared to what it later became.

A pay as you grow model is essential for scalability because that’s the only way you can bake scalability in from the start. And even with all the impressive advances in the industry we still don’t have the software infrastructure to make scaling second nature.

Information Sources

Scaling Fast by Lukas Biewald

FaceStat scales! on Dlores BLog

Platform

Merb. Ruby based MVC framework that is ORM-agnostic.

Thin. A fast and very simple Ruby web server.

Slicehost. Hosting service. Able to quickly provision servers as needed.

Amazon’s S3. Image server. Latency is high but it handles the load.

Capistrano. Automated deployment.

Git with github. Source code control system. Supports efficient simultaneous development, quick merging and deployment.

God. Server monitoring and management.

Memcached. Application caching layer.

PostgreSQL

The Stats

Six app servers.

One big database machine.

The Architecture

FaceStat is a write heavy application and performs involved calculations on data.

S3 is used to offload the responsibility for storing images. This freed them from the massive bandwidth requirements and complexity of managing their own images.

Memcached offloads reads from the database to allow the database to have more time for writes.

Lessons Learned

Monitor the site. The sooner you know about a problem the faster it can be fixed. Don't rely on user email or email from exception handlers or you'll never get ahead of problems.

Communicate with your users with an error page. A meaningful error pages shows you care and that you are working on the problem. That's enough for a second chance with most people.

Use a cached statically generated homepage. Hard to beat that for performance.

Big sites might want to give a heads up when they mention smaller sites. Just a short polite email saying how your world will soon turn upside down would do.

High-level platform really doesn’t matter compared to overall architecture. How you handle writes, reads, caching, deployment, monitoring, etc are relatively framework independent and it's how you solve those problems that matter.

Ruby and Merb supported rapid prototyping to experiment and create a radically different system form the one they intended.