« Steve Ballmer Says Microsoft has Over 1 Million Servers - What Does that Really Mean? | Main | Stuff The Internet Says On Scalability For July 12, 2013 »

Ask HS: What's Wrong with Twitter, Why Isn't One Machine Enough?

Can anyone convincingly explain why properties sporting traffic statistics that may seem in-line with with the capabilities of a single big-iron machine need so many machines in their architecture?

This is a common reaction to architecture profiles on High Scalability: I could do all that on a few machines so they must be doing something really stupid. 

Lo and behold this same reaction also occurred to the article The Architecture Twitter Uses to Deal with 150M Active Users. On Hacker News papsosouid voiced what a lot of people may have been thinking:

I really question the current trend of creating big, complex, fragile architectures to "be able to scale". These numbers are a great example of why, the entire thing could run on a single server, in a very straight forward setup. When you are creating a cluster for scalability, and it has less CPU, RAM and IO than a single server, what are you gaining? They are only doing 6k writes a second for crying out loud.

This is a surprisingly hard reaction to counter convincingly, but nostrademons has a triple great response:

They create big, complex, fragile architectures because they started with simple, off-the-shelf architectures that completely fell over at scale.


I dunno how long you've been on HN, but around 2007-2008 there were a bunch of HighScalability articles about Twitter's architecture. Back then it was a pretty standard Rails app where when a Tweet came in, it would do an insert into a (replicated) MySQL database, then at read time it would look up your followers (which I think was cached in memcached) and issue a SELECT for each of their recent tweets (possibly also with some caching). Twitter was down about half the time with the Fail Whale, and there was continuous armchair architects about "Why can't they just do this simple solution and fix it?" The simple solution most often proposed was write-time fanout, basically what this article describes.

Do the math on what a single-server Twitter would require. 150M active users * 800 tweets saved/user * 300 bytes for a tweet = 36T of tweet data. Then you have 300K QPS for timelines, and let's estimate the average user follows 100 people. Say that you represent a user as a pointer to their tweet queue. So when a pageview comes in, you do 100 random-access reads. It's 100 ns per read, you're doing 300K * 100 = 30M reads, and so already you're falling behind by a factor of 3:1. And that's without any computation spent on business logic, generating HTML, sending SMSes, pushing to the firehose, archiving tweets, preventing DOSses, logging, mixing in sponsored tweets, or any of the other activities that Twitter does.

(BTW, estimation interview questions like "How many gas stations are in the U.S?" are routinely mocked on HN, but this comment is a great example why they're important. I just spent 15 minutes taking some numbers from an article and then making reasonable-but-generous estimates of numbers I don't know, to show that a proposed architectural solution won't work. That's opposed to maybe 15 man-months building it. That sort of problem shows up all the time in actual software engineering.)

And the thread goes on with a lot of enlightening details. (Just as an aside, in an interview the question "How many gas stations are in the US" is worse than useless. If someone asked for a Twitter back-of-the-napkin analysis like nostrademons produced, now we are getting somewhere.)

Do you have an answer? Are these kind of architectures evidence of incompetence or is there a method to the madness?

Reader Comments (12)

There are a few reasons more machines is better. Google is probably a better example.

Consider two companies, one that buys a $12000 server that is ultra-reliable, has swappable RAM/CPUs/storage. It does x mFlops. Google purchases 6 $2000 machines without swappable storage that each do .6x mFlops. Google has 3.6x the CPU capability if all machines are running, but, only requires two of their six to be running to have similar capacity. While all machines are running, they can do 'expensive' tasks that other companies cannot do. And you can't just run on one server and hope for reliability. In this case, Google's 'plan to fail' architecture hands them CPU time that other entities don't have.

In order to deal with IO saturation and maintain prompt delivery of tweets, one must fan out those writes widely while minimizing depth. Consider how deep a binary tree is to represent a large set of data compared to an Nary(32) tree. If each machine has 32 machines below it that it needs to update, the request fans out much more quickly. When dealing with the scale that Twitter must handle, that fanout requires a large number of machines.

To my knowledge based on articles posted here and around the net, Twitter was always RAM based and used MySQL only for point in time backup/archival.

The difference is that Google and other entities plan and expect hardware to fail while some people design hoping nothing fails.

The gas station question can be valid. What would your analysis be to consider how to come up with the number?

Take the population of a reasonably sized city, count the many gas stations in that city, figure out how many gas stations per person, take the population of the USA divided by the number of people each gas station serves and you have your number.

The city I live in has a population of about 57000 people. We have 17 gas stations giving us 3350 people per gas station. Take the population of the United States at 310 million divided by 3350 and we come up with 92537. A quick google search shows that there were 121,446 gas stations in the USA in 2012.

While the answer isn't completely accurate based on my sample size, it is the analysis in coming up with an answer that recruiters are looking for. How do you break down that problem, how do you figure out what the magnitude of the problem is that you're solving? You break it into pieces and solve each piece.

July 15, 2013 | Unregistered Commentercd34

The response is broken because it combines the worst of both models: it assumes write-fanout AND read-fanout. Also, you don't need to store 300bytes per fan-out write, but just 8 bytes for the tweet id.

Hyperbolic, not useful.

July 15, 2013 | Unregistered Commentertobi

^^ The solution only proposes read-fanout.

150M active users * 800 tweets saved/user * 300 bytes for a tweet = 36T of tweet data

Each tweet saved once, a SELECT for each follower to build the timeline. As we know, this is not what Twitter does. But the proposition does not take the worst of both, as you say.

July 15, 2013 | Unregistered Commenterme

The 800 number is for the combined/materialized timeline. Those are tweets from other users.

You don't need to have 800 of your own tweets in fast storage. Maybe you need 50, the rest can go into some archival cluster because it has low read throughput requirements. That way you get to save all tweets, not just 800.

Also, it's probably more like 150 bytes per tweet if you apply data compression.

July 15, 2013 | Unregistered Commentertobi

I want to add one thing: I'm not thinking of MySQL here. I'd develop a small, custom C++ server that has custom data structures. Squeezing out bytes is a major win here.

This gets complex but less complex, more powerful and cheaper that what they have right now.

July 15, 2013 | Unregistered Commentertobi

I think it is all about balance. Every large website has to scale out in one way or another. Is just some has taken scaling out to an extreme. And that is the problem or questions.

SSD is getting both faster and cheaper, there are Mobile PCI-E SSD this year running at 1.4GB/s. IOPS are also getting higher, as well as IOPS consistency. While Memory Prices has gone up in recent months it is still very affordable.

Combining In-Memory Caching and SSD IOPS the bottleneck is now less of a problem. Unless you are google or Facebook scale. Most should really hit the sweet spots of Scaling Up and Out rather then simply relying on wimpy node and scale out.

July 15, 2013 | Unregistered CommenterEd


it gets complex with replication and sharding

July 16, 2013 | Unregistered Commenterriksi

I would also point out other more practical aspects of having off-the-shelf hardware, such as:
- Having a big-ass $1M computer may yield the same performance, but you end up in vendor lockin where you effectively can't shop around that much to get the hardware costs down.
- Similarly, consider your upgrade path. If you need ten percent more performance than the hardware is physically capable to give you, you need to spend $2M for a double-big-ass computer. Or you need to buy another one, in which case, congratulations, you ended up with a complex distributed system -- if not now, then the next iteration, or the one after that.

July 16, 2013 | Unregistered CommenterFooBar

I think the answer to this blogs question is simple - poor software. Working machines with 1TB of RAM, 32-1600 CPU jammed pack with SSD and petabytes of storage - most of the time all I see are machine sitting idle or doing pointless busy work like doing like random access reads to TB's of xml files.

Listening to Martin Thompson, I was blown away about what can be done with systems running in complete "Mechanical Sympathy" were performance of the system matches what can be done. Instead of being engineers and designing systems based on solid performance data - programming approaches are more often than not based on the current fashion (e.g. Ruby). We then take our fashionable software and then fashionably spread it across lots of machines. A great career move if that is your job. It is perhaps less impressive showing high performance off a Raspberry PI rather than racks of computers.

What we are not seeing are analysis of how an application is optimised for the hardware, and then and only then was additional complexity associated with large clusters considered and introduced.

July 17, 2013 | Unregistered CommenterPhilipH

Philip: While at face value that might seem true, in practice you should bear in mind the cost analysis. How much time would you spend optimizing your application for the given hardware? At least the most significant hotspots?

And extremely importantly... how much will it imply that you need to stick with that particular piece of code, and that particular platform, because any change of either would require further time-intensive finetuning? The answer to this question is not generic -- embedded devices in robots or satellites will have a wildly different answer than a website that can turn into just another Internet fad.

July 17, 2013 | Unregistered CommenterFooBar

Hi FooBar - Agree that often over optimising for a specific piece of hardware can be a folly. However, my argument is predominantly about commodity servers where a level of target stability exists.

My feeling is that required design changes to code isn't just a few tweaks, but rather systematic architectural redesign to begin to fully utilise modern hardware - predominantly (imho) to use more streaming rathe than request reply style methods.

July 20, 2013 | Unregistered CommenterPhilipH

as well, consider this common scenario: a programmer releases software that has a slow memory leak.

in the small footprint server, the leak consumes all memory rather quickly. A couple of times, and as the (ops) dude would say, 'this aggression will not stand!'

but in the 2TB ram world, that mellow, festering stink lingers until Christmas vacation when only Jr. is on pager.

July 25, 2013 | Unregistered Commenterswapoff

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>