Hot New Trend: Linking Clouds Through Cheap IP VPNs Instead of Private Lines 

You might think major Internet companies have a latency, availability, and bandwidth advantage because they can afford expensive dedicated point-to-point private line networks between their data centers. And you would be right. It's a great advantage. Or it at least it was a great advantage. Cost is the great equalizer and companies are now scrambling for ways to cut costs. Many of the most recognizable Internet companies are moving to IP VPNs (Virtual Private Networks) as a much cheaper alternative to private lines. This is a strategy you can effectively use too.

This trend has historical precedent in the data center. In the same way leading edge companies moved early to virtualize their data centers, leading edge companies are now virtualizing their networks using IP VPNs to build inexpensive private networks over a shared public network. In kindergarten we learned sharing was polite, it turns out sharing can also save a lot of money in both the data center and on the network.

The line of reasoning for adopting IP VPNs goes something like this:

  • Major companies are saving 1/4 to 1/2 of their networking costs by moving from private lines to IP VPNs. This does not even include the benefits of lower equipment costs (GigE ports are basically free) and more flexible provisioning (any-to-any connectivity, easy bandwidth dialup).
  • Cheaper comes with a cost. Private lines are reliable. The Internet is inherently unreliable, especially when two endpoints are linked by potentially dozens of routers in between. In particular Internet connections suffer from: 1) dropped packets 2) out of order packets. Statistically this may happen for only 1% of packets, but when it does the user experience plummets. To get a feel for the impact imagine you have a 200ms latency link to Europe and you're trying to do something interactive. Lose a packet and you'll have to wait for a retransmission which will take at least 1 second. So IP VPNs can provide an order of magnitude more bandwidth for less money, but they often have less actual throughput and reliability.
  • Since latency and quality are so important to Internet companies, how can they possibly afford to use IP VPNs? They cheat. They fix the IP connection by using WAN accelerators.
  • WAN accelerators are typically thought to be mostly about caching, but they can also can trick TCP into giving a better connection even over unreliable networks. It's like wearing corrective lenses for your network. And that's what you need when dropping dedicated lines for Internet connections.
  • Relatively inexpensive WAN accelerators can turn somewhat unreliable Internet connections into a very reliable cost effective connection option. Your customers won't believe it's not butter.
  • The result: lots of money saved and a quality costumer experience.

    We take TCP for granted so to learn it has this unsightly packet loss/delay problem is a bit unsettling. But here's the impact packet loss has on throughput:
  • Latency: 100ms, Loss: 1%, Throughput: 1.2 Mbps
  • Latency: 200ms, Loss: 1%, Throughput: .6 Mbps
  • Latency: 100ms, Loss: .5%, Throughput: 1.7 Mbps

    These numbers are independent of your WAN link capacity. You could have an 100Mbps link with 1% loss and 100ms latency and you're limited to 1Mbps!

    The reason why we have this bandwidth robbing state of affairs is because when TCP was designed packet loss meant network congestion. The way to deal with congestion is to stop sending data in order to avoid congestion. This drops throughput drastically for a very long time. Over long distance WAN connections packets can be delayed which seems like a packet loss which causes congestion avoidance measures to kick in. Or maybe only a single packet was dropped and that kicks in congestion avoidance.

    The trick is convincing TCP that everything is cool so the full connection bandwidth can be used. WAN accelerators have a number of complex features to keep TCP happy. Damon Ennis, VP Product Management and Customer Support for Silver Peak, a WAN accelerator company, talks about why clouds, IP VPNs, and WAN accelerators are a perfect match:
    Moving applications into the cloud offers substantial cost savings for enterprises. Unfortunately those savings come at the cost of application performance. Often performance is so hampered that users’ productivity is severely limited. In extreme cases, users refuse to utilize the cloud-based application altogether and resort to old habits like saving files locally without centralized backup or returning to their old “thick” applications.

    The cloud limits performance because the applications must be accessed over the WAN. WANs are different from LANs in three ways – WAN bandwidth is a fraction of LAN bandwidth, WAN latency is orders of magnitude higher than LAN latency, and packet loss exists on the WAN where none existed on the LAN. Most IT professionals are familiar with the impacts of bandwidth on transfer times – a 100MB file takes approximately 1 second to transfer on a Gbps LAN and approximately 10 seconds to transfer on a 100Mbps LAN. They then extrapolate this thinking to the WAN and assume that it will take 10 seconds to transfer the same file on a 100Mbps WAN. Unfortunately, this isn’t the case. Introduce 100ms of latency and this transfer now takes almost 3 minutes. Introduce just 1 % packet loss and this transfer now takes over 10 minutes.

    There’s a calculator available that will let you figure out the effective throughput of your own WAN if you know its bandwidth, latency, and loss. Once you know your effective throughput simply divide 800Mb (100MB) by your effective throughput to determine how long it would take to transfer the same example file over your WAN.

    Latency and loss don’t just impact file transfer times, they also have a dramatic impact on any applications that need to be accessed in real-time over the WAN. In this context a real-time application is one that requires real-time response to users’ keystrokes – think of any application that is served over a thin-client infrastructure or Virtual Desktop Infrastructure (VDI). Not only is the server 100 ms away but any lost packet will result in delays of up to half a second waiting for the loss to be detected and the retransmission to occur. This is the root cause of the frustrated user banging on the enter key looking for a response.
    This all seems like a lot effort, doesn't it? Why not just dump TCP and move to a better protocol? Sounds good but everything works on TCP so to change now would be monumental. And as strange as it seems TCP is doing it's job. It's a protocol so there's a separation of what's above it from what's below it which allows innovation at the TCP level without breaking anything. And that's what layering is all about.

    The upshot is with a little planning you can take advantage of much cheaper IP VPN costs, improve latency, and maximize bandwidth usage. Just like the big guys.

    Related Articles

  • Cloud Computing Requires Infrastructure 2.0 by Gregory Ness
  • Myth of Bandwidth and Application Performance by Ameet Dhillon
  • How Does WAN Optimization Work? by Paul Rubens
  • SilverPeak Technology Overview
  • Monday
    Jun292009 describes how they use Amazon EC2 and MapReduce 

    This slide show presents experience (one of the biggest dating sites out there) in using Amazon EC2 and MapReduce to scale their service.

    Go to the Slideshare presentation


    Google App Engine plus Amazon AWS: Best of both worlds

    Google App Engine (GAE) is focused on making development easy, but limits your options. Amazon Web Services is focused on making development flexible, but complicates the development process. Real enterprise applications require both of these paradigms to achieve success… What we really want is the flexibility of AWS and the simplicity of GAE.

    For the rest of the post see


    HighScalability Rated #3 Blog for Developers

    Hey we're moving up in the world, jumping from 19th place to 3rd place. In case you aren't sure what I'm talking about, Jurgen Appelo goes through this massive effort of ranking blogs according to Google PageRank, Technorati Authority, Alexa Rank, Google links, and Twitter Grader Rank.

    Through some obviously mistaken calculations HighScalability comes out #3. Given all the superb competition I'm not exactly sure how that can be. Well, thanks for all the excellent people who contribute and all the even more excellent people that read. Now at least I have something worthy to put on my tombstone :-)


    How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr's John Allspaw on His New Book

    Update 2: Velocity 09: John Allspaw, 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr. Insightful talk. Some highlights: Change is good if you can build tools and culture to lower the risk of change. Operations and developers need to become of one mind and respect each other. An automated infrastructure is the one tool you need most. Common source control. One step build. One step deploy. Don't be a pussy, deploy. Always ship trunk. Feature flags - don't branch code, make features runtime configurable in code. Dark launch - release data paths early without UI component. Shared metrics. Adaptive feedback to prioritize important features. IRC for communication for human context. Best solutions occur when dev and op work together and trust each other. Trust is earned by helping each other solve their problems. Look at what new features imply for operations, what can go wrong, and how to recover. Provide knobs and levers to help operations. Devs should have access to production machines. Fire drills to train. No finger pointing - fix stuff first. Design like you'll get woken up first when there's a problem. Say you're sorry. Not easy - like any relationship.
    Update: Operational Efficiency Hacks Web20 Expo2009 by John Allspaw. 131 picture perfect slides on operations porn. If you're interested in that kind of thing.

    Dream with me a little bit. Your startup becomes wildly successful. Hard work and random chance have smiled on you. To keep flirting with lady luck your system must scale. But how much stuff (space, hardware, software, etc) will you need to handle the growth, when will you need it and when will you need more?

    That's what Flickr's John Allspaw helps you figure out in his ground breaking new book on capacity planning: The Art of Capacity Planning: Scaling Web Resources.

    When I read statements about The Art of Capacity Planning like capacity planning is a term that to me means paying attention, All the information you need to make an educated forecast is in your historical metrics, and startups that are going to experience massive growth simply don't have time for anything but a 'steering by your wake' approach, I get the same sea change feeling I felt when the industry ran from waterfall design and embraced agile design. Current capacity planning is heavy. All up-front. Too analytical and too divorced from real life.

    Other capacity planning books assault you with models, math, and simulations. Who has the time? John has developed a common sense, low math approach to capacity planning that works using the system you already have. John's goal is to have you say: Oh, right, duh. That's common sense, not voodoo.

    Here's my email interview with John Allspaw on The Art of Capacity Planning. Enjoy.

    Please tell us who you are and what you've brought to show and tell today?

    I'm John Allspaw. I manage the Operations team at, and I've written a book (The Art of Capacity Planning: Scaling Web Resources) about capacity planning for growing websites.

    After spending a good chunk of your life writing this book, can you summarize it in just a few sentences so people will know why it should matter to them?

    This book is basically a guide to adaptive capacity planning for growing websites. It's an approach that relies much less on benchmarking and simulation, than on the close observation of production loads to guide future decisions. It's not rocket science, and I'm hoping people can use it to justify the what, why, and when of getting more resources to allow them to grow as fast as they need to. It's worked really well for me at Flickr and other organizations.

    Give me your ripple of evil. What happens without capacity planning?

    Capacity planning is a term that to me means paying attention. Web applications can fail in all sorts of dramatic ways, and you're not going to foresee all of them. What you can do, however, is make use of what you do know about what happens in your real world on a regular basis. Things like: my database can do X queries per second before it keels over. Or my cache can only keep Y minutes worth of changing objects. You're not going to predict every failure mode of the whole system, but knowing the failure modes of individual pieces should be considered mandatory. Armed with that, you can make decent forecasts about the future.

    I'm a guy or gal at a startup who's freaking out because my boss asked me how much hardware we need for the next quarter/year? What do I do now?

    Buy my book? ☺ All the information you need to make an educated forecast is in your historical metrics. You do have system and application-level statistics, right? Tying your system level stats (CPU, memory, network, storage, etc.) to application-level metrics (users, posts, photos, videos, page views, widgets sold, etc.) is key, because then you have history to back up the guesstimates. Your business, product or marketing team also has their own guesses for some of those application-level metrics, so you should get the two forecasts together and see where/how they match up or differ. Capacity should enable the business, not hinder it.

    So your book, is it the best book on Capacity Planning or the greatest book on Capacity Planning?

    My book is the best book on capacity planning. If it were the 'greatest' book, it'd be a lot bigger than it is. It's good for scaling your hardware. It's not good for flattening out posters.

    My site is gonna explode and I don't know at what point it's going to die. What do I do now?

    Argh! Panic! Handle the current explosion, and make it priority one to find out the limits of your capacity when the emergency is over.

    Try to panic gracefully. If it's dying right now, and you can't easily add any more capacity, then try some of the tried and trusted WebOps 101 tricks mentioned everywhere, including this blog:
    - disable features (preferably the heavier load-causing ones)
    - cache previously dynamic content into static bits
    - avoid loaded backend calls by serving stale content
    Of course it's easier to do those things when you have easy config flags to turn things on or off, and a list to run through of what things are acceptable to serve stale and static. We currently have about 195 'features' we can turn off at Flickr in dire circumstances. And we've used those flags when we needed to.

    Having said all of that, knowing when your resources are going to die should be mandatory, not optional. Know how many qps your databases, webservers, caching systems, and storage can handle before degradation. Test that stuff. With production traffic. If at all possible, with live production traffic, not just recorded and replayed production loads.

    How do you compare your approach with well known approaches like Guerrilla Capacity Planning? Isn't focusing on queuing theory, Little's Law, and Possion arrival rates misguided? What will really help people on-the ground?

    My approach is a bit different from Mr. Gunther's, although his book was an inspiration for mine. I do think that having a general understanding of queuing theory and the mathematics of open and closed systems can be important, don't get me wrong. But many of the startups that are going to experience massive growth simply don't have time for anything but a 'steering by your wake' approach, and done right, I think that approach will serve them well. I think some people recognize this, but in my experience, I'd say the development timelines are even tighter than most people realize.

    Almost any time and effort in constructing and running a simulation, model, or benchmark that involves the main moving parts of a web site's back-end is pretty much wasted due to how quickly application logic, use cases, and even hardware configurations can change. For example, by the time I could construct a useful model to capture the webserver-database interactions for Flickr, the results won't resemble production load, since the development cycle we have is so tight. As I write this, in the last week there were 50 code deploys of 550 changes by 19 Flickr staff. I realize that's a lot more than a lot of organizations, but that rate of change requires that the capacity planning process is easily adjustable.

    I also found the existing books on the topic to be pretty dry with the math. The foundations of queuing theory are interesting, but proofs of Little's Law isn't going to quickly justify to my finance guy that we need 11 more webservers.

    I have a cloud, do I really need to capacity plan anymore? That's so old school. Can't I decrease my administrator-to-server ratio and reduce the cost of managing systems by removing capacity planning resources?

    Nope, just trust The Cloud™ that everything will be all right. No need to pay any attention at all. Oh wait, that's a magic unicorn talking. Cloud services are a resource just like any in-house capacity: they have limits that should be paid attention to. The best use cases of cloud computing both recognize the benefits (shrinking 'procurement' times, for example) and the limitations. And those limitations are going to vary from application to application, organization to organization.

    You can build a real brand around a name like "Guerrilla Capacity Planning." Don't you think you could have come up with a better name for your approach?

    Yeah, I guess I should have a name for it. How about "Paying Attention Capacity Planning"?

    You recommend testing the limits of your hardware and software on a production site. Are you crazy?

    It's very possible that I'm crazy. But, using production traffic to define your resources ceilings in a controlled setting allows you to see firsthand what would happen when you run out of capacity in a particular resource. Of course I'm not suggesting that you run your site into the ground, but better to know what your real (not simulated) loads are while you're watching, than find out the hard way. In addition, a lot of unexpected systemic things can happen when load increases in a particular cluster or resource, and playing "find the butterfly effect" is a worthwhile exercise.

    Where does capacity planning sit in the stack? It seems to impact the entire application stack, but capacity planning is more of an operations thing and that may not have much impact in engineering.

    Capacity planning is a process, and should sit in the stack along with the other things that happen on a regular basis, like bug scrubs, like weekly meetings, etc. I'm spoiled as an Ops manager because we've got developers at Flickr who very much think like operations people. We're all addicted to graphs, and we're all pretty aware of how close each piece of the infrastructure is to becoming too 'hot', and we act accordingly. Planning, procurement, and measurement of capacity might lie on operations' shoulders, but intelligent use of it is everyone's business. I'm also blessed to have product management and customer care teams who are also aware of the state of capacity. As projects pop up that warrant capacity to be part of the considerations, it is. You simply can't launch things like FlickrVideo and the Yahoo Photos migration without capacity being part of the requirements, so we've got a good feedback loop going with respect to operations and capacity.

    Studies say 4/5 of people like a trip to the dentist more than they like capacity planning. Why does capacity planning have such a bad rep?

    Because no one wants to guess and be wrong? Because most of the cap planning literature out there are filled with boring math?

    Does the move to Web 2.0 make traditional approaches to capacity planning more difficult?

    I would say yes, but of course I'm biased. In many ways, use of the site is simply out of our hands. We gather metrics on as many aspects of the site as we can get our hands on, and add more all the time. Having an open API makes things interesting, because the use cases can vary wildly. Out of nowhere, a simple application can reveal an edge case that we hadn't foreseen, so we might have to adjust forecasts quickly. As to it being more difficult: I don't know. When I'm wrong, people can't see photos. If the guy at is wrong, then people can't get their money. What's more annoying? :)

    I really like your dashboard. Most dashboards say what happened, yours has an estimated days left before capacity runs out, which makes it actionable. How accurate have those numbers been?

    Well that particular dashboard is still being built here, but the general design is to capture those metrics on a regular basis, and to use it as a guide. The numbers for some clusters have been pretty trustworthy. But they do fluctuate in how accurate they are, since "days left" are obviously affected by things outside of the forecasting process. Sometimes, a new feature is planned that requires pounding the database shards, or bumps CPU on the webservers by a noticeable amount. It's because of these things that the dashboard is only a piece of the puzzle. Awareness and communications with product management and development have to inform planning decisions, as well as the historic metrics. Again: no crystal balls here.

    One of the equations you use is a "UR LIMITZ = Ceiling * Factor of Safety". Theo Schlossnagle has noted the evolution of phenomenal spikes in traffic, even for large sites. How do you have enough capacity as a safety factor if traffic is so spikey? What are the implications for forecasting?

    Start with the assumption that no one can accurately predict the future. Capacity planning isn't the only part of successful web operations, and not everything is a capacity issue. Theo nails what happens in the real-world, for sure. The answer is: you estimate the best you can with the history that you have. Obviously, one mistake is to make forecasts ignoring the spikes you've experienced in the past. Something important to consider is how expected (or not) your spikes are. If you sell things, you might have a seasonal holiday spike or plateau in traffic. If you're a content site that frequently gains traction with news-related sites (like Theo's example) and from time to time, you get massive unexpected spikes, then your factor of safety should include considerations for those spikes. But of course the flipside of that might mean that you have a boatload of servers doing nothing except wasting money while waiting for massive spikes that may or may not come. So it's a balance. Be reasonable with planning, follow the four guidelines that Theo points out in his post, and you've got yourself a strategy to deal with those unexpected spikes.

    At what point does it make sense for me to buy my own equipment?

    It's a difficult question. I've seen groups go from self-run colocation to managed hosting and cloud services, and others go the opposite way, for all legit reasons. Just as there are limitations with managed hosting, those limitations might not matter until you're massive. I do believe that there is a point at which owning your own equipment makes a lot of sense, because you've blown past the average needs of a managed hosting or cloud customer. Your TCO might be a lot different than mine, so to state that there's a single point for everyone would just be dumb.

    Quick fire round. What's your quick reaction to:

    1. Let's just go with a working prototype for now. We can change it when we grow big.

    Fine, for some definitions of "working", "change", and "big". I can't complain too much about that approach in some sense because that's how a lot of Flickr's backend evolved. But on the other hand, all you need is to fail a couple of times with prototypes to figure out what sort of homework needs to be done before launching something that is potentially explosive. So again, there's a balance. Don't be lazy, but don't be rigid and too fearful.

    2. Our VC told us that we're worrying about scalability too early. They doesn't want us to blow our scarce resources on preparing for success.

    Your VC is smart. If they're really smart, they'll also suggest worrying about scalability before it's too late. This answer isn't a cop-out, it's just reality. Realize that being scalable means being able to easily add capacity wherever you need it, whenever you need it. Buying too much equipment too soon is just as insane as hiring people that sit around and do nothing. The trick is knowing what "too much" and "too soon" is, and it's going to differ from company to company, from product to product.

    3. Premature optimization is the root of all evil.

    I've not made it a secret that I think this quote has given many an engineer (both dev and ops) reasoning to make dumb decisions. I actually agree with the idea behind the quote, but I also think that it's another one of those balancing acts. Knuth (or Hoare) also said "We should forget about small efficiencies, say about 97% of the time…" Another good question might be: which 3% are you going to pay attention to?

    4. We plan to throw hardware at the problem.

    Ok. When does that hardware need to come, how much of it, and what will you do when you realize more hardware won't fix a particular problem? The now classic limitations of the single-write-master, many-read-slaves database architecture is a great example of this. Experienced devs and ops people consider the possibility that hardware can't solve all problems.

    Note: these questions were adapted from How Important Is Being Scalable?.

    Now that you have some free time again, what are you going to do with your life?

    Eat. Sleep. Pay attention to my family. ☺

    Related Articles

  • Capacity Management for Web Operations by John Allspaw

  • Sunday

    Google Voice Architecture

    Hi High Scalability community!

    Do you have any information on the architecture behind Google Voice, the new service by Google that offers one Google Number for all your calls and SMS? It is based on GrandCentral who has been acquired by Google 2 years ago.



    Scaling Twitter: Making Twitter 10000 Percent Faster

    Update 6: Some interesting changes from Twitter's Evan Weaver: everything in RAM now, database is a backup; peaks at 300 tweets/second; every tweet followed by average 126 people; vector cache of tweet IDs; row cache; fragment cache; page cache; keep separate caches; GC makes Ruby optimization resistant so went with Scala; Thrift and HTTP are used internally; 100s internal requests for every external request; rewrote MQ but kept interface the same; 3 queues are used to load balance requests; extensive A/B testing for backwards capability; switched to C memcached client for speed; optimize critical path; faster to get the cached results from the network memory than recompute them locally.
    Update 5: Twitter on Scala. A Conversation with Steve Jenson, Alex Payne, and Robey Pointer by Bill Venners. A fascinating discussion of why Twitter moved to the Java JVM for their server infrastructure (long lived processes) and why they moved to Scala to program against it (high level language, static typing, functional). Ruby is used on the front-end but wasn't performant or reliable enough for the back-end.
    Update 4: Improving Running Components at Twitter by Evan Weaver. Tells how Twitter changed their infrastructure to go from handling 3 requests to 139 requests a second. They moved to a messaging model, asynchronous process, 3 levels of cache, and moved their middleware to a mixture C and Scala/JVM.
    Update 3: Upgrading Twitter without service disruptions by Gojko Adzic. Lots of good updates on the new Twitter architecture.
    Update 2: a commenter in Twitter Fails Macworld Keynote Test said this entry needs to be updated. LOL. My uneducated guess is it's not a language or architecture problem, but more a problem of not being able to add hardware fast enough into their data center. The predictability of this problem is debatable, but once you have it, it's hard to fix.
    Update: Twitter releases Starling - light-weight persistent queue server that speaks the MemCache protocol. It was built to drive Twitter's backend, and is in production across Twitter's cluster.

    Twitter started as a side project and blew up fast, going from 0 to millions of page views within a few terrifying months. Early design decisions that worked well in the small melted under the crush of new users chirping tweets to all their friends. Web darling Ruby on Rails was fingered early for the scaling problems, but Blaine Cook, Twitter's lead architect, held Ruby blameless:

    For us, it’s really about scaling horizontally - to that end, Rails and Ruby haven’t been stumbling blocks, compared to any other language or framework. The performance boosts associated with a “faster” language would give us a 10-20% improvement, but thanks to architectural changes that Ruby and Rails happily accommodated, Twitter is 10000% faster than it was in January.

    If Ruby on Rails wasn't to blame, how did Twitter learn to scale ever higher and higher?

    Update: added slides Small Talk on Getting Big. Scaling a Rails App & all that Jazz


    Information Sources

  • Scaling Twitter Video by Blaine Cook.
  • Scaling Twitter Slides
  • Good News blog post by Rick Denatale
  • Scaling Twitter blog post Patrick Joyce.
  • Twitter API Traffic is 10x Twitter’s Site.
  • A Small Talk on Getting Big. Scaling a Rails App & all that Jazz - really cute dog picks

    The Platform

  • Ruby on Rails
  • Erlang
  • MySQL
  • Mongrel - hybrid Ruby/C HTTP server designed to be small, fast, and secure
  • Munin
  • Nagios
  • Google Analytics
  • AWStats - real-time logfile analyzer to get advanced statistics
  • Memcached

    The Stats

  • Over 350,000 users. The actual numbers are as always, very super super top secret.
  • 600 requests per second.
  • Average 200-300 connections per second. Spiking to 800 connections per second.
  • MySQL handled 2,400 requests per second.
  • 180 Rails instances. Uses Mongrel as the "web" server.
  • 1 MySQL Server (one big 8 core box) and 1 slave. Slave is read only for statistics and reporting.
  • 30+ processes for handling odd jobs.
  • 8 Sun X4100s.
  • Process a request in 200 milliseconds in Rails.
  • Average time spent in the database is 50-100 milliseconds.
  • Over 16 GB of memcached.

    The Architecture

  • Ran into very public scaling problems. The little bird of failure popped up a lot for a while.
  • Originally they had no monitoring, no graphs, no statistics, which makes it hard to pinpoint and solve problems. Added Munin and Nagios. There were difficulties using tools on Solaris. Had Google analytics but the pages weren't loading so it wasn't that helpful :-)
  • Use caching with memcached a lot.
    - For example, if getting a count is slow, you can memoize the count into memcache in a millisecond.
    - Getting your friends status is complicated. There are security and other issues. So rather than doing a query, a friend's status is updated in cache instead. It never touches the database. This gives a predictable response time frame (upper bound 20 msecs).
    - ActiveRecord objects are huge so that's why they aren't cached. So they want to store critical attributes in a hash and lazy load the other attributes on access.
    - 90% of requests are API requests. So don't do any page/fragment caching on the front-end. The pages are so time sensitive it doesn't do any good. But they cache API requests.
  • Messaging
    - Use message a lot. Producers produce messages, which are queued, and then are distributed to consumers. Twitter's main functionality is to act as a messaging bridge between different formats (SMS, web, IM, etc).
    - Send message to invalidate friend's cache in the background instead of doing all individually, synchronously.
    - Started with DRb, which stands for distributed Ruby. A library that allows you to send and receive messages from remote Ruby objects via TCP/IP. But it was a little flaky and single point of failure.
    - Moved to Rinda, which a shared queue that uses a tuplespace model, along the lines of Linda. But the queues are persistent and the messages are lost on failure.
    - Tried Erlang. Problem: How do you get a broken server running at Sunday Monday with 20,000 users waiting? The developer didn't know. Not a lot of documentation. So it violates the use what you know rule.
    - Moved to Starling, a distributed queue written in Ruby.
    - Distributed queues were made to survive system crashes by writing them to disk. Other big websites take this simple approach as well.
  • SMS is handled using an API supplied by third party gateway's. It's very expensive.
  • Deployment
    - They do a review and push out new mongrel servers. No graceful way yet.
    - An internal server error is given to the user if their mongrel server is replaced.
    - All servers are killed at once. A rolling blackout isn't used because the message queue state is in the mongrels and a rolling approach would cause all the queues in the remaining mongrels to fill up.
  • Abuse
    - A lot of down time because people crawl the site and add everyone as friends. 9000 friends in 24 hours. It would take down the site.
    - Build tools to detect these problems so you can pinpoint when and where they are happening.
    - Be ruthless. Delete them as users.
  • Partitioning
    - Plan to partition in the future. Currently they don't. These changes have been enough so far.
    - The partition scheme will be based on time, not users, because most requests are very temporally local.
    - Partitioning will be difficult because of automatic memoization. They can't guarantee read-only operations will really be read-only. May write to a read-only slave, which is really bad.
  • Twitter's API Traffic is 10x Twitter’s Site
    - Their API is the most important thing Twitter has done.
    - Keeping the service simple allowed developers to build on top of their infrastructure and come up with ideas that are way better than Twitter could come up with. For example, Twitterrific, which is a beautiful way to use Twitter that a small team with different priorities could create.
  • Monit is used to kill process if they get too big.

    Lessons Learned

  • Talk to the community. Don't hide and try to solve all problems yourself. Many brilliant people are willing to help if you ask.
  • Treat your scaling plan like a business plan. Assemble a board of advisers to help you.
  • Build it yourself. Twitter spent a lot of time trying other people's solutions that just almost seemed to work, but not quite. It's better to build some things yourself so you at least have some control and you can build in the features you need.
  • Build in user limits. People will try to bust your system. Put in reasonable limits and detection mechanisms to protect your system from being killed.
  • Don't make the database the central bottleneck of doom. Not everything needs to require a gigantic join. Cache data. Think of other creative ways to get the same result. A good example is talked about in Twitter, Rails, Hammers, and 11,000 Nails per Second.
  • Make your application easily partitionable from the start. Then you always have a way to scale your system.
  • Realize your site is slow. Immediately add reporting to track problems.
  • Optimize the database.
    - Index everything. Rails won't do this for you.
    - Use explain to how your queries are running. Indexes may not be being as you expect.
    - Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins.
    - Avoid complex joins.
    - Avoid scanning large sets of data.
  • Cache the hell out of everything. Individual active records are not cached, yet. The queries are fast enough for now.
  • Test everything.
    - You want to know when you deploy an application that it will render correctly.
    - They have a full test suite now. So when the caching broke they were able to find the problem before going live.
  • Long running processes should be abstracted to daemons.
  • Use exception notifier and exception logger to get immediate notification of problems so you can address the right away.
  • Don't do stupid things.
    - Scale changes what can be stupid.
    - Trying to load 3000 friends at once into memory can bring a server down, but when there were only 4 friends it works great.
  • Most performance comes not from the language, but from application design.
  • Turn your website into an open service by creating an API. Their API is a huge reason for Twitter's success. It allows user's to create an ever expanding and ecosystem around Twitter that is difficult to compete with. You can never do all the work your user's can do and you probably won't be as creative. So open you application up and make it easy for others to integrate your application with theirs.

    Related Articles

  • For a discussion of partitioning take a look at Amazon Architecture, An Unorthodox Approach to Database Design : The Coming of the Shard, Flickr Architecture
  • The Mailinator Architecture has good strategies for abuse protection.
  • GoogleTalk Architecture addresses some interesting issues when scaling social networking sites.
  • Friday

    PlentyOfFish Architecture

    Update 5: PlentyOfFish Update - 6 Billion Pageviews And 32 Billion Images A Month
    Update 4: Jeff Atwood costs out Markus' scale up approach against a scale out approach and finds scale up wanting. The discussion in the comments is as interesting as the article. My guess is Markus doesn't want to rewrite his software to work across a scale out cluster so even if it's more expensive scale up works better for his needs.
    Update 3: POF now has 200 million images and serves 10,000 images served per second. They'll be moving to a 250,000 IOPS RamSan to handle the load. Also upgraded to a core database machine with 512 GB of RAM, 32 CPU’s, SQLServer 2008 and Windows 2008.
    Update 2: This seems to be a POF Peer1 love fest infomercial. It's pretty content free, but the production values are high. Lots of quirky sounds and fish swimming on the screen.
    Update: by Facebook standards Read/WriteWeb says POF is worth a cool one billion dollars. It helps to talk like Dr. Evil when saying it out loud.

    PlentyOfFish is a hugely popular on-line dating system slammed by over 45 million visitors a month and 30+ million hits a day (500 - 600 pages per second). But that's not the most interesting part of the story. All this is handled by one person, using a handful of servers, working a few hours a day, while making $6 million a year from Google ads. Jealous? I know I am. How are all these love connections made using so few resources?


    Information Sources

  • Channel9 Interview with Markus Frind
  • Blog of Markus Frind
  • Plentyoffish: 1-Man Company May Be Worth $1Billion

    The Platform

  • Microsoft Windows
  • IIS
  • Akamai CDN
  • Foundry ServerIron Load Balancer

    The Stats

  • PlentyOfFish (POF) gets 1.2 billion page views/month, and 500,000 average unique logins per day. The peak season is January, when it will grow 30 percent.
  • POF has one single employee: the founder and CEO Markus Frind.
  • Makes up to $10 million a year on Google ads working only two hours a day.
  • 30+ Million Hits a Day (500 - 600 pages per second).
  • 1.1 billion page views and 45 million visitors a month.
  • Has 5-10 times the click through rate of Facebook.
  • A top 30 site in the US based on Competes Attention metric, top 10 in Canada and top 30 in the UK.
  • 2 load balanced web servers with 2 Quad Core Intel Xeon X5355 @ 2.66Ghz), 8 Gigs of RAM (using about 800 MBs), 2 hard drives, runs Windows x64 Server 2003.
  • 3 DB servers. No data on their configuration.
  • Approaching 64,000 simultaneous connections and 2 million page views per hour.
  • Internet connection is a 1Gbps line of which 200Mbps is used.
  • 1 TB/day serving 171 million images through Akamai.
  • 6TB storage array to handle millions of full sized images being uploaded every month to the site.

    What's Inside

  • Revenue model has been to use Google ads., in comparison, generates $300 million a year, primarily from subscriptions. POF's revenue model is about to change so it can capture more revenue from all those users. The plan is to hire more employees, hire sales people, and sell ads directly instead of relying solely on AdSense.
  • With 30 million page views a day you can make good money on advertising, even a 5 - 10 cents a CPM.
  • Akamai is used to serve 100 million plus image requests a day. If you have 8 images and each takes 100 msecs you are talking a second load just for the images. So distributing the images makes sense.
  • 10’s of millions of image requests are served directly from their servers, but the majority of these images are less than 2KB and are mostly cached in RAM.
  • Everything is dynamic. Nothing is static.
  • All outbound Data is Gzipped at a cost of only 30% CPU usage. This implies a lot of processing power on those servers, but it really cuts bandwidth usage.
  • No caching functionality in ASP.NET is used. It is not used because as soon as the data is put in the cache it's already expired.
  • No built in components from ASP are used. Everything is written from scratch. Nothing is more complex than a simple if then and for loops. Keep it simple.
  • Load balancing
    - IIS arbitrarily limits the total connections to 64,000 so a load balancer was added to handle the large number of simultaneous connections. Adding a second IP address and then using a round robin DNS was considered, but the load balancer was considered more redundant and allowed easier swap in of more web servers. And using ServerIron allowed advanced functionality like bot blocking and load balancing based on passed on cookies, session data, and IP data.
    - The Windows Network Load Balancing (NLB) feature was not used because it doesn't do sticky sessions. A way around this would be to store session state in a database or in a shared file system.
    - 8-12 NLB servers can be put in a farm and there can be an unlimited number of farms. A DNS round-robin scheme can be used between farms. Such an architecture has been used to enable 70 front end web servers to support over 300,000 concurrent users.
    - NLB has an affinity option so a user always maps to a certain server, thus no external storage is used for session state and if the server fails the user loses their state and must relogin. If this state includes a shopping cart or other important data, this solution may be poor, but for a dating site it seems reasonable.
    - It was thought that the cost of storing and fetching session data in software was too expensive. Hardware load balancing is simpler. Just map users to specific servers and if a server fails have the user log in again.
    - The cost of a ServerIron was cheaper and simpler than using NLB. Many major sites use them for TCP connection pooling, automated bot detection, etc. ServerIron can do a lot more than load balancing and these features are attractive for the cost.
  • Has a big problem picking an ad server. Ad server firms want several hundred thousand a year plus they want multi-year contracts.
  • In the process of getting rid of ASP.NET repeaters and instead uses the append string thing or response.write. If you are doing over a million page views a day just write out the code to spit it out to the screen.
  • Most of the build out costs went towards a SAN. Redundancy at any cost.
  • Growth was through word of mouth. Went nuts in Canada, spread to UK, Australia, and then to the US.
  • Database
    - One database is the main database.
    - Two databases are for search. Load balanced between search servers based on the type of search performed.
    - Monitors performance using task manager. When spikes show up he investigates. Problems were usually blocking in the database. It's always database issues. Rarely any problems in .net. Because POF doesn't use the .net library it's relatively easy to track down performance problems. When you are using many layers of frameworks finding out where problems are hiding is frustrating and hard.
    - If you call the database 20 times per page view you are screwed no matter what you do.
    - Separate database reads from writes. If you don't have a lot of RAM and you do reads and writes you get paging involved which can hang your system for seconds.
    - Try and make a read only database if you can.
    - Denormalize data. If you have to fetch stuff from 20 different tables try and make one table that is just used for reading.
    - One day it will work, but when your database doubles in size it won't work anymore.
    - If you only do one thing in a system it will do it really really well. Just do writes and that's good. Just do reads and that's good. Mix them up and it messes things up. You run into locking and blocking issues.
    - If you are maxing the CPU you've either done something wrong or it's really really optimized. If you can fit the database in RAM do it.
  • The development process is: come up with an idea. Throw it up within 24 hours. It kind of half works. See what user response is by looking at what they actually do on the site. Do messages per user increase? Do session times increase? If people don't like it then take it down.
  • System failures are rare and short lived. Biggest issues are DNS issues where some ISP says POF doesn't exist anymore. But because the site is free, people accept a little down time. People often don't notice sites down because they think it's their problem.
  • Going from one million to 12 million users was a big jump. He could scale to 60 million users with two web servers.
  • Will often look at competitors for ideas for new features.
  • Will consider something like S3 when it becomes geographically load balanced.

    Lessons Learned

  • You don't need millions in funding, a sprawling infrastructure, and a building full of employees to create a world class website that handles a torrent of users while making good money. All you need is an idea that appeals to a lot of people, a site that takes off by word of mouth, and the experience and vision to build a site without falling into the typical traps of the trade. That's all you need :-)
  • Necessity is the mother of all change.
  • When you grow quickly, but not too quickly you have a chance grow, modify, and adapt.
  • RAM solves all problems. After that it's just growing using bigger machines.
  • When starting out keep everything as simple as possible. Nearly everyone gives this same advice and Markus makes a noticeable point of saying everything he does is just obvious common sense. But clearly what is simple isn't merely common sense. Creating simple things is the result of years of practical experience.
  • Keep database access fast and you have no issues.
  • A big reason POF can get away with so few people and so little equipment is they use a CDN for serving large heavily used content. Using a CDN may be the secret sauce in a lot of large websites. Markus thinks there isn't a single site in the top 100 that doesn’t use a CDN. Without a CDN he thinks load time in Australia would go to 3 or 4 seconds because of all the images.
  • Advertising on Facebook yielded poor results. With 2000 clicks only 1 signed up. With a CTR of 0.04% Facebook gets 0.4 clicks per 1000 ad impressions, or .4 clicks per CPM. At 5 cent/CPM = 12.5 cents a click, 50 cent/CPM = $1.25 a click. $1.00/CPM = $2.50 a click. $15.00/CPM = $37.50 a click.
  • It's easy to sell a few million page views at high CPM’s. It's a LOT harder to sell billions of page views at high CPM’s, as shown by Myspace and Facebook.
  • The ad-supported model limits your revenues. You have to go to a paid model to grow larger. To generate 100 million a year as a free site is virtually impossible as you need too big a market.
  • Growing page views via Facebook for a dating site won't work. Having a visitor on you site is much more profitable. Most of Facebook's page views are outside the US and you have to split 5 cent CPM’s with Facebook.
  • Co-req is a potential large source of income. This is where you offer in your site's sign up to send the user more information about mortgages are some other product.
  • You can't always listen to user responses. Some users will always love new features and others will hate it. Only a fraction will complain. Instead, look at what features people are actually using by watching your site.

    Related Articles

  • MySpace also uses Windows to run their site.
  • Markus Frind's posts on Webmaster World.
  • And the Money Comes Rolling In by Max Chafkin
  • How I started A Dating Empire by Markus Frind

    Thanks to Erik Osterman for recommending profiling PlentyOfFish.
  • Wednesday

    Habits of Highly Scalable Web Applications 

    Nick Belhomme wrote up a excellent summary of a talk given by Eli White on building scalable web applications. Eli worked at and is now the PHP Community Manager & DevZone Editor-in-Chief at Zend Technologies. Eli takes us on a grand tour through various proven scaling strategies. On the trip you'll visit:

  • What is scalable application design
  • Tip 1: load balancing the webserver
  • Tip 2: scaling from a single DB server to a Master-Slave setup
  • Tip 3: Partitioning, Vertical DB Scaling
  • Tip 4: Partitioning, horizontal DB Scaling
  • Tip 5: Application Level Partitioning
  • Tip 6: Caching to get around your database
  • Resources
  • Tuesday

    Learn How to Exploit Multiple Cores for Better Performance and Scalability

    InfoQueue has this excellent talk by Brian Goetz on the new features being added to Java SE 7 that will allow programmers to fully exploit our massively multi-processor future. While the talk is about Java it's really more general than that and there's a lot to learn here for everyone.

    Brian starts with a short, coherent, and compelling explanation of why programmers can't expect to be saved by ever faster CPUs and why we must learn to exploit the strengths of multiple core computers to make our software go faster.

    Some techniques for exploiting multiple cores are given in an equally short, coherent, and compelling explanation of why divide and conquer as the secret to multi-core bliss, fork-join, how the Java approach differs from map-reduce, and lots of other juicy topics.

    The multi-core "problem" is only going to get worse. Tilera founder Anant Agarwal estimates by 2017 embedded processors could have 4,096 cores, server CPUs might have 512 cores and desktop chips could use 128 cores. Some disagree saying this is too optimistic, but Agarwal maintains the number of cores will double every 18 months.

    An abstract of the talk follows though I would highly recommend watching the whole thing. Brian does a great job.

    Why is Parallelism More Important Now?

  • Coarse grain concurrency was all the rage for Java 5. The hardware reality has changed. The number of cores is increasing so applications must now search for fine grain parallelism (fork-join)
  • As hardware becomes more parallel, more and more cores, software has to look for techniques to find more and more parallelism to keep the hardware busy.
  • Clock rates have been increasing exponentially over the last 30 years or so. Allowed programmers to be lazy because a faster processor would be released that saved your butt. There wasn't a need to tune programs.
  • That wait for faster processor game is up. Around 2003 clock rates stopped increasing. Hit the power wall. Faster processors require more power. Thinner chip conductor lines were required and the thinner lines can't dissipate the increased power without causing overheating which effects the resistance characteristics of the conductors. So you can't keep increasing clock rate.
  • Fastest Intel CPU 4 or 5 years ago was 3.2 Ghz. Today it's about the same or even slower.
  • Easier to build 2.6 Ghz or 2.8 Ghz chips. Moore's law wasn't repealed so we can cram more transistors on each wafer. So more processing power could be put on a chip which leads to putting more and more processing cores on a chip. This is multicore.
  • Multicore systems are the trend. The number of cores will grow at exponential rate for the next 10 years. 4 cores at the low end. The high end 256 (Sun) and 800 (Azul) core systems.
  • More cores per chip instead of faster chips. Moore's law has been redirected to multicore.
  • The problem is it's harder to make a program go faster on a multicore system. A faster chip will run your program faster. If you have a 100 cores you program won't go faster unless you explicitly design it to take advantage of those chips.
  • No free lunch anymore. Must now be able to partition your program so it can run faster by running on multiple cores. And you must be able keep doing that as the number of cores keeps improving.
  • We need a way to specify programs so they can be made parallel as topologies change by adding more cores.
  • As hardware evolves platforms must evolve to take advantage of the new hardware. Started off with course grain tasks which was sufficient given the number of cores. This approach won't work as the number cores increase.
  • Must find finer-grained parallelism. Example sorting and searching data. Opportunities around data. The data can for sorting can be chunked and sorted and the brought together with a merge sort. Searching can be done in parallel by searching subregions of the data and merging the results.
  • Parallel solutions use more CPU in aggregate because of the coordination needed and that data needs to be handled more than once (merge). But the result is faster because it's done in parallel. This adds business value. Faster is better for humans.

    What has Java 7 Added to Support Parallelism?

  • Example problem is to find the max number from a list.
  • The course grained threading approach is to use a thread pool, divide up the numbers, and let the task pool compute the sub problems. A shared task pool is slow as the number increases which forces the work to be more course grained. No way to load balance. Code is ugly. Doesn't match the problem well. The runtime is dominated by how long it takes the longest subtask to run. Had to decide up front how many pieces to divide the problem into.
  • Solution using divide and conquer. Divide set into pieces recursively until the problem is so small the sequential solution is more efficient. Sort the pieces. Merge the results. 0(n log n), but problem is parallelizable. Scales well and can keep many CPUs busy.
  • Divide and conquer uses fork-join to fork off subtasks and wait for them to complete and then join the results. A typical thread pool solution is not efficient. Creates too many threads and creating threads are expensive and use a lot of memory.
  • This approach portable because it's abstract. It doesn't know how many processors are available It's independent of the topology.
  • The fork-join pool is optimized for fine grained operations whereas the thread pool is optimized for course grained operations. Best used for problems without IO. Just computations using CPU that tend to fork off sub problems. Allows data to be shared read-only and used across different computations without copying.
  • This approach scales nearly linearly with the number of hardware threads.
  • The goal for fork-join: Avoid context switches; Have as many threads as hardware threads and keep them all busy; Minimize queue lock contention for data structures. Avoid common task queue.
  • Implementation uses Work-Stealing. Each thread has a work queue that is a double ended queue. Each thread pulls work from the head of queue and processes it. When there's nothing do it steals work from the tail of another queue. No contention for the head because only one thread access it. Rare contention on tail because stealing is infrequent as the stolen work is large which takes them time to process. Process starts with one task. It breaks up the work. Other tasks steal work and start the same process. Load balances without central coordination, few context switches, little coordination.
  • The same approach also works for graph traversal, matrix operations, linear algebra, modeling, generate moves and evaluate the result. Latent parallelism can be found in a lot of places once you start looking.
  • Support higher level operations like ParallelArray. Can specify filtering, transformation, and aggregation options. Not a generalized in-memory database, but has a very transparent cost model. It's clear how many parallel operations are happening. Can look at the code and quickly know what's a parallel operation so you will know the cost.
  • Looks like map reduce except this is scaling across a multicore system, one single JVM, whereas map reduce is across a cluster. The strategy is the same: divide and conquer.
  • Idea is to make specifying parallel operations so easy you wouldn't even think of the serial approach.

    Related Articles

  • The Free Lunch Is Over - A Fundamental Turn Toward Concurrency in Software By Herb Sutter
  • Intuition, Performance, and Scale by Dan Pritchett
  • "Multi-core Mania": A Rebuttal by Ted Neward
  • CPU designers debate multi-core future by Rick Merritt
  • Multicore puts screws to parallel-programming models by Rick Merritt
  • Challenges in Multi-Core Era – Part 1 and Part 2 by Gaston Hillar.
  • Running multiple processes to understand multicore CPUs power.
  • Learning to Program all Over Again by Vineet Gupta