advertise
« Handle 1 Billion Events Per Day Using a Memory Grid | Main | MySpace Architecture »
Saturday
Feb142009

Scaling Digg and Other Web Applications

Joe Stump, Lead Architect at Digg, gave this presentation at the Web 2.0 Expo. I couldn't find the actual presentation, but fortunately Kris Jordan took some great notes. That's how key moments in history are accidentally captured forever. Joe was also kind enough to respond to my email questions with a phone call.

In this first part of the post Joe shares some timeless wisdom that you may or may not have read before. I of course take some pains to extract all the wit from the original presentation in favor of simple rules. What really struck me however was how Joe thought MemcacheDB Will be the biggest new kid on the block in scaling. MemcacheDB has been around for a little while and I've never thought of it in that way. Well learn why Joe is so excited by MemcacheDB at the end of the post.

Impressive Stats

  • 80th-100th largest site in the world
  • 26 million uniques a month
  • 30 million users.
  • Uniques are only half that traffic. Traffic = unique web visitors + APIs + Digg buttons.
  • 2 billion requests a month
  • 13,000 requests a second, peak at 27,000 requests a second.
  • 3 Sys Admins, 2 DBAs, 1 Network Admin, 15 coders, QA team
  • Lots of servers.

    Scaling Strategies

  • Scaling is specialization. When off the shelf solutions no longer work at a certain scale you have to create systems that work for your particular needs.
  • Lesson of web 2.0: people love making crap and sharing it with the world.
  • Web 2.0 sucks for scalability. Web 1.0 was flat with a lot of static files. Additional load is handled by adding more hardware. Web 2.0 is heavily interactive. Content can be created at a crushing rate.
  • Languages don't scale. 100% of the time bottlenecks are in
    IO. Bottlenecks aren't in the language when you are handling so many simultaneous requests. Making PHP 300% faster won't matter. Don't optimize PHP by using single quotes instead of double quotes when
    the database is pegged.
  • Don’t share state. Decentralize. Partitioning is required to process a high number of requests in parallel.
  • Scale out instead of up. Expect failures. Just add boxes to scale and avoid the fail.
  • Database-driven sites need to be partitioned to scale both horizontally and vertically. Horizontal partitioning means store a subset of rows on a different machines. It is used when there's more data than will fit on one machine. Vertical partitioning means putting some columns in one table and some columns in another table. This allows you to add data to the system without downtime.
  • Data are separated into separate clusters: User Actions, Users, Comments, Items, etc.
  • Build a data access layer so partitioning is hidden behind an API.
  • With partitioning comes the CAP Theorem: you can only pick two of the following three: Strong Consistency, High Availability, Partition Tolerance.
  • Partitioned solutions require denormalization and has become a big problem at Digg. Denormalization means data is copied in multiple objects and must be kept synchronized.
  • MySQL replication is used to scale out reads.
  • Use an asynchronous queuing architecture for near-term processing.
    - This approach pushes chunks of processing to another service and let's that service schedule the processing on a grid of processors.
    - It's faster and more responsive than cron and only slightly less responsive than real-time.
    - For example, issuing 5 synchronous database requests slows you down. Do them in parallel.
    - Digg uses Gearman. An example use is to get a permalink. Three operations are done parallel: get the current logged, get the permalink, and grab the comments. All three are then combined to return a combined single answer to the client. It's also used for site crawling and logging. It's a different way of thinking.
    - See Flickr - Do the Essential Work Up-front and Queue the Rest and The Canonical Cloud Architecture for more information.
  • Bottlenecks are in IO so you have tune the database. When the database is bigger than RAM the disk is hit all the time which kills performance. As the database gets larger the table can't be scanned anymore. So you have to:
    - denormalize
    - avoid joins
    - avoid large scans across databases by partitioning
    - cache
    - add read slaves
    - don't use NFS
  • Run numbers before you try and fix a problem to make sure things actually will work.
  • Files like for icons and photos are handled by using MogileFS, a distributed file system. DFSs support high request rates because files are distributed and replicated around a network.
  • Cache forever and explicitly expire.
  • Cache fairly static content in a file based cache.
  • Cache changeable items in memcached
  • Cache rarely changed items in APC. APC is a local cache. It's not distributed so no other program have access to the values.
  • For caching use the Chain of Responsibility pattern. Cache in MySQL, memcached APC, and PHP globals. First check PHP globals as the fastest cache. If not present check APC, memcached and on up the chain.
  • Digg's recommendation engine is a custom graph database that is eventually consistent. Eventually consistent means that writes to one partition will eventually make it to all the other partitions. After a write reads made one after another don't have to return the same value as they could be handled by different partitions. This is a more relaxed constraint than strict consistency which means changes must be visible at all partitions simultaneously. Reads made one after another would always return the same value.
  • Assume 1 million people a day will bang on any new feature so make it scalable from the start. Example: the About page on Digg did a live query against the master database to show all employees. Just did a quick hack to get out. Then a spider went crazy and took the site down.

    Miscellaneous

  • Digg buttons were a major key to generating traffic.
  • Uses Debian Linux, Apache, PHP, MySQL.
  • Pick a language you enjoy developing in, pick a coding standard, add inline documentation that's extractable, use a code repository, and a bug tracker. Likes PHP, Track, and SVN.
  • You are only as good as your people. Have to trust guy next to you that he's doing his job. To cultivate trust empower people to make
    decisions. Trust that people have it handled and they'll take care of it. Cuts down on meetings because you know people will do the job right.
  • Completely a Mac shop.
  • Almost all developers are local. Some people are remote to offer 24 hour support.
  • Joe's approach is pragmatic. He doesn't have a language fetish. People went from PHP, to Python/Ruby, to Erlang. Uses vim. Develops from the command line. Has no idea how people constantly change tool sets all the time. It's not very productive.
  • Services (SOA) decoupling is a big win. Digg uses REST. Internal services return a vanilla structure that's mapped to JSON, XML, etc. Version in URL because it costs you nothing, for example:
    /1.0/service/id/xml. Version both internal and external services.
  • People don't understand how many moving parts are in a website. Something is going to happen and it will go down.

    MemcacheDB: Evolutionary Step for Code, Revolutionary Step for Performance

    Imagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that's 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear.

    You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that.

    The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers.

    With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge.

    What is MemcacheDB? It's a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported.

    Before you get too excited keep in mind this is a key-value store. You read and write records by a single key. There aren't multiple indexes and there's no SQL. That's why it can be so fast.

    Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized. Remember it's a key-value store. The value is usually a complete application level object merged together from a possibly large number of normalized tables. Denormalizing introduces redundancies because you are keeping copies of data in multiple records instead of just one copy in a nicely normalized table. So denormalization means a lot more writes as data must be copied to all the records that contain a copy. To keep up they needed a database capable of handling their write load. MemcacheDB has the performance, especially when you layer memcached's normal partitioning scheme on top.

    I asked Joe why he didn't turn to one of the in-memory data grid solutions? Some of the reasons were:
  • This data is generated from many different databases and takes a long time to generate. So they want it in a persistent store.
  • MemcacheDB uses the memcache protocol. Digg already uses memcache so it's a no-brainer to start using MemcacheDB. It's easy to use and easy to setup.
  • Operations is happy with deploying it into the datacenter as it's not a new setup.
  • They already have memcached high availability and failover code so that stuff already works.
  • Using a new system would require more ramp-up time.
  • If there are any problems with the code you can take a look. It's all open source.
  • Not sure those other products are stable enough.

    So it's an evolutionary step for code and a revolutionary step for performance. Digg is looking at using MemcacheDB across the board.

    Related Articles

  • Scaling Digg and Other Web Applications by Kris Jordan.
  • MemcacheDB
  • Joe Stump's Blog
  • MemcachedRelated Tags on HighScalability
  • Caching Related Tags on HighScalability
  • BigTable
  • SimpleDB
  • Anti-RDBMS: A list of distributed key-value stores
  • An Unorthodox Approach to Database Design : The Coming of the Shard
  • Flickr Architecture
  • Episode 4: Scaling Large Web Sites with Joe Stump, Lead Architect at DIGG
  • Reader Comments (17)

    Unless they want to spend a lot of money on RAM, they're in for a big surprise. Memcachedb/Berkeley DB (or MySQL, PostgreSQL or other traditional DB stores based on B-Tree) performance will take a nose dive when the RAM to DB size ratio drop under 0.8 or so.

    The solution for writes currently is Hypertable, which can sustain 10k writes/s per client, even when RAM to DB size ratio is under 0.1.

    November 29, 1990 | Unregistered Commentervicaya

    What is the architecture behind "search". I am assuming some sort of full-text database is being used. Can someone point out the best way to scale full-text search in IIS/ASP.Net environment. I know SQL Server 2005 is not good enough.

    I am looking for something that uses a non-sql based backend such as berkley db.

    Any pointer will be appreciated?

    November 29, 1990 | Unregistered CommenterAsh

    Great recomposition of the notes from the Web 2.0 Expo. Joe's talk was great! As a heads up the link back to my notes seems to have a
    in the href that breaks it link. Best! -Kris

    November 29, 1990 | Unregistered CommenterKris Jordan

    Wow, looks to me like they are breaking out the big guns dude!

    RT
    www.anon-tools.us.tc

    November 29, 1990 | Unregistered CommenterJessie Wilder

    You should take a peek at SOLR (search server based on Lucene). Runs well on Tomcat and Jetty...both run on Windows. http://lucene.apache.org/solr/

    November 29, 1990 | Unregistered Commenterfcreyes

    XML storage for search/data is your best bet for a ASP.net application, using LINQ with .net 3.5 and XML is much faster then SQL. and you dont have to worry about some asshat spamming your ms-sql-s ports all day long.

    http://www.keyoti.com/products/search/dotNetWeb/index.html

    these guys have a pretty good search engine, but it only searches via crawl of page, but you can always tweak their source code, for simple sites it should hold up also is runs via web controls

    November 29, 1990 | Unregistered CommenterAnonymous

    Cool. Thanks. Interesting that this didn't come up in my search.

    November 29, 1990 | Unregistered CommenterTodd Hoff

    "Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized."

    Here's a brilliant idea: don't denormalize!!

    November 29, 1990 | Unregistered CommenterSeun Osewa

    Amazing articles, I am greate lover of memcache and had implement it in my http://www.kiran.org.in'>site

    November 29, 1990 | Unregistered CommenterKiran

    Todd
    Thanks again for a great summary.

    After reading careful through the details it sounds to me that Digg had to invest quite a bit in their infrastructure to get to this level of scalability.

    By looking at the rational behind *not using data grid* over memcacheDB i get the impression that there was not too much thought invested in evaluating this option as some of the comments doesn't make sense to me. For example changing the application to use mecacheDB as appose to MySQL required significant change. With data-grid the change could be lower as the database could be kept unchanged. Now i can't argue with success however if i would need to scale my application today i would be very cautious going in that route especially under the current market economy.

    There are more off the shelf solutions today that does most of the things that where mentioned in this study. This include the ability to break an application into distributed services, scaling the data-tier, integrating with Load-balancer to enable dynamic scaling (See recent post on that regard http://highscalability.com/handle-1-billion-events-day-using-memory-grid">Handle 1 Billion Events Per Day Using a Memory Grid (1009)). Interestingly enough just last in the past few weeks we were involved in few projects that showed how you can push more then 100k writes/sec. We used data-grid as the front-end and kept the update to the database asynchronously. We used MySQL as the underlying database which bring the advantage of using standard SQL database from a maintenance perspective. Cloud computing can also provide a very cost effective environment for achieving this level of scaling without too much hassle.

    I'm not sure that all those off the shelf alternatives was available when Digg built their solution so it made sense for Joe Stump to come-up with this architecture.
    It would be interesting to know if Joe would have chosen that route today.

    See http://natishalom.typepad.com/nati_shaloms_blog/2008/11/the-impact-of-cloud-computing-on-buy-vs-build-behaviour.html">The impact of cloud computing on build vs. buy behavior which is based on Fred Brooks excellent writeup "No Silver Bullets". As Fred noted, in many cases we tend to believe that we have specialized requirements and end up building and investing in infrastructure work that is not entirely core to our business. Economy pressure forces re-evaluate of those assumptions.

    November 29, 1990 | Unregistered Commenternatis

    Seun,

    Sorry you are very wrong. You are showing your inexperience. When you deal with large data, you have to denormalize to minimize locks on your datasets. You also improve performance by reducing joins. Remember read volume is much more that write volume.

    November 29, 1990 | Unregistered CommenterAnonymous

    I'm not sure that is relevant. Assume you are Digg, and have 1 TB of data (it's probably more, but go with me here). You can spread that across 40 servers with 24GB of RAM, and a 24GB store per node.

    According to HP's site, which I just checked, upgrading a DL 360 from 1GB of RAM to 24 GB costs $998. Let's round that up to $1,000 for simplicity. (the total server cost was under $7K, including 120GB of RAID 1 SAS storage)

    What you end up with is $40,000 of RAM totaling 1 TB, spread across 40 servers. Not a bad deal for a company that is spending $280,000 on servers. That works out to 14% of the server cost being RAM.

    So, respectfully, falling under 1.0 is only due to poor planning, or extremely large datasets. (Please note that I am smart enough to realize that 1TB is probably a rounding error for a company like Facebook or Flickr, but their biudgets are commensurately bigger, too.)

    November 29, 1990 | Unregistered CommenterAnonymous

    I see that podcast last year, very interesting guy that knows his stuff.

    -- http://www.onlinebingoclub.co.uk/foxy-bingo/">foxy

    November 29, 1990 | Unregistered Commenterfoxy

    Hi,
    We are dealing with a project that implemented MemcacheDB as a solution for persistence cache. Unfortunately it seems that not everything is so bright with this product:
    1. Get times vary with 50ms on average get and with standard deviation of 150ms (the size of the key and data is under 100bytes)
    2. You have no information how much data is stored in the MemcacheDB, what data was "lost" there, and you cannot delete information if you don't have the key (and no, enumeration is not supported)
    3. It seems that the previous two facts are connected (you cannot remove abandoned items, so performance is poor).
    4. The product seems to be abandoned according to it's website, and in its current condition is far from being manageable.

    Best Regards,
    Moshe Kaplan. Performacne Expert.

    December 2, 2009 | Unregistered CommenterMoshe Kaplan

    Thanks for another great article, seeing real world examples from some of the busiest sites on the net gives us really valuable information on what to prepare for.

    I just quoted highscalability.com in my latest article on <http://www.mrkirkland.com/prepare-for-web-application-scalability/>preparing for scalability.

    Thanks!
    Mr Kirkland
    CEO artweb.com

    May 12, 2010 | Unregistered CommenterMr Kirkland

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Post:
     
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>