Scaling Memcached: 500,000+ Operations/Second with a Single-Socket UltraSPARC T2

A software-based distributed caching system such as memcached is an important piece of today's largest Internet sites that support millions of concurrent users and deliver user-friendly response times. The distributed nature of memcached design transforms 1000s of servers into one large caching pool with gigabytes of memory per node. This blog entry explores single-instance memcached scalability for a few usage patterns. Table below shows out-of-the-box (no custom OS rewrites or networking tuning required) performance with 10G networking hardware and one single-socket UltraSPARC T2-based server with 8 cores and 8 threads per core (64 threads on a chip)... Object Size / Ops/Sec / Bandwidth 100 bytes / 530,000 / 1.2 Gb/s 2048 bytes / 370,000 / 6.9 Gb/s 4096 bytes / 255,000 / 9.2 Gb/s Check out the link for more details!

Click to read more ...


Scaling Django Web Apps by Mike Malone

Film buffs will recognize Django as a classic 1966 spaghetti western that spawned hundreds of imitators. Web heads will certainly first think of Django as the classic Python based Web framework that has also spawned hundreds of imitators and has become the gold standard framework for the web. Mike Malone, who worked on Pownce, a blogging tool now owned by Six Apart, tells in this very informative EuroDjangoCon presentation how Pownce scaled using Django in the real world. I was surprised to learn how large Pounce was: hundreds of requests/sec, thousands of DB operations/sec, millions of user relationships, millions of notes, and terabytes of static data. Django has a lot of functionality in the box to help you scale, but if you want to scale large it turns out Django has some limitations and Mike tells you what these are and also provides some code to get around them. Mike's talk-although Django specific--will really help anyone creating applications on the web. There's a lot of useful Django specific advice and a lot of general good design ideas as well. The topics covered in the talk are:

  • Django uses a shared nothing architecture. * The database is responsible for scaling state. * Application servers are horizontally scalable because they are stateless.
  • Scalability vs Performance. Performance is not the same as scalability. Scalability is A scalable system doesn’t need to change when the size of the problem changes.
  • Type of scalability: * Vertical - buy bigger hardware * Horizontal - the ability to increase a system’s capacity by adding more processing units (servers)
  • Cache to remove load from the database server.
  • Built-in Django Caching: Per-site caching, per-view cache, template fragment cache - not so effective on heavily personalized pages
  • Low-level Cache API is used to cache at any level of granularity.
  • Pounce cached individual objects and lists of object IDs.
  • The hard part of caching is invalidation. How do you know when a value changes such that the cache should be up updates so readers see valid values? * Invalidate when a model is saved or deleted. * Invalidate post_save, not pre_save. * This leaves a small race condition so: ** Instead of deleting, set the cache key to None for a short period of time ** Instead of using set to cache objects, use add, which fails if there’s already something stored for the key
  • Pounce ran memcached on their web servers * Their servers were not CPU bound, they were IO and memory bound so they compressed objects before caching.
  • Work is spread between multiple application servers using a load balancer.
  • Best way to reduce load on your app servers: don’t use them to do hard stuff.
  • Pounce used software load balancing * Hardware load balancers are expensive ($35K) and you need two for redunancy. * Software load balancers are cheap and easy. * Some options: Perlbal, Pound, HAProxy, Varnish, Nginx * Chose a single Perlbal server. This was a Single Point of Failure but they didn't have the money for hardware. Liked Perlbal's reproxying feature.
  • Used a ghetto queuing solution (MySQL + cron) to process work asynchronously in the background.
  • At scale their system needed to have high availability and be partitionable. * The RDBMS’s consistency requirements get in our way * Most sharding / federation schemes are kludges that trade consistency * There are many non relational databases (CouchDB, Cassandra, Tokyo Cabinet) but they aren't easy to use with Django.
  • Rules for denormalization: * Start with a normalized database * Selectively denormalize things as they become bottlenecks * Denormalized counts, copied fields, etc. can be updated in signal handlers
  • Joins are evil and Django makes it really easy to do joins.
  • Database Read Performance * Since your typical web app is 80% to 80% reads adding MySQL master-slave replication can solve a lot of problems. * Django doesn't support multiple database connections, but there's a library, linked to at the end of this document to help. * A big problem is slave lag. When you write to the primary it takes time for the state to be transferred to the read slaves so readers may see an old value on the read.
  • Database Write Performance * Federate. Split tables across different servers. Not well supported by Django. * Vertical Partitioning: split tables that aren’t joined across database servers. * Horizontal Partitioning: split a single table across databases (e.g., user table). Problem is autoincrement now doesn't work and Django uses autoincrement for primary keys.
  • Monitoring - You can't improve what you don't measure * Products: Ganglia and Munin
  • Measure * Server load, CPU usage, I/O * Database QPS * Memcache QPS, hit rate, evictions * Queue lengths * Anything else interesting

    Related Articles

  • Interview with Leah Culver: The Making of Pownce
  • Django Caching Code
  • Django Multidb Code
  • EuroDjangoCon Presentations

    Click to read more ...

  • Sunday

    Product: Hadoop

    Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig. Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3: Scaling Hadoop to 4000 nodes at Yahoo!. 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2: Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides. Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity in Data Systems at Scale, Handling Large Datasets at Google: Current Systems and Future Directions, Mining the Web Graph. and Sherpa: Hosted Data Serving. Update: Kevin Burton points out Hadoop now has a blog and an introductory video staring Beyonce. Well, the Beyonce part isn't quite true. Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. It replicates much of Google's stack, but it's for the rest of us. Jeremy Zawodny has a wonderful overview of why Hadoop is important for large website builders: For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy. The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy It's hard work. And it needs to be commoditized, just like the hardware has been... Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters. The obvious question of the day is: should you build your website around Hadoop? I have no idea. There seems to be a few types of things you do with lots of data: process, transform, and serve. Yahoo literally has petabytes of log files, web pages, and other data they process. Process means to calculate on. That is: figure out affinity, categorization, popularity, click throughs, trends, search terms, and so on. Hadoop makes great sense for them for the same reasons it does Google. But does it make sense for your website? If you are YouTube and you have petabytes of media to serve, do you really need map/reduce? Maybe not, but the clustered file system is great. You get high bandwidth with the ability to transparently extend storage resources. Perfect for when you have lots of stuff to store. YouTube would seem like it could use a distributed job mechanism, like you can build with Amazon's services. With that you could create thumbnails, previews, transcode media files, and so on. When they have Hbase up and running that could really spike adoption. Everyone needs to store structured data in a scalable, reliable, highly performing data store. That's an exciting prospect for me. I can't wait for experience reports about "normal" people, familiar with a completely different paradigm, adopting this infrastructure. I wonder what animal O'Reilly will use on their Hadoop cover?

    See Also

  • Open Source Distributed Computing: Yahoo's Hadoop Support by Jeremy Zawodny
  • Yahoo!'s bet on Hadoop by Tim O'Reilly
  • Hadoop Presentations
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3

    Click to read more ...

  • Friday

    Wolfram|Alpha Architecture

    Making the world's knowledge computable

    Today's Wolfram|Alpha is the first step in an ambitious, long-term project to make all systematic knowledge immediately computable by anyone. You enter your question or calculation, and Wolfram|Alpha uses its built-in algorithms and growing collection of data to compute the answer.

    Answer Engine vs Search Engine

    When Wolfram|Alpha launches later today, it will be one of the most computationally intensive websites on the internet. The Wolfram|Alpha computational knowledge engine is an "answer engine" that is able to produce answers to various questions such as
    • What is the GDP of France?
    • Weather is Springfield when David Ortiz was born
    • 33 g of gold
    • LDL vs. serum potassium 150 smoker male age 40
    • life expectancy male age 40 finland
    • highschool teacher median wage
    Wolfram|Alpha excels at different areas like mathematics, statistics, physics, engineering, astronomy, chemistry, life sciences, geology, business and finance as demonstrated by Steven Wolfram in his Introduction screencast.

    The Stats

    • Abour 10,000 CPU cores at launch
    • 10+ trillion of pieces of data
    • 50,000+ types of algorithms
    • Able to handle about 175 million queries per day
    • 5+ million lines of symbolic Mathematica code

    The Computers Powering Computable Knowledge

    There is no way to know exactly how much traffic to expect, especially during the initial period immediately following the launch, but the Wolfram|Alpha team is working hard to put reasonable capacity in place. As Stephen writes in the Wolfram|Alpha blog Alpha will run in 5 distributed colocation facilities. What computing power have they gathered in these facilities for launch day? Two supercomputers, just about 10,000 processor cores, hundreds of terabytes of disks, a heck of a lot of bandwidth, and what seems like enough air conditioning for the Sahara to host a ski resort. One of their launch partners, R Systems, created the world’s 44th largest supercomputer (per the June 2008 TOP500 list - it is listed as 66th per the latest Top500 list). They call it the R Smarr. It will be running Wolfram|Alpha on launch day! R Smarr has a Sum Rmax of 39580 GFlops using Dell DCS CS23-SH, QC HT 2.8 GHz computers, 4608 cores, 65536 GB of RAM and Infiniband interconnect. Dell is another of the launch partners with a data center full of quad-board, dual-processor, quad-core Harpertown servers. What does it all add up to? The ability to handle 175 million queries (yielding maybe a billion) per day—over 5 billion queries (encompassing around 30 billion calculations) per month.

    The Launch of Wolfram|Alpha

    Watch a live webcast of the Wolfram|Alpha system being brought online for the first time on
    • Friday, May 15, beginning at 7pm CST

    The First Killer App of The New Kind of Science

    The Genius behind Wolfram|Alpha is Stephen Wolfram. He is best know for his ambitious projects: Mathematica and A New Kind of Science (NKS). May 14, 2009 marks the 7th anniversary of the publication of his book A New Kind of Science. Stephen explains is his blog post: But for me the biggest thing that’s happened this year is the emergence of Wolfram|Alpha. Wolfram|Alpha is, I believe, going to be the first killer app of NKS.


    That it should be possible to build Wolfram|Alpha as it exists today in the first decade of the 21st century was far from obvious. And yet there is much more to come. As of now, Wolfram|Alpha contains 10+ trillion of pieces of data, 50,000+ types of algorithms and models, and linguistic capabilities for 1000+ domains. Built with Mathematica—which is itself the result of more than 20 years of development at Wolfram Research—Wolfram|Alpha's core code base now exceeds 5 million lines of symbolic Mathematica code. Running on supercomputer-class compute clusters, Wolfram|Alpha makes extensive use of the latest generation of web and parallel computing technologies, including webMathematica and gridMathematica.

    How Mathematica Made Wolfram|Alpha Possible?

    Wolfram|Alpha is a major software engineering development to make all systematic knowledge immediately computable by anyone. It is developed and deployed entirely with Mathematica—in fact, Mathematica has uniquely made Wolfram|Alpha possible. Here's why.
    • Computational knowledge and intelligence
    • High-performance enterprise deployment
    • One coherent architecture
    • Smart method selection
    • Dynamic report generation
    • Database connectivity
    • Built-in, computable data
    • High-level programming language
    • Efficient text processing and linguistic analysis
    • Wide-ranging, automated visualization capabilities
    • Automated importing
    • Development environment

    Information Sources

    Congratulations Stephen!

    Click to read more ...


    Who Has the Most Web Servers?

    An interesting post on DataCenterKnowledge!

    • 1&1 Internet: 55,000 servers
    • Rackspace: 50,038 servers
    • The Planet: 48,500 servers
    • Akamai Technologies: 48,000 servers
    • OVH: 40,000 servers
    • SBC Communications: 29,193 servers
    • Verizon: 25,788 servers
    • Time Warner Cable: 24,817 servers
    • SoftLayer: 21,000 servers
    • AT&T: 20,268 servers
    • iWeb: 10,000 servers
    • How about Google, Microsoft, Amazon, eBay, Yahoo, GoDaddy, Facebook? Check out the post on DataCenterKnowledge and of course here on!



    P2P server technology?

    Is there any type of server technology that allows visitors to a website to become part of the server? Like with bittorrent, users share some of their bandwidth, so would this be possible with web servers where a person goes to a website, downloads and runs the software which makes their internet connection and cpu and hdd become part of the web server?

    Click to read more ...


    GemStone Unveils GemFire Enterprise 6.0

    GemFire Enterprise is in-memory distributed data management platform that pools memory (and CPU, network and optionally local disk) across multiple processes to manage application objects and behavior. With the 6.0 release, GemFire has reached a stage of maturity in its evolution. GemStone touts this version as the true 'best of breed' distributed caching technology, solving scalability issues in all industries.

    Click to read more ...


    Facebook, Hadoop, and Hive

    Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run applications that process vast amounts of data), Yahoo being the first.

    Learn how they do it and what are the challenges on DBMS2 blog, which is a blog for people who care about database and analytic technologies.


    Publish/subscribe model does not scale?

    on Wiki someone posted "...For relatively small installations, pub/sub provides the opportunity for better scalability than traditional client-server, through parallel operation, message caching, tree-based or network-based routing, etc. However, as systems scale up to become datacenters with thousands of servers sharing the pub/sub infrastructure, this benefit is often lost; in fact, scalability for pub/sub products under high load in large deployments is very much a research challenge." Does anyone have something to say regarding scaling Publish/subscribe models?

    Click to read more ...


    Eight Best Practices for Building Scalable Systems

    Wille Faler has created an excellent list of best practices for building scalable and high performance systems. Here's a short summary of his points:

  • Offload the database - Avoid hitting the database, and avoid opening transactions or connections unless you absolutely need to use them.
  • What a difference a cache makes - For read heavy applications caching is the easiest way offload the database.
  • Cache as coarse-grained objects as possible - Coarse-grained objects save CPU and time by requiring fewer reads to assemble objects.
  • Don’t store transient state permanently - Is it really necessary to store your transient data in the database?
  • Location, Location - put things close to where they are supposed to be delivered.
  • Constrain concurrent access to limited resource - it's quicker to let a single thread do work and finish rather than flooding finite resources with 200 client threads.
  • Staged, asynchronous processing - separate a process using asynchronicity into separate steps mediated by queues and executed by a limited number of workers in each step.
  • Minimize network chatter - Avoid remote communication if you can as it's slower and less reliable than local computation.

    Click to read more ...