Memcached

Todd Hoff's picture

Scaling Twitter: Making Twitter 10000 Percent Faster

Update 6: Some interesting changes from Twitter's Evan Weaver: everything in RAM now, database is a backup; peaks at 300 tweets/second; every tweet followed by average 126 people; vector cache of tweet IDs; row cache; fragment cache; page cache; keep separate caches; GC makes Ruby optimization resistant so went with Scala; Thrift and HTTP are used internally; 100s internal requests for every external request; rewrote MQ but kept interface the same; 3 queues are used to load balance requests; extensive A/B testing for backwards capability; switched to C memcached client for speed; optimize critical path; faster to get the cached results from the network memory than recompute them locally.
Update 5: Twitter on Scala. A Conversation with Steve Jenson, Alex Payne, and Robey Pointer by Bill Venners. A fascinating discussion of why Twitter moved to the Java JVM for their server infrastructure (long lived processes) and why they moved to Scala to program against it (high level language, static typing, functional). Ruby is used on the front-end but wasn't performant or reliable enough for the back-end.
Update 4: Improving Running Components at Twitter by Evan Weaver. Tells how Twitter changed their infrastructure to go from handling 3 requests to 139 requests a second. They moved to a messaging model, asynchronous process, 3 levels of cache, and moved their middleware to a mixture C and Scala/JVM.
Update 3: Upgrading Twitter without service disruptions by Gojko Adzic. Lots of good updates on the new Twitter architecture.
Update 2: a commenter in Twitter Fails Macworld Keynote Test said this entry needs to be updated. LOL. My uneducated guess is it's not a language or architecture problem, but more a problem of not being able to add hardware fast enough into their data center. The predictability of this problem is debatable, but once you have it, it's hard to fix.
Update: Twitter releases Starling - light-weight persistent queue server that speaks the MemCache protocol. It was built to drive Twitter's backend, and is in production across Twitter's cluster.

Scaling Memcached: 500,000+ Operations/Second with a Single-Socket UltraSPARC T2

A software-based distributed caching system such as memcached is an important piece of today's largest Internet sites that support millions of concurrent users and deliver user-friendly response times. The distributed nature of memcached design transforms 1000s of servers into one large caching pool with gigabytes of memory per node. This blog entry explores single-instance memcached scalability for a few usage patterns.

Table below shows out-of-the-box (no custom OS rewrites or networking tuning required) performance with 10G networking hardware and one single-socket UltraSPARC T2-based server with 8 cores and 8 threads per core (64 threads on a chip)...

Object Size / Ops/Sec / Bandwidth
100 bytes / 530,000 / 1.2 Gb/s
2048 bytes / 370,000 / 6.9 Gb/s
4096 bytes / 255,000 / 9.2 Gb/s

Check out the link for more details!

Presentations: MySQL Conference & Expo 2009

The Presentations of the MySQL Conference & Expo 2009 held April 20-23 in Santa Clara is available on the above link.

They include:

  • Beginner's Guide to Website Performance with MySQL and memcached by Adam Donnison
  • Calpont: Open Source Columnar Storage Engine for Scalable MySQL DW by Jim Tommaney
  • Creating Quick and Powerful Web Applications with MySQL, GlassFish, and NetBeans by Arun Gupta
  • Deep-inspecting MySQL with DTrace by Domas Mituzas
  • Distributed Innodb Caching with memcached by Matthew Yonkovit and Yves Trudeau
  • Improving Performance by Running MySQL Multiple Times by MC Brown
  • Introduction to Using DTrace with MySQL by Vince Carbone
  • MySQL Cluster 7.0 - New Features by Johan Andersson
  • Optimizing MySQL Performance with ZFS by Allan Packer
  • SAN Performance on a Internal Disk Budget: The Coming Solid State Disk Revolution by Matthew Yonkovit
  • This is Not a Web App: The Evolution of a MySQL Deployment at Google by Mark Callaghan
inquiry's picture

Some Questions from a newbie

Hello highscalability world. I just discovered this site yesterday in a search for a scalability resource and was very pleased to find such useful information. I have some questions regarding distributed caching that I was hoping the scalability intelligentsia trafficking this forum could answer. I apologize for my lack of technical knowledge; I'm hoping this site will increase said knowledge! Feel free to answer all or as much as you want. Thank you in advance for your responses and thank you for a great resource!

1.) What are the standard benchmarks used to measure the performance of memcached or mySQL/memcached working together (from web 2.0 companies etc)?

2.) The little research I've conducted on this site suggests that most web 2.0 companies use a combination of mySQL and a hacked memcached (and potentially sharding). Does anyone know if any of these companies use an enterprise vendor for their distributed caching layer? (At this point in time I've only heard of Jive software using Coherence).

3.) In terms of a web 2.0 oriented startup, what are the database/distributed caching requirements typically needed to get off the ground and grow at a fairly rapid pace?

4.) Given the major players in the web 2.0 industry (facebook, twitter, myspace, PoF, Flickr etc, I'm ignoring google/amazon here because they have a proprietary caching layer) what is the most common, scalable back-end setup (mySQL/memcached/sharding etc)? What are its limitations/problems? What features does said setup lack that it really needs?

Thank you so much for your insight!

Some things about Memcached from a Twitter software developer

Memcached is generally treated as a black box. But what if you really need to know what's in there? Not for runtime purposes, but for optimization and capacity planning?

Read more on Evan Weaver, a software developer working for Twitter (a contributor for Rails core and Mongrel).

QCon London 2009: Upgrading Twitter without service disruptions

Evan Weaver from Twitter presented a talk on Twitter software upgrades, titled Improving running components as part of the Systems that never stop track at QCon London 2009 conference last Friday. The talk focused on several upgrades performed since last May, while Twitter was experiencing serious performance problems.

Todd Hoff's picture

Database Sharding at Netlog, with MySQL and PHP

Jurriaan Persyn is a Lead Web Developer at Netlog, a social portal site that gets 50 million unique visitors and 5+ billion page views per month. In this paper Jurriaan goes into a lot of excellent nuts and bolts details about how they used sharding to scale their system. If you are pondering sharding as a solution to your scaling problems you'll want to read this paper. As the paper is quite well organized there's no reason to write a summary, but I especially liked this part from the conclusion:


If you can do with simpler solutions (better hardware, more hardware, server tweaking and tuning, vertical partitioning, sql query optimization, ...) that require less development cost, why invest lots of effort in sharding? On the other hand, when your visitor statistics really start blowing through the roof, it is a good direction to go. After all, it worked for us.

Todd Hoff's picture

Strategy: Facebook Tweaks to Handle 6 Time as Many Memcached Requests

Our latest strategy is taken from a great post by Paul Saab of Facebook, detailing how with changes Facebook has made to memcached they have:


...been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.

To scale Facebook has hundreds of thousands of TCP connections open to their memcached processes. First, this is still amazing. It's not so long ago you could have never done this. Optimizing connection use was always a priority because the OS simply couldn't handle large numbers of connections or large numbers of threads or large numbers of CPUs. To get to this point is a big accomplishment. Still, at that scale there are problems that are often solved.

Some of the problem Facebook faced and fixed:

  • Per connection consumption of resources. What works well at low number of inputs can totally kill a system as inputs grow. Memcached uses a per-connection buffer which adds up to a lot of memory that could be used to store data. Nothing wrong with this design choice, but Facebook made changes to use a per-thread shared connection buffer and reclaimed gigabytes of RAM on each server.
  • Kernel lock contention. Facebook discovered under load there was lock contention when transmitting through a single UDP socket from multiple threads. Sockets are data structures too and they are subject to the usual lock contention issues. Facebook got around this issue by maintaining separate reply sockets in different threads so they would not contend with the receive sockets. They found another bottleneck in Linux’s “netdevice” layer that sits in-between IP and device drivers. They changed the dequeue algorithm to batch dequeues so more work was done when they had the CPU.
  • Application lock contention. Nothing brings out lock issues like moving to more cores. Facebook found when they moved to 8 core machines a global lock protecting stats collection used 20-30% of CPU usage. In application that require little processing per request, as does memcached, this is not unexpected, but doing real work with your CPU is a better idea. So they collected stats on a per thread basis and then calculated a global view on demand.
  • Interrupt floods and starvation. With so much traffic directed at a single server the hardware can flood the CPU(s) with interrupts and keep the CPU from doing "real" work. To get around this problem Facebook implements some complicated strategies to load balance IO across all the cores. As I am less clever I might try more network cards with a TCP Offload engine.

    When you read Paul's article keep in mind all the incredible number of man hours that went into profiling the system, not just their application, but the entire software hardware stack. Then add in the research, planning, and trying different solutions to see if anything changed for the better. It's a lot of work. Notice using a nifty new parallel language or moving to a cloud wouldn't have made a bit difference. It's complete mastery of their system that made the difference.

    A summary of potential strategies:

  • Profile everything. Problems are always specific. The understanding of the problem must be specific. The fix must be specific.
  • Burn profiling into your regression tests. Detect when and where performance tanks as a regular part of your build.
  • Use resources in proportion to what grows slowest. This requires multiplexing, but at least your resource usage is more predictable and bounded.
  • Batch work. When you have the CPU do all the work you possibly can in the quantum or the whole system grinds to a halt in processing overhead.
  • Do work and maintain resources per task. Otherwise locking for shared resources takes more and more time when there's less and less time to do the work that needs to be done.
  • Change algorithms. Sometimes you simply need to do things differently. Tweaking will only get you so far.

    You can find their changes on github, the hub that says "git."

  • Todd Hoff's picture

    Strategy: How to Manage Sessions Using Memcached

    Dormando shows an enlightened middle way for storing sessions in cache and the database. Sessions are a perfect cache candidate because they are transient, smallish, and since they are usually accessed on every page access removing all that load from the database is a good thing. But as Dormando points out session caches have problems. If you remove expiration times from the cache and you run out of memory then no more logins. If a cache server fails or needs to be upgrade then you just logged out a bunch of potentially angry users.

    The middle ground Dormando proposes is using both the cache and the database:

  • Reads: read from the cache first, then the database. Typical cache logic.
  • Writes: write to memcached every time, write to the database every N seconds (assuming the data has changed).

    There's a small chance of data loss, but you've still greatly reduced the database load while providing reliability. Nice solution.

  • Todd Hoff's picture

    37signals Architecture

    Update 6: Things We’ve Learned at 37Signals. Themes: less is more; don't worry be happy.
    Update 5: Nuts & Bolts: HAproxy . Nice explanation (post, screencast) by Mark Imbriaco of why HAProxy (load balancing proxy server) is their favorite (fast, efficient, graceful configuration, queues requests when Mongrels are busy) for spreading dynamic content between Apache web servers and Mongrel application servers.
    Update 4: O'Rielly's Tim O'Brien interviews David Hansson, Rails creator and 37signals partner. Says BaseCamp scales horizontally on the application and web tier. Scales up for the database, using one "big ass" 128GB machine. Says: As technology moves on, hardware gets cheaper and cheaper. In my mind, you don't want to shard unless you positively have to, sort of a last resort approach.
    Update 3: The need for speed: Making Basecamp faster. Pages now load twice as fast, cut CPU usage by a third and database time by about half. Results achieved by: Analysis, Caching, MySQL optimizations, Hardware upgrades.
    Update 2: customer support is handled in real-time using Campfire.
    Update: highly useful information on creating a customer billing system.

    In the giving spirit of Christmas the folks at 37signals have shared a bit about how their system works. 37signals is most famous for loosing Ruby on Rails into the world and they've use RoR to make their very popular Basecamp, Highrise, Backpack, and Campfire products. RoR takes a lot of heat for being a performance dog, but 37signals seems to handle a lot of traffic with relatively normal sounding resources. This is just an initial data dump, they promise to add more details later. As they add more I'll update it here.

    Syndicate content