Memcached

Todd Hoff's picture

Strategy: Facebook Tweaks to Handle 6 Time as Many Memcached Requests

Our latest strategy is taken from a great post by Paul Saab of Facebook, detailing how with changes Facebook has made to memcached they have:


...been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.

To scale Facebook has hundreds of thousands of TCP connections open to their memcached processes. First, this is still amazing. It's not so long ago you could have never done this. Optimizing connection use was always a priority because the OS simply couldn't handle large numbers of connections or large numbers of threads or large numbers of CPUs. To get to this point is a big accomplishment. Still, at that scale there are problems that are often solved.

Some of the problem Facebook faced and fixed:

  • Per connection consumption of resources. What works well at low number of inputs can totally kill a system as inputs grow. Memcached uses a per-connection buffer which adds up to a lot of memory that could be used to store data. Nothing wrong with this design choice, but Facebook made changes to use a per-thread shared connection buffer and reclaimed gigabytes of RAM on each server.
  • Kernel lock contention. Facebook discovered under load there was lock contention when transmitting through a single UDP socket from multiple threads. Sockets are data structures too and they are subject to the usual lock contention issues. Facebook got around this issue by maintaining separate reply sockets in different threads so they would not contend with the receive sockets. They found another bottleneck in Linux’s “netdevice” layer that sits in-between IP and device drivers. They changed the dequeue algorithm to batch dequeues so more work was done when they had the CPU.
  • Application lock contention. Nothing brings out lock issues like moving to more cores. Facebook found when they moved to 8 core machines a global lock protecting stats collection used 20-30% of CPU usage. In application that require little processing per request, as does memcached, this is not unexpected, but doing real work with your CPU is a better idea. So they collected stats on a per thread basis and then calculated a global view on demand.
  • Interrupt floods and starvation. With so much traffic directed at a single server the hardware can flood the CPU(s) with interrupts and keep the CPU from doing "real" work. To get around this problem Facebook implements some complicated strategies to load balance IO across all the cores. As I am less clever I might try more network cards with a TCP Offload engine.

    When you read Paul's article keep in mind all the incredible number of man hours that went into profiling the system, not just their application, but the entire software hardware stack. Then add in the research, planning, and trying different solutions to see if anything changed for the better. It's a lot of work. Notice using a nifty new parallel language or moving to a cloud wouldn't have made a bit difference. It's complete mastery of their system that made the difference.

    A summary of potential strategies:

  • Profile everything. Problems are always specific. The understanding of the problem must be specific. The fix must be specific.
  • Burn profiling into your regression tests. Detect when and where performance tanks as a regular part of your build.
  • Use resources in proportion to what grows slowest. This requires multiplexing, but at least your resource usage is more predictable and bounded.
  • Batch work. When you have the CPU do all the work you possibly can in the quantum or the whole system grinds to a halt in processing overhead.
  • Do work and maintain resources per task. Otherwise locking for shared resources takes more and more time when there's less and less time to do the work that needs to be done.
  • Change algorithms. Sometimes you simply need to do things differently. Tweaking will only get you so far.

    You can find their changes on github, the hub that says "git."

  • Todd Hoff's picture

    Strategy: How to Manage Sessions Using Memcached

    Dormando shows an enlightened middle way for storing sessions in cache and the database. Sessions are a perfect cache candidate because they are transient, smallish, and since they are usually accessed on every page access removing all that load from the database is a good thing. But as Dormando points out session caches have problems. If you remove expiration times from the cache and you run out of memory then no more logins. If a cache server fails or needs to be upgrade then you just logged out a bunch of potentially angry users.

    The middle ground Dormando proposes is using both the cache and the database:

  • Reads: read from the cache first, then the database. Typical cache logic.
  • Writes: write to memcached every time, write to the database every N seconds (assuming the data has changed).

    There's a small chance of data loss, but you've still greatly reduced the database load while providing reliability. Nice solution.

  • Todd Hoff's picture

    37signals Architecture

    Update 6: Things We’ve Learned at 37Signals. Themes: less is more; don't worry be happy.
    Update 5: Nuts & Bolts: HAproxy . Nice explanation (post, screencast) by Mark Imbriaco of why HAProxy (load balancing proxy server) is their favorite (fast, efficient, graceful configuration, queues requests when Mongrels are busy) for spreading dynamic content between Apache web servers and Mongrel application servers.
    Update 4: O'Rielly's Tim O'Brien interviews David Hansson, Rails creator and 37signals partner. Says BaseCamp scales horizontally on the application and web tier. Scales up for the database, using one "big ass" 128GB machine. Says: As technology moves on, hardware gets cheaper and cheaper. In my mind, you don't want to shard unless you positively have to, sort of a last resort approach.
    Update 3: The need for speed: Making Basecamp faster. Pages now load twice as fast, cut CPU usage by a third and database time by about half. Results achieved by: Analysis, Caching, MySQL optimizations, Hardware upgrades.
    Update 2: customer support is handled in real-time using Campfire.
    Update: highly useful information on creating a customer billing system.

    In the giving spirit of Christmas the folks at 37signals have shared a bit about how their system works. 37signals is most famous for loosing Ruby on Rails into the world and they've use RoR to make their very popular Basecamp, Highrise, Backpack, and Campfire products. RoR takes a lot of heat for being a performance dog, but 37signals seems to handle a lot of traffic with relatively normal sounding resources. This is just an initial data dump, they promise to add more details later. As they add more I'll update it here.

    Todd Hoff's picture

    Strategy: Drop Memcached, Add More MySQL Servers

    Update 2: Michael Galpin in Cache Money and Cache Discussions likes memcached for it's expiry policy, complex graph data, process data, but says MySQL has many advantages: SQL, Uniform Data Access, Write-through, Read-through, Replication, Management, Cold starts, LRU eviction.
    Update: Dormando asks Should you use memcached? Should you just shard mysql more?. The idea of caching is the most important part of caching as it transports you beyond a simple CRUD worldview. Plan for caching and sharding by properly abstracting data access methods. Brace for change. Be ready to shard, be ready to cache. React and change to what you push out which is actually popular, vs over planning and wasting valuable time.

    Feedster's François Schiettecatte wonders if Fotolog's 21 memcached servers wouldn't be better used to further shard data by adding more MySQL servers? He mentions Feedster was able to drop memcached once they partitioned their data across more servers. The algorithm: partition until all data resides in memory and then you may not need an additional memcached layer.

    Parvesh Garg goes a step further and asks why people think they should be using MySQL at all?

    Related Articles


  • The Death of Read Replication by Brian Aker. Caching layers have replaced read replication. Cache can't fix a broken database layer. Partition the data that feeds the cache tier: "Keep your front end working through the cache. Keep all of your data generation behind it."
  • Read replication with MySQL by François Schiettecatte. Read replication is dead and it should be used only for backup purposes. Take the memory used for caching and give it to your database servers.
  • Replication++, Replication 2.0, Replication.Next by Ronald Bradford. What should read replication be used for?
  • Replication, caching, and partitioning by Greg Linden. Caching overdone because it adds complexity, latency on a cache miss, and inefficiently uses cluster resources. Hitting disk is the problem. Shard more and get your data in memory.

  • Todd Hoff's picture

    A Bunch of Great Strategies for Using Memcached and MySQL Better Together

    The primero recommendation for speeding up a website is almost always to add cache and more cache. And after that add a little more cache just in case. Memcached is almost always given as the recommended cache to use. What we don't often hear is how to effectively use a cache in our own products. MySQL hosted two excellent webinars (referenced below) on the subject of how to deploy and use memcached. The star of the show, other than MySQL of course, is Farhan Mashraqi of Fotolog. You may recall we did an earlier article on Fotolog in Secrets to Fotolog's Scaling Success, which was one of my personal favorites.

    Fotolog, as they themselves point out, is probably the largest site nobody has ever heard of, pulling in more page views than even Flickr. Fotolog has 51 instances of memcached on 21 servers with 175G in use and 254G available. As a large successful photo-blogging site they have very demanding performance and scaling requirements. To meet those requirements they've developed a sophisticated approach to using memcached that others can learn from and emulate. We'll cover some of the highlightable strategies from the webinar down below the fold.

    Todd Hoff's picture

    FaceStat's Rousing Tale of Scaling Woe and Wisdom Won

    Lukas Biewald shares a fascinating slam by slam recount of how his FaceStat (upload your picture and be judged by the masses) site was battered by a link on Yahoo's main page that caused an almost instantaneous 650,000 page view jump on their site. Yahoo spends considerable effort making sure its own properties can handle the truly massive flow from the main page. Turning the Great Eye of the Internet towards an unsuspecting newborn site must be quite the diaper ready experience. Theo Schlossnagle eerily prophesized about such events in The Implications of Punctuated Scalabilium for Website Architecture: massive, unexpected and sudden traffic spikes will become more common as a fickle internet seeks ever for new entertainments (my summary). Exactly FaceStat's situation.

    This is also one of our first exposures to an application written on Merb, a popular Ruby on Rails competitor. For those who think Ruby is the problem, their architecture now serves 100 times the original load.

    How did our fine FaceStat fellowship fair against Yahoo’s onslaught?

    FredMeyers's picture

    memcached and Storage of Friend list

    My first post, please be gentle. I know it is long. You are all like doctors - the more info, the better the diagnosis.

    -----------
    What is the best way to store a list of all of your friends in the memcached cache (a simple boolean saying “yes this user is your friend”, or “no”)? Think Robert Scoble (26,000+ “friends”) on Twitter.com. He views a list of ALL existing users, and in this list, his friends are highlighted.

    I came up with 4 possible methods:
    --store in memcache as an array, a list of all the "yes" friend ID's
    --store your friend ID's as individual elements.
    --store as a hash of arrays based on last 3 digits of friend's ID -- so have up to 1000 arrays for you.
    --comma-delimited string of ID's as one element

    I'm using the second one because I think it is faster to update. The single array or hash of arrays feels like too much overhead calculating and updating – and even just loading – to check for existence of a friend.

    The key is FRIEND[small ID#]_[big ID#]. The value is 1.
    This way there are no dupes. (I add u as friend, it always adds me as ur friend...I remove u, u remove me).
    Store with it 2 additional flags: One denotes start of entries. One denotes end of entries.
    As friends are added, the end flag position relative to new friends will become meaningless, but that is ok (I think).
    To see if someone is your friend, the system checks if both start and end flags exist.
    If both exist, it can check for existence of friend ID - if exists, then friend.
    Start flag is required. If start flag is pushed out of cache, we must assume some friends were also pushed out.

    Currently, the system loads from DB in a daemon in the background after you log in (if two flags are not already set).
    Until the two flags are set, it does db lookups.
    There is no timeout on the data in cache.
    Adding/removing friends to your account adds/removes to/from memcache - so, theoretically, it might never have to pre-load anything.

    Downside of my method is if the elements span multiple servers and one dies, you loose some of your friends (that's the upside of using arrays).
    I don't know how to resolve if the lost box didn't contain either of the flags -- in that case, the users' info will NEVER get refreshed. This is my concern.

    Any ideas?

    Thanks so much!!!

    Webinar: Designing and Implementing Scalable Applications with Memcached and MySQL

    The following technical Webinar could be of interest to the community.

    WHO:

    • Farhan "Frank" Mashraqi, Director of Business Operations and Technical Strategy, Fotolog Inc
    • Monty Taylor, Senior Consultant, Sun Microsystems
    • Jimmy Guerrero, Sr Product Marketing Manager, Sun Microsystems - Database Group

    WHAT:

    • Designing and Implementing Scalable Applications with Memcached and MySQL web presentation.

    WHEN:

    • Thursday, May 29, 2008, 10:00 am PST, 1:00 pm EST, 18:00 GMT
    • The presentation will be approximately 45 minutes long followed by Q&A.

    Check out the details here!

    Todd Hoff's picture

    Strategy: Break Up the Memcache Dog Pile

    Caching is like aspirin for headaches. Head hurts: pop a 'sprin. Slow site: add caching. Facebook must have a lot of headaches because they popped 805 memcached servers between 10,000 web servers and 1,800 MySQL servers and they reportedly have a 99% cache hit rate. But what's the best way for you to cache for your application? It's a remarkably complex and rich topic. Alexey Kovyrin talks about one common caching problem called the Dog Pile Effect in Dog-pile Effect and How to Avoid it with Ruby on Rails. Glenn Franxman also has a Django solution in MintCache.

    Data is usually cached because it's too expensive to calculate for every hit. Maybe it's a gnarly SQL query you want to avoid and a little stale data is OK. Or maybe the amount of data you have is simply larger than physical memory on any one machine. Or maybe you have the temerity to write to your database and cause its cache to flush so database caching isn't sufficient at a certain level of scale.

    Typical examples are for caching article vote counts, comment threads, and event streams. One familiar example that bit me hard is displaying the the top N blog articles. Do you want to scan through your entire access log table for every page display? Absolutely not. Especially when the nightly backups are going on and the network is very slow. Not good :-) Yet you still want to update the results every X minutes so the stats stay fresh.

    Data freshness requires a refrigeration truck or an expiry time on your cache entry that causes stats to be periodically recalculated. Now, what happens when your cached data expires and a 1000 requests simultaneously try to recalculate the expensive to calculate data? Database load spikes and the world nearly ends. And since memcached operations are not atomic it's possible stale data could be cached and you'll serve stale data. Which kind of defeats of the purpose of taking load off the data while providing accurate data. So, how do you unpile the dogs?

    Todd Hoff's picture

    Friends for Sale Architecture - A 300 Million Page View/Month Facebook RoR App

    Update: Jake in Does Django really scale better than Rails? thinks apps like FFS shouldn't need so much hardware to scale.

    In a short three months Friends for Sale (think Hot-or-Not with a market economy) grew to become a top 10 Facebook application handling 200 gorgeous requests per second and a stunning 300 million page views a month. They did all this using Ruby on Rails, two part time developers, a cluster of a dozen machines, and a fairly standard architecture. How did Friends for Sale scale to sell all those beautiful people? And how much do you think your friends are worth on the open market?

    Scalr - Open Source Auto-scaling Hosting on Amazon EC2

    Scalr is a fully redundant, self-curing and self-scaling hosting environment utilizing Amazon's EC2. It has been recently open sourced on Google Code.

    Scalr allows you to create server farms through a web-based interface using prebuilt AMI's for load balancers (pound or nginx), app servers (apache, others), databases (mysql master-slave, others), and a generic AMI to build on top of.
    Scalr promises automatic high-availability and scaling for developers by health and load monitoring.

    The health of the farm is continuously monitored and maintained. When the Load Average on a type of node goes above a configurable threshold a new node is inserted into the farm to spread the load and the cluster is reconfigured. When a node crashes a new machine of that type is inserted into the farm to replace it.

    Scaling Out MySQL

    This post covers two main options for scaling-out MySql and compare between them. The first is based on data-base clustering and the second is based on In Memory clustering a.k.a Data Grid. A special emphasis is given to a pattern which shows how to scale our existing data base without changing it through a combination of Data Grid and data base as a background service. This pattern is referred to as Persistency as a Service (PaaS). It also address many of the fequently asked question related to how performance, reliability and scalability is achieved with this pattern.

    Todd Hoff's picture

    Scaling Twitter: Making Twitter 10000 Percent Faster

    Update 2: a commenter in Twitter Fails Macworld Keynote Test said this entry needs to be updated. LOL. My uneducated guess is it's not a language or architecture problem, but more a problem of not being able to add hardware fast enough into their data center. The predictability of this problem is debatable, but once you have it, it's hard to fix.
    Update: Twitter releases Starling - light-weight persistent queue server that speaks the MemCache protocol. It was built to drive Twitter's backend, and is in production across Twitter's cluster.

    Twitter started as a side project and blew up fast, going from 0 to millions of page views within a few terrifying months. Early design decisions that worked well in the small melted under the crush of new users chirping tweets to all their friends. Web darling Ruby on Rails was fingered early for the scaling problems, but Blaine Cook, Twitter's lead architect, held Ruby blameless:


    For us, it’s really about scaling horizontally - to that end, Rails and Ruby haven’t been stumbling blocks, compared to any other language or framework. The performance boosts associated with a “faster” language would give us a 10-20% improvement, but thanks to architectural changes that Ruby and Rails happily accommodated, Twitter is 10000% faster than it was in January.

    If Ruby on Rails wasn't to blame, how did Twitter learn to scale ever higher and higher?

    Update: added slides Small Talk on Getting Big. Scaling a Rails App & all that Jazz

    Todd Hoff's picture

    Slashdot Architecture - How the Old Man of the Internet Learned to Scale

    Slashdot effect: overwhelming unprepared sites with an avalanche of reader's clicks after being mentioned on Slashdot. Sure, we now have the "Digg effect" and other hot new stars, but Slashdot was the original. And like many stars from generations past, Slashdot plays the elder statesman's role with with class, dignity, and restraint. Yet with millions and millions of users Slashdot is still box office gold and more than keeps up with the young'ins. And with age comes the wisdom of learning how to handle all those users. Just how does Slashdot scale and what can you learn by going old school?

    Todd Hoff's picture

    Lessons from Pownce - The Early Years

    Pownce is a new social messaging application competing micromessage to micromessage with the likes of Twitter and Jaiku. Still in closed beta, Pownce has generously shared some of what they've learned so far. Like going to a barrel tasting of a young wine and then tasting the same wine after some aging, I think what will be really interesting is to follow Pownce and compare the Pownce of today with the Pownce of tomorrow, after a few years spent in the barrel. What lessons lie in wait for Pownce as they grow?

    Todd Hoff's picture

    Secrets to Fotolog's Scaling Success

    Fotolog, a social blogging site centered around photos, grew from about 300 thousand users in 2004 to over 11 million users in 2007. Though they initially experienced the inevitable pains of rapid growth, they overcame their problems and now manage over 300 million photos and 800,000 new photos are added each day. Generating all that fabulous content are 20 million unique monthly visitors and a volunteer army of 30,000 new users each day. They did so well a very impressed suitor bought them out for a cool $90 million. That's scale meets success by anyone standards. How did they do it?

    Todd Hoff's picture

    TypePad Architecture

    TypePad is considered the largest paid blogging service in the world. After experience problems because of their meteoric growth, they eventually transitioned to an architecture patterned after their sister company, LiveJournal.