« Google Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications | Main | Facebook and Site Failures Caused by Complex, Weakly Interacting, Layered Systems »

More Troubles with Caching

As a tasty pairing with Facebook And Site Failures Caused By Complex, Weakly Interacting, Layered Systems, is another excellent tale of caching gone wrong by Peter Zaitsev, in an exciting twin billing: Cache Miss Storm and More on dangers of the caches. This is fascinating case where the cause turned out to be software upgrade that ran long because it had to be rolled back. During the long recovery time many of the cache entries timed out. When the database came back, slam, all the clients queried the database to repopulate the cache and bad things happened to the database. The solution was equally interesting: 

So the immediate solution to bring the system up was surprisingly simple. We just had to get traffic on the system in stages allowing Memcache to be warmed up. There were no code which would allow to do it on application side so we did it on MySQL side instead. “SET GLOBAL max_connections=20” to limit number of connections to MySQL and so let application to err when it tries to put too much load on MySQL as MySQL load stabilizes increasing number of connections higher until you finally can serve all traffic without problems.

Peter also suggested a few other helpful strategies:

  • Watch frequently accessed cache items as well as cache items which take long to generate in case of cache miss 
  • Optimize Queries 
  • Use a Smarter Cache
  • Include Rollback in Maintenance Window
  • Know your Cold Cache Performance and Behavior 
  • Have a way to increase traffic gradually 
  • Consider Pre-Priming Caches 

Reader Comments (6)

Am I crazy for thinking that a database should not livelock under heavy query load? If the kernel and network layer become unresponsive that is one thing, but new requests should not prevent existing requests from making progress, nor should they cut overall throughput or response time for earlier requests.

This is something VoltDB was forced to deal with early on when we switched to an asynchronous wire protocol. Benchmark clients could queue an unbounded amount of work, so governors had to be implemented or the server and client would promptly run out of memory queuing requests. I'll go so far as to say that you can't livelock VoltDB, not to the extent that a request that comes later can impact the progress of a request that came earlier, because requests are executed serially.

I am not saying you can't have a query pattern that causes problems (imagine what happens when all queries are routed to the same partition/shard in any distributed database), but livelock shouldn't be one of them. We should be designing databases accordingly, and not demanding workarounds at the application level.

October 1, 2010 | Unregistered CommenterAriel Weisberg

Ariel I think it's a general issue of server design. It takes a lot of design and coding to accept connections as fast as possible and prioritize work as it comes in. If you didn't start out that way it's a hard switch to make. Adopting an async model gets you a long there.

October 1, 2010 | Registered CommenterTodd Hoff


When you really can't run your site without lots of cached data keeping the miss rate low, it's a whole different creature. It's like half an in-memory database -- doesn't have to be complete, persistent (obviously), fully reliable, or always up to date depending on the use case, but it has to have a certain amount of usable data and you can't casually assume blowing out lots of data or skipping the cache will have no consequences. Seems like that could affect a lot of your design decisions.

October 2, 2010 | Unregistered CommenterRandall

Wouldn't 'eventually consistent' application resolve this problem easily? Queuing commands, when showing old cached data from Solr or another no-sql, read-optimized storage?:>

October 4, 2010 | Unregistered CommenterScooletz

We saw this problem at Squarespace because we run a large Coherence data grid cluster and keep a ton of stuff in memory. We built an explicit "cache warming" solution that primes a cold cache with the most used data when the system starts up. To figure out the most used data we have a service that runs periodically cataloging this. This forms the "recipe" the cache warming system follows to prime the cache.

In practice we haven't had to use it because we never take the system down completely and utilize rolling restarts.

October 6, 2010 | Unregistered CommenterDavin Chew

I agree with Ariel's comment earlier. The core problem here, which the author does not appear to recognize, is the database. Rather than dumping resources, additional complexity, and more machines into the caching layer, the solution likely is to figure out why the database is so slow and fix that.

A cache is supposed to provide a modest performance boost under normal operation, not to be used to hide underlying problems. If you can't run without your caching layer, if your caching layer has more machines than your database, if cache misses have very high latency, you need to take a serious look at your architecture design.

October 18, 2010 | Unregistered CommenterGreg Linden

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>