Update 2: Michael Galpin in Cache Money and Cache Discussions likes memcached for it's expiry policy, complex graph data, process data, but says MySQL has many advantages: SQL, Uniform Data Access, Write-through, Read-through, Replication, Management, Cold starts, LRU eviction. Update: Dormando asks Should you use memcached? Should you just shard mysql more?. The idea of caching is the most important part of caching as it transports you beyond a simple CRUD worldview. Plan for caching and sharding by properly abstracting data access methods. Brace for change. Be ready to shard, be ready to cache. React and change to what you push out which is actually popular, vs over planning and wasting valuable time. Feedster's François Schiettecatte wonders if Fotolog's 21 memcached servers wouldn't be better used to further shard data by adding more MySQL servers? He mentions Feedster was able to drop memcached once they partitioned their data across more servers. The algorithm: partition until all data resides in memory and then you may not need an additional memcached layer. Parvesh Garg goes a step further and asks why people think they should be using MySQL at all?
Pre-generating static files is an oldy but a goody, and as Thomas Brox Røst says, it's probably an underused strategy today. At one time this was the dominate technique for structuring a web site. Then the age of dynamic web sites arrived and we spent all our time worrying how to make the database faster and add more caching to recover the speed we had lost in the transition from static to dynamic. Static files have the advantage of being very fast to serve. Read from disk and display. Simple and fast. Especially when caching proxies are used. The issue is how do you bulk generate the initial files, how do you serve the files, and how do you keep the changed files up to date? This is the process Thomas covers in his excellent article Serving static files with Django and AWS - going fast on a budget", where he explains how he converted 600K thousand previously dynamic pages to static pages for his site Eventseer.net, a service for tracking academic events. Eventseer.net was experiencing performance problems as search engines crawled their 600K dynamic pages. As a solution you could imagine scaling up, adding more servers, adding sharding, etc etc, all somewhat complicated approaches. Their solution was to convert the dynamic pages to static pages in order to keep search engines from killing the site. As an added bonus non logged-in users experienced a much faster site and were more likely to sign up for the service. The article does a good job explaining what they did, so I won't regurgitate it all here, but I will cover the highlights and comment on some additional potential features and alternate implementations... They estimated it would take 7 days on single server to generate the initial 600K pages. Ouch. So what they did was use EC2 for what it's good for, spin up a lot of boxes to process data. Their data is backed up on S3 so the EC2 instances could read the data from S3, generate the static pages, and write them to their deployment area. It took 5 hours, 25 EC2 instances, and a meager $12.50 to perform the initial bulk conversion. Pretty slick. The next trick is figuring out how to regenerate static pages when changes occur. When a new event is added to their system hundreds of pages could be impacted, which would require the effected static pages to be regenerated. Since it's not important to update pages immediately they queued updates for processing later. An excellent technique. A local queue of changes was maintained and replicated to an AWS SQS queue. The local queue is used in case SQS is down. Twice a day EC2 instances are started to regenerate pages. Instances read twork requests from SQS, access data from S3, regenerate the pages, and shutdown when the SQS is empty. In addition they use AWS for all their background processing jobs.
CommentsI like their approach a lot. It's a very pragmatic solution and rock solid in operation. For very little money they offloaded the database by moving work to AWS. If they grow to millions of users (knock on wood) nothing much will have to change in their architecture. The same process will still work and it still not cost very much. Far better than trying to add machines locally to handle the load or moving to a more complicated architecture. Using the backups on S3 as a source for the pages rather than hitting the database is inspired. Your data is backed up and the database is protected. Nice. Using batched asynchronous work queues rather than synchronously loading the web servers and the database for each change is a good strategy too. As I was reading I originally thought you could optimize the system so that a page only needed to be generated once. Maybe by analyzing the events or some other magic. Then it hit me that this was old style thinking. Don't be fancy. Just keep regenerating each page as needed. If a page is regenerated a 1000 times versus only once, who cares? There's plenty of cheap CPU available. The local queue of changes still bothers me a little because it adds a complication into the system. The local queue and the AWS SQS queue must be kept synced. I understand that missing a change would be a disaster because the dependent pages would never be regenerated and nobody would ever know. The page would only be regenerated the next time an event happened to impact the page. If pages are regenerated frequently this isn't a serious problem, but for seldom touched pages they may never be regenerated again. Personally I would drop the local queue. SQS goes down infrequently. When it does go down I would record that fact and regenerate all the pages when SQS comes back up. This is a simpler and more robust architecture, assuming SQS is mostly reliable. Another feature I have implemented in similar situations is to setup a rolling page regeneration schedule where a subset of pages are periodically regenerated, even if no event was detected that would cause a page to be regenerated. This protects against any event drops that may cause data be undetectably stale. Over a few days, weeks, or whatever, every page is regenerated. It's a relatively cheap way to make a robust system resilient to failures.
Update: Evaluating Terracotta by Piotr Woloszyn. Nice writeup that covers resilience, failover, DB persistence, Distributed caching implementation, OS/Platform restrictions, Ease of implementation, Hardware requirements, Performance, Support package, Code stability, partitioning, Transactional, Replication and consistency. Terracotta is Network Attached Memory (NAM) for Java VMs. It provides up to a terabyte of virtual heap for Java applications that spans hundreds of connected JVMs. NAM is best suited for storing what they call scratch data. Scratch data is defined as object oriented data that is critical to the execution of a series of Java operations inside the JVM, but may not be critical once a business transaction is complete. The Terracotta Architecture has three components:
- Client Nodes - Each client node corresponds to a client node in the cluster which runs on a standard JVM
- Server Cluster - java process that provides the clustering intelligence. The current Terracotta implementation operates in an Active/Passive mode
- Storage used as
- Virtual Heap storage - as objects are paged out of the client nodes, into the server, if the server heap fills up, objects are paged onto disk
- Lock Arbiter - To ensure that there is no possibility of the classic "split-brain" problem, Terracotta relies on the disk infrastructure to provide a lock.
- Shared Storage - to transmit the object state from the active to passive, objects are persisted to disk, which then shares the state to the passive server(s).
One of the most popular and effective scalability strategies is to impose limits (GAE Quotas, Fotolog, Facebook) as a means of protecting a website against service destroying traffic spikes. Twitter will reportedly limit the number followers to 2,000 in order to thwart follow spam. This may also allow Twitter to make some bank by going freemium and charging for adding more followers. Agree or disagree with Twitter's strategies, the more interesting aspect for me is how do you introduce new policies into an already established ecosystem? One approach is the big bang. Introduce all changes at once and let everyone adjust. If users don't like it they can move on. The hope is, however, most users won't be impacted by the changes and that those who are will understand it's all for the greater good of their beloved service. Casualties are assumed, but the damage will probably be minor. Now in Twitter's case the people with the most followers tend to be opinion leaders who shape much of the blognet echo chamber. Pissing these people off may not be your best bet. What to do? Shegeeks.net makes a great proposal: Limit The New, Not The Old. The idea is to only impose the limits on new accounts, not the old. Old people are happy and new people understand what they are getting into. The reason I like this suggestion so much is that it has deep historical roots, all the way back to the fall of the Roman republic and the rise of the empire due to the agrarian reforms laws passed in 133BC. In ancient Rome property and power, as they tend to do, became concentrated in the hands of a few wealthy land owners. Let's call them the nobility. The greatness that was Rome was founded on a agrarian society. People made modest livings on small farms. As power concentrated small farmers were kicked of the land and forced to move to the city. Slaves worked the land while citizens remained unemployed. And cities were no place to make a life. Civil strife broke out. Pliny said that "it was the large estates which destroyed Italy." To redress the imbalance and reestablish traditional virtuous Roman character, "In 133 BC, Tiberius Sempronius Gracchus, the plebeian tribune, passed a series of laws attempting to reform the agrarian land laws; the laws limited the amount of public land one person could control, reclaimed public lands held in excess of this, and attempted to redistribute the land, for a small rent, to farmers now living in the cities." At this time Rome was a republic. It had a mixed constitution with two consuls at the top, a senate, and tribunes below. Tribunes represented the people. The agrarian laws passed by Gracchus set off a war between between the nobility and the people. The nobles obviously didn't want to lose their the land and set in motion a struggle eventually leading to the defeat of the people, the fall of the Republic, and the rise of the Empire. Machiavelli, despite his unpopular press, was a big fan of the republican form of government, and pinpointed the retroactive and punitive nature of the agrarian laws as why the Republic fell. He didn't say reform wasn't needed, but he thought the mistake was in how in how the reform was carried out. By making the laws retroactive it challenged the existing order. And by making the laws so punitive it made the nobility willing to fight to keep what they already had. This divided the people and caused a class war. In fact, this is the origin in the ex post facto clause in the US Constitution, which says you can't pass a law that with retroactive effect. Machiavelli suggested the reforms should have been carried out so they only impacted the future. For example, in new conquered lands small farmers would be given more land. In this way the nobility wouldn't be directly challenged, reform could happen slowly over time, and the republic would have been preserved. Cool parallel, isn't it? Hey, who says there's nothing to learn from history! Let's just hope history doesn't rerepeat itself again.
A couple of videos about distributed computing with direct reference on Google infrastructure. You will get acquainted with: --MapReduce the software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on commodity hardware --GFS and the way it stores it's data into 64mb chunks --Bigtable which is the simple implementation of a non-relational database at Google Cluster Computing and MapReduce Lectures 1-5.
At least in the articles on Plenty of Fish and Slashdot it was mentioned that one can achieve higher performance by creating read-only and write-only databases where possible. I have read the comments and tried unsuccessfully to find more information on the net about this. I still do not understand the concept. Can someone explain it in more detail, as well as recommend resources for further investigation? (Are there books written specifically about this technique?) I think it is a very important issue, because databases are oftentimes the bottleneck.
The primero recommendation for speeding up a website is almost always to add cache and more cache. And after that add a little more cache just in case. Memcached is almost always given as the recommended cache to use. What we don't often hear is how to effectively use a cache in our own products. MySQL hosted two excellent webinars (referenced below) on the subject of how to deploy and use memcached. The star of the show, other than MySQL of course, is Farhan Mashraqi of Fotolog. You may recall we did an earlier article on Fotolog in Secrets to Fotolog's Scaling Success, which was one of my personal favorites. Fotolog, as they themselves point out, is probably the largest site nobody has ever heard of, pulling in more page views than even Flickr. Fotolog has 51 instances of memcached on 21 servers with 175G in use and 254G available. As a large successful photo-blogging site they have very demanding performance and scaling requirements. To meet those requirements they've developed a sophisticated approach to using memcached that others can learn from and emulate. We'll cover some of the highlightable strategies from the webinar down below the fold.
What is Memcached?The first part of the first webinar gives a good overview of memcached. Just in case the rock you've been hiding under recently disintegrated, you may not have heard about memcached (as if). Memached is: A high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. Memcached essentially creates an in-memory shard on top of a pool of servers from which an application can easily get and set up to 1MB of unstructured data. Memcached is two hash tables, one from the client to the server and another one inside the server. The magic is that none of the memcached servers need know about each other. To scale up you just add more servers and the key hashing algorithm makes it all work out right. Memcached is not redundant, has no failover, and has no authentication. It's simple server for storing and getting data, the complex bits must be implemented by applications. The rest of the first webinar is Farhan explaining in wonderful detail how they use memcached at Fotolog. The beginning of the second seminar covers much of the same ground as the first. In the last half of the second seminar Farhan gives a number of excellent code examples showing how memcached is installed, used, and managed. If you've never used memcached before there's a lot of good stuff presented.
Memcached and MySQL Go Better TogetherThere's a little embrace and extend in the webinar as MySQL cluster is presented several times as doing much the same job as memcached, but more reliably. However, the recommended approach for using memcached and MySQL is:
Cache everything that is slow to query, fetch, or calculate.This is Fotolog's rule of deciding what to cache. What is considered slow depends on your requirements. But when something becomes slow it's a candidate for caching.
Fotolog's Caching TypologyFotolog has come up with an interesting typology of their different caching strategies:
Non-Deterministic CacheThis is the typical way of using memcached. I think the non-determinism comes in because an application can't depend on data being in the cache. The data may have been evicted because a slab is full or because the data simply hasn't been added yet. Other Fotolog strategies are deterministic, which means the application can assume the data is always present.
State CacheKeeps the current state of an application in cache.
Usage StepsIt's a form of non-deterministic caching so the same steps apply.
Deterministic CacheKeep all data for particular database tables in the cache. An application always assumes that the data they need is in the cache (deterministic), applications never go to the database for data. Applications don't have to check memcached before accessing the data. The data will exist in the cache because the database is essentially loaded into the cache and is always kept in sync.
Proactive CachingData magically shows up in the cache. As updates happen to the database the cache is populated based on the database change. Since the cache is updated as the database is updated the chances of data being in cache are high. It's non-deterministic caching with a twist.
Usage StepsThere are typically three implementation approaches:
File System CachingNFS has significant overhead when used with a large number of servers. Fotolog originally stored XML files on a SAN and exposed them using NFS. They saw contention on these files so they put files in memcached. Big performance improvements were seen and it kept NFS mounts open for other requests. Smaller media can also be stored in the cache.
Partial Page CachingCache directly displayable page elements. The other caching strategies cache data used to create pages, but some things are still compute intensive and require a lot of work. So instead of just caching objects, prepare and cache entire page elements for reuse. For page elements that are accessed many times per second this can be a big win. For example: calculating top users in a region, popular photo list, and featured photo list. Especially when using sharding it can take some time to calculate these lists, so caching the resulting page elements makes a lot of sense.
Application Based ReplicationWrite data to the cache through your own API. The API hides implementation details like:
MiscellaneousThere were a few suggestions on using memcached that didn't fit in any other section, so they're gathered here for posterity:
Final ThoughtsFotolog has obviously put a great deal of thought and effort into creating sophisticated scaling strategies using memcached and MySQL. What I'm struck with is the enormous amount of effort that goes into syncing rows and objects back and forth between the cache and the database. Shouldn't it be easier? What role is the database playing when the application makes such constant use of the object cache? Wouldn't more memory make the disk based storage unnecessary?
Ehcache is a pure Java cache with the following features: fast, simple, small foot print, minimal dependencies, provides memory and disk stores for scalability into gigabytes, scalable to hundreds of caches is a pluggable cache for Hibernate, tuned for high concurrent load on large multi-cpu servers, provides LRU, LFU and FIFO cache eviction policies, and is production tested. Ehcache is used by LinkedIn to cache member profiles. The user guide says it's possible to get at 2.5 times system speedup for persistent Object Relational Caching, a 1000 times system speedup for Web Page Caching, and a 1.6 times system speedup Web Page Fragment Caching. From the website: Introduction Ehcache is a cache library. Before getting into ehcache, it is worth stepping back and thinking about caching generally. About Caches Wiktionary defines a cache as A store of things that will be required in future, and can be retrieved rapidly . That is the nub of it. In computer science terms, a cache is a collection of temporary data which either duplicates data located elsewhere or is the result of a computation. Once in the cache, the data can be repeatedly accessed inexpensively. Why caching works Locality of Reference While ehcache concerns itself with Java objects, caching is used throughout computing, from CPU caches to the DNS system. Why? Because many computer systems exhibit locality of reference . Data that is near other data or has just been used is more likely to be used again. The Long Tail Chris Anderson, of Wired Magazine, coined the term The Long Tail to refer to Ecommerce systems. The idea that a small number of items may make up the bulk of sales, a small number of blogs might get the most hits and so on. While there is a small list of popular items, there is a long tail of less popular ones. The Long Tail The Long Tail is itself a vernacular term for a Power Law probability distribution. They don't just appear in ecommerce, but throughout nature. One form of a Power Law distribution is the Pareto distribution, commonly know as the 80:20 rule. This phenomenon is useful for caching. If 20% of objects are used 80% of the time and a way can be found to reduce the cost of obtaining that 20%, then the system performance will improve. Will an Application Benefit from Caching? The short answer is that it often does, due to the effects noted above. The medium answer is that it often depends on whether it is CPU bound or I/O bound. If an application is I/O bound then then the time taken to complete a computation depends principally on the rate at which data can be obtained. If it is CPU bound, then the time taken principally depends on the speed of the CPU and main memory. While the focus for caching is on improving performance, it it also worth realizing that it reduces load. The time it takes something to complete is usually related to the expense of it. So, caching often reduces load on scarce resources. Speeding up CPU bound Applications CPU bound applications are often sped up by: * improving algorithm performance * parallelizing the computations across multiple CPUs (SMP) or multiple machines (Clusters). * upgrading the CPU speed. The role of caching, if there is one, is to temporarily store computations that may be reused again. An example from ehcache would be large web pages that have a high rendering cost. Another caching of authentication status, where authentication requires cryptographic transforms. Speeding up I/O bound Applications Many applications are I/O bound, either by disk or network operations. In the case of databases they can be limited by both. There is no Moore's law for hard disks. A 10,000 RPM disk was fast 10 years ago and is still fast. Hard disks are speeding up by using their own caching of blocks into memory. Network operations can be bound by a number of factors: * time to set up and tear down connections * latency, or the minimum round trip time * throughput limits * marshalling and unmarhshalling overhead The caching of data can often help a lot with I/O bound applications. Some examples of ehcache uses are: * Data Access Object caching for Hibernate * Web page caching, for pages generated from databases. Increased Application Scalability The flip side of increased performance is increased scalability. Say you have a database which can do 100 expensive queries per second. After that it backs up and if connections are added to it it slowly dies. In this case, caching may be able to reduce the workload required. If caching can cause 90 of that 100 to be cache hits and not even get to the database, then the database can scale 10 times higher than otherwise. How much will an application speed up with Caching? The short answer The short answer is that it depends on a multitude of factors being: * how many times a cached piece of data can and is reused by the application * the proportion of the response time that is alleviated by caching In applications that are I/O bound, which is most business applications, most of the response time is getting data from a database. Therefore the speed up mostly depends on how much reuse a piece of data gets. In a system where each piece of data is used just once, it is zero. In a system where data is reused a lot, the speed up is large. The long answer, unfortunately, is complicated and mathematical. It is considered next.
Update: A very nice JavaWorld podcast interview with Google engineer Max Ross on Hibernate Shards. Max defines Hibernate Shards (horizontal partitioning), how it works (pretty well), virtual shards (don't ask), what they need to do in the future (query, replication, operational tools), and how it relates to Google AppEngine (not much). To scale you are supposed to partition your data. Sounds good, but how do you do it? When you actually sit down to work out all the details it’s not that easy. Hibernate Shards to the rescue! Hibernate shards is: an extension to the core Hibernate product that adds facilities for horizontal partitioning. If you know the core Hibernate API you know the shards API. No learning curve at all. Here is what a few members of the core group had to say about the Hibernate Shards open source project. Although there are some limitations, from the sound of it they are doing useful stuff in the right way and it’s very much worth looking at, especially if you use Hibernate or some other ORM layer.
What is Hibernate Shards?
Schema Design for Shards
The Sharding Code’s Relationship to Hibernate
Pluggable Strategies Determine How Data Are Split Across Shards
This is an unusually well written and useful paper. It talks in detail about experiences implementing a complex project, something we don't see very often. They shockingly even admit that creating a working implementation of Paxos was more difficult than just translating the pseudo code. Imagine that, programmers aren't merely typists! I particularly like the explanation of the Paxos algorithm and why anyone would care about it, working with disk corruption, using leases to support simultaneous reads, using epoch numbers to indicate a new master election, using snapshots to prevent unbounded logs, using MultiOp to implement database transactions, how they tested the system, and their openness with the various problems they had. A lot to learn here. From the paper: We describe our experience building a fault-tolerant data-base using the Paxos consensus algorithm. Despite the existing literature in the field, building such a database proved to be non-trivial. We describe selected algorithmic and engineering problems encountered, and the solutions we found for them. Our measurements indicate that we have built a competitive system. Introduction It is well known that fault-tolerance on commodity hardware can be achieved through replication [17, 18]. A common approach is to use a consensus algorithm  to ensure that all replicas are mutually consistent [8, 14, 17]. By repeatedly applying such an algorithm on a sequence of input values, it is possible to build an identical log of values on each replica. If the values are operations on some data structure, application of the same log on all replicas may be used to arrive at mutually consistent data structures on all replicas. For instance, if the log contains a sequence of database operations, and if the same sequence of operations is applied to the (local) database on each replica, eventually all replicas will end up with the same database content (provided that they all started with the same initial database state). This general approach can be used to implement a wide variety of fault-tolerant primitives, of which a fault-tolerant database is just an example. As a result, the consensus problem has been studied extensively over the past two decades. There are several well-known consensus algorithms that operate within a multitude of settings and which tolerate a variety of failures. The Paxos consensus algorithm  has been discussed in the theoretical  and applied community [10, 11, 12] for over a decade. We used the Paxos algorithm (“Paxos”) as the base for a framework that implements a fault-tolerant log. We then relied on that framework to build a fault-tolerant database. Despite the existing literature on the subject, building a production system turned out to be a non-trivial task for a variety of reasons: While Paxos can be described with a page of pseudo-code, our complete implementation contains several thousand lines of C++ code. The blow-up is not due simply to the fact that we used C++ instead of pseudo notation, nor because our code style may have been verbose. Converting the algorithm into a practical, production-ready system involved implementing many features and optimizations – some published in the literature and some not. • The fault-tolerant algorithms community is accustomed to proving short algorithms (one page of pseudo code) correct. This approach does not scale to a system with thousands of lines of code. To gain confidence in the “correctness” of a real system, different methods had to be used. • Fault-tolerant algorithms tolerate a limited set of carefully selected faults. However, the real world exposes software to a wide variety of failure modes, including errors in the algorithm, bugs in its implementation, and operator error. We had to engineer the software and design operational procedures to robustly handle this wider set of failure modes. • A real system is rarely specified precisely. Even worse, the specification may change during the im- plementation phase. Consequently, an implementation should be malleable. Finally, a system might “fail” due to a misunderstanding that occurred during its specification phase. This paper discusses a selection of the algorithmic and engineering challenges we encountered in moving Paxos from theory to practice. This exercise took more R&D efforts than a straightforward translation of pseudo-code to C++ might suggest. The rest of this paper is organized as follows. The next two sections expand on the motivation for this project and describe the general environment into which our system was built. We then provide a quick refresher on Paxos. We divide our experiences into three categories and discuss each in turn: algorithmic gaps in the literature, software engineering challenges, and unexpected failures. We conclude with measurements of our system, and some broader observations on the state of the art in our field.