advertise
« Using the Ambient Cloud as an Application Runtime | Main | Paper: High Performance Scalable Data Stores »
Friday
Feb262010

MySQL and Memcached: End of an Era?

If you look at the early days of this blog, when web scalability was still in its heady bloom of youth, many of the articles had to do with leveraging MySQL and memcached. Exciting times. Shard MySQL to handle high write loads, cache objects in memcached to handle high read loads, and then write a lot of glue code to make it all work together. That was state of the art, that was how it was done. The architecture of many major sites still follow this pattern today, largely because with enough elbow grease, it works.

This was a pre-cloud, relational database dominated world, built from parts scrounged from the remnants of enterprises and datacenters past. Twitter and Digg started in this era, but are evolving into something different, as scaling pressures increase and new purpose built technologies pop into being.

With a little perspective, it's clear the MySQL+memcached era is passing. It will stick around for a while. Old technologies seldom fade away completely. Some still ride horses. Some still use CDs. And the Internet will not completely replace that archaic electro-magnetic broadcast technology called TV, but the majority will move on into a new era.

LinkedIn has moved on with their Project Voldemort. Amazon went there a while ago.

Digg declared their entrance into a new era in a post on their blog titled Looking to the future with Cassandra, saying:

The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes. This is completely wrong for large-scale web applications, where response time is critical. It’s made much worse by the serial nature of most applications. Each component of the page blocks on reads from the data store, as well as the completion of the operations that come before it. Non-relational data stores reverse this model completely, because they don’t have the complex read operations of SQL.

Twitter has also declared their move in the article Cassandra @ Twitter: An Interview with Ryan King. Their reason for changing is:

We have a lot of data, the growth factor in that data is huge and the rate of growth is accelerating. We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate. We need a system that can grow in a more automated fashion and be highly available.

It's clear that many of the ideas behind MySQL+memcached were on the mark, we see them preserved in the new systems, it's just that the implementation was a bit clunky. Developers have moved in, filled the gaps, sanded the corners, and made a new sturdy platform which will itself form the basis for a new ecosystem and a new era.

It's always a bit sad to see an era pass, but it's not all that often we get to notice as it's happening. We can enjoy what has gone before, but we can also get pumped to jump in with both feet and create the future. And excitingly, that's what many leading edge companies are doing today.

Related Articles

Reader Comments (17)

This is a great summary of the "state of data". MySQL+ Memcached was a great first step -- but as we've seen, sharding SQL still leaves you with all of the problems of SQL, and none of the benefits of a relational model. It simply doesn't work without exponentially more elbow grease, headaches, and anger from management :)

That's what we built our new real-time data platform at Drawn to Scale. We noticed that customers needed to do more than just process data with things like Hadoop (which we use and love) -- a growing number of companies need to search, query, and serve any handful of a billion records instantaneously. We built the scalable replacement for what people typically do with an RDBMS.

February 26, 2010 | Unregistered CommenterBradford

While I agree that there are lots of fun and interesting and powerful new patterns and components on the current bleeding leading edge, I would not say that memcached and the things that memcache is evolving into will be soon as deadend as CDs and horseback riding. The high performance key value store is going to be a useful pattern and component of real systems, from small to huge, for a long time to come, even as all these other evolving and becoming useful pattens and components are also showing up as well.

February 26, 2010 | Unregistered CommenterMark Atwood

The Chisimba project (http://avoir.uwc.ac.za) has started looking at Cassandra as well for some modules. So far the experiments that I have done look excellent and will compliment our already rock solid scalable architecture.

I see two different ways that high scalability can be achieved:

1. Use a nosql option to scale out database and continually add more physical and virtual machines as nodes to scale your app to infinitum

2. Use linked processors and processes in "web threads" http://www.paulscott.za.net/index.php?module=jabberblog&postid=ps123_5432_1251745446&action=viewsingle to scale your app in a much more distributed pattern

There, is, of course, no reason not to do both! I think that we need to start putting "web" back into the web, with both data, processes, people and sites

February 26, 2010 | Unregistered CommenterPaul Scott

So nosql databases might replace sql databases in the area of large datasets. I guess they won't replace traditional OLTP databases, e. g. for billing applications (where people like me _WANT_ ACID and referential integrity). And you didn't say a word on memcached. What's the replacement for memcached? Even Google uses (some kind of) memcached (at least the appengine) cause on-the-fly mapreduce does not replace indices/caches (well Google could do that as well but they had to build some more power plants and datacenters ...)

February 27, 2010 | Unregistered Commenters0enke

As always, you have good commentary, Todd. :) But I have to agree with Mark: the emergence of NoSQL/!SQL/anti-SQL/KeyValueStores/whatever isn't the marking dead of a scalability pattern with having a persistent datastiore (i.e. MySQL) and the caching of results for the purpose of offloading unnecessary computation and results-gathering (i.e. memcached)

I'll also go so far as to say that these new tools won't obviate the tools that came before it. Voldemort, Cassandra, MongoDB, CouchDB, etc...they all aim to be good for various scenarios, and some of those are absolutely places where MySQL was traditionally used. But that doesn't mean that all of MySQL (or memcached's) usefulness, availability, or performance is somehow done for. We should all worship at the Church of WhatWorks™ for our infrastructures and cultures, and I know more than a handful of large-scale websites that could (and will) continue quite fine with the MySQL/memcached approach. :)

Now, if you want to speculate that scalability patterns and tools that include completely normalized and monolithic datastores are going away, then I'll agree with you. I'll agree with you in 2003, even.

February 27, 2010 | Unregistered CommenterJohn Allspaw

John, by "end of an era" I didn't mean to imply that MySQL or memcached sucked in anyway. Not at all. After the post it hit me I could come off as negative to technologies that have helped build the web, and that was not my intention.

That phrasing is supposed to be evocative of the sense of when you have a decision to make on which platform to select, M&M or something else, I think the trend will be to move on to something else. And there are good reasons for choosing something else. You first find those decisions being made by people on the leading edge that have no choice but to invent something different. Playing it safe isn't an option. Then that group grows, the technologies mature, and at some point it's just as easy or even easier to pick something else and that's when the era will have changed. It hasn't changed yet. But from where I sit it's a pattern I saw emerging and thought worth commenting on. I would never try to tell you which church to worship at :-)

February 27, 2010 | Registered CommenterTodd Hoff

Todd, (and others commenting)
When you (and others) say these NoSQL solutions will replace MySQL and Memcached, I am still unclear about one thing. I understand that it is going to make the performance hit for fetching records from the datastore a lot faster. But one of the things I often use Memcached for is caching entire chunks (in rails terms, partials) of html so that that part of the page doesn't have to be dynamically generated on every request. It seems to me that no matter how fast you can pull from your data-store, nothing will be able to keep up with just fetching a string of html from memcached.
So when you say that cassandra will replace MySQL and Memecached, do you also mean this usage of memcached?

February 27, 2010 | Unregistered CommenterRyan Shaw

@Ryan: hello, I'm developing Redis with your reasoning in mind. This is why Redis supports EXPIRE for instance. What I'm trying to build with Redis is something between a DB (and with Virtual Memory that 2.0 will feature, a DB that can hold more data than RAM), a cache, and a messaging system. This three components are very important for many high performance applications, and with the Redis data model they are very "related" so it's possible to unify them into an unique interface.

February 27, 2010 | Unregistered Commenterantirez

Great commentary! I'm glad this discussion is happening. I think a lot of the NoSQL fervor came about as a reaction many frameworks and seminal web development books saying (in so many words) everything goes in mysql... normalize and it'll be alright. In that world, memcached is one of the few approaches that will alleviate pressure.

Like Cal Henderson said (way back in 2004), Normalized Data is for Sissies (pdf link). Very often, a key to scaling is duplicating data, either for better locality, higher throughput or more balance. So what are some examples of duplicating data? Memcached is one. So is de-normalizing your data - having a piece of data appear more than once in your datastore.

Is this the end of an era? The end of scaling by using MySql and Memcached? I think it's a bit early to say that - and here's why. At each stage of development / scaling, you have to optimize between certain tradeoffs:
* Development effort
* Operational effort
* "Scalability"

As long as MySql is the easiest way for someone to get up and running with their idea, it will be exceedingly popular. One day, the default will be something new, but that won't happen until there is ONE clear contender. And then frameworks that use it by default. And books / articles that walk you through getting it up and running. MySql is a great balance of the above variables -- it's like ketchup, it hits everything, without being too much of any one thing.

If you knew the app you were going to build from the moment you started, you could pick an appropriate data store (like a graph database for a social network) that will enable you to build the thing you want -- but that's not reality. You can't know what direction your app will head, so in the beginning (at the very least), development and operational effort are more important than scalability that may never be necessary (premature optimization).

Many big LAMP sites using "M&M" are using MySql in interesting and non-relational ways:
FriendFeed,
Flickr,
Facebook

So then what about these new tools? There will always be new ways of storing and accessing data. As the big sites continue to get bigger, and have a commitment to open source, we'll see more and more interesting projects. Let's not forget where Memcached came from - live journal was having trouble, so they made an in-memory cache. Facebook was having trouble, so they made Cassandra (based on what is publicly known about BigTable and Dynamo).

February 27, 2010 | Unregistered CommenterErik Kastner

You are perfectly correct Ryan. I wasn't trying to say caching will go away or that MySQL will go away. I'm a big believer in the whole memory is the new disk proposition. And MySQL still meets many needs. That's why in a previous article I went out of my way to show how well StackOverflow can scale using a relational database. This isn't an either or situation.

What has passed is MySQL and memcached, which complement each other so well, as the default platform on which to develop scalable systems. New products rollup all the techniques you would use to scale using M&M in a better and easier to use package. They shard and scale out of the box. They handle schemaless data models while not leaving the data opaque. They manage shards, have indexes, support queries, keep data consistent, span data centers, map reduce, are highly availability, and all the other bullet proof features developers need to feel comfortable with before they deploy to production. All that can be done with M&M, but once you start denormalizing, sharding, and playing with consistency models to scale it all becomes custom. Why do that when there are alternatives?

February 27, 2010 | Registered CommenterTodd Hoff

Couldn't agree with you more Todd.

I recently watched a PyCon 2010 presentation about scaling at Reddit (http://pycon.blip.tv/file/3257303/). They had to write a custom ORM to shard their data that basically turned PostgreSQL into a key/value store. While watching, I couldn't help but think that if they were building Reddit in 2010 there is just no way they would do that. They would go with Cassandra or Voldemort or MongoDB or whatever and they would save a man-year or more of development. But when they had to scale (in 2007 or 2008 or whatever) a NoSQL DB that worked out of the box just wasn't an option.

The pattern I'm seeing forming is lots of startups are building the initial version of their app on MySQL/Memcache. Then when it's time to think about scaling, instead of pondering MySQL Master/Multiple Read Slave setups or heaven forbid sharding, people are thinking first. "Well how could I use a complementary DB to offload things from MySQL that have no business being in a relational database?" "What write heavy tables would be better suite to Cassandra?" "What table is really a queue that could live in Redis?" and so on.

Now some startups might have such an app where they can skip the whole MySQL step altogether and go straight to a non-relational DB pattern. Most though still have at the core of their some transaction processing needs that are happy and healthy on MySQL.

February 27, 2010 | Unregistered CommenterAra Anjargolian

It's not an end of era. You're talking about an alternative method of data storage for the biggest internet players out there. Mysql/memcached still works fine for the rest of us. I mean really, do you a blog like this needs to make the move to Cassandra?

February 28, 2010 | Unregistered Commenterjohn

John, this blog is hosted by Squarespace which uses an in-memory data grid, so I don't think it's moving :-)

A technological era in my mind is a span of time characterized by a strongly related set of practices and techniques that lay out a clear path for achieving goals. According to that definition MySQL and memcached definitely defined an era. My previous Drupal blog would not have moved to Cassandra because it was MySQL based. That's how strong the MySQL era is.

The dead ball era in baseball ended when the balls became alive again. The steroid era in baseball ended when they tested for steroids. The end of the M&M era can be marked by a shift in energy, early adopter decisions, and thought leadership.

I haven't had to make a new M&M post for quite some time because that practice is well established. M&M works fine because of a mass of work that has gone before. Folks like LiveJournal when faced with challenges invented incredible technologies like memcached and in the process laid out a brilliant path of practices that we've all been walking for a while. They went beyond the era of vertical scalability by spending more on ever larger centralized servers. People today are in the same position, facing similar challenges, just one era removed. In response they are inventing new solutions based on what they learned in the previous era.

The energy is shifting there. The code is shifting there. Very leading edge groups created these new technologies. Leading edge companies are making decisions to adopt those technologies based on a very careful consideration of what will get the best results. These people are not frivolous, they have very really problems that need solving. And soon, through all those efforts, that path will be just as easy to walk and more people will start walking it simply because it will become the best path to take. Just like before.

So when I read the criticism that if you aren't using MySQL that you are just a delusional fan boy trying to pretend you are Amazon, I don't think people are considering the process of how tool chains are adopted. It's not about developers trying to be Amazon, it's about developers trying to go from point A to B as quickly as possible. That's how tool chains change.

February 28, 2010 | Unregistered CommenterTodd Hoff

Regardless of which side of the MySQL+Memcache era passing debate you're on, it's exciting to see so much DBMS innovation and startup activity. It was focused on the data warehousing/analytics during the early-mid 2000s (I was at Vertica during that time)...and now the innovation seems to be shifting towards handling more transactional (OLTP) workloads, with the KV stores/NoSQL and some newer (mostly unlaunched) OLTP dbms architectures.

Here's some good reference material from Dr. Mike Stonebraker (founder of Ingres, Postgres, et al) on NoSQL and OLTP database architectures. It's not the SQL that's the problem...it's the legacy transaction processing overhead in MySQL, Oracle and other transactional DBMS that limits their scalability.

FYI...Mike Stonebraker has co-founded another DBMS company called VoltDB, which addresses the scalability issues discussed in this thread--without sacrificing SQL or ACID transactions. VoltDB is in beta at the moment.

Anyway, here's the additional reading on the topic:
- The NoSQL Debate has nothing to do with SQL
- OLTP through the Looking Glass and what we Found There

Thanks Todd; good stuff!

February 28, 2010 | Unregistered CommenterAndy Ellicott

Ian's analysis in the piece you quote was spot on: "The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes. This is completely wrong for large-scale web applications, where response time is critical."

Indexing was not performance critical in an age where bulk loading overnight was fine. But today indexing performance is a serious problem for all applications that must query live (i.e., continuously updated) data. Webcrawlers and real-time logfile analysis, for example.

We think we've got a better mousetrap: our TokuDB storage engine indexes 10x-50x faster than InnoDB and will go up from there in future releases. As Mark Callaghan observes in The NoSQL Debate has nothing to do with SQL, "Write-optimization has finally arrived for MySQL with the availability of TokuDB."

Whether our customers will ask for our indexing technology in a MySQL friendly form or for some other platform only time will tell but I'd hesitate to say the whole relational model is no longer useful.

March 1, 2010 | Unregistered CommenterJohn Partridge

The 'issues' with SQL that people are moving away from with NoSQL are severalfold.

Performance/scalability is an often-cited one, but as some have mentioned in other comments, the issue there isn't so much with SQL - which is just a *protocol*, really - as with how traditional databases (which, coincidentally, happen to be SQL ones, so what we think of when we say SQL) have implemented it. ACID transactions are all well and good, but difficult to do in a distributed environment.

Now, things like being able to say "UPDATE foo SET bar = bar + 1 WHERE baz = 2" can be awkward to handle correctly in a distributed environment, as they kind of imply some level of ACIDity to get consistent results no matter what other queries are updating foo.bar - but the SELECT statement, which is the best bit of SQL, in no way hampers highly replicated operation; you can just run that SELECT against a nearby replica.

SQL is more limiting in the area of data models, though. There's an implicit assumption that each table has a fixed set of columns, and those columns are of non-compound types, and some applications suffer from that (a case from my own history was a hosted CRM app that let different users add different user-defined fields to tables, to reflect their particular needs). Many of the NoSQL offerings provide more flexible data models, which makes this stuff easier (and removes a lot of the pain that ORMs entail). However, again, SELECT is relatively immune to much of this; one can have an underlying data model with no schema, and have SELECT just deal with the fields it actually finds in the records, and so on.

My company (GenieDB) has been working on this theme - we've written an innovative "NoSQL" database, then written a MySQL pluggable table storage engine that allows tables in our system to be viewed as MySQL tables. MySQL forces us to list all the columns in a CREATE TABLE statement, while our underlying DB has no schemas - so if a record lacks a field MySQL expects, we return a NULL, and if it has a field MySQL doesn't expect, we just ignore that field. We therefore support things like the UPDATE statement above, but warn users that if they run more than one such update at once, the results might not be what they expect; however, for the kinds of queries you find in most applications, we've taken pains to make sure things work as MySQL users will expect.

This means that one can fire up a MySQL server against a GenieDB installation to run some ad-hoc SELECT statements, or one can build an app on top of GenieDB+MySQL, then rewrite the bottlenecks to use the native API for better performance as required, and various other exciting possibilities.

March 2, 2010 | Unregistered CommenterAlaric Snell-Pym

I completely think that a lot of databases will start to make the migration over to Cassandra. It's a fantastic database and is powering a lot of the heavy hitters in the industry. My only problem with it is the lack of tools that exist right now for it. MySQL has such a vast amount of tools that allow developers to develop for and in it. It's just a matter of time though before tools become readily available and more importantly frameworks (CakePHP, CodeIgniter, ect) pick it up as well to have it be accessed instead along with MySQL and the other databases out there.

We thought of using it for http://theeasyapi.com but ended up not using it because cakePHP doesn't have any help on how to implement it. We are working on creating our own DB classes for CakePHP that would allow us to use that database instead of MySQL because of the flexibility and resources that it offers.

This is a great article, well written.

March 9, 2010 | Unregistered CommenterChad R. Smith

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>