advertise
Monday
Aug042008

A Bunch of Great Strategies for Using Memcached and MySQL Better Together

The primero recommendation for speeding up a website is almost always to add cache and more cache. And after that add a little more cache just in case. Memcached is almost always given as the recommended cache to use. What we don't often hear is how to effectively use a cache in our own products. MySQL hosted two excellent webinars (referenced below) on the subject of how to deploy and use memcached. The star of the show, other than MySQL of course, is Farhan Mashraqi of Fotolog. You may recall we did an earlier article on Fotolog in Secrets to Fotolog's Scaling Success, which was one of my personal favorites. Fotolog, as they themselves point out, is probably the largest site nobody has ever heard of, pulling in more page views than even Flickr. Fotolog has 51 instances of memcached on 21 servers with 175G in use and 254G available. As a large successful photo-blogging site they have very demanding performance and scaling requirements. To meet those requirements they've developed a sophisticated approach to using memcached that others can learn from and emulate. We'll cover some of the highlightable strategies from the webinar down below the fold.

What is Memcached?

The first part of the first webinar gives a good overview of memcached. Just in case the rock you've been hiding under recently disintegrated, you may not have heard about memcached (as if). Memached is: A high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. Memcached essentially creates an in-memory shard on top of a pool of servers from which an application can easily get and set up to 1MB of unstructured data. Memcached is two hash tables, one from the client to the server and another one inside the server. The magic is that none of the memcached servers need know about each other. To scale up you just add more servers and the key hashing algorithm makes it all work out right. Memcached is not redundant, has no failover, and has no authentication. It's simple server for storing and getting data, the complex bits must be implemented by applications. The rest of the first webinar is Farhan explaining in wonderful detail how they use memcached at Fotolog. The beginning of the second seminar covers much of the same ground as the first. In the last half of the second seminar Farhan gives a number of excellent code examples showing how memcached is installed, used, and managed. If you've never used memcached before there's a lot of good stuff presented.

Memcached and MySQL Go Better Together

There's a little embrace and extend in the webinar as MySQL cluster is presented several times as doing much the same job as memcached, but more reliably. However, the recommended approach for using memcached and MySQL is:
  • Write scale the database by sharding. Partition data across multiple servers so more data can be written in parallel. This avoids a single server becoming the bottleneck.
  • Front MySQL with a memcached farm to scale reads. Applications access memcached first for data and if the data is not in memcached then the application tries the database. This removes a great deal of the load on a database so it can continue to perform it's transactional duties for writes. In this architecture the database is still the system of record for the true value of data.
  • Use MySQL replication for reliability and read query scaling. There's an effective limit to the number of slaves that can be supported so just adding slaves won't work as scaling strategy for larger sites. Using this approach you get scalable reads and writes along with high availability. Given that MySQL has a cache, why is memcached needed at all?
  • The MySQL cache is associated with just one instance. This limits the cache to the maximum address of one server. If your system is larger than the memory for one server then using the MySQL cache won't work. And if the same object is read from another instance its not cached.
  • The query cache invalidates on writes. You build up all that cache and it goes away when someone writes to it. Your cache may not be much of a cache at all depending on usage patterns.
  • The query cache is row based. Memcached can cache any type of data you want and it isn't limited to caching database rows. Memcached can cache complex complex objects that are directly usable without a join.

    Cache everything that is slow to query, fetch, or calculate.

    This is Fotolog's rule of deciding what to cache. What is considered slow depends on your requirements. But when something becomes slow it's a candidate for caching.

    Fotolog's Caching Typology

    Fotolog has come up with an interesting typology of their different caching strategies:
  • Non-Deterministic Cache - the classic memcached model of reading through the cache and writing to the database.
  • State Cache - maintain current application state in cache.
  • Proactive Cache - push changes from the database directly to the cache.
  • File System Cache - save NFS load by serving files from the cache instead of the file system.
  • Partial Page Cache - cache displayable page elements, not just data.
  • Application Based Replication - use a client side API to hide all the low level details of interacting with the cache.

    Non-Deterministic Cache

    This is the typical way of using memcached. I think the non-determinism comes in because an application can't depend on data being in the cache. The data may have been evicted because a slab is full or because the data simply hasn't been added yet. Other Fotolog strategies are deterministic, which means the application can assume the data is always present.

    Useful For

  • Ideal for complex objects that are read several times. Especially for sharded environments where you need to collect data from multiple shards.
  • Good replacement for MySQL query cache
  • Caching relationships and other lists
  • Slow data that’s used across many pages
  • Don’t cache if its more taxing to cache than you’ll save
  • Tag clouds and auto-suggest lists For example, when a photo is uploaded the photo is uploaded on the page of every friend. These lists are taxing to calculate so they are cached.

    Usage Steps

  • Check memcached for your data key.
  • If the data doesn't exist in the cache then check database.
  • If the data exists in the database then populate memcached.

    Stats

  • 45 memcached instances dedicated to nondeterministic cache
  • Each instance (on average): * ~440 gets per second * ~40 sets per second * ~11 gets/set Fotolog likes to characterize their caching policy in terms of the ratio of gets to sets.

    Potential Problems

  • While memcached is usually a light CPU user, Fotolog ran their cache on their application servers and found some problems: * 90% CPU usage * Memory garbage collected nearly once a minute * Experienced blocking on memcached on app servers
  • Operations are not transactional. One work around is to set expirations so stale data doesn't stay around long.

    State Cache

    Keeps the current state of an application in cache.

    Useful For

  • Expensive operations.
  • Sessions. If a memcached server goes down just make people login again.
  • Keep track of who's online and their current status, especially for IM applications. Use memcached or the load would cripple the database.

    Usage Steps

    It's a form of non-deterministic caching so the same steps apply.

    Stats

  • 9G dedicated. Depending on the number users all the users can keep in cache.

    Deterministic Cache

    Keep all data for particular database tables in the cache. An application always assumes that the data they need is in the cache (deterministic), applications never go to the database for data. Applications don't have to check memcached before accessing the data. The data will exist in the cache because the database is essentially loaded into the cache and is always kept in sync.

    Useful For

  • Read scalability. All reads go through the cache and the database is completely offloaded.
  • Ideal for caching things that have no expiration.
  • Heavily accessed data/objects/lists
  • User credentials
  • User profiles
  • User preferences
  • Active media belonging to users, like photo lists.
  • Outsourcing logins to memcached. Don't hit database on login. All logins are processed through the cache.

    Usage Steps

  • Multiple dedicated cache pools are maintained. Instead of one large pool make separate standalone pools. Multiple pools are necessary for high availability. If a cache pool goes down then applications switch over to the next pool. If a multiple pools did not exist and a pool did go down then the database would be swamped with the load.
  • All cache pools are maintained by the application. A write to the database means writing to multiple memcache pools. When an update to a user profile happens, for example, the update has to replicated to multiple caches from that point onwards.
  • When the site starts after a shutdown then when it comes up the deterministic cache is populated before the site is up. The site is rarely rebooted so this is rare occurrence.
  • Reads could also be load balanced against the cache pools for better performance and higher scalability.

    Stats

  • ~ 90,000 gets / second across cache cluster
  • ~ 300 sets / second
  • get/set ratio of ~ 300

    Potential Problems

  • Must have enough memory to hold everything.
  • The database has rows and the cache may have objects. The database caching logic must know enough to create objects from the database schema.
  • Maintaining the multiple caches seems complex. Fotolog uses Java and Hibernate. They wrote their own client to handle rotation through pools.
  • Maintaining multiple caches adds a lot of overhead to the application. In practice there's little replication overhead compared to benefit.

    Proactive Caching

    Data magically shows up in the cache. As updates happen to the database the cache is populated based on the database change. Since the cache is updated as the database is updated the chances of data being in cache are high. It's non-deterministic caching with a twist.

    Useful For

  • Pre-Populating Cache: Keeping memcached updated minimizes calls to database if object not present.
  • “Warm up” cache in cases of cross data-center replication.

    Usage Steps

    There are typically three implementation approaches:
  • Parse binary log for updates. When an update is found perform the same operation on the cache.
  • Implement user defined functions. Setup triggers that call UDF to update the cache. See http://tangent.org/586/Memcached_Functions_for_MySQL.html for more details.
  • Use the Blackhole gambit. Facebook is rumored to use the Blackhole storage engine to populate cache. Data written to a Blackhole table is replicated to cache. Facebook uses it more to invalidate data and for cross country replication. The advantage of this approach is the data is not replicated through MySQL which means there are no binary logs for the data and it's not CPU intensive.

    File System Caching

    NFS has significant overhead when used with a large number of servers. Fotolog originally stored XML files on a SAN and exposed them using NFS. They saw contention on these files so they put files in memcached. Big performance improvements were seen and it kept NFS mounts open for other requests. Smaller media can also be stored in the cache.

    Partial Page Caching

    Cache directly displayable page elements. The other caching strategies cache data used to create pages, but some things are still compute intensive and require a lot of work. So instead of just caching objects, prepare and cache entire page elements for reuse. For page elements that are accessed many times per second this can be a big win. For example: calculating top users in a region, popular photo list, and featured photo list. Especially when using sharding it can take some time to calculate these lists, so caching the resulting page elements makes a lot of sense.

    Application Based Replication

    Write data to the cache through your own API. The API hides implementation details like:
  • An application writes to one memcached client which writes to multiple memcached instances.
  • Where the cache pools are and how many their are.
  • Writing to multiple pools at the same time.
  • Rotating to another pool on a pool failure until another pool is brought up and updated with data. This approach is very fast if not network bound and very cost effective.

    Miscellaneous

    There were a few suggestions on using memcached that didn't fit in any other section, so they're gathered here for posterity:
  • Have a lot of nodes to handle loss. Losing a node with a few nodes will cause a spike on the database as everything reloads. Having more servers means less database load on failure.
  • Use a warm standby that takes over IP of a memcached server that fails. This means you clients will not have to update their cache lists.
  • Memcached can operate with UDP and TCP. Persistence connections are better because there's less overhead. Cache designed to use 1000s of connections.
  • Use separate memcached servers to reduce contention with applications.
  • Check that your slab sizes match the size of the data you are allocating or you could be wasting a lot of memory. Here are some additional strategies from Memcached and MySQL tutorial:
  • Don't think row-level (database) caching, think complex objects.
  • Don't run memcached on your database server, give your database all the memory it can get.
  • Don't obsess about TCP latency - localhost TCP/IP is optimized down to an in-memory copy.
  • Think multi-get - run things in parallel whenever you can.
  • Not all memcached client libraries are made equal, do some research on yours.
  • Instead of invalidating your data, expire it whenever you can - memcached will do all the work
  • Generate smart keys - ex. on update, increment a version number, which will become part of the key
  • For bonus points, store the version number in memcached - call it generation
  • The latter will be added to Memcached soon - as soon as Brian gets around to it

    Final Thoughts

    Fotolog has obviously put a great deal of thought and effort into creating sophisticated scaling strategies using memcached and MySQL. What I'm struck with is the enormous amount of effort that goes into syncing rows and objects back and forth between the cache and the database. Shouldn't it be easier? What role is the database playing when the application makes such constant use of the object cache? Wouldn't more memory make the disk based storage unnecessary?

    Related Articles

  • Designing and Implementing Scalable Applications with Memcached and MySQL by Farhan Mashraqi from Fotolog, Monty Taylor from Sun, and Jimmy Guerrero from Sun
  • Memcached for Mysql Advanced Use Cases by Farhan Mashraqi of Fotolog
  • Memcached and MySQL tutorial by Brian Aker, Alan Kasindorf - Overview with examples of how a few companies use memcached. Good presentation notes by Colin Charles.
  • Strategy: Break Up the Memcache Dog Pile
  • Secrets to Fotolog's Scaling Success
  • Memcached for MySQL

    Click to read more ...

  • Tuesday
    Jul292008

    Ehcache - A Java Distributed Cache 

    Ehcache is a pure Java cache with the following features: fast, simple, small foot print, minimal dependencies, provides memory and disk stores for scalability into gigabytes, scalable to hundreds of caches is a pluggable cache for Hibernate, tuned for high concurrent load on large multi-cpu servers, provides LRU, LFU and FIFO cache eviction policies, and is production tested. Ehcache is used by LinkedIn to cache member profiles. The user guide says it's possible to get at 2.5 times system speedup for persistent Object Relational Caching, a 1000 times system speedup for Web Page Caching, and a 1.6 times system speedup Web Page Fragment Caching. From the website: Introduction Ehcache is a cache library. Before getting into ehcache, it is worth stepping back and thinking about caching generally. About Caches Wiktionary defines a cache as A store of things that will be required in future, and can be retrieved rapidly . That is the nub of it. In computer science terms, a cache is a collection of temporary data which either duplicates data located elsewhere or is the result of a computation. Once in the cache, the data can be repeatedly accessed inexpensively. Why caching works Locality of Reference While ehcache concerns itself with Java objects, caching is used throughout computing, from CPU caches to the DNS system. Why? Because many computer systems exhibit locality of reference . Data that is near other data or has just been used is more likely to be used again. The Long Tail Chris Anderson, of Wired Magazine, coined the term The Long Tail to refer to Ecommerce systems. The idea that a small number of items may make up the bulk of sales, a small number of blogs might get the most hits and so on. While there is a small list of popular items, there is a long tail of less popular ones. The Long Tail The Long Tail is itself a vernacular term for a Power Law probability distribution. They don't just appear in ecommerce, but throughout nature. One form of a Power Law distribution is the Pareto distribution, commonly know as the 80:20 rule. This phenomenon is useful for caching. If 20% of objects are used 80% of the time and a way can be found to reduce the cost of obtaining that 20%, then the system performance will improve. Will an Application Benefit from Caching? The short answer is that it often does, due to the effects noted above. The medium answer is that it often depends on whether it is CPU bound or I/O bound. If an application is I/O bound then then the time taken to complete a computation depends principally on the rate at which data can be obtained. If it is CPU bound, then the time taken principally depends on the speed of the CPU and main memory. While the focus for caching is on improving performance, it it also worth realizing that it reduces load. The time it takes something to complete is usually related to the expense of it. So, caching often reduces load on scarce resources. Speeding up CPU bound Applications CPU bound applications are often sped up by: * improving algorithm performance * parallelizing the computations across multiple CPUs (SMP) or multiple machines (Clusters). * upgrading the CPU speed. The role of caching, if there is one, is to temporarily store computations that may be reused again. An example from ehcache would be large web pages that have a high rendering cost. Another caching of authentication status, where authentication requires cryptographic transforms. Speeding up I/O bound Applications Many applications are I/O bound, either by disk or network operations. In the case of databases they can be limited by both. There is no Moore's law for hard disks. A 10,000 RPM disk was fast 10 years ago and is still fast. Hard disks are speeding up by using their own caching of blocks into memory. Network operations can be bound by a number of factors: * time to set up and tear down connections * latency, or the minimum round trip time * throughput limits * marshalling and unmarhshalling overhead The caching of data can often help a lot with I/O bound applications. Some examples of ehcache uses are: * Data Access Object caching for Hibernate * Web page caching, for pages generated from databases. Increased Application Scalability The flip side of increased performance is increased scalability. Say you have a database which can do 100 expensive queries per second. After that it backs up and if connections are added to it it slowly dies. In this case, caching may be able to reduce the workload required. If caching can cause 90 of that 100 to be cache hits and not even get to the database, then the database can scale 10 times higher than otherwise. How much will an application speed up with Caching? The short answer The short answer is that it depends on a multitude of factors being: * how many times a cached piece of data can and is reused by the application * the proportion of the response time that is alleviated by caching In applications that are I/O bound, which is most business applications, most of the response time is getting data from a database. Therefore the speed up mostly depends on how much reuse a piece of data gets. In a system where each piece of data is used just once, it is zero. In a system where data is reused a lot, the speed up is large. The long answer, unfortunately, is complicated and mathematical. It is considered next.

    Related Articles

  • Caching Category on High Scalability
  • Product: Memcached
  • Manage a Cache System with EHCache

    Click to read more ...

  • Saturday
    Jul262008

    Sharding the Hibernate Way

    Update: A very nice JavaWorld podcast interview with Google engineer Max Ross on Hibernate Shards. Max defines Hibernate Shards (horizontal partitioning), how it works (pretty well), virtual shards (don't ask), what they need to do in the future (query, replication, operational tools), and how it relates to Google AppEngine (not much). To scale you are supposed to partition your data. Sounds good, but how do you do it? When you actually sit down to work out all the details it’s not that easy. Hibernate Shards to the rescue! Hibernate shards is: an extension to the core Hibernate product that adds facilities for horizontal partitioning. If you know the core Hibernate API you know the shards API. No learning curve at all. Here is what a few members of the core group had to say about the Hibernate Shards open source project. Although there are some limitations, from the sound of it they are doing useful stuff in the right way and it’s very much worth looking at, especially if you use Hibernate or some other ORM layer.

    Information Sources

  • Google Developer Podcast Episode Six: The Hibernate Shards Open Source Project. This is the document summarized here.
  • Hibernate Shards Project Page
  • Hibernate Shards Dev Discussion Group.
  • Ryan Barrett’s Scaling on the Cheap presentation. Many of the lessons from here are in Hibernate Shards.
  • JavaWorld podcast interview: Sharding with Max Ross - Hibernate Shards - Max Ross is the Google engineer who spends his days working on the Google App Engine data store. On the side he works on Hibernate Shards, another scalability-obsessed project that is open source.

    What is Hibernate Shards?

  • Shard: splitting up data sets. If data doesn't fit on one machine then split it up into pieces, each piece is called a shard.
  • Sharding: the process of splitting up data. For example, putting employees 1-10,000 on shard1 and employees 10,001-20,000 on shard2.
  • Sharding is used when you have too much data to fit in one single relational database. If your database has a JDBC adapter that means Hibernate can talk to it and if Hibernate can talk to it that means Hibernate Shards can talk to it.
  • Most people don't want to shard because it makes everything complex. But when you have too much data, when you fill your database up, you need another solution, which can be to shard the data across multiple relational databases. The complexity arises because your application has to have the smarts to access multiple databases and that's where Hibernate Shards tries to help.
  • Structure of the data is identical from server to server. The same schema is used across all databases (MySQL, etc).
  • Hibernate was chosen because it's a good ORM tool used internally at Google, but to Google Scale (really really big), sharding needed to be added because Hibernate didn’t support that sort of scale out of the box.
  • The learning curve for a Hibernate user is zero because the Hibernate API is the same. The shard implementation hasn’t violated the API (yet). Sharded versions of Session, Critieria, and Factory are available so the programmer doesn't need to change code. Query isn't implemented yet because features like aggregation and grouping are very difficult to implement across databases.
  • How does it compare to MySQL's horizontal partitioning? Shards is for situations where you have too much data to fit in a single database. MySQL partitioning may allow you to delay when you need to shard, but it is still a single database and you’ll eventually run into limits.

    Schema Design for Shards

  • When sharding you have to consider the general issues of distributed data design for high data volumes. These aren’t Hibernate Shards specific issues, but are general to the problem space.
  • Schema design is the most important of the sharding process and you’ll have to do that up front.
  • You need to pick a dimension, a root level entity, that is easily sharded. Users and customers are common examples.
  • Accept the fact that those entities and all the entities that hang off those entities will be stored in separate physical spaces. Querying across different shards will be difficult. As will management and just about anything else you take for granted.
  • Control over how data are distributed is determined by a pluggable strategies layer.
  • Plan for the future by picking a strategy that will last you a long time. Repartitioning/resharding the data is operationally very difficult. No management tools for this yet.
  • Build simpler models that don't contain as many relationships because you don't have cross shard relationships. Your objects graphs should be contained on one shard as much as possible.
  • Lots of lots of objects pointing to each other may not be a good candidate for sharding.
  • Because the shards design doesn’t modify Hibernate core, you can design using shards from the start, even though you only have one database. Then when you need to start scaling it will be easier to grow.
  • Existing systems with shardable tables shouldn’t take very long to get up and running.
  • Policy decisions can drive sharding. For example, let's say customers don't want their data intermingling, so each customer would get their own database. In this case the application would shard on the customer as a matter of policy, not simply scaling concerns.

    The Sharding Code’s Relationship to Hibernate

  • Hibernate Shards encapsulates knowledge of all the shards. This knowledge is not in the database or the application. It's at the Hibernate persistence layer which provides a unified view of all the databases so the application doesn't have to know.
  • Shards doesn't have full support for Hibernate’s query interface. Hibernate has a criteria or a query interface. Criteria interface is robust, but not good for JPA (Java persistence API), which is query based.
  • Sharding should work across all databases Hibernate works on since shards is a layer on top of Hibernate core beneath the standard Hibernate interfaces. Programmers aren’t aware of it.
  • What they are doing is figuring out how to do standard things like save objects, update, and query objects across multiple databases using standard Hibernate interfaces. If Hibernate can talk to it they can talk to it.
  • A sharded session is used to contain Hibernate’s sessions so Hibernate capabilities are preserved.
  • Can not manage cross shard foreign relationships (yet). Do have runtime checks to detect when cross shard relations are used accidentally. No foreign key constraint checking and there’s no Hibernate lazy loading. From a programming perspective you can have IDs that reference other objects on other shards, it’s just that Hibernate won’t know about these relationships.
  • Now that the base software is done these more advanced features can be considered. It may take changes in Hibernate core

    Pluggable Strategies Determine How Data Are Split Across Shards

  • A Strategy dictates how data are spread across the shards. It’s an interface you need to implement. There are three Strategies: * Shard Resolution Strategy - how you will retrieve your objects. * Shard Selection Strategy – define where objects are saved to. * Access Strategy – once you figure out which shard you are talking to, how do you want to access those shards (serially, 2 at a time, in parallel, etc)?
  • Goal is to have Strategies as flexible as possible so you can decide how your data are sharded.
  • A couple of implementations are provided out of the box: * Round Robin - First one goes to the first shard, second to the second shard, and then it loops back. * Attribute Based – Look at attributes in the data to determine which shard. You can shard users by country, for example.
  • Configuration is set by creating a prototype configuration for all shards (remember, same schema). Then you specify what's different from shard to shard like URL, user name and password, dialect (MySQL, Postgres, etc). Then they'll create a sharded session factory for Hibernate so developers use standard interfaces.

    Some Limitations

  • Full Hibernate HQL is not yet supported (maybe it is now, but I couldn’t tell).
  • Distributed queries are handled by applying a standard HQL query to each shard, merging the results, and applying the filters. This all happens in the application server so using very large data sets could be a problem. It’s left to the intelligence of the developers to do the right thing to manage performance.
  • No mirroring or data replication. Replication is having common tables, like zip codes, available on all shards.
  • No clean way to manage read only data you want on every shard for performance and referential integrity reasons. Say you have country data. It makes sense to replicate that data on each shard so all queries using that data can stay on the shard.
  • No handling of fail over situations, which is just like Hibernate. You could handle it in your connection pool or some other layer. It’s not considered part of the shard/OR mapping layer.
  • There’s a need for management tools that work across shards. For example, repartition data on a live system.
  • It’s possible to shard across different databases as long as you keep the same schema in the same in each database.
  • The number of shards you can have is somewhat limited because each shard is backed by a connection pool which is a lot of databases connections. And ORDER_BY operations across databases must be done in memory so a lot of memory could be used on large data sets.

    Related Articles

  • An Unorthodox Approach to Database Design: The Coming of the Shard.

    Click to read more ...

  • Saturday
    Jul262008

    Google's Paxos Made Live – An Engineering Perspective

    This is an unusually well written and useful paper. It talks in detail about experiences implementing a complex project, something we don't see very often. They shockingly even admit that creating a working implementation of Paxos was more difficult than just translating the pseudo code. Imagine that, programmers aren't merely typists! I particularly like the explanation of the Paxos algorithm and why anyone would care about it, working with disk corruption, using leases to support simultaneous reads, using epoch numbers to indicate a new master election, using snapshots to prevent unbounded logs, using MultiOp to implement database transactions, how they tested the system, and their openness with the various problems they had. A lot to learn here. From the paper: We describe our experience building a fault-tolerant data-base using the Paxos consensus algorithm. Despite the existing literature in the field, building such a database proved to be non-trivial. We describe selected algorithmic and engineering problems encountered, and the solutions we found for them. Our measurements indicate that we have built a competitive system. Introduction It is well known that fault-tolerance on commodity hardware can be achieved through replication [17, 18]. A common approach is to use a consensus algorithm [7] to ensure that all replicas are mutually consistent [8, 14, 17]. By repeatedly applying such an algorithm on a sequence of input values, it is possible to build an identical log of values on each replica. If the values are operations on some data structure, application of the same log on all replicas may be used to arrive at mutually consistent data structures on all replicas. For instance, if the log contains a sequence of database operations, and if the same sequence of operations is applied to the (local) database on each replica, eventually all replicas will end up with the same database content (provided that they all started with the same initial database state). This general approach can be used to implement a wide variety of fault-tolerant primitives, of which a fault-tolerant database is just an example. As a result, the consensus problem has been studied extensively over the past two decades. There are several well-known consensus algorithms that operate within a multitude of settings and which tolerate a variety of failures. The Paxos consensus algorithm [8] has been discussed in the theoretical [16] and applied community [10, 11, 12] for over a decade. We used the Paxos algorithm (“Paxos”) as the base for a framework that implements a fault-tolerant log. We then relied on that framework to build a fault-tolerant database. Despite the existing literature on the subject, building a production system turned out to be a non-trivial task for a variety of reasons: While Paxos can be described with a page of pseudo-code, our complete implementation contains several thousand lines of C++ code. The blow-up is not due simply to the fact that we used C++ instead of pseudo notation, nor because our code style may have been verbose. Converting the algorithm into a practical, production-ready system involved implementing many features and optimizations – some published in the literature and some not. • The fault-tolerant algorithms community is accustomed to proving short algorithms (one page of pseudo code) correct. This approach does not scale to a system with thousands of lines of code. To gain confidence in the “correctness” of a real system, different methods had to be used. • Fault-tolerant algorithms tolerate a limited set of carefully selected faults. However, the real world exposes software to a wide variety of failure modes, including errors in the algorithm, bugs in its implementation, and operator error. We had to engineer the software and design operational procedures to robustly handle this wider set of failure modes. • A real system is rarely specified precisely. Even worse, the specification may change during the im- plementation phase. Consequently, an implementation should be malleable. Finally, a system might “fail” due to a misunderstanding that occurred during its specification phase. This paper discusses a selection of the algorithmic and engineering challenges we encountered in moving Paxos from theory to practice. This exercise took more R&D efforts than a straightforward translation of pseudo-code to C++ might suggest. The rest of this paper is organized as follows. The next two sections expand on the motivation for this project and describe the general environment into which our system was built. We then provide a quick refresher on Paxos. We divide our experiences into three categories and discuss each in turn: algorithmic gaps in the literature, software engineering challenges, and unexpected failures. We conclude with measurements of our system, and some broader observations on the state of the art in our field.

    Related Articles

  • ZooKeeper - A Reliable, Scalable Distributed Coordination System

    Click to read more ...

  • Tuesday
    Jul222008

    Scaling Bumper Sticker: A 1 Billion Page Per Month Facebook RoR App  

    Several months ago I attended a Joyent presentation where the spokesman hinted that Joyent had the chops to support a one billion page per month Facebook Ruby on Rails application. Even under a few seconds of merciless grilling he would not give up the name of the application. Now we have the big reveal: it was LinkedIn's Bumper Sticker app. For those not currently sticking things on bumps, Bumper Sticker is quite surprisingly a viral media sharing application that allows users to express their individuality by sticking small virtual stickers on Facebook profiles. At the time I was quite curious how Joyent's cloud approach could be leveraged for this kind of app. Now that they've released a few details, we get to find out.

    Site: http://www.Facebook.com/apps/application.php?id=2427603417

    Information Sources

  • Video: Scaling to 1 Billion Page Views Per MonthVideo (very flashy)
  • Web Scalability Practices: Bumper Sticker on Rails by Ikai Lan and Jim Meyer from LinkedIn
  • 1 Billion Page Views a Month by David Young from Joyent
  • Ruby on Rails: scaling to 1 billion page views per month by Dennis Howlettby from Zdnet
  • Joyent's Grid Accelerators for Web Applications by Jason Hoffman from Joyent
  • On Grids, the Ambitions of Amazon and Joyent by Jason Hoffman from Joyent
  • Scaling Ruby on Rails to 1 Billion Page Views a Month by Joe Pruitt from DevCentral

    The Platform

  • MySQL
  • Nginx
  • Mongrel
  • CDN
  • Ruby on Rails (rapid prototype development approach)
  • Facebook
  • Joyent Accelerator - provides a highly scalable on-demand infrastructure for running web sites, including rich web applications written in Ruby on Rails, PHP, Python and Java. Joyent Accelerators are next-generation virtual computers that can grow and multiply (or shrink and consolidate) depending on the real world demands faced by your Web application. Accelerators are built on OpenSolaris, multi-core (8+), RAM-rich servers (32GB+ each) and vast amounts of NAS storage.
  • Masochism Plugin - provides an easy solution for Ruby on Rails applications to work in a replicated database environment. Connection proxy sends some database queries (those in a transaction, update statements, and ActiveRecord::Base#reload) to a master database, and the rest to the slave database.

    The Stats

  • 1 billion page views per month
  • 13.5 million installations
  • 1.5 million daily active users. Recruited 1 million users in first 46 days.
  • 20-27 million canvas page views a day
  • 13 web application servers running Nginx and Mongrel
  • 8 static asset servers serving over 3,500,000 stickers (migrating to a CDN)
  • 4 MySQL servers in a master/slave configuration using Masochism as a proxy to load balance database operations.
  • Cost is about $25K/month.

    The Architecture

  • Bumper Sticker was an experiment to see how fast the Light Engineering Development (LED) team at LinkedIn could build a Ruby on Rails Facebook application.
  • RoR was an easy an environment to prototype in, but they needed a production environment in which they could quickly develop, deploy, and scale. Joyent was selected.
  • Some Notes on Joyent:
    * Joyent is a scale on demand cloud. Allows customers to have a dynamic data center instead of being stuck using their own rigid infrastructure.
    * There's an API if you need one. The service is unmanaged, you get root on all your boxes.
    * They consider their infrastructure to be better and more open than Amazon. You get access to a high end load balancer and the capabilities of OpenSolaris (Dtrace, Zones, lower request processing overhead, sub 10 second reboot times).
    * Joyent's primary scalability principle is to organize apps around silos built from their powerful Accelerator blocks: put applications on different servers based on the quality of service you want to give them. For example, put static content on their own servers so the static content is always served fast and reliably. This allows you to prioritize based on what's important to you. You could, for example, prioritize the virality of your application by putting the Invite Friends functionality on their own servers, thus assuring the growth of your application through your viral functionality possibly at the expense of less important functionality.
    * Has three data centers in the US and are opening a fourth, none in Europe.
    * Considers their secret sauce to be their highly sophisticated administration system which allows a few people to easily manage a large infrastructure.
    * Has a peering relationship with Facebook. That means there are direct high-speed fiber links between Joyent’s data center in Emeryville and Facebook’s data center in San Francisco.

  • 80% of the content for Bumber Sticker is static. The Facebook API can directly render content at a specified memory location. Bumper Sticker was able to use the scripting feature of F5 BIG-IP load balancer to directly load static content by passing a pointer to the Facebook API.

    The Lessons

  • Rails scales exactly like any other app. Take into account all the components from the moment the request is received at the load balancer all the way down and all the way back again.
  • The development process is: put some measurements in place, find problems, fix problems, more people adopt and scale you out of your solution, and the cycle repeats. Sun's Dtrace feature makes it easy to instrument the stack to identify bottlenecks.
  • Rails scales as long as the development team using it understands that many of the bottlenecks are exactly those faced by developers on any other database-driven web platform.
  • Hit a disk spindle and you are screwed. Avoid going to the database or the file system. The more they avoided disk the fewer timeouts they experienced.
  • Convert anything dynamic into static content. Dynamic content is your enemy. Convert anything dynamic into static content so it can be removed from the disk path.
  • Push content to the edge. Move content as close to the client as possible. Move cache to the CDN. Reduce time going across the network.
  • Faster means more viral. On a viral system the better the performance the more people can play with your application. The more people who play with your system the more likely they are to pull more people in, which means the more the app will spread and go viral. Bumper Sticker has been successful at creating a community of fans who enjoy uploading and sharing their own stickers.

    Some issues:
  • Since most of the content is static and served by the load balancer, the impact of Rails in the system is not clear.
  • The functionality of Bumper Sticker is relatively simple. What would the impact be on scalability if other often requested features like search were added?

    Related Articles

  • Friends for Sale Architecture - A 300 Million Page View/Month Facebook RoR App
  • Monday
    Jul212008

    Eucalyptus - Build Your Own Private EC2 Cloud

    Update: InfoQ links to a few excellent Eucalyptus updates: Velocity Conference Video by Rich Wolski and a Visualization.com interview Rich Wolski on Eucalyptus: Open Source Cloud Computing. Eucalyptus is generating some excitement on the Cloud Computing group as a potential vendor neutral EC2 compatible cloud platform. Two reasons why Eucalyptus is potentially important: private clouds and cloud portability: Private clouds. Let's say you want a cloud like infrastructure for architectural purposes but you want it to run on your own hardware in your own secure environment. How would you do this today? Hm.... Cloud portability. With the number of cloud offerings increasing how can you maintain some level of vendor neutrality among this "swarm" of different options? Portability is a key capability for cloud customers as the only real power customers have is in where they take their business and the only way you can change suppliers is if there's a ready market of fungible services. And the only way their can be a market is if there's a high degree of standardization. What should you standardize on? The options are usually to form a great committee and take many years to spec out something that doesn't exist, nobody will build, and will never really work. Or have each application create a high enough layer interface that portability is potentially difficult, but possible. Or you can take a popular existing API, make it the general API, and everyone else is accommodated using an adapter layer and the necessary special glue to take advantage of value add features for each cloud. With great foresight Eucalyptus has chosen to create a cloud platform based on Amazon's EC2. As this is the most successful cloud platform it makes a lot of sense to use it as a model. We see something similar with the attempts to port Google AppEngine to EC2 thus making GAE a standard framework for web apps. So developers would see GAE on top of EC2. A lot of code would be portable between clouds using this approach. Even better would be to add ideas in from RightScale, 3Tera, and Mosso to get a higher level view of the cloud, but that's getting ahead of the game. Just what is Eucalyptus? From their website: Overview ¶ Elastic Computing, Utility Computing, and Cloud Computing are (possibly synonymous) terms referring to a popular SLA-based computing paradigm that allows users to "rent" Internet-accessible computing capacity on a for-fee basis. While a number of commercial enterprises currently offer Elastic/Utility/Cloud hosting services and several proprietary software systems exist for deploying and maintaining a computing Cloud, standards-based open-source systems have been few and far between. EUCALYPTUS -- Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems -- is an open-source software infrastructure for implementing Elastic/Utility/Cloud computing using computing clusters and/or workstation farms. The current interface to EUCALYPTUS is interface-compatible with Amazon.com's EC2 (arguably the most commercially successful Cloud computing service), but the infrastructure is designed to be modified and extended so that multiple client-side interfaces can be supported. In addition, EUCALYPTUS is implemented using commonly-available Linux tools and basic web service technology making it easy to install and maintain. Overall, the goal of the EUCALYPTUS project is to foster community research and development of Elastic/Utility/Cloud service implementation technologies, resource allocation strategies, service level agreement (SLA) mechanisms and policies, and usage models. The current release is version 1.0 and it includes the following features: * Interface compatibility with EC2 * Simple installation and deployment using Rocks cluster-management tools * Simple set of extensible cloud allocation policies * Overlay functionality requiring no modification to the target Linux environment * Basic "Cloud Administrator" tools for system management and user accounting * The ability to configure multiple clusters, each with private internal network addresses, into a single Cloud. The initial version of EUCALYPTUS requires Xen to be installed on all nodes that can be allocated, but no modifications to the "dom0" installation or to the hypervisor itself. For more discussion see:

  • James Urquhart's excellent blog The Wisdom of Clouds.
  • Simon Wardley's post Open sourced EC2 .... not by Amazon.
  • Google Cloud Computing Group.
  • Eucalyptus and You by James Urquhart
  • Open Virtual Machine Format on LayerBoom. The Open Virtual Machine Format, or OVF is a proposed universal format that aims to create a secure, extensible method of describing and packaging virtual containers.

    Click to read more ...

  • Sunday
    Jul202008

    The clouds are coming

    A report from the CloudCamp conference on cloud computing, held in London in July 2008.

    Click to read more ...

    Sunday
    Jul202008

    Strategy: Front S3 with a Caching Proxy

    Given S3's recent failure (Cloud Status tells the tale) Kevin Burton makes the excellent suggestion of fronting S3 with a caching proxy server. A caching proxy server can reply to service requests without contacting the specified server, by retrieving content saved from a previous request, made by the same client or even other clients. This is called caching. Caching proxies keep local copies of frequently requested resources. In normal operation when an asset (a user's avatar, for example) is requested the cache is tried first. If the asset is found in the cache then it's returned. If the asset is not in the cache it's retrieved from S3 (or wherever) and cached. So when S3 goes down it's likely you can ride out the down time by serving assets out of the cache. This strategy only works when using S3 as a CDN. If you are using S3 for its "real" purpose, as a storage service, then a caching proxy can't help you... Amazon doesn't used S3 as a CDN either Amazon Not Building Out AWS To Compete With CDNs. They use Limelight Networks. Some proxy options are: Squid, Nginx, Varnish. Planaroo shares how a small startup responds to an S3 outage (summarized):

  • Up-to-date backups are a good thing. Keep current backups such that you can switch to a new URL for your assets. Easier said than done I think.
  • Switch it, don't fix it. Switch to your backup rather than wait for the system to come up quickly, because it may not.
  • Serve CSS, JavaScript, icons, and Google AJAX libraries from alternate sources. Don't rely S3 or Google to always be able to server your crown jewels.

    Click to read more ...

  • Friday
    Jul182008

    Robert Scoble's Rules for Successfully Scaling Startups

    Robert Scoble in an often poignant FriendFeed thread commiserating PodTech's unfortunate end, shared what he learned about creating a successful startup. Here's a summary of a Robert's rules and why Machiavelli just may agree with them:

  • Have a story.
  • Have everyone on board with that story.
  • If anyone goes off of that story, make sure they get on board immediately or fire them.
  • Make sure people are judged by the revenues they bring in. Those that bring in revenues should get to run the place. People who don't bring in revenues should get fewer and fewer responsibilities, not more and more.
  • Work ONLY for a leader who will make the tough decisions.
  • Build a place where excellence is expected, allowed, and is enabled.
  • Fire idiots quickly.
  • If your engineering team can't give a media team good measurements, the entire company is in trouble. Only things that are measured ever get improved.
  • When your stars aren't listened to the company is in trouble.
  • Getting rid of the CEO, even if it's all his fault, won't help unless you replace him/her with someone who is visionary and who can fix the other problems. An excellent list that meshes with much of my experience, which is why I thought it worth sharing :-) My take-away from Robert's rules can be summarized in one word: focus. Focus is the often over looked glue binding groups together so they can achieve great things... When Robert says have "a story" that to me is because a story provides a sort of "decision box" giving a group its organizing principle for determining everything that comes later. Have a decision? Look at your story for guidance. Have a problem? Look at your story for guidance. Following your stories' guidance is another matter completely. Without a management strong enough act in accordance with the story, centripetal forces tear an organization apart. It takes a lot of will to keep all the forces contained inside the box. Which is why I think Robert demands a "focused" leadership. Machiavelli calls this idea "virtue." Princes must exhibit virtue if they are to keep their land. Machiavelli doesn't mean virtue in the modern sense of be good and eat your peas, but in the ancient sense of manliness (sorry ladies, this was long ago). Virtue shares the same root as virility. So to act virtuously is to be bold, to act, to take risks, be aggressive, and make the hard unpopular decisions. For Machiavelli that's the only way to reach your goals in accordance with how the world really works, not how it ought to work. Any Prince who acts otherwise will lose their realm. Firing people (yes, I've been fired) not contributing to your story is by Machiavelli's definition a virtuous act. It is a messy ugly business nobody likes doing. It requires admitting a mistake was made, a lot of paperwork, and looking like the bad guy. And that's why people are often not fired as Robert suggests. The easy way is to just ignore the problem, but that's not being virtuous. Addition by subtraction is such a powerful force precisely because it maintains group focus on excellence and purpose. It's a statement that the story really matters. Keeping people who aren't helping is a vampire on a group's energy. It slowly drains away all vivacity until only a pale corpse remains. Robert's rules may seem excessively ruthless and cruel to many. Decidedly unmodern. But in true Machiavellian fashion let's ask what is preferable: a strong secure long-lived state ruled by virtue or a state ruled according to how the world ought to work that is constantly at the mercy of every invader? If you are still hungry for more starter advice, Gordon Ramsay has some unintentionally delicious thoughts on developing software as well. Serve yourself at: Gordon Ramsay On Software/04.html#a225"> and . Gordon Ramsay's Lessons for Software Take Two. Kevin Burton share's seven deadly sins startups should avoid and makes an inspiring case how his company stronger and better able to compete by not taking VC funds. Really interesting. Diary of a Failed Startup says solve a problem, not a platform to solve a class of problems. Truer words were never spoken.

    Click to read more ...

  • Wednesday
    Jul162008

    The Mother of All Database Normalization Debates on Coding Horror

    Jeff Atwood started a barn burner of a conversation in Maybe Normalizing Isn't Normal on how to create a fast scalable tagging system. Jeff eventually asks that terrible question: which is better -- a normalized database, or a denormalized database? And all hell breaks loose. I know, it's hard to imagine database debates becoming contentious, but it does happen :-) It's lucky developers don't have temporal power or rivers of blood would flow. Here are a few of the pithier points (summarized):

  • Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. (Jeff)
  • Normalize until it hurts, denormalize until it works. (Jeff)
  • Use materialized views which are tables created and maintained by your RDBMS. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automatically. (Goran)
  • According to Codd and Date table names should be singular, but what did they know. (Pablo, LOL)
  • Denormalization is something that should only be attempted as an optimization when EVERYTHING else has failed. Denormalization brings with it it's own set of problems. You have to deal with the increased set of writes to the system (which increases your I/O costs), you have to make changes in multiple places when data changes (which means either taking giant locks - ugh or accepting that there might be temporary or permanent data integrity issues) and so on. (Dare Obasanjo)
  • What happens, is that people see "Normalisation = Slow", that makes them assume that normalisation isn't needed. "My data retrieval needs to be fast, therefore I am not going to normalise!" (Tubs)
  • You can read fast and store slow or you can store fast and read slow. The biggest performance killer is so called physical read. Finding and accessing data on disk is the slowest operation. Unless child table is clustered indexed and you're using the cluster index in the join you will be making lots of small random access reads on the disk to find and access the child table data. This will be slow. (Goran)
  • The biggest scalability problems I face are with human processes, not computer processes. (John)
  • Don't forget that the fastest database query is the one that doesn't happen, i.e. caching is your friend. (Chris)
  • Normalization is about design, denormalization is about optimization. (Peter Becker)
  • You're just another knucklehead. (BuggyFunBunny)
  • Lets unroll our loops next. RDBMS is about shared *transactional data*. If you really don't care about keeping the data right all the time, then how you store it doesn't matter.(Christog)
  • Jeff, are you awake? (wiggle)
  • Denormalization may be all well and good, when you need the performance and your system is STABLE enough to support it. Doing this in a business environment is a recipe for disaster, ask anyone who has spent weeks digging through thousands of lines of legacy code, making sure to support the new and absolutely required affiliation_4. Then do the whole thing over again 3 months later when some crazy customer has five affiliations. (Sean)
  • Do you sex a cat, or do you gender it? (Simon)
  • This is why this article is wrong, Jeff. This is why you're an idiot, in case the first statement wasn't clear enough. You just gave an excuse to be lazy to someone who doesn't have a clue. (Marcel)
  • This is precisely why you never optimize until *after* you profile (find objectively where the bottlenecks are). (TED)
  • Another great way to speed things up is to do more processing outside of the database. Instead of doing 6 joins in the database, have a good caching plan and do some simple joining in your application. (superjason)
  • Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, (Ed)
  • Keep a de-normalized replica for expensive operations (e.g. reports), Cache Results for repeat queries (Memcache), Partition the database for scalability (vertical or horizontal) (Gareth)
  • Speaking from long experience, if you don't normalize, you will have duplicates. If you don't have data constraints, you will have invalid data. If you don't have database relational integrity, you will have orphan "child" records, etc. Everybody says "we rely on the application to maintain that", and it never, never does. (A. Lloyd Flanagan)
  • I don't think you can make any blanket statements on normal vs. non-normal form. Like everything else in programming, it all depends on requirements and intended goals. (Wayne)
  • De-normalization is for reporting, not for OLTP. (Eric)
  • Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys. (Leandro)
  • What I think is funny is the number of people who think that because they use LINQ or Hibernate they aren't affected by these issues. (Sam)
  • You miss the point of normalization entirely. Normalization is about optimizing large numbers of small CrUD operations, and retrieving small sets of data in order to support those crud operations. Think about people modifying their profiles, recording cash registers operations, recording bank deposits and withdrawals. Denormalization is about optimizing retrieval of large sets of data. Choosing an efficient database design is about understanding which of those operations is more important. (RevMike)
  • Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data. (Chris)
  • Cache the generated view pages first. Then cache the data. You have to think about your content- very infrequently will anyone be updating it, it's all inserts. So you don't have to worry about normalization too much. (Matt)
  • I wonder if one factor at play here is that it's very easy to write queries for de-normalized data, but it's relatively hard to write a query that scales well to a large data set. (Thomi)
  • Denormalization is the last possible step to take, yet the first to be suggested by fools. (Jeremy)
  • There is a simple alternative to denormalisation here -- to ensure that the values for a particular user_id are physically clustered. (David)
  • The whole issue is pretty simple 99% of the time - normalized databases are write optimized by nature. Since writing is slower then reading and most database are for CRUD then normalizing makes sense. Unless you are doing *a lot* more reading then writing. Then if all else fails (indexes, etc.) then create de-normalized tables (in addition to the normalized ones). (Rob)
  • Don't fear normalization. Embrace it. (Charles)
  • I read on and discovered all the loons and morons who think they know a lot more then they do about databases. (Paul)
  • Put all the indexable stuff into the users table, including zipcode, email_address, even mobile_phone -- hey, this is the future!; Put the rest of the info into a TEXT variable, like "extra_info", in JSON format. This could be educational history, or anything else that you would never search by; If you have specific applications (think facebook), create separate tables for them and join them onto the user table whenever you need. (Greg)
  • How is the data being used? Rapid inserts like Twitter? New user registration? Heavy reporting? How one stores data vs. how one uses data vs. how one collects data vs. how timely must new data be visible to the world vs. should be put OLTP data into a OLAP cube each night? etc. are all factors that matter. (Steve)
  • It might be possible to overdo it, but trust me, I have had 20 times the problems with denormalized data than with normalized. (PRMAN)
  • Your LOGICAL model should *always* be fully normalized. After all, it is the engine that you derive everything else from. Your PHYSICAL model may be denormalized or use system specific tools (materialized views, cache tables, etc) to improve performance, but such things should be done *after* the application level solutions are exhausted (page caching, data caches, etc.) (Jonn)
  • For very large scale applications I have found that application partitioning can go a long way to giving levels of scalability that monolithic systems fail to provide: each partition is specialized for the function it provides and can be optimized heavily, and when you need the parts to co-operate you bind the partitions together in higher level code. (John)
  • People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (PRMan)
  • People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (Steve)
  • I normalise, then have distinct (conceptually transient) denormalised cache tables which are hammered by the front-end. I let my model deal with the nitty-gritty of the fix-ups where appropriate (handy beginUpdate/endUpdate-type methods mean the denormalised views don't get rebuilt more than necessary). (Mo)
  • Stop playing with mySQL. (Jonathan)
  • What a horrible, cowboy attitude to DB design. I hope you don't design any real databases. (Dave)
  • IOW, scalability is not a problem, until it is. Strip away the scatalogical reference, and all you have is a boring truism. (Yawn)
  • Is my Site OLTP? If the answer is yes then Normalize. Is my site OLAP? If the answer is yes then De-Normalize! (WeAreJimbo)
  • This is a dangerous article, or perhaps you just haven't seen the number of horrific "denormalised" databases I have. People use these posts as excuses to build some truly horrific crimes against sanity. (AbGenFac)
  • Be careful not to confuse a denormalised database with a non-normalised database. The former exists because a previously normailsed database needed to be 'optimised' in some way. The latter exists because it was 'designed' that way from scratch. The difference is subtle, but important. (Bob) OK, more than a few quotes. There's certainly no lack of passion on the issue! One thing I would add is to organize your application around an application level service layer rather than allowing applications to access the database directly at a low level. Amazon is a good example of this approach. Many of the denormalization comments have to do with the problems of data inconsistency, which is of course true because that's why normalization exists. Many of these problems can be reduced if there's a single service access point over which to get data. It's when data access is spread throughout an application that we see serious problems.

    Related Articles

  • Denormalization Patterns by Kenneth Downs
  • When Databases Lie: Consistency vs. Availability in Distributed Systems by Dare Obasanjo
  • Stored procedure reporting & scalability by Jason Young
  • When Not to Normalize your SQL Database by Dare Obasanjo

    Click to read more ...