« Scalable virus scanning for web-applications | Main | eBay Architecture »

How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale

Update 3: ReadWriteWeb says Google App Engine Announces New Pricing Plans, APIs, Open Access. Pricing is specified but I'm not sure what to make of it yet. An image manipulation library is added (thus the need to pay for more CPU :-) and memcached support has been added. Memcached will help resolve the can't write for every read problem that pops up when keeping counters.
Update 2: threw a GAE load party and a lot of people came. The results at Load test : Google App Engine = 1, Community = 0. GAE handled a peak of 35 requests/second and a sustained 10 requests/second. Some think performance was good, others not so good. My GMT watch broke and I was late to arrive. Maybe next time. Also added a few new design rules from the post.
Update: Added a few new rules gleaned from the GAE Meetup: Design By Explicit Cost Model and Puts are Precious.

How do you structure your database using a distributed hash table like BigTable? The answer isn't what you might expect. If you were thinking of translating relational models directly to BigTable then think again. The best way to implement joins with BigTable is: don't. You--pause for dramatic effect--duplicate data instead of normalize it. *shudder*

Flickr anticipated this design in their architecture when they chose to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation. I don't know how that decision was made, but it must have gone against every fiber in their relational bones...

But Flickr’s reasoning was genius. To scale you need to partition. User data must spread across the shards. So where do comments belong in a scalable architecture?

From one world view comments logically belong to a relation binding comments and users together. But if your unit of scalability is the user shard there is no separate relation space. So you go against all your training and decide to duplicate the comments. Nerd heroism at its best. Let inductive rules derived from observation guide you rather than deductions from arbitrarily chosen first principles. Very Enlightenment era thinking. Voltaire would be proud.

In a relational world duplication is removed in order to prevent update anomalies. Error prevention is the driving force in relational modeling. Normalization is a kind of ethical system for data. What happens, for example, if a comment changes? Both copies of the comment must be updated. That leads to errors because who can remember where all the data is stored? A severe ethical violation may happen. Go directly to relational jail :-)

BigTable data ethics are more Mardi Gras than dinner with the in-laws. Data just wants to have fun. BigTable won’t stop you from hurting yourself. And to get the best results you may have to engage in some conventionally risky behaviors. But if those are the glass bead necklaces you have to give for a peak at scalability, why not take a walk on the wild side?

For a more modern post-relational discussion of data ethics I’m using as my primary source a thread of conversations from JA Robson, Ben the Indefatigable, Michael Brunton-Spall, and especially Brett Morgan. According to our new Voltaire, Locke, Bacon, and Newton, here’s what it takes to act ethically in a BigTable world:

  • Don’t bother with BigTable unless your goal is to create a web site that scales to millions of users. The techniques for building scalable read-mostly web applications are difficult and require a radical mindset change. Standard relational techniques work very well until you scale to huge numbers of users. It is at that point you need to break the rules and do something counter-intuitively different. More of the same will not work. If you don’t plan to get to that point it may not be worth the effort to change. BigTable is targeted at building web applications, It's nature makes it a poor match for OLAP, data warehousing, data mining, and other applications performing complex data manipulations.

  • Assume slower random data access rather than fast sequential access. Every get of an entity could be from a different disk block on a different machine in a cluster. Calculating, for example, the average over a column in SQL can be efficient because data is stored together on disk. In BigTable data can be anywhere so iterating over every value in a column is expensive. Each read is potentially a random block from anywhere which means the average retrieval time can be relatively high. The implication is to use BigTable you must adopt some unfamiliar and unintuitive strategies in order to deal with such a very different performance profile. Using relational database we are used to writing applications against fast highly performant databases. With BigTable you have to become familiar with the rules for developing against a slower but more scalable database. Neither approach is better for all purposes, but BigTable has the edge for high scalability.

  • Group data for concurrent reads. Given the high cost of reading data from BigTable your application will not scale if every page requires a large number of reads. The solution: denormalize. Store data in the same entity based on what data needs to be read concurrently. Relational modeling groups data together based on the “minimize problems” rule. BigTable’s new rule is “maximize concurrent reads” which implies denormalization. Store entities so they can be read in one access rather than performing a join requiring multiple reads. Instead of storing attributes in separate entities in order to remove duplication, duplicate the attributes and store them where they need to be used. Following this rule minimizes the number of reads required to return an entity.

  • Disk and CPU are cheap so stop worrying about them and scale. A criticism of denormalization is storing duplicate data wastes disk space. Google’s architecture trades disk space for better performance. Disk is (relatively) cheap, so don’t fight it. On the CPU front a data center’s worth of CPU is at your service. As long as you structure your application in the way GAE forces you to, your application can scale as large as it needs to simply by running on more machines. All scalability bottlenecks have been removed.

  • Structure data around how it will be used. Trade SQL sets for application based entities. Queries are slow so the closer data is to the format it is to be used the faster pages will render. It’s like the database model becomes the model previously used at the caching layer. Complete entities tend to be cached, not low level detail rows. That’s what BigTable models should look like because that’s how concurrent reads are maximized. This isn’t the same as an object oriented database because the behavior is provided by applications, behavior is not bound to the entity so multiple applications can read the same entities yet implement very different behaviors.

  • Compute attributes at write time. Since looping over large columns of data is inefficient with BigTable the idea is to calculate values at write time instead of read time. For example, instead of calculating an average by reading an entire column at read time, track the total number and the total value at write time so the average can be calculated with one read on page display. Programmer effort is made up front at write time to minimize the work needed at read time. Preventing applications from iterating over huge data is key for making applications scale. Given the limitations of GAE transactions and quotas, GAE may not be appropriate for business applications that need exact summary statistics. Warning: if the summary stat is written on every read request then this approach will not scale as writes don't scale.

  • Create large entities with optional fields. Normalization creates lots of small entities. Instead, create larger entities with optional parts so you can do one read and then determine what’s present at run time. This shifts work from the database to the CPU while minimizing the number joins.

  • Define schemas in models. Denormalization requires user developed code to properly keep data consistent across multiple entities. The database won’t do it for you anymore. Schemas are really defined in code because it’s only code that can track all the relationships and maintain correctness. All database access must go through the models or otherwise the much feared inconsistency problems will result.

  • Hide updates using Ajax. Updates are slow so big bang updates of many entities will appear slow to users . Instead, use Ajax to update the database in little increments. As a user enters form data update the database so the update cost is amortized over many calls rather than one big call at the end. The result is a good user experience and a more scalable app.

  • Puts are Precious. Updating entities in large batches, say even 200 at a time, isn't
    part of the BigTable model. Entity attributes are automatically and synchronously indexed on writes. Indexing is an expensive operation that accumulates a lot of CPU time so the number updates that can be performed in one query is quite limited. The work around is to perform updates in smaller batches driven by an external CPU. Even when GAE provides the ability run batches within GAE the programming model for writes needs to be accounted for in a design.

  • Design By Explicit Cost Model. If you are going to be charged for an operation GAE wants you to explicitly ask for it. This is why some automatic navigation between objects isn't provided because that will force an explicit query to be written. Writing an explicit query is a sort of EULA for being charged. Click OK in the form of a query and you've indicated that you are prepared to pay for a database operation.

  • Place a many-to-many relation in the entity with the fewest number of elements. One way to create a many-to-many relationship is to have a list property that contains keys to the other related entities. A Company entity, for example, could contain a list of keys to Contact entities or a Contact entity could contain a list of keys to Company entities. Since it's likely a Contact is associated with fewer Companies the list should be contained in the Contact. The reasoning is maintaining large lists is relatively inefficient so you want to minimize the number of items in a list as much as possible.

  • Avoid unbounded queries. Large queries don't scale. Consider showing only the most recent 10 or so values from an attribute.

  • Avoid contention on datastore entities. If every request to your app reads or writes a particular entity, latency will increase as your traffic goes up because reads and writes on a given entity are sequential. One example construct you should avoid at all costs is the global counter, i.e. an entity that keeps track of a count and is updated or read on every request.

  • Avoid large entity groups. Any two entities that share a common ancestor belong to the same entity group. All writes to an entity group are sequential, so large entity groups can bog down popular apps quickly if there are a lot of writes to that group. Instead, use small, localized groups in your design.

  • Shard counters. Increment one of N counters and sum those N counters on the read side. This avoids the dreaded write bottleneck. See Efficient Global Counters by App Engine Fan for more details.

    An excellent example showing some of these principles in action can be found in this GQL thread.

    Take this nicely normalized schema:

    - Name
    - Country
    - Code
    - Name
    - Description
    - Reference to Product Entity
    - Reference to Customer Entity
    - Date of order

    Anyone from a relational background would look at this schema and give it a big thumbs up. With a little effort we can imagine the original physical purchase order that has now been normalized into three different tables.

    To recreate the original purchase order a join on purchases, produce and customer is needed. Read speed is not optimized, safety is optimized.

    Here’s what the same schema looks like optimized for reading:

    - Customer Name
    - Customer Country
    - Product Code
    - Product Name
    - Purchase Order Number
    - Date Of Order

    The three original tables have been folded into one entity. Now a purchase order can be read in one get operation. No join necessary. Notice how the entity looks more like an original purchase order. It is also what would probably be cached and is what our model would probably look like.

    But what if you want to update a product name or a customer name? Those attributes are duplicated in all entities. Here’s where the protection offered by the relational model comes in. Only one entity needs updating in a normalized model.

    In BigTable you have to remember everywhere a customer name and product name and change every instance to new values. It’s not a simple, safe, or reliable approach. But it does optimize for read speed and scalability.

    For an application with a high proportion of updates to reads this approach wouldn’t make sense. But on the web reads usually dominate. How often do you really change a customer name or a product name? Seldom. How often do you read them? All the time.

    Designing to scale for reads and taking the pain on writes takes some getting used to. It’s a massive change to standard relational tactics. But this is what it takes to scale web applications, even if it feels a little strange at first.

    Related Articles

  • ER-Modeling with Google App Engine (updated)
  • Tips on writing scalable apps
  • Reader Comments (35)

    Very interesting article. Thanks for putting it up Todd.

    You're not up to an easy task if you have to migrate your application, that is based on a relational model, into a "bigtable" application. Even more changes are required than when you need to create an OLAP database, based on your OLTP one in production.

    I particularly like the AJAX tip to make the users think that the site has awesome response. You can see exactly that in Gmail, or even at some of the busiest forums that I visit.

    Best regards,

    November 29, 1990 | Unregistered CommenterDBA

    I've read somewhere a while ago that ebay uses a scheme where no joins in the database are permitted, just like the BigTable approach. Any joins that need to be done are done completely in code on the web-server. I guess this also helps scalability because you take away work from the database (of which there is typically only one) and move it to the web-server (of which you can have many). Ofcourse a cluster of databases is an option as well, but multiplying the web-server is much easier.

    November 29, 1990 | Unregistered CommenterElmer

    Very nice summary. Especially the notion of more natural data mapping on the DB level. This is why CouchDB et. al. use a "document"-oriented approach to storing data. Take the data in the form it exists and store it in that form.

    November 29, 1990 | Unregistered CommenterJan

    "use Ajax to update the database in little increments"

    That sounds ridiculous.

    You'll have quirky UI without "Save" button. It's nice in preference panels on OS X, but quite unexpected in web apps. User errors/re-edits of the form will cause more updates.

    Instead you should make updates asynchronous. Get the data, update what you can update quickly, store rest in a queue to update later and tell user it's done.

    November 29, 1990 | Unregistered CommenterkL

    The save button would simply commit the data. Use a bit to toggle the data from an unfinished to finished state. This satisfies your need for a save button.

    November 29, 1990 | Unregistered CommenterAnonymous

    I begin to understand how to write the data layer in GAE but I am not
    quite there yet... what has been said in the article referenced here
    makes sense for scenarios where an app does way more reads than
    writes (indeed, taking from the article example, username and
    productname seldomly change). However, what about highly dynamic and
    interactive social apps where writes are on par (almost) with reads.
    For example, take an app where users rate products. Right now I have a
    rating model (one rating per user per product), a user model and a
    product model. The rating model caches/duplicates/denormalizes a
    couple of the product attributes that are always displayed together
    with the rating... those attributes rarely, if never, change (e.g.,
    product code). However, every time I display a rating I also want to
    display the average rating and the total number of ratings. Right now
    these are calculated at write time and stored/cached in the product
    model. Every time I display the rating, I pull out the associated
    product instance
    from a reference property and display its rating average and count.
    Effectively for every page read I have to pull two model instances out
    of the datastore. If I follow what's written in the article I should
    instead store, in a highly duplicated fashion, the rating average and
    count in every rating instance... which means that every time a user
    rates a product possibly hundred of thousands rating instances need to
    be updated... instead of simply read from the datastore. How much
    slower are writes compared to reads in the cloud? That's why we need
    real-worl benchmarks and guidelines from Google engineers who have
    already a lot of experience... theoretical enunciations are good for
    an introduction but are not that informative (I am a neuroscientist...
    I need to see hardcore data before making a decision). Another
    question arise, right now ratings have the user who created them as
    parent. Should I instead use the product to speed up the
    aforementioned reads/writes used to aggregate ratings data into
    product statistics?

    Thanx you all for your suggestions to come,


    November 29, 1990 | Unregistered CommenterEdoardo "Dado" Marcora


    I don't think that the article would recommend duplicating the rating information to each rating. In a sql situation, one could potentially store only the rating in the product instance and then calculate the total rating /# of ratings with a simple SQL statement.

    In BigTable, you would want to keep the design that you have used in your sql of keeping the rating count and current average in the product instance. Then updating the rating consists of exactly one read and one write.

    Duplicating the data in this case would make matters harder in bigtable as well.

    November 29, 1990 | Unregistered CommenterJess Sightler

    The shift to denormalization in performance-critical situations is hardly a new thing; don't dismiss all traditional DBAs as being stuck in 1993, problems of this nature have been encountered and solved well before Google was around.

    The whole study of data warehousing is partly an exercise in how much extra space you can use to improve the performance of queries that involve huge amounts of data and relate many different entities. An OLTP schema of 50 tables might be compacted down to a few fact tables and a handful of dimensions, in order to reduce the effort to relate information. This is in direct response to the difficulty in writing reporting-type queries against normalized tables.

    Of course a data warehouse and a web database are two entirely different things. My point is that a real DBA can understand the trade-off between safety/data integrity and performance, as you've pointed out, and not throw a fit upon seeing a table not in third normal form.

    November 29, 1990 | Unregistered CommenterAlex Tomic

    Don't conflate logical design with physical design. The only reason you're flouting relational convention is that most RDBMS map the logical relation -directly- into physical relationships.

    "Denormalization requires user developed code to properly keep data consistent across multiple entities. The database won’t do it for you anymore. Schemas are really defined in code because it’s only code that can track all the relationships and maintain correctness"

    ....And that's a good thing?

    I mean, I understand expediency and all that jazz, but, shouldn't we just be _fixing_ the databases instead of moving to a dumber technology?

    November 29, 1990 | Unregistered CommenterAnonymous

    A small doubt. While you propose that people should try to denormalize their whole model if they were to deploy their application in GAE. Fair enough. But this is what is contradictory.

    While you say that, duplicate information wherever you want, you also say puts(writes) are precious. Say for example in the Purchase Order example, if the name of the Product is modified, then if the Product is sold to more than 1000 customers then i have to update in thousand PurchaseOrders, which according to GAE is slow. Right?

    Does GAE guarantee that all the 1000 entries will be updated, even if its slow.

    November 29, 1990 | Unregistered CommenterR.Sriram

    > While you say that, duplicate information wherever you want, you also say puts(writes) are precious.

    You are 100% correct that this is a contradiction. It will take some experimentation to see what is the proper mix is for GAE. It's probably true that for simple data models using object relationships will work well. Now when you get to more complex data models when you are integrating many many data sources to make up a page then it seems unlikely you'll be fast enough and you have none of your normal tools (caching, threading) to work around the problem. We'll have to see. It's a new platform with a very specific niche and it will take a while to best figure out how to exploit that niche.

    > Does GAE guarantee that all the 1000 entries will be updated, even if its slow.

    Not if it exceeds your CPU cap.

    November 29, 1990 | Unregistered CommenterTodd Hoff

    The sometimes-implied and sometimes-explicit insistence by Google App Engine (GAE) groupies that developers somehow gain by abandoning relational database management systems (RDBMS) is a sad and ridiculous end-result of overhyped object technology. Today many, if not most, young developers are wary of relational databases, are heavily invested in the use of OO tools and very aware of the impedance mismatch between the two. Unfortunately this leads to a belief that RDBMS should be supplanted by "something better". I have several objections to this belief's presence in the discussion of GAE.

    About scalability: all that is required is that one wait. Today's software problem will be solved when faster hardware arrives, possibly as soon as next year. Stories describing this abound on the website.

    But suppose I do invest the effort in creating a GAE app. My app is scalable, runs fast but needs ad-hoc patching and has inconsistencies between views of data, God knows when and where. Can I trust this platform? The answer is obviously no. Would I recast it in relational form were a scalable RDBMS available? Undoubtedly the answer is yes.

    A priori, the chance is extremely low that any particular web app will prove popular enough to require the scalability of Google's BigTable. Very few web apps (a miniscule fraction, really) require anything near the throughput that Google does. Perhaps GAE is adequate for those few apps. But for most web apps the RDBMS approach will be more fruitful. You can therefore choose to deploy a
    - buggy but super-scalable application using an OOP design (GAE) and fix it all the time,
    - reliable but non-super-scalable application using proven RDBMS design.
    In either case, over time you will end up using the RDBMS approach. But in the latter case you'll only need to code it once.

    GAE is most valuable for it's integration and choice of tools, not for it's quirky underlying structure. Google's engineers drank the "OOP-is-superior-to-RDBMS" koolaid on this one. I encourage developers to stay the true RDBMS path. It's easier, more widely used, and even has an aesthetically pleasing theoretical foundation (OOP databases have none).

    November 29, 1990 | Unregistered Commentertndal

    There were a few comments along the lines of "if you don't need to scale, why bother?".

    I don't think the individual scaling needs of an application are the only justification for the architecture. Forcing the apps into this model allows the service, as a whole, to scale.

    Or in other words: doing it like this allows your app, regardless of scale, to run on Google's infrastructure. That the application itself becomes scalable is an added bonus.

    November 29, 1990 | Unregistered CommenterCarsten Neubert

    However, what about highly dynamic and
    interactive social apps where writes are on par (almost) with reads.

    .... hmm, you might just have put your finger on why (social website that emerged from within google) was both slow and highly inconsistent, for the longest time.

    November 29, 1990 | Unregistered Commenterhko

    Suppose a table that stores articles. Each article can be editted by hundrets of diffrent people. These people are stored in another table (Users). This is a many-to-many-situation, as far as I understood it.

    Now the problem: I want to find out, which articles can be editted by a given user.

    Now my denormalized proposal: Store the (many) different alphanumerical UIDs of the Editors in a FULLTEXT indexed field in the row of the article? When trying to find all the articles assigned to an Editor: SELECT article_uid FROM articles WHERE MATCH (editors_text) AGAINST ('my_user_364542_uid').

    Is that cool? The only other denormalized approach would be to store the article seperatly for each editor. (there might be hundrets of editors).

    PS: I am trying to avoid JOINs since all my tables are sharded over several DBs

    November 29, 1990 | Unregistered CommenterTobias

    How would I go about implementing a technology such as Google BigTable? "Download BigTable" in Google Search doesn't return useful results...

    November 29, 1990 | Unregistered CommenterJason

    Well,I didn't knew about that how google got into storing big huge sets of data using the "bigtable". Capable of even dealing with perabytes..!!
    -----">sea plants">Sea grapes...">plant roots

    November 29, 1990 | Unregistered Commenterfarhaj

    Sounds like this builds upon the design of dimensional data warehouses from back in the 90's:

    Some of us stopped worrying a long time ago. Nice to see it catching on.

    November 29, 1990 | Unregistered CommenterChris Kerlin

    With the advancement of technology, disk space and bandwidth have become cheaper and cheaper over the years. Webmaster now does not need to worry about lack of diskspace and can concentrate on building their sites.

    November 29, 1990 | Unregistered CommenterWebmaster Forum

    A small doubt. While you propose that people should try to denormalize their whole model if they were to deploy their application in GAE. Fair enough. But this is what is contradictory.

    November 29, 1990 | Unregistered CommenterOyun

    With the latest advancement in technology, bandwidth and disk space are becoming cheaper as I learned it in my"> 70-270 exam which tests you in your knowledge of installing and configuring application on server, but if you want to become more expert in server matter then"> 70-640 is best which leads to Server Administrator Core Requirements and give you huge knowledge to pass your"> 70-649 with full confidence and become expert of your certification. By passing these certification you can manage all your work easily as other perofessionals do.

    November 29, 1990 | Unregistered CommenterTestKing

    hello guyz

    yeah if you want learn any thing try and try and try .
    about scalability: all that is required is that one wait. Today's software problem will be solved when faster hardware arrives, possibly as soon as next year. Stories describing this abound on the website.


    November 29, 1990 | Unregistered CommenterAnonymous

    I have used xcache instead of memcached for my projects" title="недвижимость киев">Dom &" title="free online dating">Live Date Search. It can cache both PHP variables and compiles PHP pages which is a great advantages.
    Of course, it can not cache MySQL directly like memcached but it's possible to create own SQL cache based on query hash on xcache. And it gives great space saving as a bonus :-)

    November 29, 1990 | Unregistered CommenterAnthony Date

    I find amazing how a good database architecture is still underestimated, everybody is worry about the last technologic piece of jewellery.

    Tom Green" target="_blank">Gold Price Blog" target="_blank">Personal finance advice" target="_blank">Insurance blog

    November 29, 1990 | Unregistered CommenterAnonymous

    Great article. Nowadays people seem to forget how design of a database may facilitate or slow down work. I guess that in pursuit of technological novelties many of us have forgot about the good programming practices.
    Best regards,">Websites for sale

    November 29, 1990 | Unregistered CommenterAna Backstone

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>