advertise
« Robert Scoble's Rules for Successfully Scaling Startups | Main | ZooKeeper - A Reliable, Scalable Distributed Coordination System »
Wednesday
Jul162008

The Mother of All Database Normalization Debates on Coding Horror

Jeff Atwood started a barn burner of a conversation in Maybe Normalizing Isn't Normal on how to create a fast scalable tagging system. Jeff eventually asks that terrible question: which is better -- a normalized database, or a denormalized database? And all hell breaks loose. I know, it's hard to imagine database debates becoming contentious, but it does happen :-) It's lucky developers don't have temporal power or rivers of blood would flow.

Here are a few of the pithier points (summarized):

  • Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. (Jeff)
  • Normalize until it hurts, denormalize until it works. (Jeff)
  • Use materialized views which are tables created and maintained by your RDBMS. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automatically. (Goran)
  • According to Codd and Date table names should be singular, but what did they know. (Pablo, LOL)
  • Denormalization is something that should only be attempted as an optimization when EVERYTHING else has failed. Denormalization brings with it it's own set of problems. You have to deal with the increased set of writes to the system (which increases your I/O costs), you have to make changes in multiple places when data changes (which means either taking giant locks - ugh or accepting that there might be temporary or permanent data integrity issues) and so on. (Dare Obasanjo)
  • What happens, is that people see "Normalisation = Slow", that makes them assume that normalisation isn't needed. "My data retrieval needs to be fast, therefore I am not going to normalise!" (Tubs)
  • You can read fast and store slow or you can store fast and read slow. The biggest performance killer is so called physical read. Finding and accessing data on disk is the slowest operation. Unless child table is clustered indexed and you're using the cluster index in the join you will be making lots of small random access reads on the disk to find and access the child table data. This will be slow. (Goran)
  • The biggest scalability problems I face are with human processes, not computer processes. (John)
  • Don't forget that the fastest database query is the one that doesn't happen, i.e. caching is your friend. (Chris)
  • Normalization is about design, denormalization is about optimization. (Peter Becker)
  • You're just another knucklehead. (BuggyFunBunny)
  • Lets unroll our loops next. RDBMS is about shared *transactional data*. If you really don't care about keeping the data right all the time, then how you store it doesn't matter.(Christog)
  • Jeff, are you awake? (wiggle)
  • Denormalization may be all well and good, when you need the performance and your system is STABLE enough to support it. Doing this in a business environment is a recipe for disaster, ask anyone who has spent weeks digging through thousands of lines of legacy code, making sure to support the new and absolutely required affiliation_4. Then do the whole thing over again 3 months later when some crazy customer has five affiliations. (Sean)
  • Do you sex a cat, or do you gender it? (Simon)
  • This is why this article is wrong, Jeff. This is why you're an idiot, in case the first statement wasn't clear enough. You just gave an excuse to be lazy to someone who doesn't have a clue. (Marcel)
  • This is precisely why you never optimize until *after* you profile (find objectively where the bottlenecks are). (TED)
  • Another great way to speed things up is to do more processing outside of the database. Instead of doing 6 joins in the database, have a good caching plan and do some simple joining in your application. (superjason)
  • Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, (Ed)
  • Keep a de-normalized replica for expensive operations (e.g. reports), Cache Results for repeat queries (Memcache), Partition the database for scalability (vertical or horizontal) (Gareth)
  • Speaking from long experience, if you don't normalize, you will have duplicates. If you don't have data constraints, you will have invalid data. If you don't have database relational integrity, you will have orphan "child" records, etc. Everybody says "we rely on the application to maintain that", and it never, never does. (A. Lloyd Flanagan)
  • I don't think you can make any blanket statements on normal vs. non-normal form. Like everything else in programming, it all depends on requirements and intended goals. (Wayne)
  • De-normalization is for reporting, not for OLTP. (Eric)
  • Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys. (Leandro)
  • What I think is funny is the number of people who think that because they use LINQ or Hibernate they aren't affected by these issues. (Sam)
  • You miss the point of normalization entirely. Normalization is about optimizing large numbers of small CrUD operations, and retrieving small sets of data in order to support those crud operations. Think about people modifying their profiles, recording cash registers operations, recording bank deposits and withdrawals. Denormalization is about optimizing retrieval of large sets of data. Choosing an efficient database design is about understanding which of those operations is more important. (RevMike)
  • Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data. (Chris)
  • Cache the generated view pages first. Then cache the data. You have to think about your content- very infrequently will anyone be updating it, it's all inserts. So you don't have to worry about normalization too much. (Matt)
  • I wonder if one factor at play here is that it's very easy to write queries for de-normalized data, but it's relatively hard to write a query that scales well to a large data set. (Thomi)
  • Denormalization is the last possible step to take, yet the first to be suggested by fools. (Jeremy)
  • There is a simple alternative to denormalisation here -- to ensure that the values for a particular user_id are physically clustered. (David)
  • The whole issue is pretty simple 99% of the time - normalized databases are write optimized by nature. Since writing is slower then reading and most database are for CRUD then normalizing makes sense. Unless you are doing *a lot* more reading then writing. Then if all else fails (indexes, etc.) then create de-normalized tables (in addition to the normalized ones). (Rob)
  • Don't fear normalization. Embrace it. (Charles)
  • I read on and discovered all the loons and morons who think they know a lot more then they do about databases. (Paul)
  • Put all the indexable stuff into the users table, including zipcode, email_address, even mobile_phone -- hey, this is the future!; Put the rest of the info into a TEXT variable, like "extra_info", in JSON format. This could be educational history, or anything else that you would never search by; If you have specific applications (think facebook), create separate tables for them and join them onto the user table whenever you need. (Greg)
  • How is the data being used? Rapid inserts like Twitter? New user registration? Heavy reporting? How one stores data vs. how one uses data vs. how one collects data vs. how timely must new data be visible to the world vs. should be put OLTP data into a OLAP cube each night? etc. are all factors that matter. (Steve)
  • It might be possible to overdo it, but trust me, I have had 20 times the problems with denormalized data than with normalized. (PRMAN)
  • Your LOGICAL model should *always* be fully normalized. After all, it is the engine that you derive everything else from. Your PHYSICAL model may be denormalized or use system specific tools (materialized views, cache tables, etc) to improve performance, but such things should be done *after* the application level solutions are exhausted (page caching, data caches, etc.) (Jonn)
  • For very large scale applications I have found that application partitioning can go a long way to giving levels of scalability that monolithic systems fail to provide: each partition is specialized for the function it provides and can be optimized heavily, and when you need the parts to co-operate you bind the partitions together in higher level code.
    (John)
  • People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (PRMan)
  • People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (Steve)
  • I normalise, then have distinct (conceptually transient) denormalised cache tables which are hammered by the front-end. I let my model deal with the nitty-gritty of the fix-ups where appropriate (handy beginUpdate/endUpdate-type methods mean the denormalised views don't get rebuilt more than necessary). (Mo)
  • Stop playing with mySQL. (Jonathan)
  • What a horrible, cowboy attitude to DB design. I hope you don't design any real databases. (Dave)
  • IOW, scalability is not a problem, until it is. Strip away the scatalogical reference, and all you have is a boring truism. (Yawn)
  • Is my Site OLTP? If the answer is yes then Normalize. Is my site OLAP? If the answer is yes then De-Normalize! (WeAreJimbo)
  • This is a dangerous article, or perhaps you just haven't seen the number of horrific "denormalised" databases I have. People use these posts as excuses to build some truly horrific crimes against sanity. (AbGenFac)
  • Be careful not to confuse a denormalised database with a non-normalised database. The former exists because a previously normailsed database needed to be 'optimised' in some way. The latter exists because it was 'designed' that way from scratch. The difference is subtle, but important. (Bob)

    OK, more than a few quotes. There's certainly no lack of passion on the issue!

    One thing I would add is to organize your application around an application level service layer rather than allowing applications to access the database directly at a low level. Amazon is a good example of this approach. Many of the denormalization comments have to do with the problems of data inconsistency, which is of course true because that's why normalization exists. Many of these problems can be reduced if there's a single service access point over which to get data. It's when data access is spread throughout an application that we see serious problems.

    Related Articles



  • Denormalization Patterns by Kenneth Downs
  • When Databases Lie: Consistency vs. Availability in Distributed Systems by Dare Obasanjo
  • Stored procedure reporting & scalability by Jason Young
  • When Not to Normalize your SQL Database by Dare Obasanjo
  • Reader Comments (7)

    My least favorite saying is "Normalize till it hurts, Denormalize till it works." It implies that normalization is painful and you must de-normalize to achieve functionality.

    In fact, normalization optimizes for correct writes. Denormalization optimizes for convenient reads. If you don't need a lot of fast reads, you don't really need to denormalize.

    The only redeeming characteristic of the saying is that it gets things in the right order. You normalize first to establish the correct record-keeping scheme, then denormalize by copying, extending, and aggregating. You cannot denormalize unless you have first normalized.

    BTW, the six way JOIN in the original article was way off track, it would give a completely meaningless result. To display a person's profile he'd need at least 3 separate queries.

    November 29, 1990 | Unregistered CommenterKenneth Downs

    Having normalized form be the religious doctrine preached by professors is a good thing -- those who aren't willing to explore the trade-offs of different approaches will be best served by that, but for high-use systems whose performance wont be satisfied by those traditional methods, take the blinders off and look around. A relational-database is a tool. Normalized form is a tool. Transactions are a tool. Indexes are a tool. Do you think Google uses relational databases for their web indexes?

    November 29, 1990 | Unregistered CommenterAc5p

    Normalization is a great idea, but like all great ideas, it can be carried to useless extremes. It is possible, and there are implementations of databases that use only pairs (lexicons) of data. A database that does this is automatically in about 7th normal form. However highly extensible this level of normalization is, it does wonders to obscure the intent of the rows.

    Just like the guy who has a margarita at happy hour and the guy who takes two shots of tequila for breakfast are two different people neither, both, or one of whom may have a problem (and it might not be the one that is obvious), it is sometimes useful to have a small number of attributes that might look like repeats in a row in order to AVOID problems. If two somewhat unrelated pieces use the same key and are related 1-to-1, it may not be a bad idea for them to share sections of the same row. They may be out of date, but they can't get disassociated. If you think that you can fix this all by transaction boundaries, THINK AGAIN. I have scars to prove it.

    However, an unpredictable number of attributes (or groups of them) that are sparse and counted by another attribute in the row is a bad idea. When you logically think about deleting and adding attributes, you should put them in a lexicon with some distinguishing key based upon the main part of the row.

    November 29, 1990 | Unregistered CommenterMich Ravera

    when you read you want to use de-normalized data
    when you search you use normalized data.
    when you write, you do both.

    November 29, 1990 | Unregistered CommenterAnonymous

    The trouble with normalization is that people usually get their first experience with it using MySQL.
    MySQL doesn't work well with JOINs, but it's easy to assume that the trouble lies with the method, not the RDBM. However, it inherently lies in MySQL's inability to plan a reasonable join (it's been plaguing MySQL for years), and database administrator's inability to correctly optimize indexes.

    There is no better way to get inconsistent data, clumsy and unreadable schema than with denormalization.

    Denormalization is a beginner's mistake, don't fall into that.

    November 29, 1990 | Unregistered Commenterdenis

    hi denis, some of your thought are reasonable,
    but you miss the whole picture I think.

    you simply cannot always use normalized data.

    take this example:

    search on a video site, results show 12,400 videos about "cats" every video has from 1 to 10 tags. every page shows 15 videos.

    you need to show the tags of each of 15 videos while paging the results of this query:
    "whatever IN ('cat')"

    the site has thousands of people running searches on dogs, cat, horses etc. at the same time.
    are you still using normalized search on the tags to videos table?
    I hope you don't.

    November 29, 1990 | Unregistered Commentersnake

    I'm not one to place fault on anyone about their believes or the way they decide to do things. I will, however, give me opinion on the matter and let you do with it what you will. Having said that, I can not imagine, even consider for a minute, designing a database without normalization. After all, the bottom line objective of any database is to convert plain everyday data into sound, solid, and correct information on which sound calculated decision can based on. It's that simple.

    So then, if a database is designed in such a fashion that it allows inconsistent data to be entered and manipulated, the end result is going to a production of bad information which can potentially cost an organization money and business, not to mention failure. You should always remember, when it comes to databases, the saying "garbage in, garbage out" is oh so true.

    Normalization has a very specific purpose and place in database design. Its main purpose is to ensure the transformation of plain data entry into meaningful and useful information; the key word here being “Information”, as in Information Technology. If you cannot pull good information out of your database to effectively and efficiently run a business, then what’s the point?

    May 27, 2011 | Unregistered CommenterSantos Velasquez

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Post:
     
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>