Database People Hating on MapReduce

Todd Hoff's picture

Update: Typical Programmer tackles the technical issues in Relational Database Experts Jump The MapReduce Shark. The culture clash is still what fascinates me.

David DeWitt writes in the Database Column that MapReduce is a major step backwards:

  • A giant step backward in the programming paradigm for large-scale data intensive applications
  • A sub-optimal implementation, in that it uses brute force instead of indexing
  • Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  • Missing most of the features that are routinely included in current DBMS
  • Incompatible with all of the tools DBMS users have come to depend on

    Listening to databasers and map reducers talk is like eavesdropping on your average family holiday mashup. Every holiday people who have virtually nothing in common are thrown together because they incidentally share a little DNA or are married to the shared DNA. In desperation everyone gravitates to some shared enemy they can all confidently bash. But after that moment is relieved and awkward silence once again looms, nothing is left but more drinking and tackling sensitive topics you just know will end badly.

    Database folks love their schemas, relational purity and their swiss army knife indexes. You soon learn that really map reduce is just another form of an index and indexes really can scale to any heights with just a little tweaking. Map reducers lover their pure functional models, their self-healing clustery filled ecosystems, and the shear joy of the semi-organized chaos of letting a 10,000 CPUs simultaneously bloom.

    I for one stand firmly by the relish tray. Transactions have a place and so does structured data. That's why Google contributes heavily to MySQL. Yet, I too like my map reduce engine, distributed file system combo platter. With map reduce I can implement any complex behavior over any data set. With enough machines that work can be performed in a predictable amount of time. You aren't limit to set logic, SQL types, and tweaked indexes. That's pretty good stuff too.

    Much like a staunchly conservative nail crunching father and his too soft pansy liberal son, these two camps will never understand each other. Every sign of beauty in one person's eyes is just another confirmation to the other side of impending senility. Why even try? Just hug in a manly way and agree to meet again next year.

  • Comments

    Re: Database People Hating on MapReduce

    Slight typo, first paragraph after the bullet list, you say "ease dropping" when I think you mean "eavesdropping". :)

    Cheers - Callum.

    Re: Database People Hating on MapReduce

    PS> Todd, whatever you've been smoking over the last few weeks that has led to this new philosophical, art of war quoting author, I want me some! :)

    Re: Database People Hating on MapReduce

    > You soon learn that really map reduce is just another form of an index and indexes really can scale to any heights with just a little tweaking.

    How so? Map reduce iterates over every data element, akin to a table scan. There's nothing sorted or indexed about it at all.

    After reading the article, I'm convinced the author has no idea what MapReduce is used for. The main purpose of a database is to find particular pieces of information quickly among a large data set. MapReduce is used to run distributed calculations on every piece of information in a data set. Indexes are largely useless if you need to hit every data element (unless it's a covering index, but I digress).

    I also doubt that Google runs a MapReduce job when you perform a search query... In fact, I'm positive they have indexes that would make most of our heads spin ;)

    Two completely different problems. Or, put another way, when all you have is a hammer, everything looks like a nail.

    Sean

    Todd Hoff's picture

    Re: Database People Hating on MapReduce

    > Slight typo

    Ack, thanks.

    > Todd, whatever you've been smoking

    Just sniffing blue sky and sipping organically distilled rain water. :-)

    > How so?

    This take stems from an actual overheard conversation between two entrenched advocates from both sides. Really fascinating. Databases with columnar and adaptive indexes were capable of great things and map reduce didn't need no stinkin' indexes to deliver. So I think each side "understands" each other, they just don't understand each other.

    Re: Database People Hating on MapReduce

    >I also doubt that Google runs a MapReduce job when you perform a search query

    Huh? How else do you think its done? I just did a search for "The quick brown fox jumps over the lazy dog". Do you really think that the above search query is hitting a database?

    The amount of data that is being searched is HUGE. The number of requests per second is HUGE. The response time is TINY.

    How do you index every single word on a page using a database using indexes? Is this scalable to the number of requests that google handles? What about response times it provides? (0.14 secs for me for the above search term)

    I'll go back over to the relish tray.. :)

    Re: Database People Hating on MapReduce

    I don't think that MapReduce is invoked for the query. I don't work for Google, so obviously this is an educated guess.

    Reviewing what technology they have opened to the public, I would imagine that MapReduce is used to generate the index. Once the index is created, a query is really doing a query against the index and combining the results for relevance and presentation.

    Re: Database People Hating on MapReduce

    >>I also doubt that Google runs a MapReduce job when you perform a search query
    > Huh? How else do you think its done?

    They use a specialized, massively scalable, massively parallelized database called BigTable.

    > I just did a search for "The quick brown fox jumps over the lazy dog". Do you really think that the above search query is hitting a database?

    Yes.

    > How do you index every single word on a page using a database using indexes?

    By using a specialized, massively scalable, massively parallelized database.

    > Is this scalable to the number of requests that google handles?

    Apparently it is.

    > What about response times it provides?

    Achievable if a) the search is parallelized, and b) extensive memory caching (and other optimizations) are used. Think about it: (after stripping out irrelevant words) each word in your query gets sent to a different DB machine. Each machine looks up its word up in its index, and gets back a list of pages that contain that word. In most cases, the lookup can be fetched from a memory cache. When it can't, they've optimized it so that it only takes 1 disk seek, which is the next fastest thing. They then union the lists of pages together, and serve you up your results.

    Read the BigTable paper at http://labs.google.com/papers/bigtable.html if you want to learn more. (And be amazed!)

    Re: Database People Hating on MapReduce

    The previous commentor is correct. MapReduce is used to gather the table for indexing, to build the index. BigTable is used to store that data and is what is searched against.

    I haven't met any mapreduce people, but I do know DBAs and they really don't like their little kingdoms threatened in any way. I once made the (humorous) mistake of telling a DBA that a database is just a place to store data. You could see the sweat beading up, the veins in his head throbbing as he was trying to control his outrage.

    Heaven forbid if there is another solution available. Besides, no RDBMS has come close to being as massively scalable at processing like a MapReduce implementation. I love the "tools" argument. Do you really think some Google suit is wanting to run crystal reports against the kind of data that their index is built from? It isn't a business application after all.

    Re: Database People Hating on MapReduce

    Actually I am pretty sure that Google's index data is not stored in BigTable. BigTable is used for a variety of other tasks related to search (as well as other apps) such as storing your search history for personalization purposes. BigTable does plug very nicely into googles architecture for providing versioned lightweight database functionality. Also checkout Sawzall. Building an index to provide near instant lookup times is a very different problem. I can speak from experience using Hadoop and HBase for building search technology.

    MapReduce is used to process massive amounts of data. I have used it to process data in relational databases although I don't let the job talk to the database directly.

    Interesting discussion though.

    Re: Database People Hating on MapReduce

    No inside knowledge here but my understanding was that BigTable was a storage mechanism and MapReduce was a distributed calculation infrastructure. One of the big issues in getting the M/R and RDBMS folks talking is that most of the M/R folks come from or work in the NLP field, where answers are inherently subjective and fuzzy matching algorithms are king. I find similar frustrations getting strongly typed and dynamically typed language people to see that each has it's sweet spot.

    I've found that "Managing Gigabytes" (Witten) and " Foundations of Statistical Natural Language Processing" (Manning/Shuetze) to be the best inoculation for RDBMS folks trying to think about NLP and text search problems.

    BTW - One criticism I would have on M/R is that it seems horribly inefficient in terms of computation. Their goal was probably development agility for parallel computation, so that's not a big ding. I worry about inter-node bandwidth, though. When working on a text search engine back in 93, I "cleaned up" some code in a way that pushed one data map from L1 cache to main memory and dropped indexing speed by 40%.

    Cheers,
    Clark

    Re: Database People Hating on MapReduce

    BigTable (http://209.85.163.132/papers/bigtable-osdi06.pdf) is used by the crawler, but to my understanding, not the search index (according to the paper, at least, and the fact that BigTable came about long after the index).

    The search index is split across many (thousands?) of machines, when you fire off a query it is searched across many of these computers (search for "google query shard") and collected for further ranking (http://www.linesave.co.uk/google_search_engine.html)

    Reading the GFS paper (http://labs.google.com/papers/gfs.html) is also instructive on how they store huge files (ie a big assed index) over many computers and retrieve it.

    Reading http://www.google.com/librariancenter/articles/0512_01.html is also instructive about search engine indexes, more so if you try and code up the search engine he describes.

    Sean

    Re: Database People Hating on MapReduce

    I think this document is comparing things that are not comparable. They are talking about MapReduce as if it were a distributed database. But that's completely wrong. Hadoop is a distributed computed platform, not a distributed database prepared for OLAP. In some situations where scalability is important, Hadoop could be used instead of a database. But this cases are very specific ones.

    They said that distributed database were invented a long time ago. Maybe that is true, but it seems that they did not success. Otherwise, why there are not any distributed database that scales properly now?

    Re: Database People Hating on MapReduce

    The reason for it is its incompatibility to fulfill the requirements where others are there from programmers but are not as likly as that but able to work out databases more easily and convieniently.
    -----
    Underwater sea plants
    Seaweed...Seagrass

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.

    Post new comment

    The content of this field is kept private and will not be shown publicly.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd><div ?=?><p ?=?> <img ?=?><h1 ?=?><h2 ?=?><h3 ?=?>
    • Lines and paragraphs break automatically.
    • Glossary terms will be automatically marked with links to their descriptions
    • You may link to webpages through the weblinks registry

    More information about formatting options

    To combat spam, please enter the code in the image.