Database People Hating on MapReduce

Update: Typical Programmer tackles the technical issues in Relational Database Experts Jump The MapReduce Shark. The culture clash is still what fascinates me. David DeWitt writes in the Database Column that MapReduce is a major step backwards:

  • A giant step backward in the programming paradigm for large-scale data intensive applications
  • A sub-optimal implementation, in that it uses brute force instead of indexing
  • Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  • Missing most of the features that are routinely included in current DBMS
  • Incompatible with all of the tools DBMS users have come to depend on Listening to databasers and map reducers talk is like eavesdropping on your average family holiday mashup. Every holiday people who have virtually nothing in common are thrown together because they incidentally share a little DNA or are married to the shared DNA. In desperation everyone gravitates to some shared enemy they can all confidently bash. But after that moment is relieved and awkward silence once again looms, nothing is left but more drinking and tackling sensitive topics you just know will end badly. Database folks love their schemas, relational purity and their swiss army knife indexes. You soon learn that really map reduce is just another form of an index and indexes really can scale to any heights with just a little tweaking. Map reducers lover their pure functional models, their self-healing clustery filled ecosystems, and the shear joy of the semi-organized chaos of letting a 10,000 CPUs simultaneously bloom. I for one stand firmly by the relish tray. Transactions have a place and so does structured data. That's why Google contributes heavily to MySQL. Yet, I too like my map reduce engine, distributed file system combo platter. With map reduce I can implement any complex behavior over any data set. With enough machines that work can be performed in a predictable amount of time. You aren't limit to set logic, SQL types, and tweaked indexes. That's pretty good stuff too. Much like a staunchly conservative nail crunching father and his too soft pansy liberal son, these two camps will never understand each other. Every sign of beauty in one person's eyes is just another confirmation to the other side of impending senility. Why even try? Just hug in a manly way and agree to meet again next year.

    Click to read more ...

  • Thursday

    Moving old to new. Do not be afraid of the re-write -- but take some help

    Recently I had to help users on one of my opensource project ISPMan. This project started in 2001 as I was too unwilling to take care of the DNS and VitualHosting stuff as it was a side-thing to the company I worked for (so i wrote a software that took care of all these little details) Summary: A large project that needs a rewrite can be done in a matter of day. I will not give you a full case study about a project that went through a re-write but a case study about how easy it is to re-write something. Details: My boss was cool enough to let me open-source the project and obviously, I got a lot of cool-cred out of it. Later on I also did some support and implementation and earned quiet some money with it. Eventually I had to let the project go out of my hand to the community as I only did it to facilitate a job that wasnt williing to do. (Setup DNS zones of multiple servers, find out which host should host the website and put VirtualHost section there, find out which mail server should take care of the mailbox and create a mailbox, etc) The project was quiet successful and there are a number of users who are using it. One of the project members took himself to be the project manager and has been running the project since. I have been out of this project almost 5 years (not much time, I have had 2 kids 4.5 years and 8 months old, and my job was very demanding). The stress from my job has weakened a bit now (It took me really 3.5 years to bring them to a stable "actually an oxymoron when we are talking about high scalability" state). Back to the topic. We have had having complains about not having this feature and that for this opensource product. I tried to put in this feature... "I was the main authour. How complicated would it be for me to add this new feature.." and eventuall I went "What the ...". Yup, I hated my code, I hated everything about it. (as a programmer, as a sysadmin, as an op). Yes it was me coded it 7 years ago, it was me who insisted it to be like that... etc. But times have changes.. Its not 2001 any more. So I want to re-write it. Its not the first time that the idea of re-write was in place. Couple of years the senior members (any one who had to deal with the code) wanted to re-write. Its not extensible, its not pluggable, blah, blah, blah. Yup it was not written to do so. I was just written so I dont have to do the work which I did not wanted to and it served it welll... So if you want more extend it, re-write it, fork-it, you get my point (from the guy who wrote an app and opensourced it). I understand that point of view of the developers too "Why do I have to know how you LDAP scheme works, or how does Cyrus mail server works, etc" The re-write part: Using a PHP framework ( ) ( something that almost did not exist in 2001 ) I was able to get something up and running in a couple of hours. This whole thing was not possible without. * PHP5 (we moved from perl to php, in 2001 PHP was really Pretty Home Pages) * Zend framework ( opensource frameworks were scarce in 2001) * Some experience that LDAP as great it is, should not be where you put all your eggs This particular example does not say that perl(or any other language ) is bad and php is good or ldap(or any other directory) is bad and mysql is good , its just how we did it for this particuliar project. Oh and forgot about the "Get Some Help" Tellling your colleagues that they have to move to a new API can be more difficult then to hear a few blah and blah by the secreteries moved from XP to Vista This was about really moving from Perl to PHP, from PHP to Java, from Java to Perl, from Perl to Ruby. The whole point is that it does not matter. If you can do X faster than Y than I take you (In compute intensive scenario) If you do all your calculations this way, it might go somewhere.

    Click to read more ...


    Strategy: Asynchronous Queued Virus Scanning

    Atif Ghaffar has a nice strategy to deal with virus checking uploads:

  • Upload item into a safe area. If necessary, the uploader blocks waiting for a result.
  • Queue a work order into a job system so all the work can be distributed throughout your cluster.
  • A service in your cluster performs the virus scan and informs the uploader of the result.
  • Move the vetted item into your system. This removes the CPU bottleneck from your web servers and distributes it through your cluster. Keep your web servers providing prompt service to users. Let your cluster do the heavy lifting. This minimizes response time and maximizes throughput. A similar system can be used for creating thumbnails, transcoding, copyright checks, updating indexes, event notification or any other kind of intensive work.

    Click to read more ...

  • Tuesday

    Does Sun Buying MySQL Change Your Scaling Strategy?

    Sun is buying MySQL for $1 billion. The MySQL team has worked long and hard so I don't begrudge them their pay day. Strike while the iron is offering a lot of cash I say. And I have nothing against Sun. Yet I can't help but think this changes the mental calculation of what database to use. When Oracle acquired Innobase a new independent storage engine was needed for MySQL. How is this different? Does this change your thinking any? Would Martha say it's a good thing? Like Luke I've searched my feelings, but the force is not with me and I don't really know how I feel about it.

    Click to read more ...


    Sun to Acquire MySQL

    So what are we announcing today? That in addition to acquiring MySQL, Sun will be unveiling new global support offerings into the MySQL marketplace. We'll be investing in both the community, and the marketplace - to accelerate the industry's phase change away from proprietary technology to the new world of open web platforms. Read more on Jonathan Schwartz's Blog What do you think about this?

    Click to read more ...

    Jan142008 community site launched - framework for building scale-out applications

    GigaSpaces launched, a community web site for developers who wish to utilize and contribute to the open source OpenSpaces development framework. OpenSpaces extends the Spring Framework for enterprise Java development, and leverages the GigaSpaces eXtreme Application Platform (XAP) for data caching, messaging and as the container for application business logic. It is designed for building highly-available, scale-out applications in distributed environments, such as SOA, cloud computing, grids and commodity servers. OpenSpaces is widely used in a variety of industries, including financial services, telecommunications, manufacturing and retail -- and across the web in e-commerce, Web 2.0 applications such as social networking sites, search and more. already lists more than two dozen projects submitted by the developer community, including GigaSpaces customers, partners and employees. Innovative projects include an instant messaging platform, integration with PHP, configuration via JRuby, an implementation of Spring Batch and a scalable dynamic RSS feed delivery system. GigaSpaces recently announced the OpenSpaces Developer Challenge, a developer competition with $25,000 in total prizes and a $10,000 grand prize. The prizes will be awarded to the most innovative applications built using the OpenSpaces framework or plug-ins that extend it. The Challenge deadline is April 2, 2008 and ‘early bird’ prizes are available for those who submit their concepts by February 13, 2008. Additionally, in November of 2007 GigaSpaces launched its Start-Up Program, which provides free software licenses for qualifying individuals and companies.

    Click to read more ...


    Google Reveals New MapReduce Stats

    The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce. The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data. Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time." It is interesting to compare this to Amazon EC2:

    • $0.40 Large Instance price per hour x 400 instances x 10 minutes = $26.7
    • 1 TB data transfer in at $0.10 per GB = $100
    For a hundred bucks you could also process a TB of data!

    Click to read more ...


    A Note on How to Create Teasers When Posting 

    I fully and enthusiastically encourage anyone who wants to share a relevant topic to register and post. People have added a lot of good and useful content. Don't be shy. It's been asked how a teaser is created when posting so the full article doesn't display on the front page. A teaser is a paragraph interesting enough to convince readers to click on the "read more" link to get the full article. Creating a teaser in Drupal is accomplished by inserting < ! -- break -- > on a separate line directly after the text you want to be the teaser. Only DO NOT include the spaces. So your post looks like: Teaser Content < ! -- break -- > (no spaces in real life) Rest of Content It's a bit kludgey, but it works.

    Click to read more ...

    Jan122008, french registrar launches in granular server resources., a French domain registrar has launched a very flexible dynamic resource allocated VPS service.

    Click to read more ...


    FTP Sanity: Redundancy, archiving, consolidation.

    Easy FTP redundancy and consolidation with the Open Source project Generic-FTP. Works with probably any Linux FTP Server (ProFTPD only one tested). Get rid of some single points of failure. A very easy to set up solution using scripts written in PHP. Tested thoroughly in a production environment.

    Click to read more ...