Hadoop

Todd Hoff's picture

Hadoop Getting Closer to 1.0 Release

Update: Yahoo! Launches World's Largest Hadoop Production Application. A 10,000 core Hadoop cluster produces data used in every Yahoo! Web search query. Raw disk is at 5 Petabytes. Their previous 1 petabyte database couldn't handle the load and couldn't grow larger. Greg Linden thinks the Google cluster has way over 133,000 machines.

From an InfoQ interview with project lead Doug Cutting, it appears Hadoop, an open source distributed computing platform, is making good progress towards their 1.0 release. They've successfully reached a 1000 node cluster size, improved file system integrity, and jacked performance by 20x in the last year.

How they are making progress could be a good model for anyone:

Looking for good business examples of compaines using Hadoop

I have read the blog about Mailtrust/Rackspace as well the interesting things with Google and Yahoo. Who else is using Hadoop/MapReduce to solve business problems.

TIA
johnmwillis.com

Todd Hoff's picture

How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct.

Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential.

Todd Hoff's picture

Video: Dryad: A general-purpose distributed execution platform

Dryad is Microsoft's answer to Google's map-reduce. What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars?

Syndicate content