Update 2: Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides. Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity in Data Systems at Scale, Handling Large Datasets at Google: Current Systems and Future Directions, Mining the Web Graph. and Sherpa: Hosted Data Serving.
Update: Kevin Burton points out Hadoop now has a blog and an introductory video staring Beyonce. Well, the Beyonce part isn't quite true.
Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. It replicates much of Google's stack, but it's for the rest of us. Jeremy Zawodny has a wonderful overview of why Hadoop is important for large website builders:
For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy.
The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy
It's hard work. And it needs to be commoditized, just like the hardware has been...
Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.
The obvious question of the day is: should you build your website around Hadoop? I have no idea.
There seems to be a few types of things you do with lots of data: process, transform, and serve.
Yahoo literally has petabytes of log files, web pages, and other data they process. Process means to calculate on. That is: figure out affinity, categorization, popularity, click throughs, trends, search terms, and so on. Hadoop makes great sense for them for the same reasons it does Google. But does it make sense for your website?
If you are YouTube and you have petabytes of media to serve, do you really need map/reduce? Maybe not, but the clustered file system is great. You get high bandwidth with the ability to transparently extend storage resources. Perfect for when you have lots of stuff to store.
YouTube would seem like it could use a distributed job mechanism, like you can build with Amazon's services. With that you could create thumbnails, previews, transcode media files, and so on.
When they have Hbase up and running that could really spike adoption. Everyone needs to store structured data in a scalable, reliable, highly performing data store. That's an exciting prospect for me.
I can't wait for experience reports about "normal" people, familiar with a completely different paradigm, adopting this infrastructure. I wonder what animal O'Reilly will use on their Hadoop cover?
Comments
Hadoop on Amazon's EC2/S3
Tom White recently wrote an article about Hadoop on EC2/S3
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873...
re: Hadoop on Amazon's EC2/S3
Very nice article. Thanks. I added it as references to a few posts.
Re: Product: Hadoop
Hadoop is mainly used for running jobs that are the equivalent of "grep thing filename | sort | uniq -c | sort -nr" on very large datasets. While the distributed storage part is very cool, it's not a system that you'd want to be backing a website with, from what I understand. It's not for "real-time" work, it's for massive batch work. (I'm told)
Re: Product: Hadoop
Hadoop is Apache and Yahoo's answer to Google Map/Reduce, Google File System, and Google BigTable -- and is based off papers written about those technologies.
Re: Product: Hadoop
is there any framework which i can use to answer query in real time.
i search a lot but not find very useful .hadoop is very good but not for
web search.
Post new comment