Product: Hadoop

Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig. Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3: Scaling Hadoop to 4000 nodes at Yahoo!. 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2: Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides. Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity in Data Systems at Scale, Handling Large Datasets at Google: Current Systems and Future Directions, Mining the Web Graph. and Sherpa: Hosted Data Serving. Update: Kevin Burton points out Hadoop now has a blog and an introductory video staring Beyonce. Well, the Beyonce part isn't quite true. Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. It replicates much of Google's stack, but it's for the rest of us. Jeremy Zawodny has a wonderful overview of why Hadoop is important for large website builders: For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy. The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy It's hard work. And it needs to be commoditized, just like the hardware has been... Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters. The obvious question of the day is: should you build your website around Hadoop? I have no idea. There seems to be a few types of things you do with lots of data: process, transform, and serve. Yahoo literally has petabytes of log files, web pages, and other data they process. Process means to calculate on. That is: figure out affinity, categorization, popularity, click throughs, trends, search terms, and so on. Hadoop makes great sense for them for the same reasons it does Google. But does it make sense for your website? If you are YouTube and you have petabytes of media to serve, do you really need map/reduce? Maybe not, but the clustered file system is great. You get high bandwidth with the ability to transparently extend storage resources. Perfect for when you have lots of stuff to store. YouTube would seem like it could use a distributed job mechanism, like you can build with Amazon's services. With that you could create thumbnails, previews, transcode media files, and so on. When they have Hbase up and running that could really spike adoption. Everyone needs to store structured data in a scalable, reliable, highly performing data store. That's an exciting prospect for me. I can't wait for experience reports about "normal" people, familiar with a completely different paradigm, adopting this infrastructure. I wonder what animal O'Reilly will use on their Hadoop cover?

See Also

  • Open Source Distributed Computing: Yahoo's Hadoop Support by Jeremy Zawodny
  • Yahoo!'s bet on Hadoop by Tim O'Reilly
  • Hadoop Presentations
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3

    Click to read more ...

  • Friday

    Wolfram|Alpha Architecture

    Making the world's knowledge computable

    Today's Wolfram|Alpha is the first step in an ambitious, long-term project to make all systematic knowledge immediately computable by anyone. You enter your question or calculation, and Wolfram|Alpha uses its built-in algorithms and growing collection of data to compute the answer.

    Answer Engine vs Search Engine

    When Wolfram|Alpha launches later today, it will be one of the most computationally intensive websites on the internet. The Wolfram|Alpha computational knowledge engine is an "answer engine" that is able to produce answers to various questions such as
    • What is the GDP of France?
    • Weather is Springfield when David Ortiz was born
    • 33 g of gold
    • LDL vs. serum potassium 150 smoker male age 40
    • life expectancy male age 40 finland
    • highschool teacher median wage
    Wolfram|Alpha excels at different areas like mathematics, statistics, physics, engineering, astronomy, chemistry, life sciences, geology, business and finance as demonstrated by Steven Wolfram in his Introduction screencast.

    The Stats

    • Abour 10,000 CPU cores at launch
    • 10+ trillion of pieces of data
    • 50,000+ types of algorithms
    • Able to handle about 175 million queries per day
    • 5+ million lines of symbolic Mathematica code

    The Computers Powering Computable Knowledge

    There is no way to know exactly how much traffic to expect, especially during the initial period immediately following the launch, but the Wolfram|Alpha team is working hard to put reasonable capacity in place. As Stephen writes in the Wolfram|Alpha blog Alpha will run in 5 distributed colocation facilities. What computing power have they gathered in these facilities for launch day? Two supercomputers, just about 10,000 processor cores, hundreds of terabytes of disks, a heck of a lot of bandwidth, and what seems like enough air conditioning for the Sahara to host a ski resort. One of their launch partners, R Systems, created the world’s 44th largest supercomputer (per the June 2008 TOP500 list - it is listed as 66th per the latest Top500 list). They call it the R Smarr. It will be running Wolfram|Alpha on launch day! R Smarr has a Sum Rmax of 39580 GFlops using Dell DCS CS23-SH, QC HT 2.8 GHz computers, 4608 cores, 65536 GB of RAM and Infiniband interconnect. Dell is another of the launch partners with a data center full of quad-board, dual-processor, quad-core Harpertown servers. What does it all add up to? The ability to handle 175 million queries (yielding maybe a billion) per day—over 5 billion queries (encompassing around 30 billion calculations) per month.

    The Launch of Wolfram|Alpha

    Watch a live webcast of the Wolfram|Alpha system being brought online for the first time on
    • Friday, May 15, beginning at 7pm CST

    The First Killer App of The New Kind of Science

    The Genius behind Wolfram|Alpha is Stephen Wolfram. He is best know for his ambitious projects: Mathematica and A New Kind of Science (NKS). May 14, 2009 marks the 7th anniversary of the publication of his book A New Kind of Science. Stephen explains is his blog post: But for me the biggest thing that’s happened this year is the emergence of Wolfram|Alpha. Wolfram|Alpha is, I believe, going to be the first killer app of NKS.


    That it should be possible to build Wolfram|Alpha as it exists today in the first decade of the 21st century was far from obvious. And yet there is much more to come. As of now, Wolfram|Alpha contains 10+ trillion of pieces of data, 50,000+ types of algorithms and models, and linguistic capabilities for 1000+ domains. Built with Mathematica—which is itself the result of more than 20 years of development at Wolfram Research—Wolfram|Alpha's core code base now exceeds 5 million lines of symbolic Mathematica code. Running on supercomputer-class compute clusters, Wolfram|Alpha makes extensive use of the latest generation of web and parallel computing technologies, including webMathematica and gridMathematica.

    How Mathematica Made Wolfram|Alpha Possible?

    Wolfram|Alpha is a major software engineering development to make all systematic knowledge immediately computable by anyone. It is developed and deployed entirely with Mathematica—in fact, Mathematica has uniquely made Wolfram|Alpha possible. Here's why.
    • Computational knowledge and intelligence
    • High-performance enterprise deployment
    • One coherent architecture
    • Smart method selection
    • Dynamic report generation
    • Database connectivity
    • Built-in, computable data
    • High-level programming language
    • Efficient text processing and linguistic analysis
    • Wide-ranging, automated visualization capabilities
    • Automated importing
    • Development environment

    Information Sources

    Congratulations Stephen!

    Click to read more ...


    Who Has the Most Web Servers?

    An interesting post on DataCenterKnowledge!

    • 1&1 Internet: 55,000 servers
    • Rackspace: 50,038 servers
    • The Planet: 48,500 servers
    • Akamai Technologies: 48,000 servers
    • OVH: 40,000 servers
    • SBC Communications: 29,193 servers
    • Verizon: 25,788 servers
    • Time Warner Cable: 24,817 servers
    • SoftLayer: 21,000 servers
    • AT&T: 20,268 servers
    • iWeb: 10,000 servers
    • How about Google, Microsoft, Amazon, eBay, Yahoo, GoDaddy, Facebook? Check out the post on DataCenterKnowledge and of course here on!



    P2P server technology?

    Is there any type of server technology that allows visitors to a website to become part of the server? Like with bittorrent, users share some of their bandwidth, so would this be possible with web servers where a person goes to a website, downloads and runs the software which makes their internet connection and cpu and hdd become part of the web server?

    Click to read more ...


    GemStone Unveils GemFire Enterprise 6.0

    GemFire Enterprise is in-memory distributed data management platform that pools memory (and CPU, network and optionally local disk) across multiple processes to manage application objects and behavior. With the 6.0 release, GemFire has reached a stage of maturity in its evolution. GemStone touts this version as the true 'best of breed' distributed caching technology, solving scalability issues in all industries.

    Click to read more ...


    Facebook, Hadoop, and Hive

    Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run applications that process vast amounts of data), Yahoo being the first.

    Learn how they do it and what are the challenges on DBMS2 blog, which is a blog for people who care about database and analytic technologies.


    Publish/subscribe model does not scale?

    on Wiki someone posted "...For relatively small installations, pub/sub provides the opportunity for better scalability than traditional client-server, through parallel operation, message caching, tree-based or network-based routing, etc. However, as systems scale up to become datacenters with thousands of servers sharing the pub/sub infrastructure, this benefit is often lost; in fact, scalability for pub/sub products under high load in large deployments is very much a research challenge." Does anyone have something to say regarding scaling Publish/subscribe models?

    Click to read more ...


    Eight Best Practices for Building Scalable Systems

    Wille Faler has created an excellent list of best practices for building scalable and high performance systems. Here's a short summary of his points:

  • Offload the database - Avoid hitting the database, and avoid opening transactions or connections unless you absolutely need to use them.
  • What a difference a cache makes - For read heavy applications caching is the easiest way offload the database.
  • Cache as coarse-grained objects as possible - Coarse-grained objects save CPU and time by requiring fewer reads to assemble objects.
  • Don’t store transient state permanently - Is it really necessary to store your transient data in the database?
  • Location, Location - put things close to where they are supposed to be delivered.
  • Constrain concurrent access to limited resource - it's quicker to let a single thread do work and finish rather than flooding finite resources with 200 client threads.
  • Staged, asynchronous processing - separate a process using asynchronicity into separate steps mediated by queues and executed by a limited number of workers in each step.
  • Minimize network chatter - Avoid remote communication if you can as it's slower and less reliable than local computation.

    Click to read more ...

  • Wednesday


    The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for ordinary programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

    Click to read more ...



    The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

    Click to read more ...