With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future.
Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.
One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster.
From the abstract:
The recent Data-Intensive Computing Symposium brought together experts in system design, programming, parallel algorithms, data management, scientific applications, and information-based applications to better understand existing capabilities in the development and application of large-scale computing systems, and to explore future opportunities.
Google Fellow Jeff Dean had a very interesting presentation on Handling Large Datasets at Google: Current Systems and Future Directions. He discussed:
• Hardware infrastructure
• Distributed systems infrastructure:
–Scheduling system
–GFS
–BigTable
–MapReduce
• Challenges and Future Directions
–Infrastructure that spans all datacenters
–More automation
It is really like a "How does Google work" presentation in ~60 slides?
Update 2: Rumor no more. Google Jumps Head First Into Web Services With Google App Engine. The quick and dirty of it: developers simply upload their Python code to Google, launch the application, and can monitor usage and other metrics via a multi-platform desktop application. There were 10,000 developer slots open and of course I was too late. More as the cobra strikes.
Update: TechCrunch reports Google To Launch BigTable As Web Service next week. It competes with Amazon's SimpleDB. Though it won't be truly comparable until they also release an EC2 and S3 equivalent. An internet hit for each data access is a little painful. As Jimmy says in Goodfellas, "That's the way. You don't take no sh*t from nobody. "
First Dave Winer hallucinates a pig on the mean streets of Walnut Creek that told him Google's long foretold cloud offering will be free for bloggers of "modest needs." GigaOM then says a free cloud service is how Google could eat Amazon's bacon for lunch.
The reason for this free cloud buffet is said to be the easier integration of acquisitions who must presumably be in the Google cloud to be taken out. All the free stuff Google offers earns almost no money. They make money on search. Hosting every last CPU cycle on earth has to be costly. What's the return? Cheaper integration of new startups that will also provide no new revenue?
Perhaps I am simply not clever enough to see the revolutionary brilliance in this line of thought. Though I would be quite pleased to have Google shareholders subsidize my projects.
Folknologist thinks Google may keep costs down by requiring developers to code to a Cloud Virtual Machine based on Java byte codes...
Update: Google added videos on Cluster Computing and MapReduce. There are five lectures: Introduction, MapReduce, Distributed File Systems, Clustering Algorithms, and Graph Algorithms.
Advanced website design depends on deep distributed system design knowledge. Where do you get this knowledge? Try Google. They have a a whole Code for Educators program with tutorials and lectures on AJAX programming, distributed systems, and web security. Looks pretty nice.
The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce.
The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data.
Recent comments
2 days 5 hours ago
2 days 5 hours ago
2 days 5 hours ago
2 days 5 hours ago
1 week 2 days ago
1 week 3 days ago
1 week 4 days ago
1 week 4 days ago
1 week 4 days ago
1 week 4 days ago