Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases. "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing."
Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop (product), which implements functionality similar to Google's GFS and Map/Reduce systems.
The recent Data-Intensive Computing Symposium brought together experts in system design, programming, parallel algorithms, data management, scientific applications, and information-based applications to better understand existing capabilities in the development and application of large-scale computing systems, and to explore future opportunities.
Google Fellow Jeff Dean had a very interesting presentation on Handling Large Datasets at Google: Current Systems and Future Directions. He discussed:
• Hardware infrastructure
• Distributed systems infrastructure:
–Scheduling system
–GFS
–BigTable
–MapReduce
• Challenges and Future Directions
–Infrastructure that spans all datacenters
–More automation
It is really like a "How does Google work" presentation in ~60 slides?
Update: InfoQ interview on Hypertable Lead Discusses Hadoop and Distributed Databases. Hypertable differs from HBase in that it is a higher performance implementation of Bigtable.
Skrentablog gives the heads up on Hypertable, Zvents' open-source BigTable clone. It's written in C++ and can run on top of either HDFS or KFS. Performance looks encouraging at 28M rows of data inserted at a per-node write rate of 7mb/sec.
Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.
Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?
Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it?
Recent comments
8 hours 15 min ago
14 hours 30 min ago
14 hours 46 min ago
15 hours 14 min ago
16 hours 47 min ago
21 hours 12 min ago
1 day 7 hours ago
1 day 8 hours ago
1 day 13 hours ago
1 day 14 hours ago