Friday, April 30, 2010 at 7:56AM
- I Want a New Data Store. Jeremy Zawodny of Craigslist wants a new database, one that can do what it should: perform alter table operations faster, has efficient queries when most of the data is on disk and not in RAM, and matches their data that now looks more document oriented than relational. A lot of people willing to help.
- Computer Science Unplugged. An extensive collection of free resources that teach principles of Computer Science such as binary numbers, algorithms and data compression through engaging games and puzzles that use cards, string, crayons and lots of running around. And it's free! Fascinating Interview with Tim Bell on teaching complex computing concepts, creating makers not just users, and how to change schools. From O'Reilly Radar.
- Akamai’s Network Now Pushes Terabits of Data Every Second. Akamai handles 12 million requests per second, logs more than 500 billion requests for content per day, and sends 3.45 terabits per second of data.
- Google’s MapReduce Programming Model — Revisited. We reverse-engineer the seminal papers on MapReduce and Sawzall, and we capture our ﬁndings as an executable speciﬁcation.
- Facebook Flashcache. James Hamilton describes "a simple write back persistent block cache designed to accelerate reads and writes from slower rotational media by caching data in SSD."
- Pigz – parallel gzip OMG. John Allspaw plays with a single core and multicore version of gzip. With one core zipping a 418m log file took 12.4 seconds. On a 16 core machine with parrallel gzip it to took 1.6 seconds.
- Hadoop Meetup Videos. Videos on Using Hadoop to fight spam at Yahoo! Mail, Hive/HBase integration, Public Terabyte Dataset Project - Web crawling with Amazon's EMR.
- How TokuDB Fractal Tree Indexes Work by Bradley C. Kuszmaul. Fractal Trees are functionally equivalent to B-trees, but run signiﬁcantly faster. Fractal Trees convert random I/O, which involves painfully slow disk seeks, into sequential I/O, which provides up to two orders of magnitude more performance.
- Scaling writes in MySQL by Philip Tellis. After partitioning (12 partitions per day, 2 hours of data per partition) we were able to sustain an insert rate of around 8500 rows per second.
- Attempts at Analyzing 19 million documents using MongoDB map/reduce by Steve Eichert. We still feel that Mongo will be a great place to persist the output of all of our Map/Reduce steps, however, we don’t feel that it’s well suited to the type of analysis that we want to do.
- TR10, and a Bloom/BOOM FAQ. BOOM is the name of a research project based at Berkeley, which seeks to enable programmers to build Orders Of Magnitude bigger systems in O.O.M. less code. Our focus is on enabling developers to easily harness the power of many computers at once, e.g. in the setting of cloud computing.
- Storing log messages in Hadoop by Peter Dikant. Using hadooop to store 20 TB of data.
- Ceph: The Distributed File System Creature from the Object Lagoon by Jeffrey B. Layton. Ceph is a distributed parallel file system promising scalability and performance, something that NFS lacks.
- Horizontal Scalability via Transient, Shardable, and Share-Nothing Resources by Adam Wiggins. Now is the time of horizontal scalability achieved by using resources that are transient, shardable and share nothing with other resources. He gives as example several applications and a language: memcached, CouchDB, Hadoop, Redis, Varnish, RabbitMQ, Erlang, detailing how each one applies those principles.