hot links

Hot Scalability Links for July 2, 2010

What says 4th of July like Nathan's ultimate scalable hot dog eating contest? This totally requires a scale-up strategy.
Facebook at 60,000 servers and counting.
Deepak Singh has collected some impressive massive data stats on extreme Hadoop usage: Facebook: 36 PB of uncompressed data, 2250 machines, 23,000 cores, 32 GB of RAM per machine, processing 80-90TB/day; Yahoo: 70 PB of data in HDFS, 170 PB spread across the globe, 34000 servers, Processing 3 PB per day, 120 TB flow through Hadoop every day; Twitter: 7 TB/day into HDFS; LinkedIn: 120 Billion relationships; 82 Hadoop jobs daily (IIRC); 16 TB of intermedia data.
Who knew DevOps could be so funny? Adam Jacob, CTO of Opscode, gave a hilarious talk at the Velocity conference on the true nature of DevOps. Warning: your neck may get sore from nodding in agreement so much and your belly may ache from laughing so much.
Pig at LinkedIn. Not your average article: For me, understanding my work over the last year by understanding Pig was profound. It gave it more meaning, because strangely enough Pig has become a big part of my life. By the numbers, I’ve spent as much time in the last year with Pig as with anything or anyone else in my life excepting my wife.
On the performance of clouds. How does cloud performance compare? Here's a test of four test applications: a small object, a large object, a million calculations, and a 500,000-row table scan on five different clouds. There's no single "best" cloud: PaaS (App Engine, Force.com) scales easily, but locks you in; IaaS (Rackspace, Amazon, Terremark) offers portability, but leaves you doing all the scaling work yourself.
Brett Slatkin ran a 1.5 billion row MapReduce on Google App Engine. It ran on 16 cores, sustaining over 2000 writes per second for over 8 days. The details are on Brett's blog. If you find the stenographic encoding of his site to be too much, there's a buzz version.
Daniel Lemire in NoSQL or NoJoin? thinks NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the join operation.
John Rauser gave a really top notch presentation at Velocity titled TCP and the Lower Bound of Web Performance, focussing on TCP optimizations: 1) Carefully consider every byte of content; 2) Think about what goes into those first few packets; 3) Accept the speed of light; 4) If your application is delivered on the web, you need to understand how the network functions.
Apples and Oranges and Llamas by Jeff Darcy. You can’t just slap some basic consistent hashing on top of several single-machine data stores and claim to be in the same league as some of the real distributed data stores I’ve mentioned. You need to have a reasonable level of partitioning and replication and membership-change handling integrated into the base project to be taken seriously in this realm.
Capacity Planning at Internet Scale. Rich Miller has a great summary of a capacity planning panel talk at Structure 2010 featuring folks from Zynga, Facebook, Yahoo, PayPal, and Engine Yard.
SQLite explains how they take advantage of Write-Ahead Logging, replacing their previous rollback journal for atomic commit and rollback.