Hot Scalability Links for June 3, 2010

  • How Big is a Yottabyte? Not so big that the NSA can't hope to store it says CrunchGear: There are a thousand gigabytes in a terabyte, a thousand terabytes in a petabyte, a thousand petabytes in an exabyte, a thousand exabytes in a zettabyte, and a thousand zettabytes in a yottabyte. In other words, a yottabyte is 1,000,000,000,000,000GB.
  • The CMS data aggregation system. The Large Hadron Collider project is using MongoDB as a cache. Here we discuss a new data aggregation system which consumes, indexes and delivers information from different relational and non-relational data sources to answer cross data-service queries and explore meta-data associated with petabytes of experimental data.
  • Google I/O 2010 Videos are up available (many of them anyway). You might be particularly interested in Google Storage for Developers, Building high-throughput data pipelines with Google App Engine, Batch data processing with App Engine, BigQuery and Prediction APIs, and Measure in milliseconds redux: Meet Speed Tracer.
  • Scale at Facebook by Director of Engineering, Aditya Agarwal. You can't scale Facebook using traditional horizontal partitioning. People make friends across many networks. Every new user can potentially access any other user. There's no way to cut the data to effectively partition the data such that access is within that particular partition.
  • dbShards is a software product that allows database sharding to be applied to existing applications and databases with little or no modification to existing code. I talked a little with them at Gluecon and attended their talk. Worth a look, hopefully more on them later.
  • Real World Ruby and Cassandra. Mike Subelsky gives a very good developer oriented overview of Cassandra from is experiences developing a QA system.
  • The Google App Engine may be having problems, they are pioneering a new development and deployment model after all, but you gotta love how they deal with problems head on: Datastore Performance Growing Pains.
  • Non-volatile Memories Workshop 2010. If you want the down and dirty on the current state of the solid state memory industry, this is your Nirvana.
  • Advanced Squid Caching in Scribd: Cache Invalidation Techniques. Caching is great until the data changes. Alexey Kovyrin looks at a few techniques we use at Scribd to solve cache invalidation problems. Very helpful article as are many of his other cache related articles.
  • How Facebook satisfied a need for speed by Mac Slocum. The two biggest changes were to pipeline the page  content to overlap generation, network, and render time, and to move to a  very small core JavaScript library for features that are required on  the initial page load.
  • When Very Low-Power, Low-Cost Servers Don't Make Sense.  James Hamilton: CPU intensive and workloads with poor scaling characteristic are poor choices to be hosted on very low-power, low-cost servers. CPU intensive workloads are a lose because these workloads are CPU-bound so run best where there is maximum CPU per-server in the cluster.
  • Joe Stump on the economics of Cassandra over MySQL: It's [SimpleGeo] running a 50-node cluster, which spans three data centers on Amazon's EC2 service for about $10,000 a month, says CTO Joe Stump, who previously used Cassandra at Digg. By contrast, MySQL premium support would cost about $5,000 per year per node, or $250,000 per year--more than double the Cassandra setup, Stump says, and Microsoft SQL Server can cost as much as $55,000 per processor per year. The $10,000 is an operational expense as opposed to a capital expense, and that's "a bit nicer on the books," he says.
  • Has anyone ever witnessed a hash collision in the wild (MD5, SHA, etc)? on reddit. I'm still not sure of the answer, but great discussion.
  • Scrub & Spin: Stealth Use of Formal Methods in Software  Development by Gerard Holzmann. Mission of JPL is to launch software into space and operate it remotely from as far away as possible. The first 1969 moon trip, the Lunar Lander had 10,000 lines of code, 36K of ROM, 2K or RAM, and a 43Khz clock cycle. In 2019 the estimate to go to the moon will take 10 million lines of code, 1GB of RAM, and a 1Ghz clock cycle. Why when the problem hasn't changed is software still growing? How will we make sure all that software works?
  • Namecast - Welcome to the self-healing network. Yours.
  • Schooner - Memcached/NoSQL Appliance.

NorthScale - Get started with NorthScale Memcached  Server today!

If you would like to advertise a product, job, or event, please contact us for more information.