Stuff The Internet Says On Scalability For February 18, 2011

Submitted for your reading pleasure on this cold and rainy Friday...

  • Quotable Quotes:
    • CarryMillsap: You can't hardware yourself out of a performance problem you softwared yourself into.
    • @juokaz: schema-less databases doesn't mean data should have no structure
  • Scalability Porn:
  • Hadoop has hit a scalability limit at a whopping 4,000 machines and are looking to create the next generation architecture. Their target is clusters of 10,000 machines and 200,000 cores. The fundamental idea of the re-architecture is to divide the two major functions of the Job Tracker, resource management and job scheduling/monitoring, into separate components. 
  • If you need one presentation to help someone understand the world of NoSQL, you could do a lot worse than Dwight Merriman's interview at MongoSF 2010 with Software Engineering Radio. Thorough, informative, tight, insightful. Nicely done.
  • Wonderful thread on Stackoverflow about Optimizing Kohana-based Websites for Speed and Scalability, but the advice will apply to most sites, especially PHP sites. 
  • Amazon brings back the big sexy of static web sites by enabling the serving of websites directly out of S3. Of course static ain't what it used to me. Now you can farm out all the dynamic portions, like comments, to specialized services and just manage the content yourself for cheap. Amazon's Werner Vogels has already made the switch.
  • StorageMojo with an interesting list of storage related papers from Fast'11: A Study of Practical Deduplication, Tradeoffs in Scalable Data Routing for Deduplication Clusters, Exploiting Half-Wits: Smarter Storage for Low-Power Devices, and more.
  • Greg Lindend oberves on the Google Megastore: The problem with providing Megastore's level of consistency is performance. The paper mostly describes Megastore's performance in sunny terms, but, when you look at the details, it does not compare favorably with other databases. Megastore has "average read latencies of tens of milliseconds" and "average write latencies of 100-400 milliseconds". In addition, Megastore has a limit of "a few writes per second per entity group" because higher write rates will cause conflicts, retries, and even worse performance.
  • Database Technology used today for Large Scale Data. A really good summary look at Massive Parallel Processing / Parallel DBMS; Column-Oriented Databases; Stream Processing / ESP or CEP; Key-Value Store / MapReduce.
  • What you do with all that Big Data? Visualize it with the Google Public Data Explorer. Some nice looking charts and maps.
  • On the theme of mobile devices becoming the new super computers, Stacey Higginbotham with What Chips Tell Us About the Future of Mobilenew chips designed to go inside base stations and other aspects of an operator’s core network also show how a proliferation of devices and demand for data are changing the way networks are built. On a related note Qualcomm Reveals Quad-core Processors for Mobile Devices. Think of all the power you will hold in your hand...
  • An Interesting Problem: Scaling Graph Databases by Alex Popescu.  Nice list of articles talking about the difficult challenge of spanning graph databases across more than one process.
  • Case Study: MultiLane - a concurrent blocking multiset.  Dmitry Vyukov describes a way to improve scalability of a producer-consumer system by means of partitioning. Basically, it presents a way to wrap any producer-consumer queue in order to get a queue with the same properties but better scalability.
  • That Tune, Named - How does the music-identifying app Shazam work its magic? Farhad Manjoo with a absolutely fascinating look at how Shazam recognizes songs from short snippets of audio recorded over a cell phone. They concentrate on just the intense parts of the music, create a fingerprint, and search their library of 8 million songs for a match. Algorithm aficionados will appreciate this paper for more details.