Stuff The Internet Says On Scalability For June 8, 2012

It's HighScalability Time:

  • 21TB : Tumblr relational data
  • Quotable Quotes:
    • @ajbaird: Scalability is not a "feature" tacked on at the end development.
    • @h_ingo: I like Doron's comparison: Build a MySQL scale-out cluster instead, then buy 2 Ferrari's with the money saved :-) 
  • You might figure Harry Potter would have some sort of scaling spell, but no, he has to rely on the muggle powered Azure. Pottermore uses Azure to handle 110 million page impressions a day. 
  • Ian Bogost in What Should We Do for a Living? brings up a sobering idea from the Facebook Illusion, the Internet economy will not save us, it sucks at scaling jobs and exists because it is subsidized by surpluses from the old economy it was supposed to replace. Where is that replicator when we need it?
  • In Praise of Idleness. Bruce Dawson argues against busy waiting and for locks, in most cases. Couldn't agree more, that's a lot of CPU doing nothing and programmers quickly lose track of the overall flow of the program, though Adaptive spinning looks interesting. On Reddit.
  • Mikio Braun tackles Scalability challenges in Big Data Science. Scaling out your database won't scale out computations. To solve problems like k-means clustering requires some multi-threading, actors, message queues, map-reduce. Then it goes into large scale learning with stochastic gradient decent and stream processing, concluding that you  don't scale in real-time.
  • Interesting discussion on Hacker News on Amazon EC2 Instance Comparison. The usual it costs a lot but you get a lot talk, but there was a consensus there really needs to be a better way to conveniently compare all prices for all features across all services.
  • Storage Mojo on SSDs: It appears that with glaring exception of the SPARC Cluster, the 3 systems with the lowest latency at the average, 90th percentile and maximum response times do not use SSDs. Some SSD-based systems equal or exceed the non-SSD systems at some points, but overall the non-SSD systems seem to have an important latency advantage. Also, 
  • Continuing on the SSD theme, Jay Kreps has written a dazzling article on SSDs and Distributed Data Systems and how it may impact your system design: store more data per machine and do fewer network requests; potentially abandon partitioning altogether and storing all data on all machines; co-locate the storage and the application it serves; get rid of RAM caches; graph databases fit well with the ability of SSDs to handle many seeks per request; do we need MapReduce when random IO is cheap?; use log structured formats for storage; 
  • Fast Intersection of Sorted Lists Using SSE Instructions. Ilya Katsov with an awesome analysis of algorithms for finding the intersection of sorted lists. The partitioned vector version written in C killed the other version written in C, Java, and Lucene. Also, Searching for substrings in a massive string
  • Testing 3 million hyperlinks, lessons learned. Sam Saffron with a wonderfully detailed article on finding broken links on a site. It turns out clay tablets last a lot longer than hyperlinks. Throttle requests per domain; A robust validation function; set yourself up with a proper User Agent String; Handle 302s, 303s and 307s; and many more. 
  • Dan Rayburn with his usual measured and thoughtful analysis in Netflix's CDN News Being Overblown By Many Wall Street Analysts, Focus On The Facts: these caches being deployed by Netflix are for video traffic only. Netflix also uses Akamai for other services like small object delivery and these caches only support the delivery of video content; Netflix's caches setup to only proactively cache Netflix's content, but they are not demand driven caches. Netflix caches use a pre-population method that works for a set library with a known set of subscribers, but would not be appropriate for a general-purpose CDN or caching system.
  • Flickr parses Exif image data on the client side using Web Workers: Parsing Exif on the client-side is both fast and efficient. It allows us to show the user a thumbnail without putting the entire image in the DOM, which uses a lot of memory and kills performance. 
  • Dave Cramer is right, the cloud isn't great for spiky traffic flows and wrong about how easy it is to do all the plumbing yourself.
  • Loading half a billion rows into MySQL Background. Gives a lot of details on why and how they did it: use as few indices as you can; loading the data in sequential order not only makes the loading faster, but the resulting table will be faster; load any of the data from MySQL (instead of a flat file intermediary), it will be much faster. A comment war broke out between MySQL and Oracle and NoSQL gladiators.
  • Facebook's networking plans: "The interaction between servers and networking devices is going to become a very blurry line over time. What I’m envisioning is that the top-of-rack switch…will evolve to be more than just an Ethernet switch. These switches will likely incorporate the boot devices for the servers as well as the equivalent of network interface cards — the cards that connect today’s servers to switches." This looks a lot like your typical telco rack.
  • A new blog you might be interested in: The Database Scalability Blog by Doron Levari, founder and CEO/CTO of ScaleBase. Why shared-storage DB clusters don't scale: With a shared disk all of the above need to be done "globally" between all participating servers, thru network adapters and cables, introducing latency. Scale-out your DB on ARM-based servers: A database on an iPad? Naa, I prefer breaking another record in Fruit Ninja. A database on 20 ARM-based servers? If it's 5x faster and costs 60x less in electricity and cooling - then yes, definitely.
  • Twitter is open sourcing Zipkin: a distributed tracing system that we created to help us gather timing data for all the disparate services involved in managing a request to the Twitter API. The Zipkin collector receives the data via Scribe and stores it in Cassandra along with a few indexes. The indexes are used by the Zipkin query daemon to find interesting traces to display in the web UI.
  • Tumbr is open sourcing Jetpants: in-house toolchain for managing huge MySQL database topologies. Jetpants offers a command suite for easily cloning replicas, rebalancing shards, and performing master promotions. It’s also a full Ruby library for use in developing custom billion-row migration scripts, automating database manipulations, and copying huge files quickly to multiple remote destinations.
  • Facebook is open sourcing Folly - C++ utilities that are used heavily in production, like  AtomicHashMap.

This weeks musical selection: