Stuff The Internet Says On Scalability For July 27, 2012

It's HighScalability Time:

  • Almost 1 Billion Users: Facebook; 30,000 connections across 94 locations: the Olympics Network; 2.5 quintillion: bytes of data created each day; 80K QPS: MemSQL.
  • In some early results Zencoder found EC2 was faster than GCE in their video transcoding tests, saying Google needs larger instances with faster CPUs. Love how Google's jbeda said they would take a look at the results. +1 for competition and benchmarks. Something to keep in mind is for Google a core means:  a hyperthread per virtual CPU. So that means that a n1-standard-8 instance gets 4 physical cores and 8 hyperthreads, not 8 physical cores.
  • Kevin Rose recommends hiring generalists rather than developers with niche skills; don't give away your company; and thinks advisors should be investors. Founders should also probably stick around and managers shouldn't blame developers.
  • Is MemSQL the world's fastest database? BS meter on high, but it is created by two former Facebook engineers, who should know their MySQL. And recall Cassandra started out of Facebook. The magic sauce of this MySQL compatible database is it operates in-memory and optimizes queries by translating SQL into C++.
  • That pesky power law ruins everything: 20% of iOS applications make 97% of the revenue. Directly related to The Sparrow Problem where low software prices make it hard to build standalone software companies, which makes spirit killing buyouts the goal.
  • Remember when man bites dog was news? Alexander Haislip put Twitter's recent few minutes of downtime in a new context, calling it the Era of the Outage: a time when the expectations for 100% uptime and always-on availability make any outage a news event. It turns out Twitter had an infrastructure double whammy. Is that really news?  Aren't these just websites?
  • In perhaps the nicest rant ever, Hadapt founder Daniel Abadi asks Why Database-to-Hadoop Connectors are Fundamentally Flawed and Entirely Unnecessary in which he predicts Hadoop and structured databases will become one, saying: There is absolutely no technical reason why there needs to be two separate. Others like Trevor C. think differently, saying: Every time we tweak Hadoop to make it more "database fancy" it crushes all of our other workloads.
  • 230 slides on High performance network programming on the jvm by Urban Airship's Erik Onnen. I think the C++ version could be done in 100 slides. But really, very thorough and very good.
  • Great summary of the Coprocessors vs MapReduce? thread on Google Groups by Andrew Purtell: If you receive a lot of bulk data and need to transform it for later storing into HBase, then a MapReduce process is the efficient option. Even with an identity transform it is more efficient to drop all of the new data into place in one transaction rather than a transaction for each item, this is the rationale for HBase bulk loading. On the other hand if the data arrives in a streaming fashion, then Coprocessors make it possible to conveniently transform it inline as it is persisted, via Observers.
  • Thoughts on Big Data Technologies (4): Our Love-Hate relationship with the Relational Database. Ben Stopford with a thoughtful series of articles on Big Data. I liked this POV: Rearchitecting for different points in the performance trade-off curve leaves traditional architectures lacking. This is important from the perspective of the big data movement because these new or previously niche technologies are now serious contenders for big data. The architectural shifts are not new per say: in-memory technology, column orientation and shared nothing architectures have been around for many years but only recently have hardware advances allowed them to be serious contenders.
  • Harish Ganesan with an extremely detailed look at the The art of infrastructure elasticity, covering designing an elastic and scalable infrastructure for an Queuing application. Very useful.
  • A new database to take a look at, FoundationDB:  decouples checking if transactions conflict to provide isolation from making transactions durable. These can then be scaled independently. The first job, conflict resolution, is a small fraction of the total work and needs access only to the key ranges used by each transaction, not the existing contents of the database or the actual values being written. This allows it to be very fast. The rest of the system is a consistent, durable distributed store that has to preserve a predetermined ordering of writes, and provide a limited window of MVCC snapshots for reads. On Hacker News.
  • How Galaxy Maintains Data Consistency in the Grid: Galaxy employs a software implementation of a cache-coherence protocol, similar to the one used by CPUs to coordinate L1 caches. The purpose of the protocol is to allow any cluster node to read and write any data item while maintaining coherence.
  • Clarifying Throughput vs. Latency.  Aater Suleman found interview candidates didn't have a good handle on the differences between latency and throughput, so Aater came up with a helpful explanation: Throughput is defined as the amount of water entering or leaving the pipe every second and latency is the average time required to for a droplet to travel from one end of the pipe to the other.  The time a droplet waits in the queue before entering the pipe is called the queuing delay.
  • Good Google Groups discussion on sizing hardware for a Cassandra cluster.

This weeks selection: