Stuff The Internet Says On Scalability For June 22, 2012

It's HighScalability Time:

  • Quoteable Quotes:
    • @xinqiyang: Partition, replicate, index. Many efficiency and scalability problems are solved the same way.
    • @SnideLemon: Let's switch to a bottle of wine for economic scalability." -- the best justification for additional drinking ever‬
    • @cloudbees: "A whopping 57% of respondents cited desire for scalability as their chief motivation to go to cloud"
  • You are the next computer. Cells are capable of arithmetic using two naturally occurring molecules: erythromycin, an antibiotic, and phloretin, found in Apple trees. "These act as inputs, switching a reaction within the two types of cell on or off. The reaction leads to the production of a red or green fluorescent protein that signals the result of the calculation. For example, in the half adder cell, the presence of both molecules makes it glow red."
  • AddThis with an excellent series of articles on their infrastructure: CPU cycles in the cloud are about 3x more expensive than the fully loaded cost of dedicated cycles in our own data center.  Data stored in the cloud is at least 10x more expensive than in our own clusters.  Even if data storage were free, we would still be paying for CPU cycles to process the data and network costs to expose it.  But the killer is still latency and IO bottlenecks.  At our scale, there aren’t meaningful pricing options to overcome these in the cloud.
  • The node.js scalability myth is exploded by Felix Geisend. Love the point that events aren't any more scalable than threads. Node.js is single threaded so it only scales vertically and doesn't have horizontal scalability story. It concludes saying there is no silver bullet, to the happiness of all Werewolves.
  • Sweet discussion of improving 99th percentile read latency on the Oracle boards. Take a look at:  Java GC activity, checkpointing, log cleaning, cache misses, and disk bottlenecks. Reducing the number of threads and processes brought down the the 99th percentile latency significantly.
  • Netflix Recommendations: Beyond the 5 stars. Fascinating mystery tour through what works and doesn't work when predicting what you'll enjoy watching next. An excellent description of their process. 
  • Shite happens. Twitter goes down. AWS goes down. Complex things fail. It doesn't mean anyone is incompetent. It just means there's a continuous learning process of how to keep these beasts fed and happy. You could do better?
  • Will Google ever adopt Go on the backend? I thought yes because Go has incredibly fast startup times, which you need to fluidly schedule work across nodes. Waiting 30 minutes for Java to start isn't very agile. Jonathan Shapiro brashly says it will never happen, Google will never use Go on the backend: it's insecure, it multiplies the thermal load by a factor of 3 so they can't afford the heat, to run garbage collection uses 3x the memory of C++, and the performance sucks. 
  • Excellent talk on how Kafka is used at LinkedIn. It's not a metamorphosis. Kafka is used to push data from data sources to data consumers. It solves the data scale and access problem. Data is only useful when you get all it in one place and Kafka helps you do that.
  • Are your system's as tough as a mouse? Now here's how you stop an attack. To protect themselves during malaria infections, mice can kill their own healthy red blood cells, cutting off the parasite’s primary resource. 
  • The catch-22 of read/write splitting. Doron Levari on using replicated database servers for reads: It's an OK scale-out solution and relatively easy to implement, but improvement of caching systems, changing requirements in the online applications and big-data and big-concurrency - rapidly driving it towards its fate.
  • An oh so sweet discussion of Big O Notation on Stack Overflow.
  • Good Google Groups discussion on the pitfalls of using timestamps as keys. Hotspotting is the problem and the HBase Writes Distributor is the solution. 
  • Sean Cribbs on Eventually Consistent Data Structures. The holly grail of distributed computing: Convergent Replicated Data Types are data-structures that tolerate eventual consistency. They replace traditional data-structure implementations and all have the property that, given any number of conflicting versions of the same datum, there is a single state on which they converge. Also, Strong Eventual Consistency and Conflict-free Replicated Data Types.
  • He with lots of data, computers, and smart folks wins.  Building high-level features using large scale unsupervised learning:  we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days.
  • Large-Scale Machine Learning at Twitter. This is the paper behind a great presentation at the Hadoop Summit. Seems like very practical stuff.
  • Towards a Unified Architecture for in-RDBMS Analytics: two key factors that we found impact performance: the order data is stored, and parallelization of computations on a single-node multicore RDBMS.
  • Rob Diana with some good Geek Reading for June 21, 2012

This weeks selection: listen to Darwin Tunes, music evolved by consumer choice: