Stuff The Internet Says On Scalability For September 28, 2012

It's HighScalability Time:

  • Quotable Quotes:
    • @dbasch: The world is full of "scalability engineers" who would die from an orgasm if their software ever saw 10,000 requests in a day.
    • @mtnygard: “Scaling issues are always expressed as a queue backing up somewhere.” —@moonpolysoft #strangeloop
    • @rbranson: If your data fits in main memory, you're doing it wrong. #strangeloop
    • @peakscale: Using schemaless DBs an "overreaction" & "confuses the poor impl. of schemas with the value that schemas provide"
    • @adrianco: GM: Performance analysis is complicated by your brain thinking LINEARLY about a computer system that is NONLINEAR. 
    • @littleidea: it's better to have infinite scalability and not need it, than to need infinite scalability and not have it
  • Looks like Google is on the right track with their language understanding efforts. How hierarchical is language use: In this paper, we review evidence from the recent literature supporting the hypothesis that sequential structure may be fundamental to the comprehension, production and acquisition of human language. Moreover, we provide a preliminary sketch outlining a non-hierarchical model of language use and discuss its implications and testable predictions.

  • Lots of techniques for Enhancing the Scalability of Memcached. Very detailed and filled with many potential wins in your own code. Optimized memcached increases throughput by 6X increase and performance per watt by 3.4X over the baseline, though a commenter pointed out tests were against an older version of memcached. Some of the changes: Hash table locking mechanism changed to allow for parallel access; Bag LRU – The data structure is changed to an array of different sized LRU’s with Single linked-list bags of cache items; DELETE and STORE operations now use a parallel hash table approach with striped locks; remove locks on GETs; and many more.

  • Copying data 3 times for safety has to be expensive, especially as storage requirements skyrocket. StorageMojo in More efficient erasure coding in Windows Azure storage shows how using advanced erasure codes can provide reliability and dramatically reduced storage requirements. We'll probably see more of this in the future. 

  • Beginning on October 8th Stanford is having what looks like a really cool course: An Introduction to Computer Networks. One of the teachers is Nick McKeown. I've listened to him speak a few times and he's excellent. 

  • This week's selection for Werner Vogel's book club: on Leases

  • The stangeloop conference managed to trend on Twitter for a while. By the volume of tweets it looks like people were having a good old time. Here are some notes from the conference you might find interesting: Day 2, Day 1, and here's an exhaustive set of notes. Machine Learning for Hackers by John White looks especially good.

  • Nature decouples. You can understand nature one layer of the onion of the time. You don't don't need to know about quarks and gluons to understand water turbulence. What makes science possible is you can study the different layers of the onion independently. Software is still usually a Big Ball of Mud.

  • I've been taking a course on architecture so I found the article Fundamental: Stress-Strain Curves In Web Engineering by John Allspaw quite thoughtful. A lot of parallels between structural engineering and software engineering want to be made, but as engineering is a conscious balancing of forces with goals, the problem for software engineering is there are no equations of equilibrium to guide structural decisions. Also good, A Mature Role for Automation: Part I.  

  • Architecture Without an End State. The always excellent Michael Nygard outlines 8 rules for dealing with complex systems: Embrace Plurality, Contextualize Downstream, Beware Grandiosity, Decentralize, Isolate Failure Domains, Data Outlives Applications, Applications Outlive Integrations, Increase Discoverability. 

  • Since you can't predict the future your best bet is to measure and react. Then generate lots of lots of bets. Put them in the field. Measure which ones are succeeding. And then scale up which ones win.

  • On spinlocks and sleep(): Yes, we really did achieve a 3.7X speedup on a garbage collection benchmark by removing a call to sleep().

  • Scaling Riak to 25 million ops/day at Kiip. Excellent set of notes on the talk. Kiip team found Riak extremely solid. Some advice: Scale early, Don’t use secondary index (2i) in real-time queries, The JavaScript engine requires a lot of RAM, Don’t restart nodes in rapid succession.

  • Good Cassandra Counters thread on Google Groups. Rohit Bhatia with a nice TLDR: if you want 99.99% accurate counters, and can manage with eventual consistency. Cassandra works nicely.

  • The new data have and have nots. Commerce Weekly: Big data in retail: This model not only caters to large retailers over smaller retailers because of the size of their wallets, but because it’s easier for brands to interact with the corporate headquarters of a major retailer with 1,000 stores than to interact with 1,000 owners of independent stores, Hawkins writes. He goes into detail about how this business model will affect the industry on several fronts — you can read his piece in its entirety here.

  • Client Side Load Balancing and Failover with Javascript and Cookies: Mixing a server list from cookies and web client side application ensure a better reliability for web applications content like pictures and others, it let the possibility to control traffic , failover , loadbalancing  on client and server side. 

  • Oracle NoSQL Database Exceeds 1 Million Mixed YCSB Ops/Sec: We ran a set of YCSB performance tests on Oracle NoSQL Database using SSD cards and Intel Xeon E5-2690 CPUs with the goal of achieving 1M mixed ops/sec on a 95% read / 5% update workload.

  • The shape of the internet has changed: It now lives life on the edge. Most people still don't know the tricksy role CDN's play in serving content on the Internet. 45 percent of internet traffic today is served from CDNs and 98 percent of internet traffic now consists of content that can be cached.

  • Twitter has released Algebird: Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).

  • Dmitriy Samovskiy with a  Concise Introduction to Infrastructure as Code: Once you achieve high levels in monitoring and deployment (not necessarily highest though), you can start doing things like self-healing, autoscale, testing through fault injection and other cool things < Nice, short list of all the things you can do to create IaC.

  • Wordcamp 2011 Videos are now available on all things Wordpress.

This week's selection:

My record is 14 skips. 51 seems impossible!