Stuff The Internet Says On Scalability For July 13, 2012

It's HighScalability Time (Good luck today):

  • A Friday the 13th Postmorterama:
    • James Hamilton with some high powered perspective on the report for the Fukushima Nuclear Accident. Apparently they haven't heard of the blameless post-mortem. Lots of interesting stuff, but this is a potentially disaster saving general lesson learned: operators can’t figure out what is happening or take appropriate action without detailed visibility into the state of the system.
    • Evernote with a nicely detailed note on a recent outage. A kernel panic happened while upgrading two new “shard” servers with 3x as much RAM, SSDs instead of 15krpm disks, bonded networking, and an updated kernel. They had to revert and shite loves to happen when other shite happens.
    • Heroku with their postmortem on what happened when AWS went down. They lost 30% of their instances across 3 AZs in the US-East region. Rich detail on the impact of the AWS, but not much on what they can do about it in the future, probably because there's not much to do unless you want to take the multi-region hit.
    • Forget the money, follow the lack of power. Saleforce, like Amazon, suffered an outage because of a power failure. Why don't these expensive backup power systems seem to work?
    • CloudBees’ Postmortem on Two Recent Outages: AWS and Leap Second Linux Bug:  customers with a database running in a healthy zone weren’t impacted by this outage. No databases were lost. 
  • Scaling lessons learned at Dropbox: theme is robustify. Run with extra load, create custom stats, master the shell for analytics, verbose logging is good, test weak points, test by running code, keep it consistent, and many more. Nice article.
  • @b6n: @jaykreps "All consistent state is local." -- Protip O'Neill
  • Urban Airship on how they used HBase to construct a high write mobile backend to support hundreds of millions of devices and a frontend API that sustains thousands of requests per second: HBase offers operational ease and a low latency, high throughput system with known scalability characteristics.
  • Bitly on Debugging a Specialized Database Cache. They use simplelevelydb to implement their view network history feature. Make reads fast by doing the hard work at insertion time. Valgrind to find memory leaks. Malloc overhead was very high so redesigned memory usage. Graphs and metrics uncovered a lot of bugs.
  • Social and engineering are two words that don't usually go together in a good way. That's what makes this effort by Facebook to add social to their tools quite interesting. Phabricator, for example, is their internal code-review tool, that was extended using Open Graph to publish a single Open Graph action against a "diff" object. when changes are created, accepted, closed, or requests. Annoying no doubt, but an interesting use. A social dimension was added to a large number of other tools like Pixelcloud, Scuba, SIOG, StayFitB, and Pokemon. StayFitB tracks workouts using Open Graph actions each time an employee badges in to the fitness center.
  • They are coming. They are coming in 2014 says HP. Memristors that is...and yes...memristors could still change everything.
  • Ricky Ho with another great article, this time it's an expansive description of the Couchbase Architecture. Some key features: load balancing, async writes, append-only updates, automatic compaction, map function generated query indexes.
  • HBase Log Splitting.  Jimmy Xiang with a detailed explanation of how log splitting is used to "recover lost updates from region server failures."
  • Architecting Data Center Networks in the era of Big Data and Cloud by Brad Hedlund: This session for data center architects will explore the transition from traditional scale-up chassis based Layer 2 centric networking, to the next generation of scale-out Layer 3 Leaf/Spine CLOS based fabrics of fixed switches.
  • Scalr's Sebastian Stadil and Julien Rialan with a wonderful interview on FLOSS Weekly 217. You may recognize Sebastian as the founder of the Cloud Meetup in Sillycon Valley. Today's fun fact: Scalr's software was used to help discover the Higgs. 
  • Scalability and Reliability in the Cloud by Greg Thompson. Nice slide deck covering  scalability factors, vertical vs horizontal, stateless applications, connection management, segmenting traffic, segmenting responsbility, clustering, and much more.
  • Now this is testing: NBC, Google, Stage ‘War Games’ To Prepare for Olympic Disruptions. Let's hope those emergency power generators cooperate.
  • Don't get a divorce Mommy and Daddy memcached, I hate it when you argue.

This weeks musical selection: