Stuff The Internet Says On Scalability For July 6, 2012

It's HighScalability Time (with 33% more goodness for the same NoPrice):

  • 2.5 terabits per second : Infinite-capacity wireless vortex beams; One Trillion Tons : Carbon Pollution; 1 Megawatt: more power due to leap second; 100,000 terabytes: information storage capability of one gram of human stool; 2.8B social interactions per day : Zynga;  2 trillion requests a day : Akamai
  • Hugh E. Williams on eBay by the numbers: 10 petabytes of data, 300 million items for sale, 250 million user queries per day, 100 million active users,  2 billion pages served per day, $68 billion in merchandise sold in 2011, 75 billion database calls each day.
  • @adrianco: At best you are only a few one-line code changes away from an outage. Experiences find those bugs, then there is a big dose of random luck.
  • With strangely little reaction to Amazon's post-mortem it was nice to see Dmitriy Samovskiy's take in Applying 5 Whys to Amazon EC2 Outage: AWS effectively lost its control plane for entire region as a result of a failure within a single AZ. This was not supposed to be possible. Also, Power outage, seriously?, On Hacker News,   Multi-AZ failover never happened. Not even manually. Totally disappointed, My Friday Night With AWS
  • Lessons Netflix Learned from the AWS Storm. As leaders by example in the cloud space, Netflix is the canary in the cloud mine for most of us. The canary didn't die, though it was a bit wobbly. Regional isolation worked. Europe was unaffected as US-EAST went down. Highly distributed Cassandra survived with 1/3rd of its nodes down. What did go wrong? Further proof that code to deal with failures is the most likely code to fail. "Our internal mid-tier load-balancing service. This caused unhealthy instances to fail to deregister from the load-balancer which black-holed a large amount of traffic into the unavailable zone. In addition, the network calls to the instances in the unavailable zone were hanging, rather than returning no route to host."
  • In the whippings will continue until morale improves department is the least scalable team building practice since eunuchs: stack ranking. It's like throwing a bone into a hungry dog pack and expecting anything but the gnashing of teeth and the spilling of blood.
  • Cloud Independence Day. Inspired by a spirit of independence, Benjamin Black declares: The Google Cloud Platform is the biggest deal in IT since Amazon launched EC2 and will cause the cloud market to explode. Google has core infrastructure services on par with Amazon's RDS, EBS, EC2, CloudWatch and additional services like BigQuery and PageSpeed, and new world class services just waiting to be productized, like a CDN, DNS, and load balancing. Plus Google is an infrastructure company at a planetary scale. So now have true competition which means we now have a true utility market.
  • Love this find from Pete Warden: Networks of book makers in late Medieval England - Alex Gillespie's talk on medieval manuscripts was eye-opening in a lot of ways. I never realized that you could get cheap books before printing arrived, on demand from local scribes. The impact of the technology wasn't so much due to the price, as the fact that mass production made books far more plentiful than ever before, with a much more centralized distribution model.
  • How do debuggers keep track of the threads in your program? Joe Damato with an excellent description of  the "relatively undocumented API for debuggers (or other low level programs) that can be used to enumerate the existing threads in a process and receive asynchronous notifications when threads are created or destroyed."
  • Videos from Google I/O without all the cool chachkis, unfortunately.
  • Scalable Logging and Tracking. Kedar Sadekar explains how Netflix scalably collects data: push logging data to a separate thread; fail fast if the expected data is not present; make the logging path clear of dependencies; use canaries to select the best GC strategy; auto-scale based on requests per second; slowly ramp up traffic; log collectors collect the log data and send it on to Hive for persistence; data is the analyzed by numerious AI bots.
  • HyperDex. NoSQL database offering: a richer API, stronger consistency guarantees and predictable failure recovery, while also performing better than other systems for most real-world workloads.
  • Scale up or out? Good Google Group discussion on choosing AWS instances types so as to minimize the impact of noisy neighbors by selecting m1.xlarge, m2.4xlarge, c1.xlarge, cc1.4xlarge, or cc2.8xlarge.
  • Memory allocation overhead is the number one undiagnosed performance killer of most programs. In Impact of memory allocators on MySQL performance  Alexey Stroganov shows newer is not always better: the new glibc with new malloc implementation may be NOT suitable and may show worse results than on older platforms.
  • Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs: we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.
  • Future-predicting system cuts app loading time. Cut 6 seconds from phone app loading time by predicting when the application will be used using signals like location and time. Uses 2% of battery per day.
  • High performance vehicular connectivity using opportunistic erasure coding: Motivated by poor network connectivity from moving vehicles, we develop a new loss recovery method called opportunistic erasure coding (OEC).
  • Lots of good stuff on Watson from IBM. All behind a very tall and wide paywall of course.
  • Riak behind a Load Balancer?
  • Diamonds are a qubits' best friend. A group of Harvard scientists made a cool discovery using diamonds which lead to being able "store information in qubits for nearly two seconds — an increase of nearly six magnitudes."
  • Why mobile performance is difficult: on mobile networks because packets get lost for other reasons: you move around your house while surfing a web page, or you're on the train, or you just block the signal some other way. When that happens it's not congestion, but TCP thinks it is, and reacts by slowing down the connection. John Graham-Cumming solves mobile performance problems via customized TCP stack settings and by actively monitoring and clasifying connections so as to give the best perfomance. These custom adapations are not explained, hopefully in another post.