Stuff The Internet Says On Scalability For April 5, 2013

Hey, it's HighScalability time:


(Dr. Who Scaling Up the Shard click for cool animated gif)

  • 50 sextillion: # of earth-like planets in universe; 100,000: stars
  • Quotable Quotes:
    • @petdance: "I wish I had enough money to run Oracle instead of Postgres." "Why do you want to do that?" "I don't, I just wish I had enough money to."
    • @JBossMike: Java is old. Java is verbose. Java is boring. Java is dead… Java is FAST. 
    • @old_sound: We need a "shrink conf" for when scaling is not what we actually need.
    • Carsten Puls: At first, customers want to get going. Understanding what's going on under the hood isn't that important. As grows, want more control and go under the hood. Managing that balance through lifecycle is important.
    • @rbranson: What does almost every memcache library do during a multi-get when 1 out of 10 boxes times out? F*cking whole thing fails. < Reminded me of this
    • @heyavie: I wonder if King Kong's creators ever talked scalability?
    • arkitaip: It seems very risky to base your core business on a language that's only been around for two years. Sometimes web development seems to be more volatile than the fashion industry.
    • @_Mblueberries: I hate scaling and cleaning the fish.
    • @jcoglan: Reminder that 'scalability' is a property of system architecture and data layouts, not language runtimes (mostly)
    • @SuperLuckyHappy: Latest Headlines:  CIA: “Collect everything and hang on to it forever.” CIA Chief Technology Officer Big Data and Cloud Computing Pre
    • joelgrus: Finally, a way to combine the elegance of functional programming with the unwieldy, verbose syntax of Java!
    • The Archimedes Codex: The transition from the roll to the codex—the book format we know today—was a revolution in the history of data storage. The genius of the codex is that it contains knowledge not in two dimensions, like a roll, but in three. The roll has height and width; the codex has height, width, and depth. Because it has depth, it doesn’t need to be nearly as wide. A codex with 200 folios (400 pages), 6 inches wide, has the same potential data-storage area as a roll of the same height that is 200 feet long. To access data in a codex, you only have to travel through the depth dimension, which is just a couple of inches thick. 

  • Custon Silcon + ZFS + Andy Bechtolsheim = DSSD, a chip startup to "improve the performance and reliability of flash memory for high performance computing, newer data analytics and networking." Pushing compute down to the edge, on to the disk, was long ago predicted by Jim Gray, who declared "locality is king" and "processors are going to migrate to where the transducers are." Sounds like DSSD is taking a shot at fulfilling Jim's vision. They are minimizing OS overhead, excellent. ZFS probably comes in because there's a long standing holy grail of implementing small fine grained objects in the file system, which is a performance nightmare, but extends the promise of doing away with the database layer. The gotcha with all these plays is commodity hardware. It's hard to maintain orders of magnitude performance gains over commodity players who are riding much cheaper cost curves and ever improving performance curves. Initial leads fade quickly and the capital costs keep rising. Sounds like there is a real software system's aspect here, so maybe that will decapitate those curves. In any case, it's great to see people developing hardware rather than staying mediocre with software.

  • Wired with a wonderful look at the weird: Computers Made Out of DNA, Slime and Other Strange Stuff. Certainly, they cover slime mold, but there's also liquid crystal computing, computational DNA, evolution, particle collisions, quantum computers, and frozen light. There are more things in heaven and earth, Horatio, Than are dreamt of in your philosophy...

  • Not to be out done: How to Make a Computer from a Living Cell: The Stanford researchers' genetic logic gate can be used to perform the full complement of digital logic tasks, and it can store information, too. It works by making changes to the cell's genome, creating a kind of transcript of the cell's activities that can be read out later with a DNA sequencer.
  • Alan Kay: One way to think of all of these organizations is to realize that if they require a charismatic leader who will shoot people in the knees when needed, then the corporate organization and process is a failure. It means no group can come up with a good decision and make it stick just because it is a good idea. All the companies I’ve worked for have this deep problem of devolving to something like the hunting and gathering cultures of 100,000 years ago. If businesses could find a way to invent “agriculture” we could put the world back together and all would prosper.

  • George DysonCoalitions holds the key, a conclusion to which all observed evidence, including Nils Barricelli’s experiments with numerical symbioorganisms, lends support. These coalitions are forged on many levels—between molecules, between cells, between groups of neurons, between individual organisms, between languages, and between ideas. The badge of success is worn most visibly by the members of a species, who constitute an enduring coalition over distance and over time. Species may in turn form coalitions, and, perhaps, biology may form coalitions with geological and atmospheric processes otherwise viewed as being on the side of nature, not on the side of life."

  • Quite a discussion on .NET and Node.JS - Performance Comparison: On an average Node.js wins hands down. 

  • Amen...spot on observation by Andrew Binstock in The Quiet Revolution in Programming:  during the last 24 months, the sheer volume of change in the computing paradigm has been so great that programming has felt its imprint right away. Multiple programming paradigms are changing simultaneously: the ubiquity of mobile apps; the enormous rise of HTML and JavaScript front-ends; and the advent of big data. The movement from few to many languages has important ramifications. For example, it's now more difficult to find programmer talent that satisfies all the needs of a project; and it's more difficult as a programmer to be deeply fluent in all the necessary languages and idioms. These obstacles might suggest division of labor along the lines of programming languages (which likely reflect different concerns and separate components), but as the first chart shows, this is not happening. 

  • Simon Peyton Jones describes "Software Transactional Memory (STM), a promising new approach to programming shared-memory parallel processors, that seems to support modular programs in a way that current technology does not."

  • If you can pay for a Gold Support Package you can now get on Google Compute Engine. Let them eat cake.

  • SimCity outages, traffic control and Thread Pool for MySQL shows throughput and response times can be balanced using thread pools while providing protection against DDoS attacks. Mechanism seems to be the reduction of internal communications between the kernel and the threads. With too many threads more work is spent on managing the threads than doing the work. Only a few threads in a thread pool are necessary to support tens of thousands of connections. Also a reminder that tools have to work differently at scale. You can't just list 30,000 connections to the screen. 
    • Also, MySQL thread pool and scalability examples: Real scalability is when throughput graph is neither dropping or becoming flat - it goes up and up and up with a stable response time. This can be achieved only by Scale Out. Getting 7,500 TPS with 1 database with 32 connections, then add an additional database and the straight line going up will reach, say, 14,000. A system with 3 database can support 96 connections and 21,000 TPS... and on and on it goes... 

  • Bryan Summersett sets up a salon for reading and appreciating code. The first offering: Analyzing mbostock's queue.js. Now that's documentation!

  • Pinching pennies when scaling in The Cloud. Scott Hanselman is "starting to code differently now that I can see the pennies add up based on a specific changes. The point here being, you can squeeze a LOT out of a small VM, and I plan to keep squeezing, caching, and optimizing as much as I can so I can pay as little as possible in the Cloud. If I can get this VM to do less, then I will be able to run more sites on it and save money."

  • 600k concurrent HTTP connections, with Clojure & http-kit on a single PC. Granted, the server isn't doing anything, but memory and CPU usage weren't crazy when connections were held open. Shows how to set up virtual network interface to bypass the per IP 65536 connection limit.

  • More impressive than a speeding bullet, Ilya Grigorik on Breaking the 1000 ms Time to Glass Mobile Barrier (slides). Network alone won't do it: Fiber-to-the-home services provided 18 ms round-trip latency, cable-based services averaged 26 ms, and DSL-based services averaged 43 ms, 4G 150-250ms. Gives a great explanation of the life of web-page in the browser and the budget needed to get under 1000ms. Result: One Request. Inline. Defer the rest.

  • dynamic-dynamodb provides automatic read and write provisioning for DynamoDB. Looks interesting and would save a lot of work.

  • When Netflix is not busy ruining the cloud, it is winning the World Series of how to build stuff posts with another winner: System Architectures for Personalization and Recommendation: We want the ability to use sophisticated machine learning algorithms that can grow to arbitrary complexity and can deal with huge amounts of data. We also want an architecture that allows for flexible and agile innovation where new approaches can be developed and plugged-in easily. Plus, we want our recommendation results to be fresh and respond quickly to new data and user actions. Finding the sweet spot between these desires is not trivial. Also, Karyon: The nucleus of a Composable Web Service

  • Designing Fault Tolerant Distributed Applications, nicely summarized by Edward Ribeiro:  the seven principles and tips exposed by Scott in the first part of the talk are much more important to the talk as a whole while the software that he demo'ed (ordasity, pica) are examples of application of those principles in a clean and clever design. Those principles exposed in the first part will certainly be more valuable to the general distributed systems engineer in the long term, even tough I liked the demos too. :)

  • BloomJoin: BloomFilter + CoGroup to implement asymetric joins, "where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don't share a common key with the smaller set. Before performing the reduce itself, we can use the bloom filter to filter out most of the records which would not have matched anything in the other side of the join"

  • Wonderfully clear and informative talk by Jimmy Lin: Scaling Big Data Mining Infrastructure: The Twitter Experience

  • Are futures the future? Yes, if you believe this interesting exploration by Jame Coglan: Callbacks are imperative, promises are functional: Node’s biggest missed opportunity. Others don't think it would help in common cases

  • Jesse Johnson with a well written and acceessible  series of articles on the study of large data sets: Big Data and the Topologist, The geometry of neural networks, Principle component analysis, Making linear data algorithms less linear: kernels, Topological exploration of data sets: persistent homology, Refining information from persistent homology, Finding clusters in data, Topological Clustering, Digital spheres and Reeb graphs.

  • While obviously promotional, low-latency dedupping to lower the cost of storing data into expensive flash memory makes a lot of sense. 

  • Why dart2js produces faster JavaScript code from Dart: The dart2js compiler, which converts Dart to JavaScript, now produces smaller and faster code thanks to its global type inferencer. By analyzing the entire program, the compiler can eliminate bailouts and redundant checks, resulting in code that runs faster on modern JavaScript engines.

  • If you don't know jack about Beanstalk, Ashley Schroder has written an informative intro to what it is and why it has value: Magento and AWS Elastic Beanstalk – The Scalability Silver Bullet? Git is used for deployment. Rather than manually manage instances, as long as you follow Beanstalk's way of doing things, it brings together various parts of Amazon’s infrastructure into auto-scalable framework.  Many details in the article, more promissed in a follow up.

  • Dynamically Scalable, Fault-Tolerant Coordination on a Shared Logging Service: In this paper, we explore a fresh approach which uses a shared log as a flexible, raw substrate for implementing coordination primitives. We show that we can reduce an existing coordination service, ZooKeeper, to use a shared log, and show that a shared log offers benefits such as high performance and dynamic reconfigurability.


  • Sage Observations on Product Development from Chip Overclock that will help scare away the evil spirits of failed projects future. The kind of stuff you learn on your way to Oz.