Stuff The Internet Says On Scalability For March 7th, 2014

Hey, it's HighScalability time:


Twitter valiantly survived an Oscar DDoS attack by non-state actors.

  • Several Billion: Apple iMessages per Day along with 40 billion notifications and 15 to 20 million FaceTime calls. Take that WhatsApp. Their architecture? Hey, this is Apple, only the Shadow knows.
  • 200 bit quantum computer: more states than atoms in the universe; 10 million matches: Tinder's per day catch; $1 billion: Kickstarter's long tail pledge funding achievement
  • Quotable Quotes:
    • @cstross: Let me repeat that: 100,000 ARM processors will cost you a total of $75,000 and probably fit in your jacket pocket.
    • @openflow: "You can no longer separate compute, storage, and networking." -- @vkhosla #ONS2014
    • @HackerNewsOnion: New node.js co-working space has 1 table and everyone takes turns
    • @chrismunns: we're reaching the point where ease and low cost of doing DDOS attacks means you shouldn't serve anything directly out of your origin
    • @rilt: Mysql dead, Cassandra now in production using @DataStax python driver.
    • @CompSciFact: "No engineered structure is designed to be built and then neglected or ignored." -- Henry Petroski
    • Arundhati Roy: Revolutions can, and often have, begun with reading.
    • Brett Slatkin: 3D printing is to design what continuous deployment is to code.
  • Well Facebook got on that right quick: Facebook wants to use drones to blanket remote regions with Internet. We talked about a drone driven Internet back in January. This is good news IMHO. Facebook will have the resources to make this really happen. Hopefully. Maybe. Cross your fingers.

  • A vast hidden surveillance network runs across America, powered by the repo industry. This intelligence database was powered by individuals driving around and taking pictures of licence plates to track cars. Imagine how Google Glass will enable the tracking of people, without any three letter government agencies in the loop. Crowdsourcing is fun!

  • Francis Bacon way back in the 1700s was all over BigData with his ant, spider, and honey bee analogy:  Good scientists are not like ants (mindlessly gathering data) or spiders (spinning empty theories).  Instead, they are like bees, transforming nature into a nourishing product. This essay examines Bacon's "middle way" by elucidating the means he proposes to turn experience and insight into understanding.  The human intellect relies on "machines" to extend perceptual limits, check impulsive imaginations, and reveal nature's latent causal structure, or “forms.”

  • Game programmers take a look at Amit Patel and his Red Blob Games blog. It's full of deep and delightfully detailed articles on different topics for game programmers. The primary focus is on grid games. His most recent post, Graph theory for pathfinding, is a great tutorial on the "parts of graph theory we use in graph-based pathfinding algorithms, and how grids are represented." Highly recommended.

  • Interesting ratio: two thirds of the brain's energy budget is used to help neurons or nerve cells "fire'' or send signals. The remaining third is used for housekeeping or cell-health maintenance.

  • If you love Erlang's hot deployment capability then you might like what Netflix has done with Java, they've made it so you can inject code into a running Java application at any time. The behavior of any application can be altered without a full scale deployment. A good example is injecting new adpater code into the API handler layer, a source of frequent change as clients change. It looks like Netflix will soon become a viable PaaS layer in their own right, better than even Amazon for a particular problem type.

  • More Quick Links from Greg Linden. Always a good mix of stuff. In the news is mobile gesture detection, tesla, Google, Yahoo, appstores, and "Customers who bought this."

  • How does Google use Percolator, Dremel and Pregel? Percolator: Incremental indexing tool for quicker and efficient way of fast indexing rather than batch indexing. Dremel: Huge Database Handling tool with fast data analysis and extraction. Pregal: Tools to handle Graph problems.

  • More brain system analogies. Art and the Default Mode Network: “There is a paradigm shift going on. The focus has been on getting the brain to do things, rather than studying what it’s doing all the time.” In approaching the DMN, Raichle’s musings demand we reorient our binary notions of active versus inactive, for with the DMN we find the omnipresent “baseline” brain, the parts that brain imaging studies always seek to cancel out so that the true point of “activation” can be seen.

  • Jakob Engblom with fascinating analysis of "The Mill," a new general-purpose high-performance processor design from out-of-the-box computing, and declares it "It just might Work!" Mill is VLIW, does away with registers, and claims impressive power and performance numbers. Jakob: Mill is a 2.0 design that has taken many good ideas that have failed in their version 1.0 incarnations, refreshed them, fixed their flaws, and made them work.

  • How to Choose a Hard Drive. Don't listen to Backblaze advises Henry Newman. Instead: Choosing a hard drive requires attention to detail about the drive and about how you are going to use it. Think about this if you have a RAID device with SSDs with a write budget and one drive fails. You have to spend a lot of time and writes rebuilding parity or re-mirroring. This is something you might not have thought about when calculating your write budget. The whole concept of a write budget was unheard of 5 years ago except for a few people serious about flash drives.

  • 245 million people visit a Walmart store: This past holiday season is when mobile commerce really went mainstream; more than half of the traffic to Walmart.com came from mobile devices.

  • Wonderful. Understanding Throughput and Latency Using Little’s Law: What I think’s interesting, and is probably a major source of confusion for many, is how throughput at one level determines latency at the next higher level. The simplest example of this is an instruction. We can design a processor that wastes no time in latching the results and that executes an instruction in one long cycle. But we don’t because we can get much higher throughput by pipelining the instructions, while causing only a nominal increase in latency due to the additional time taken to latch the results. So, by increasing single-instruction latency we decrease the latency of processing a complete stream of instructions.

  • With Version 1.2 Cassandra has made it easier to store more data on a single node. With off heap data structures, virtual nodes, and improved JBOD support you can now run nodes with several Terabytes of data. Also, a good experience report in Application Failure Scenarios with Cassandra.

  • Murat with a great explanation of Naiad: A timely dataflow system: Naiad is useful especially in incremental processing of graphs. As has been observed before, MapReduce is inappropriate for graph processing because of the large number of iterations needed in graph applications. MapReduce is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient. And for real-time applications batch processing delays of MapReduce becomes unacceptable.

  • Recent papers by Susskind and Tao illustrate the long reach of computation. I can't say I really understand what Scott Aaronson is saying, but I like how he says it.

  • How to Balance (Mobile) Traffic Across Applications Using PF_RING: In essence as long as you have enough cores on your system to allocate to the balancer, you can avoid purchasing costly yet not-so-versatile traffic balancers. As we release the source code of apps such as the pfdnacluster_master, you can change your traffic balancing policies the way you want/like without relying on any hardware traffic balancer manufacturer. Nice and cheap, isn’t it?

  • StorageMojo on FAST ’14: the big picture: Which points to the democratization of advanced file and storage research – or perhaps a return to the old normal. These technologies are fundamental to a digital civilization – a transition we’ve just begun – and much remains to be done before we have the robust persistence we need.

  • You know how programmers after seeing pre-existing code want to demolish it and start over? Well in Japan they do that with houses. Why Are Japanese Homes Disposable? What people do is demolish homes and build their own custom homes. In Japan homes lose their entire value in 15 to 30 years so there's a huge appetite for new homes. And lots of architect willing to build you a house with your own preferences. Houses don't have to last. Local governments don't regulate. Yet the houses are very high quality. Why this dynamic? Earthquakes make people think structures are temporary? Building codes are getting stricter? Since there's no incentive to maintain a house there's no Home Depot or development of maintenance skills. The downside is disposable homes mean no building of wealth. The investment is thrown away. I can't figure out if there's a deep parallel with software here or not. But it is interesting.

  • Through a Table, Sparsely: There are several ways to deal with the sparseness of the world around us. You can rearrange your schema, maybe factor out that address table, but essentially continue to treat sparseness as one of the miseries of life. You can read up on the Fourth Normal Form and slice your tables into lots of thin vertical strips, but that presents obvious practical problems. You can switch to a “NoSQL” system that promises more flexibility, but blaming the syntax has never made sense to me. SQL is not the problem, nor is your schema. The implementation underneath is what matters. There is nothing inherent in SQL that prevents it from adding columns and indexes without locking the table, or otherwise handling sparse dimensions gracefully.

  • Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading: We believe what makes multithreading hard is rather quantitative: multithreaded programs have too many schedules. The number of schedules for each input is already enormous because the parallel threads may interleave in many ways, depending on such factors as hardware timing and operating system scheduling. Aggregated over all inputs, the number is even greater. Finding a few schedules that trigger concurrency errors out of all enormously many schedules (so developers can prevent them) is like finding needles in a haystack. Although Deterministic Multi-Threading reduces schedules for each input, it may map each input to a different schedule, so the total set of schedules for all inputs remains enormous. < Good discussion at Lambda the Ultimate. Erlang is not an Actor language?