Stuff The Internet Says On Scalability For September 5th, 2014

Hey, it's HighScalability time:


Telephone Tower, late 1880s, 5000 telephone lines. Switching FTW.

  • 1.3 trillion: row table in SQL server; 100,000: galaxies in the Laniakea supercluster.
  • Quotable Quotes:
    • @pbailis: OLAP: data at rest, queries in motion. Stream processing: data in motion, queries at rest. PSoup: data in motion, queries in motion.
    • @ronpepsi: Scaling rule: addressing one bottleneck always starts the clock ticking on another one. (The same goes for weak links in chains.)
    • @utahkay: Our mental models are deterministic, and break down when you reach high utilization in a stochastic system. 

  • Instagram introduced Hyperlapse, their answer to a world that doesn't move fast enough already. And here's the story of how they did it: The Technology behind Hyperlapse from Instagram. It combines time travel and psychadelics, I think you'll enjoy it.

  • Etsy CEO to Businesses: If Net Neutrality Perishes, We Will Too. The idea of being a common carrier is old, deep, and powerful. It creates markets that grow rather than monopolies the choke economies to death. Ferries were required to be common carriers, that is they must ferry all people and goods at the same price.  Otherwise communities would not survive. AT&T became a monopoly on the promise of universal service and becoming a common carrier for all. The Internet is a more important version of the same idea.

  • To make lots and lots of money you need to hitch your star to a fast growing something. Google placed ads on an exponentially expanding inventory of 3rd party web content. Winner. Now Google is exploiting another phenomena experiencing an exponential growth curve: data. This time they aren't placing ads, they are calculating functions with BigQuery. Put On Your Streaming Shoes is a story showing just why and how this jump to another fast growing something will likely succeed.

  • Just an incredible look into the structure behind PhotoGate. Notes on the Celebrity Data Theft. These aren't just script kiddies. These are sophisticated and organized groups. Are hacker networks the new roving band of Vikings looking to rape and pillage? Though it would help if the villages were better protected.

  • It's JavaScript all the way down. The MEAN -- MongoDB, Express, AngularJS, and NodeJS  -- stack is available on Google Compute Engine

  • Donald MacKenzie on high-frequency trading. Latency is the new geography. Data is the new spy network. Algorithms are the new order of battle. Loss in the new battle wound. Risk has been abstracted.

  • After a drink or two talk often turns to a game of dueling war stories. Here Chip Overclock shares a few of his stories. One involves loose solder. Another is a "rm -rf" story that is also in my deck of war stories. And some other good ones. What's yours?

  • If during a physics course you ever wondered why we can't power devices from all the RF energy flying through the air? They're working on it. You may have also wondered why we can't power devices from differences in temperatures. They're working on that too. Powering Wireless Wireless Sensor Nodes with Ambient Temperature Changes. The smaller and more power efficient electrical components become the more we can run on these "free" energy sources. 

  • Cool interactive visual explanation of a great evil: Gridlock vs. Bottlenecks: Gridlock occurs when a queue from one bottleneck makes a new bottleneck in another direction, and so on in a vicious cycle.

  • Push our limits - reliability testing at Twitter. In staging they load test, duplicate requests in old and new systems and compare results, send production traffic to a new service. In production they use canary requests and stress testing. Twitter has learned testing the entire creation path doesn't work as well as they thought: Many of our services have unique architectures that make load testing complicated. We had to focus on prioritizing our efforts, review the main call paths, then design and cover the major scenarios.

  • We don't like it when a company only persues profit. How about when a company endlessly pursues market share over profit? Does the free market still work? Why Amazon Has No Profits (And Why It Works): But while he [Bezos] certainly does seem to be having fun, he is also building a company, with all the cash he can get his hands on, to capture a larger and larger share of the future of commerce. When you buy Amazon stock (the main currency with which Amazon employees are paid, incidentally), you are buying a bet that he can convert a huge portion of all commerce to flow through the Amazon machine. The question to ask isn’t whether Amazon is some profitless ponzi scheme, but whether you believe Bezos can capture the future. That, and how long are you willing to wait?  

  • Don't forget unikernels is the message in Containers vs Hypervisors: The Battle Has Just Begun. A VM in under 1MB. If you want lightweight, that's lightweight. 

  • Awsomesauce. How L1 and L2 CPU caches work, and why they’re an essential part of modern chips

  • HAProxy Is Still An Arrow in the Quiver for Those Scaling Apps. Good examples of how HAProxy is used at GitHub, Airbnb, Stack Exchange, and Instagram. It just works at scale seems to be the common theme.

  • Humans and machines don't do as well in image classification when an image is complex and has no clear focus. Humans are slightly better classifiers overall. Machines don't do as well when an object is distorted or abstracted in some way, but machines are much faster than humans.  What I learned from competing against a ConvNet on ImageNet: It is clear that humans will soon only by able to outperform state of the art image classification models by use of significant effort, expertise, and time.

  • Gene Kim with twitter notes on Flowcon Day 1. Some thoughts are more amenable to tweets, but it's a fun read.

  • Question: Could moneyball ever select moneyball as the proper way of creating a team? Extreme Moneyball: The Houston Astros Go All In on Data Analysis makes me wonder how we go about creating axioms.

  • This looks like it will be a great series. Oodles of code examples and explanations. The Science of Crawl (Part 1): Deduplication of Web Content. 20% of the web is duplicate content. To dedupe run content through a funnel. Remove duplicate URLs with a bloom filter. DUST off similar looking URLs. Content must be cleaned of HTML before bloom filters are applied to identify duplicates. A locality sensitive hash is used to find text that is similar. Simhash is explored in loving detail. "The process is relatively fast - a series of bloom filters, string cleaners, bit hashes and, finally, a logarithmic time lookup. This funnel does not catch every duplicate page, but, on average, it cleans our corpus of web documents and ultimately provides more diverse search results to users."

  • Good discussion on Quora: For high scalability of websites using backend databases, is it better to use sharding or do full replication of database? Is there any practical benefit of using full replication? Yes and yes.

  • Pflua: a high-performance network packet filtering library written in Lua. It supports filters written in pflang, the filter language of the popular tcpdump tool.

  • When to use cloud platforms vs. dedicated servers. A nice introduction and overview. Something a newbie could really benefit from. 

  • Wondering what broadband options are available at a specific address? Here's the National Broadband Map, a search engine you can use to find out.

  • An Overview of Linux Kernel Lock Improvements. The jump to 240 core/16 socket systems is a major jump that will introduce significant lock and cache line contention inside the kernel. Minimizing cache line contention within Linux kernel locking primitives is the subject of the slide deck. Way too much detail to reproduce here, but it's interesting reading.

  • Create: A shared-nothing, fully searchable, document-oriented cluster data store: a quick and powerful massively scalable backend for data intensive apps or analytics.

  • In an age when our Heartbleeds. Bulletproof SSL and TLS. Looks like a comprehensive and practical book and a nearly impossible to understand subject. But no Kindle version?

  • Detecting Near-Duplicates for Web Crawling: In the course of developing a near-duplicate detection sys- tem for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar’s finger- printing technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f- bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. 

  • Sibyl: A System for Large Scale Machine Learning at Google: Large scale machine learning is playing an increasingly important role in improving the quality and monetization of Internet properties. A small number of techniques, such as regression, have proven to be widely applicable across Internet properties and applications. Sibyl is a research project that implements these primitives at scale and is widely used within Google. In this talk I will outline Sibyl and the requirements that it places on Google's computing infrastructure.

  • Greg Linden with more quick links.