Stuff The Internet Says On Scalability For August 30, 2013

Hey, it's HighScalability time:


(Nerd Power: Paul Kasemir software engineer AND American Ninja Warrior)

  • Two billion documents, 30 terabytes: Github source code indexed
  • Quotable Quotes:
    • David Krakauer: We fail to make intelligent machines because engineering is about putting together stupid components to make smart objects. Evolution is about putting together smart components into intelligent aggregates. Your brain is like an ecosystem of organisms. It's not like a circuit of gates.
    • @spyced: At this point if you depend on EBS for critical services you're living in denial and I can't help you. 
    • @skilpat: TIL Friedrich Engels, not Leslie Lamport, invented logical clocks in a 1844 letter to Karl Marx
    • Dan Geer: Risk is a necessary consequence of dependence.
    • @postwait: OS Rule 1. The version of /usr/bin/X you want today will never be what your OS ships. Use /keep/your/shit/to/yourself/X instead.
    • @antirez: the moral is: "Hey, that's a completely harmless commit about error reporting! It will never affect behavior". hehe
    • Robert Scoble: My next-door neighbor was on the first iPhone team and he told me  he almost killed himself working for Steve Jobs, because he demands so much from you. He did not take substandard performance, and he would keep you up, and he would call you on a Sunday when you’re having family time … and essentially randomize your whole life
  • LOL. Interview with an Ex-Microsoftie: "Program Files", is that one of yours? —Sure is. I thought, "This is where we'll put the programs. On the file system. And since they're gonna be made out of files, I'll stick the word "Files" on the end. With a space in the middle. People are going to be typing this all the time, so they'll want it to be as long and descriptive as possible."

  • SSD RAID Load Testing Results from a Dell PowerEdge R720. Brent Ozar finds amazing results by switching to a bare metal SQL server and cheap commodity-grade local solid state storage. Interesting findings from load testing SSDs in RAID arrays. Shows that with the high licensing fees for SQL Server Enterprise it's actually comparatively cheep to buy a very high end hardware configuration. 

  • Kyle Kingsbury has great blog. Burn the Library is an uncontentious look at contention where we consider ontological considerations such as: “Simultaneous” is about causality, not clocks and how a last write wins policy increases the entropy of the universe via data loss. In The network is reliable is a wonderfully detailed look at the evidence for and sources of network partitions concluding network partitions are not apparitions, they are real, created by an astonishing array of forces, so unless you've designed your network to be rock solid, failure is an option you must design for.

  • HELib: implements a homormophic encryption scheme. Unlike some earlier HE schemes, HELib uses a SIMD-like optimization known as ciphertext packing.

  • Myth: Select * is bad: The reason select * actually is bad—hence the reason the myth is very resistant—is because the star is just used as an allegory for “selecting everything without thinking about it”. This is the bad thing.

  • LinkedIn: Using set cover algorithm to optimize query latency for a large scale distributed graph: we have developed a distributed and partitioned graph system that scales to hundreds of millions of members and their connections and handles hundreds of thousands of queries per second. The system is composed of GraphDB, a network cache layer, and an API layer for client access. The problem solved is that the routing algorithm would route requests for two pieces of data that were actually collocated to different nodes.

  • "If you write a lock statement you’re doing something wrong" says Jim in Don't lock: I’ve found that the most effective way to write multithreading code is to avoid mutable data and use inter-thread communications protocols in much the same way that we use inter-process communications protocols when working with cooperating processes.

  • That Wibbly Wobbly Real-Timey Wimey stuff. Good discussion of how to handle dynamic real-time pub-sub based on tags. JSON-RPC servers are NodeJS. Events are published using Redis. Plan on moving to ZeroMQ.

  • EAI Patterns Series by @VaughnVernon: Request-Reply, Return Address, Envelope Wrapper, Content Enricher, Content Filter, Splitter, Content-Based Router, Routing Slip, Recipient List, Aggregator, Scatter-Gather, Resequencer, Claim Check, Message Expiration, Message Bus, Message Channel.

  • Good question: Nasdaq glitch: Why no backup? Please, someone bail them out so they can afford to do the right thing.

  • Security is hard. User Sessions, What Data Should Be Stored Where. Wow, a tricky attack on sessions that depends on sessions not being transactional. When users are verified and when sessions are set are not in a transaction and this can be used as an attack. This leads to a good idea of what should be stored in a sessions: user id, session id, temporary state. Though I'm not sure why user id is necessary if you have a session id. 

  • An excellent Interview with the Github Elasticsearch Team:  we have a good 40 to 50 search indexes, just for all the different things that we keep track of; It was decided a few months ago to move all of the calculations to elasticsearch using the histogram facets; We definitely have different clusters. The exception tracking lives in a total separate cluster; It’s only the head of the master branch that you’re going to get in there and still that’s a lot of data, two billion documents, 30 terabytes; we’re trying to move away from EC2 infrastructure, replacing 44 EC2 instances with the eight pieces of physical hardware.

  • Go After 2 Years in Production. Off to a good start. Performance: generally slower than Java, faster than Ruby, but fast enough, the bottleneck has always been the database; Memory: no problems; Concurrency: cheap and easy; Reliability: robust; Deployment: easy because it's a single compiled image; Talent: good quality people.

  • Isaac Asimov predicted this. Cliodynamics: (from Clio, the muse of history, and dynamics, the study of temporally varying processes) is the new transdisciplinary area of research at the intersection of historical macrosociology, economic history/cliometrics, mathematical modeling of long-term social processes, and the construction and analysis of historical databases. Mathematical approaches – modeling historical processes with differential equations or agent-based simulations; sophisticated statistical approaches to data analysis – are a key ingredient in the cliodynamic research program. But ultimately the aim is to discover general principles that explain the functioning and dynamics of actual historical societies.

  • The 'third era' of app development will be fast, simple, and compact: "We are looking to bring about applications that blend scalar processing on the CPU, parallel processing on the GPU, and optimized processing of DSP via high bandwidth shared memory access with greater application performance at low power consumption."

  • On NoSQL. Interview with Rick Cattell: 2x2x2 Requirements for Scalability. The first 2x means that there are two different kinds of scalability: horizontal scaling over multiple servers, and vertical scaling for performance on a single server. The remaining 2×2 means that there are two key features needed to achieve the horizontal and vertical scaling, and for each of those, there are two additional things you have to do to make the features practical.

  • If you want to build low power hardware that has features nobody else has (read Moto X) then you'll need to go custom. How do program all that? Here's a very good introduction to FPGA Programming for the Masses: FPGAs have programmable logic cells that could be used to implement an arbitrary logic function both spatially and temporally. FPGA designs implement the data and control path, thereby getting rid of the fetch and decode pipeline. The distributed on-chip memory provides much-needed bandwidth to satisfy the demands of concurrent logic. The inherent fine-grained architecture of FPGAs is very well suited for exploiting various forms of parallelism present in the application, ranging from bit-level to task-level parallelism. In addition to the conventional reconfiguration capability where the entire FPGA fabric is programmed with an image before execution, FPGAs are also capable of undergoing partial dynamic reconfiguration. This means part of the FPGA can be loaded with a new image while the rest of the FPGA is functional. This is similar to the concept of paging and virtual memory in the processor taxonomy.

  • Moore's Law Dead by 2022: "I don't expect to see another 3,500x increase in electronics -- maybe 50x in the next 30 years," unfortunately, "I don't think the world's going to give us a lot of extra money for 10 percent [annual] benefit increases

  • Project Iris: Peer-to-peer messaging for backend decentralization. Completely decentralized. Secure against passive and active attacks. Beautiful, simple and language agnostic API.

  • Robert May with A Simple Introduction To Effective Caching in Ruby on Rails. Good, straightforward, actionable advice.

  • On Mechanical Sympathy. Lock-Based vs Lock-Free Concurrent Algorithms: While attending the session a couple of things occurred to me. Firstly, I thought it was about time I reviewed the current status of Java lock implementations. Secondly, that although StampedLock looks like a good addition to the JDK, it seems to miss the fact that lock-free algorithms are often a better solution to the multiple reader case.

  • If you are looking to ditch batch processing then be sure and read In-Stream Big Data Processing: This article is an effort to explore techniques used by developers of in-stream data processing systems, trace the connections of these techniques to massive batch processing and OLTP/OLAP databases, and discuss how one unified query engine can support in-stream, batch, and OLAP processing at the same time.

  • Evernote on How to upgrade 50M+ user indexes to a new search engine without anybody noticing. By separating the what from the how. Each user's data lives in it's dedicated Lucene index which means the users can be moved independently without other users noticing. An interesting approach.

  • A Fast Candidate Generation for Real-Time Tweet Search with Bloom Filter Chains: We introduce Bloom filter chains, a novel extension of Bloom filters that can dynamically expand to efficiently represent an arbitrarily long and growing list of monotonically increasing integers with a constant false positive rate. Using a collection of Bloom filter chains, a novel approximate candidate generation algorithm called BWAND is able to perform both conjunctive and disjunctive retrieval.

  • BOA: The Bayesian Optimization Algorithm: In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed.

  • Cognitive Computing Programming Paradigm: A Corelet Language for Composing Networks of Neurosynaptic Cores: Marching along the DARPA SyNAPSE roadmap, IBM unveils a trilogy of innovations towards the TrueNorth cognitive computing system inspired by the brain’s function and efficiency. The sequential programming paradigm of the von Neumann architecture is wholly unsuited for TrueNorth. The programming paradigm consists of (a) an abstraction for a TrueNorth program, named Corelet, for representing a network of neurosynaptic cores that encapsulates all details except external inputs and outputs. “Traditional architecture is very sequential in nature, from memory to processor and back,” explained Dr. Dharmendra Modha in a recent Forbes article. “Our architecture is like a bunch of LEGO blocks with different features. Each corelet has a different function, then you compose them together.”

  • A Few Useful Things to Know about Machine Learning: This article summarizes twelve key lessons that
    machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.