Stuff The Internet Says On Scalability For January 18, 2013

Hey, it's HighScalability time:

  • 1 trillion nodes : The Near Future; 1 trillion connections : Facebook Now; 1 billion celestial objects observed : Gaia mission
  • Quotable Quotes:
    • Van Jacobson : IP started as an overlay on the phone system; today the phone system is an overlay on IP.
    • @MarkDurbin104 : Unit of Logic: a Fathom?
    • @somic : virtual infra with API is a cloud as much as a bunch of shell scripts are infra as code
    • @xaprb : I'm going to settle the argument about linear scalability once and for all. Pianos are linearly scalable. Fish are not. End of story.
    • @gigastacey : Facebook's cold storage is 1 exabyte per room with 1.5MW per room power requirement with no redundant power #ocpsummit
    • dgb75 : PHP, not my first choice but the right choice
    • @iaboyeji : Scalability is a rich man's problem
  • Joe Stump on Sprintly on the impact of reducing ORM overhead: On a primed cache, your @sprintly experience should be 100x faster now. On an unprimed cache a mere 10-15x faster.
  • Not a lot of technological detail on Facebook's new Graph Search, but here's the broad story of how it came about. Facebook has a lot of structured data about people so search needed to take advantage of that. They started in 2011 deciding to build a unified search. A quick prototype proved a proof of concept. Next they built a substring parser that could generate and rank all the potential page titles matching a query. To answer queries, with privacy filters applied, they leveraged an already existing search engine within Facebook called Unicorn. What's missing is an index of all posts and comments shared on Facebook.
  • A great list of Papers that have influenced the design of Akka. Actors have a long history of working in practice. STM? Not so much.
  • I didn't know Wolfram Alpha would give travel times for light in fiber and in vacuum. Cool.
  • Lessons learned from grid computing: Moving data is expensive. Good look at the issues--bulk inserts and transfers, replication, backup, RDBMS loading, sync vs async: Grid computing is a good fit when data is to live in memory for a long time and processed several times. Otherwise, take a look at streaming platforms like Twitter Storm.
  • Configuration is good until you actually have to tune a complex mix of dependent parameters. Peter Bailis shows how to use  PBS (Probabilistically Bounded Staleness) predictions with Cassandra to: profile your existing Cassandra cluster and determine which configuration of N,R, and W are the best fit for your application, expressed quantitatively in terms of latency, consistency, and durability. 
  • Lloyd Hilaiel with a neat idea to handle DoS attacks in Node.js by sending a 503 HTTP return code if the server is too busy. An internal ping through the event loop is used to measure event handling delay, once it reaches a certain delay the server is too busy. A little crude as you may actually want to kill off low priority work in order to handle a new request, but it is simple and easy to use. Commentors on Hacker News also bring up the idea of using a reverse proxy to shed load from the Node.js server.
  • Introduction to Algorithms from MIT: This course provides an introduction to mathematical modeling of computational problems. It covers the common algorithms, algorithmic paradigms, and data structures used to solve these problems.
  • If you don't like putting all your digital eggs in one AWS basket then take a look at Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies: find datacenters near Amazon, parellelization,  find alternatives to AWS services, and explore private clouds. Also, Deploying and Scaling Stackato Private PaaS on Open Cloud System.
  • Nginx Load Balancing Basics. Simple to setup, stable, fleixible, good SSL support, and supports SPDY.
  • Running programs inside other containers is always tricky: Coprocessor / threading model for HBase. Also, Maximizing throughput  by looking at configuration options, measurement methods, network IO, tuning GC, and bringing number of cores inline with number of disks, 
  • Surge 2011 ~ Closing Plenary ~ Theo Schlossnagle - fun and fatal discussion of disaster porn on Reddit.
  • Curt Monash with good posts on database options Tokutek and NuoDB.
  • Maybe the revolution won't be televised...in hidef...over the Internet if as Dan Rayburn predicts,  Streaming Video Can’t Scale At Cable TV Quality, Will Never Replace Traditional TV Distribution: Streaming media has limitations and that’s not a bad thing, you simply have to apply it to the best set of applications as you would any other technology. But many are hell-bent on the concept that one technology has to replace another, when in fact, most times, one complements the other. Streaming media is never going to be as reliable, scalable or as high-quality as cable TV, even in the future.
  • Sounds vaguely religious, but here's a AWS Advent 2012 Recap with lots of good materials available.
  • If you are not a reader of Nat Torkington's Four short links, you should be.
  • In the unexpected department, Walmart Labs has a lot of open source code on GitHub. Includes code related to Node, monitoring, Scala, MapReduce, JavaScript, benchmarking, and something called faketoe, which I just kind of like as a name for a XML to JSON converter.
  • Need a Twitter fix? The Twitter stack and Scaling Scalability: Evolving Twitter Analytics.
  • How to avoid relying on github: mirror your repository. Dean Clatworthy nicely explains the magic incantations needed to make git push to two different places so a GitHub failure is not a failure of all. Good discission of the nature of hosting and other relating topics On Reddit.
  • Key to running computations securely on any machine: Multi-Party Computation: From Theory to Practice:  allows, in theory, a set of parties to compute any function on their secret input without revealing anything bar the output of the function.
  • Hadoop Storage: JBOD vs RAID-0: Hadoop prefers a set of separate disks to the same set managed as a RAID-0 disk array. Read speeds are particularly important to the performance of a Hadoop cluster.
  • Good article by Stacey Higginbotham on the implications of Facebook's Open Compute project: Open Compute has managed to give customers — from financial services firms to web properties — a platform on which to build custom and modular servers.
  • The Economics of Long-Term Digital Storage: A Blue Ribbon panel described economic sustainability as the major issue facing long term digital preservation. This is despite Kryder’s Law, the 30-year history of the cost of digital storage media dropping exponentially.