Stuff The Internet Says On Scalability For January 13, 2012

With a name like HighScalability... it has to be good:

  • Facebook: 1 Billion Users?Internet Archive: 500,000 users/day, 6PB of data, 150 billion pages, 1000 queries a second; 6,180: The number of patents granted to IBM in 2011; 676: The number of patents granted to Apple in 2011; Live TV is Dead; Kickstarter: 10,000 successfully funded projects$82bn: Apple's cash hoard; 100 Billion Planets: Our home sweet galaxy; Creative: 100-core system-on-a-chip; 15 million: Lines of code in the Linux kernel; According to Twitter: Justin Bieber > President Obama.
  • Quotable quotes:
    • @florind : I just realized that the Santa story is a classical scalability myth.
    • @juokaz : doesn't always use dating sites, but when he does, he finds out about them on High Scalability http://bit.ly/xYfBmq. True story
    • @niclashulting : The Yahoo! homepage is updated 45,000 times every five minutes." A content strategy is vital.
  • Google’s Data Center Engineer Shares Secrets of ‘Warehouse’ Computing. Cade Metz interviews Google's guru of the datacenter, Luiz André Barroso: delivering good Internet service requires designing the software and hardware of entire datacenter to work together as one computer; split application pieces across an array of computers; modesty is key, select modest machines with modest processors and spread applications as thin as possible; wimpy cores won't process work fast enough to be useful, that's too thin.
  • Hackerspace Global Grid. You know the nation state is in trouble when hackers are proposing to create their own space program and satellite network 
  • Brewster Kahle - Universal Access to All Knowledge - The Long Now. The Internet Archive snapshots one copy of every web page every 2 months. In January 1996 you could go to AltaVista and literally look at the web, it was 30 million pages, about the size of two coke machines. 
  • MongoDB on AWS. Miles Ward has written an excellent white paper focussing on MongoDB on AWS, but many of the ideas will work for other systems as well. Also, Using AWS for Disaster Recovery, Also also, SQL Server High Availability and Disaster Recovery Basics Webcast
  • LinkedEngineering with a Recap: Improving Hadoop Performance by (up to) 1000x, a presentation given by the ever insightful Daniel Abadi on how to make Hadoop perform: use a faster storage layer and a specialized graph processing layer.
  • Need long lasting persistence? Is 2000 years enough? Take a look at the Rosetta Project. They've figured out how to micro-etch text on to a nickle disk using a 10 micron wide eximer beam. It's not digital, but human readable text, as long as you have a 500x microscope hanging around.
  • Your Ideal Performance: Consistency Tradeoff. Paul Cannon with a clear explanation on different tuning strategies for N (replication), W (write nodes), R (read nodes). Priority = no data loss: N=5, W=3, R=3. Since W+R > N any node set chosen for reading will always intersect with any node set chosen for writing, and so Abby’s data is guaranteed to be consistent- even if she loses up to two nodes within a replication set. Priority = speed + low cost:  N=W=R=1. Priority = many nodes + fast consistency + high consistency:  N=3 and R=W=1. Sound impossible? Read up on the magic of Probabilistically Bounded Staleness.
  • Notes on the Oracle Big Data Appliance. Curt Monash puts hand to forehead and has a vision: $450,000 for 18 12-core servers, plus $54,000/year maintenance; uses Cloudera Enterprise; 1 core/4 GB RAM/3 TB raw disk. Also, Splunk update. Also also, Big data terminology and positioning
  • Mark Atwood - A Modest Proposal for a heretical Key Value Store. Proposes a new KV store that works on real hardware, uses fast disk streaming IO, and can use random writes. Such a system would have no networking,  no REST, no JSON. It should be implemented in the kernel with about six system calls, and a buffer mediated API. It should have simple string based hierarchical name spaces.  It would store binary objects. It would have some simple access control and have mutable objects. I must admit I did not get the joke until it was explained. My cheeks are red with shame. 
  • Jeff Darcy on Scaling Filesystems vs. Other Things. I'm not exactly sure what this thread is about, but it was interesting. It started off with Block devices are the wrong place scale and do HA. It’s always expensive (NetApp), unreliable (SPOF), or administratively complex (Gluster) and went somewhere from there. 
  • We need a more efficent SQS. Everything is fine with HTTP until usage based billing makes you question your most basic assumptions as a programmer: HTTP everywhere may not be such a good idea. The HTTP overhead for SQS is 4x, which increases bandwith costs and decreases performance.
  • Facebook tells the gripping story of The Life of a Typeahead Query. A great architecture breakdown. Many technical decisions in search boil down to a trilemma among performance, recall, and relevance. Our typeahead is perhaps unusual in our high priority on performance; spending more than 100 msec to retrieve a result will cause the typeahead to "stutter," leading to a bad user experience that is almost impossible to compensate for with quality.
  • Along the same lines the Google Bots now aren't so interesting looking either: Google Bot Is Still the enemy - We really appreciate that the Crawl team helps us prove that we could serve 8.6M users a day with our product, (and against unique page requests no less)  but it would be nice if they could do it once a month rather than once a day. 
  • How to reliably achieve unique constraints with Cassandra? A good thread showing the difficulty of doing what a RDBMS can do without breaking a relation.
  • Lock Free Skip Tree. Provide better caching access patterns than a skiplist. Which was found via Heap fragmentation in region server. Which was found via What’s new in Cassandra 1.0: Performance. Cassandra dramatically improved write performance by using arena allocation for memtables, which basically means preallocating a lot of memory up front.
  • A complete set notes from the Standford Machine Learning class.