« Why My Slime Mold is Better than Your Hadoop Cluster | Main | Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory »

Stuff The Internet Says On Scalability For April 6, 2012

It's HighScalability Time:

  • Exascale Supercomputer: how IBM plans to understand data from a universe of light;  905 Billion Objects and 650,000 Requests/Second: S3; 64-cores: PostgreSQL shows linear read scalability;
  • Quotable quotes:
    • pkaler: Programming is hard. Scaling is harder.
    • @crucially: As far as I can tell, openstack is what happens when ops people write code. 
    • @DEVOPS_BORAT: Goal of sysadmin is replace itself with small shell script. Goal of devops is replace itself with small REST API.
    • @fowlduck: ec2, where dynamic scalability means them running out of instances :(
    • hcarvalhoalves: You know what is amazing? Is that as soon you hit bigger or more general problems, you always face the compromise of "trading X resource for accuracy". Which leads me to believe that software, so far, has only been deterministic by pure accident.
    • Kyle Lemmons: Clearly Go is a superior weapon if the goal is to shoot everyone in the > foot at the same time. The GIL in python forces you to shoot each person  in the foot in sequence :). 
    • Geva Perry: In the Game of Clouds, You Win Or You Die: CloudStack
  • Exclusive: a behind-the-scenes look at Facebook release engineering. Ryan Paul with a fascinating blow by blow of a Facebook software release: Facebook's entire code base is compiled down to a single 1.5GB  binary executable; the size of the binary explains why BitTorrent is used distribute updates; a minor update is distributed every single business day; major updates go out once a week on Tuesday afternoon; IRC is used to coordinate releases between developers; Facebook operates at nearly full capacity during an update; reverting a release is only done as a last resort; Facebook gamifies the process by assigning developers karma scores to know when code being merged in is at higher risk; a new HipHop virtual machine promises to reduce release times as only thin bytecode deltas need to be shipped; Facebook employees  see an experimental build of the site based on the very latest code;  Facebook tracks positive and negative tweets about Facebook. On HackerNews
  • MongoDb Architecture. Ricky Ho with an epic look at the finer details of how MongoDb web scales. Covers: Major difference from RDBMS, Query processing, Storage Model, Data update and Transaction, Replication Model, Sharding Model, Map/Reduce Execution. Conclusion is MongoDb is very powerful and easy to use. 
  • Paul Graham on Y Combinator scalability: Our whole approach to scaling Y Combinator is the standard approach to scaling software. 1) You can’t predict in advance where the bottlenecks will be so you just keep going until you hit the next one and 2) You can always scale a lot more than you originally predicted. 
  • The Total Cost of (Non) Ownership of a NoSQL Database Cloud Service. Not surprisingly, since this is an Amazon paper, DynamoDB comes out the winner in a TCO competition between DynamoDB and on-premise database and a database hosted on EC2/EBS. The key differentiators are the administration costs and replicating the 3x redundancy provided by DynamoDB. YMMV a lot on these estimated. Also, My Three Gripes about DynamoDB
  • The Architecture of One High-Performance Back-End for eCommerce Site.  Ilya Katsov with an epic description of a: pretty typical eCommerce service for hierarchical and faceted navigation built on top of Oracle Coherence and designed our own data structures and indexes. Covers the following patterns: Homogeneous Cluster Nodes, Maintenance Node, Data Loading Pipeline, Replicated Custom Index, Probabilistic Test.
  • Money saving tip: If you have large files, Eric Hammond reminds us, S3 can be used to seed torrents.
  • Exclusive: Google, Amazon, and Microsoft Swarm China for Network Gear. Delightfully detailed look at the inner workings of building out networking for a datacenter. Like commodity hardware for servers, we are seeing the same process happening in networking gear, with custom gear made  in Asia. It makes sense, with lots of good systems designers out there and cheap contract manufacturing, why not build your own? Inspect all you want, security makes me nervous. It's very easy to slip backdoors into any and every part of the hardware/software stack.
  • Another way to push code closer to the data is to actually host applications in networking equipment. Arista claims: We are bringing a 5-10 times improvement in the system-level latency of a financial trading environment, reducing it from 5-10 microseconds to sub microsecond for trade executions. 
  • Building real-time feed updates for NewsBlur with Redis and WebSockets. A simple and straightforward explanation: it’s effectively a small circle composed of subscribers and publishers, using Redis to maintain pubsub connections between the many clients and their many feeds.
  • Common sense solutions to calculating scores and average ratings are wrong. Evan Miller shows an above correct solution in How Not To Sort By Average Rating. Good discussion On HackerNews.
  • The train as metaphor:  “The Tube, in a sense, generates its own traffic. As soon as you upgrade something, as soon as you put in another couple of trains per hour, you find that the capacity is taken up … The more you expand, the more people use it. As soon as you fix this congestion point, there’s another one along the line somewhere else to fix. So it’s a never ending task. What we have to sometimes do is deliver on the impossible.” Allan thinks this is like the "traditional bottleneck triangle we always talk about with computer systems (I/O and disk, CPU, and memory)."
  • Compression on client side vs server side. Good thread on where to compress data. On the client uses less server CPU and network resources. The server can compress across a wider scope of data. 
  • Multi-cloud IP address failover with Heartbeat and vCider. Configuring HA across your own datacenter to clouds like Rackspace and Amazon is tricky. This a clear explanation of how to make it work. 
  • Excellent background on OpenStack: The Secret History of OpenStack, the Free Cloud Software That’s Changing Everything. Essential intel if you are trying to make sense of all the OpenStack, CloudStack, Eucalyptus news churning up the social media sphere. Also, Some Brutally Honest Thoughts on Citrix’s DefectionAnalysis: CloudStack Goes To Apache Foundation And Embraces AWS APIs, Citrix Joins Apache and Contributes CloudStack: Bold Move or Brash Decision?
  • Your Mouse is a Database. Interesting way to look at BigData: big data is not just about volume, but also about velocity and variety. If we draw a picture of the design space for big data along these three dimensions of volume, velocity, and variety, then we get the big-data cube shown in figure 1. Each of the eight corners of the cube corresponds to a (well-known) database technology.

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>