Stuff The Internet Says On Scalability For December 16, 2011

A HighScalability is forever:

  • eBay: tens of millions of lines of code; Google code base change rate per month: 50%; Apple: 100 million downloads; Internet: 186 Gbps
  • Quotable quotes:
    • @OttmarAmann : Scalability is not as important as managing complexity 
    • @amankapur91 : Does scalability imply standardization, and then does standardization imply loss of innovation?
  • Spotify uses a P2P architecture and this paper, Spotify – Large Scale, Low Latency, P2P Music-on-Demand Streaming, describes it.
  • The Faving spam counter-measures. Ironically, deviantART relates a gripping story of how they detected and stopped a deviant user from attacking their servers with an automated faving script which faved every 10 seconds for 24 hours a day. The same spam filter they use on the rest of the site was used. Problem solved. Would like detail on their spam filter though.
  • Interesting Google Group's thread on the best practices for simulating transactions in Cassandra. Ah, the quest for atomic writes of multiple records is never ending.  Another good thread on message queue selection. And is .08/hr 1.92/day $59.52/month for a 600 MHZ CPU instance with 128 MB memory a little expensive?
  • Translation Memory. Etsy has a problem. They deploy software continuously, which can be tricky, but is doable. What is far trickier is supporting multiple languages because this implies translations must also be continuous. Usually translations are an arduous process that take a schedule hit. Etsy's clever solution is to integrate translations into the deployment process using Lucene/Solr's MoreLikeThis feature, which suggests possible translations in real-time. Very nice.
  • Facebook explains how they've made HipHop Virtual Machine dynamically translate PHP code into native machine code. Lots of good details and is well written.
  • Why wireless mesh networks won’t save us from censorship. Shaddi Hasan harshes the buzz on the utopian vision of a darknet freeing us from a SOPA/RIAA/everything tyranny. The reasons: Management is hard and expensive; Omni-directional antennas suck; Single-radio equipment doesn’t work; multi-radio equipment is very expensive; Your RF tricks won’t help you here; Unplanned mesh networks break routing. My take: what can't be routed around must be crushed.
  • Faster, More Effective Flowgraph-based Malware Classification. SilvioCesare presents good ways to bust bad software, in real-time, using graph algorithms. Using this system, over 30 previously unknown vulnerabilities were identified in Linux distributions. Looks cool.
  • Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks. René Pickhardt shows us how to make retrieval of social news feeds in social networks very efficient. It is able to dynamically retrieve more than 10’000 temporal ordered news feeds per second in social networks with millions of users like Facebook and Twitter by using graph data bases (like neo4j). His index is  O(1) in reading and O(d) in writing. Great discussion of something a lot of people are interested in how to do.
  • James Hamiton with a nice gloss on Hyder: Transactional Indexed Record Manager for Shared Flash Storage. In the Hyder system, all cores operate on a single shared transaction log. Each core (or thread) processes Optimistic Concurrency Control (OCC) database transactions one at a time.
  • Is Your Kernel Reading /proc Too Slowly? Mark Seger, author of Collectl, with a detailed analysis on a serious problem for anyone running an HPC cluster, particularly if you worry about system noise and its impact on fine-grained MPI codes: reading from /proc has been measured to be over a factor of 50 reading /proc/stat on a system with 8 sockets and 48 cores. If you love to run top continually in the background you may be in for a shock. And you should be using collectl anyway.
  • An Illustrated Guide to Cryptographic Hashes. Loved this article by Steve Friedl. This stuff is wicked tough to understand and he does a great job making it understandable. 
  • eBay on Rapid Development Setup in Large Environments. Mahesh Somani with a great description of the release process at eBay, which tries to balance agility, code sharing, with a huge code base. While many web properties have gone branchless, eBay uses a more traditional feature branch approach. To handle large projects they: split large projects into several small projects, decouple applications from common code areas, create meta-information (DSL) instead of compiled source code, using source element changes in combination with binary bundles. 
  • Google predicts what will bug your code.
  • For a cloud to really work, everything has to be in software appliances rather than hardware appliances, which requires the ability to effectively scale virtual appliances. Greg Ferro and Iven Pepelnajk both have good articles on how a new startup called Embrane hopes to do this using IP flows. An IP Flow is the the stateful conversation of IP packets from a specific source/destination – not just one IP packet but the whole two way, full duplex, stateful session of TCP or UDP packets that form the Layer 5 session flow. Embrane scales out  by managing IP flows and then directing to other appliances, in effect creating what I would call a two tier load balancing