Stuff The Internet Says On Scalability For February 24, 2012

This is not your father's HighScalability:

  • 13,000 times the world’s GDP: Cost of the Death Star
  • Quotable quotes:
    • @chrissalzman: Scalability is the enemy of right now.
    • @resatsch: I like our IT team: "We used Redis before Youporn did it"
    • @virtual_bill: Mixing flash and spinning disk to balance cost is like strapping a rocket to a turtle.
    • @jaksprats: HDDs got slower at random access as they got bigger, cuz disk seeks stayed almost the same, similar phenomenon w/ Flash
  • Priam, king of Troy, begat a daughter, Cassandra, and Netflix, king of true distributed Amazon infrastructure, begat a co-processor for Cassandra, Priam, used for Backup and recovery, Bootstrapping, Centralized configuration management, and RESTful monitoring and metrics. This is why Troy was never actually destroyed, it was simply backedup in-situ to another region.
  • Evernote is everfaithful to SQL because SQL gives it all the ACID it needs to keep its billion Notes and almost 2 billion Resource files in order. But is keeping an attribute map to facilitate painless schema upgrades and partitioning users really being faithful to SQL?
  • With Amazon's Simple Workflow service Amazon is moving up stack, leaving traditional IaaS behind, and heading directly into enterprise PaaS territory. It's an interesting choice. Workflow systems are a highly researched area that tend to not do well in practice because real life state machines and dependencies quickly outstrip the expressiveness and capabilities of the underlying workflow engine. Typically using a queuing system, application logic, and publish-subscribe event notification, will get you where you need to go. But enterprises need workflow and approval processes and that requirement may be one of the tethers being cut with a workflow service. We'll see if users are willing to put that much of their application structure into lock down.
  • Moving 6 Billion Messages Without Being Noticed. DeviantArt explains a few tech-details behind a recent database cluster migration. Their new cluster has 8 servers as masters and 8 identical machines as a hot-backup.  Each machine has 12 cores, 96GB RAM, 200GB SSDs.
  • HyperDex is a new key-value store that places objects on servers so that both search and key-based operations contact a small subset of all servers in the system. It uses hyperspace hashing to create a multidimensional euclidean space into which objects are mapped. The first question of course is how does HyperDex compare to Redis? The usual benchmark cat fights ensue. 
  • Stochastic chips process signals in parallel which can make it up to three orders of magnitude faster than a conventional microprocessor in solving the pattern recognition task. 
  • Near Neighbor Search in High Dimensional Data. Locality-Sensitive Hashing (continued) LS Families and Amplification LS Families for Common Distance Measures.
  • How Mailinator compresses email by 90%. Explores the idea of how to compress streaming email stored for a short period of time in RAM. Uses  multi-line compression and finding the shortest uncached sequence of consecutive lines.
  • Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons. Great summary of the first book club meeting contrasting Pregel with Collect. Pregel wins as it is a real framework whereas Collect is considered more a proof of concept.
  • Will we witness a gradual decrease in cloud prices? Charles Babcock, in Amazon Brings Price-Cutter Mentality To AWS, thinks so. Amazon has lowered prices 18 times in six years which will hopefully push costs down pressure generally. Though don't expect them to rush down that curve.
  • The many layers of Advanced Caching in Rails. Rich in details on page caching, action caching, and fragment caching, plus lots of other good advice, like consolidate cache expiration logic in one place.
  • How key-based cache expiration works: The cache key is the fluid part and the cache content is the fixed part. A given key should always return the same content. You never update the content after it’s been written and you never try to expire it either.
  • The Unreasonable Effectiveness of Data by  Alon Halevy, Peter Norvig, and Fernando Pereira, Google: Follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications,  trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.
  • Linear Scalability of Distributed Applications: In this thesis, we deal with adaptive management of cloud resources under specific application requirements. Our approach responds effectively to sudden load increases or failures and makes best use of the geographical distance between nodes to improve application-specific data availability. We then propose a decentralized approach for adaptive management of computational resources for applications requiring high availability and performance guarantees under load spikes, sudden failures or cloud resource updates. Our approach involves a virtual economy among service components (similar to the one among data replicas) and an innovative cascading scheme for setting up the performance goals of individual components so as to meet the overall application requirements.
  • Real-Time Web Technologies Guide. For those wondering how to send messages around here's a big list of Hosted Realtime Services, Self Hosted Realtime Services, and WebSocket Client Libraries.
  • Multithreading support in memcached. The lure of improving performance via threads is a powerful drug that can not be denied. Memcached uses a wrapper approach, where calls in single thread mode no-op locks and multiple threaded mode take one of a few locks. Each thread listens on its own sockets.