Stuff The Internet Says On Scalability For February 15, 2013

Hey, it's HighScalability time:

  • The Herokulypse. A cautionary tale of what can happen when scalability is left for later. Rap Genius created quite a stir (reddit, Hacker News) when they documented high costs ($20K/month for 15 million monthly uniques) and poor performance (6 second average response times) using Heroku's random routing mesh. The cause was tracked to queuing at the dyno level when the expectation was requests are routed to free dynos. Heroku admits this is a problem. So poor load balancing combined with RoR single threading = poor performance, one that adding more dynos and spending more money won't necessarily help. While it seems clear Heroku didn't make this aspect of their system crystal clear, the incident has generated a lot of teaching moments, if you slog through it all. This is a developing story.
  • You need money to feed the beast. Fred Wilson has some revenue ideas for you: Paid App Downloads - ex. WhatsApp; In-app purchases - ex. Zynga Poker; In-app subscriptions - ex. NY Times app; Advertising - ex. Flurry, AdMob; Digital-to-physical - ex. Red Stamp, Postagram; Transactions - ex Hailo.
  • Netflix is amassing an impressive Open Source infrastructure. As an introduction Adrian Cockcroft and Ruslan Meshenberg wrote up meeting notes on their First NetflixOSS Meetup. A project called Denominator, implementing multi-region failover and traffic sharing patterns, looks especially cool. Also, The Netflix API Optimization Story.
  • @jackrusher: "The 20-year-old five-minute rule for RAM and disks still holds, but for ever-larger disk pages."
  • Researchers create crash-proof, self-repairing, inspired-by-nature computer: Instead of being procedural, UCL’s computer is “systemic” — each process that it executes is actually its own system, with its own data and instructions. Instead of these processes being executed sequentially, a pseudorandom number generator acts as the task scheduler. Each of the systems act simultaneously, and independently of other systems — except for a piece of memory where similar systems can share context-sensitive data.
  • Solid look at How do I freaking scale Oracle? covering Oracle RAC, Mirroring, transaction replication, partitioning, Hybrid NoSQL/cache. Conclusion is Oracle can do almost everything NoSQL can do, but at a greater cost.
  • Why average latency is a terrible way to track website performance: We've gotten a lot of responses saying "duuuh, averages are bad!" and "use 95% percentile". This post is about more than these simple statements. Its is about the common logic errors that result, and discussion of how to select metrics that help you make better decisions.
  • Todd Kennedy on why they Switched to Node.js. I wonder how often technologies are selected because they aren't boring? Discussion on reddit. Also, Realtime Node.js App: Building a Server
  • Making sense of exascale data sets using Bullet Time, the famous hollywood filling technique: The idea is to surround the simulated action with thousands, or even millions, of virtual cameras that all record the action as it occurs. Humans can later “fly” through the action by switching from one camera angle to the next, just like bullet time.
  • StorageMojo’s best papers of FAST ’13. Some interesting papers with classic StorageMojo takes: Erasure codes, high update costs make this most attractive for active archives, not primary storage; SSDs help with contention, but they aren’t affordable for large-scale deployments; Using flash only for reads mean ignoring half – or more – of the I/O problem; We may be trusting SSDs more than they deserve.
  • Great look at the memory hierarchy, cache coherence, and other concurrency issues: CPU Cache Flushing Fallacy:
  • Exponential Decay of History, Improved:  Exponential decay provides a deep history – potentially keeping years or centuries of context in a bounded volume. The tradeoff is losing much of the intermediate information. < Exciting applications for compressing system logs, game save files, caches, history, undo, dashboards, and other situations with huge streams of data.
  • Fatcache: is memcache on SSD. Think of fatcache as a cache for your big data. Fatcache achieves performance comparable to an in-memory cache by focusing on two design criteria: Minimize disk reads on cache hit; Eliminate small, random disk writes.
  • The Trinity Graph Engine: we can perform efficient graph analytics on web-scale, billion-node graphs using dozens of commodity machines. Also, How to Partition a Billion-Node Graph
  • Great answer by Gil Tene to Are there any issues with objects order during concurrent compaction? IMO, much of the actual "beneficial effect" of original-order and access-order layout-during compaction stuff out there is dominated by a single Java class: String. For one, it's a fairly hot class in many applications. And since most implementations represent String as two separate heap objects (a String and a char array), which live and die together and are accessed together in a clear pattern, co-locating the two object parts of a String linearly in memory provides a clear benefit (over splitting them).
  • Open Systems - Actors and Cloud: In this talk we will show how highly-available stateful Actors provide a flexible and easy to use programming model for the Cloud on the outside, while still allowing for the traditional programming models on the inside.