Stuff The Internet Says On Scalability For September 7, 2012

It's HighScalability Time:

  • Quotable Quotes:
  • Evolution of SoundCloud’s Architecture: The way we develop SoundCloud is to identify the points of scale then isolate and optimize the read and write paths individually, in anticipation of the next magnitude of growth.
  • How We Build Our 60-Node (Almost Distributed Web Crawler.  Semantics3 crawls 1-3 million pages a day at a cost of ~$3 a day (excluding storage costs) using micro-instances, Grearman, redis, perl, chef, and capistrano.
  • Werner Vogels continues his 50 Shades of Programming book club with Back-to-Basics Weekend Reading - Granularity of locks. Highlight is a touching remembrance of Jim Gray.
  • Speaking of locks and stories, the MySQL Performance BLog in Write contentions on the query cache tracks down a performance problem to:  contention/excessive waiting somewhere in the server. A good candidate for contention issues was the query cache as it was enabled. 
  • Identifying and Eliminating Extra DB Latency. Embrace your Sherlock as you hunt down a trick performance problem. Love these kind of posts. The solution: We increased the statement cache size parameter, which dictates how many parsed SQL statements can remain in cache at a given time, from 80 to 500.
  • Harness unused smartphone power for a computing boost: Büsching and colleagues joined six low-powered Android phones into a network. Each can carry out 5.8 million calculations per second, or megaflops. When connected via Wi-Fi the phones could carry out a combined 26.2 megaflops, about 75 per cent of the theoretical maximum. Connecting the phones together via USB upped this to 29 megaflops.
  • Baron Schwartz with an interesting take on in-memory databases and flash: Most people I know using Fusion-IO aren't doing it with in-memory databases, but with databases much larger than memory. They do it because they are overwhelmingly disk-bound on reads. If you're using an in-memory database, spending a large amount of money on that caliber of storage is likely to be a waste. An in-memory database pretty much needs durable storage for a (relatively) occasional and (mostly) sequential write workload, which can be handled quite well by a good RAID controller with a battery-backed write cache.
  • Why Tarsnap doesn't use Glacier. While the should we or shouldn't we aspects of their decision to not use Glacier is interesting, what I loved was the explanation of the data and index deduplication design of tarsnap and how it would interface was Glacier. Fascinating reading.
  • In AWS RDS Benchmark and Modeling by Roberto Gaiser finds: EBS size has no effect on I/O performance; small instances are more affected by other instances on the same server; larger instance have a greater share of the physical resources, more network throughput translates to more EBS I/O
  • Herb Sutter shows How to write a CAS loop using std::atomics.  Simple it is not.
  • Scaling out Postgres, Part 1: Partitioning. Discusses how connect.me plans to scale out in anticipation of potential viral growth.  On HackerNews
  • Interesting analysis of What Do Real-Life Hadoop Workloads Look Like? suggests: at any given time some subsets are more valuable than the rest and get accessed more often. The smaller subset could be maintained in more expensive storage with lower latency and/or higher replica counts without significantly affecting total cost if HDFS placement was made sensitive to access patterns.
  • Damn Cool Algorithms: Homomorphic Hashing. Nick Johnson does a good job explaining the paper On-the-fly verification of rateless erasure codes for efficient content distribution. Homomorphic Hashing is useful for "verifying data from untrusted peers" and "creat[ing] an impressively fast and efficient content distribution network."
  • Cache Lines Are The New Registers: it’s clear that the most important optimizations are not the micro-fiddling with instruction order and register allocation. It’s memory access planning. Would it be crazy to suggest that compilers might as well be able to provide the CPU with a much more detailed and accurate picture of the program’s memory access patterns?
  • Netflix has made Eureka available. Service discovery and load balancing are intimately connected and Netflix has united them in one secure, cross availability zone package. Available servers are cached on the client rather than in an intermediate service layer. Services register with Eureka and then send heartbeats to renew their leases every 30 seconds. Even if you don't want to sign up for the entire framework, the ideas are worth considering.
  • Cloud Tech III is happening in Mountain View on October 6th. Over 1300 people attended last year. Speakers include Jeff Dean, Andy Bechtolsheim and many other luminaries from some of your favorite companies. Sebastian always puts on a great event.

This weeks selection: