Stuff The Internet Says On Scalability For June 17, 2011

Submitted for your scaling pleasure:

  • Google's code base receives 20+ code changes per minute and 50% of the files change every month. Learn how they test all that. Graph analysis? Of course.
  • Quatrains of quotably quotable quotes:
    • mehals: Reading the AmazonStorage Wiki and Game of Thrones. Lannister took a query as an insult to their scalability, laid siege to House Oracle.
    • jfpaccini: Werner Vogels at #awssummit: big data is one of the strongest driver to cloud computing.
    • johncmunoz: The average obscures the sexiness in your data. Show the distribution in your #bigdata. #SAS #Tableau #JMP #R, even #Excel will do.
    • Boss1881: The era of cloud-computing is approaching, but some are concerned - can wireless carriers keep up with network and data demands
    • bernardlunn: Just saw the Fail Whale (very briefly, well done scalability engineers), felt like old times
    • jfpaccini: Werner Vogels at #awssummit: big data is one of the strongest driver to cloud computing.
    • cloudcompete: Petabytes aren't cool... you know what's cool, Exabytes. #bigdata
  • Achievements for this week: Instagram: We have 5M users, nearly 100M photosFacebook Nears 700 Million Users Worldwide
  • Did hell freeze over? iCloud Uses Windows Azure Services For Hosting Data. Hard to believe 1) it costs $$$ 2) Apple doesn't see this a core competency 3) Apple hates MS. Easy to believe 1) shorter time to market 2) Apple requires geographical redundancy and only has only one datacenter location.
  • Sweet article by Peter Hizalev on Scaling Redis by adding a consistent hashing layer to Redis. With a goal of providing generalized Google Instant type results, Meshin turned to an in-memory database using a scale-out approach on commodity hardware. Redis was chosen, but a growable sharding scheme was needed  and the solution is talked about in detail.  The system has 14 nodes each with 96GB of RAM — this amounts a total of 1.3TB. The nodes have 16 hardware threads each. We deployed 160 partitions each replicated twice. Useful Redis related discussion in the comments as well.
  • Cache me if you can. Benjamin Pollack talks about how they changed Kiln to cache LINQ database entities in Memcache, which cut the amount of data pulled rom SQL by 75%.
  • Scaling SQL Server: Growing Out. Jeremiah Peschka says scaling reads is easy, cache. How do you scale writes? Some techniques: use a write master; write to many servers; use a specialty store.
  • deviantART shows how to repel an attack using spam counters to bounce bots who favorite too much. 
  • Heaps of trouble in this Reddit thread: Erlang memory architecture vs Java memory architecture: "the simple fact that heaps are private to a thread relieves the threads of all forms of lock checking on their own data. Add to that the fact that there are no destructive writes. Good discusion of  memory management, tasks, flow control, queue depths, and other tasty issues.
  • Which is a better compiler, V8 or GCC? V8 is faster than GCC says Andy Wingo in a conroversial benchmark that found V8's compiled result is simply better. As you might imagine the reaction was, well, intense.
  • 12 videos from the Data Scientist Summit 2011
  • Yes, You Can Run 18 Static Sites on a 64MB Link-1 VPS says LowEndBox. No, not using Apache. Use: Debian 5, Lighttpd, WordPress, MySQL and PHP in CGI/FastCGI mode. Have to say when I saw MySQL I was quite doubtful, but they show you how. The conclusion: 64MB is more than enough to serve a few low traffic static websites. You can actually run a few WordPress sites with a few hundred visitors a day — at the price equivalent to many heavily oversold shared hosting and you get root access! On Hacker News.
  • smhasher - A code library for testing hash functions. Hash functions are hard to test. This is nice looking library that can make it easy to run a lot of different tests you may not have even thought about.
  • Multicore Garbage Collection with Local Heaps.  A garbage collector with local per-processor heaps can outperforma stop-the-world parallel garbage collector in raw parallel throughput, and exhibits more robust performance through having fewerall-core synchronisations
  • Why SQL Sucks for NoSQL Unstructured Database. Nuno Job with a great checklist of things to look for on a query language for unstructured information: Navigation Language, Data Model, Regular expressions, Lambdas, High order functions, Functional flavor, Good string handling, Modules so you can build your own libraries, App Server aware: has functions that serve REST.
  • Omar Al Zabir with a A Simple Way to Cache Objects and Collections for Greater Performance and Scalability. After implementing caching, it became significantly faster, around 32 requests/sec. Page load time decreased significantly as well to 0.41 sec only. During the load test, CPU utilization was around 60%.
  • eHarmony Switches from Cloud to Atom Servers. We were paying huge amounts for a few hours of compute in the cloud,” says Cormac Twomey, director of software engineering at eHarmony. “We worked closely with SeaMicro to bring the SM10000 servers in-house, and since then we have enjoyed a dramatic reduction in operating expense and have seen a substantial reduction in variability around job completion times. We now have an additional 20 hours of compute per-day at our disposal.” Will this tactic work as well for real-time loads is a good question?