Stuff The Internet Says On Scalability For December 5, 2011

It's HighScalability Time!

  • Quotable quotes:
    • @jaykreps : Was wondering, How can I turn my boring, cachable, read-only traffic into random writes on mongodb? And lo! link
    • @marshallk : Google runs 100-200 experiments every day on UI, algorithm & product
    • @styggiti : The problem with companies like IBM and Oracle baking NoSQL "scalability" into their products isn't the tech, it's the $$ licensing.
  • Blazing fast node.js: 10 performance tips from LinkedIn Mobile. You may have thought that node.js made just everything magically fast, but Shravya Garlapati has some great strategies for going even faster: Avoid synchronous code; Turn off socket pooling;  Don't use Node.js for static assets; Render on the client-side; Use gzip; Go parallel; Go session-free; Use binary modules; Use standard V8 JavaScript instead of client-side libraries; Keep your code small and light.
  • Nice thread in NoSQL Databases on HBase and Consistency in CAP. The short summary of the article is that CAP isn't "C, A, or P, choose two," but rather "When P happens, choose A or C."
  • Fast, easy, realtime metrics using Redis bitmaps. Chandra Patni explains the innovative use of Redis bitmaps to handle problems like a daily unique user count for 128 million users in 50 ms. The advantages are speed and space efficiencies.  I like the idea of keeping a seperate bitmap index for each account to facilitate keeping stats by bitmap.
  • BitTorrent’s µTP protocol routes around TCP's near fatal resend on congestion algorithm by using UDP to yield until the pipes are cleared up again when congestion is detected. Good article by Janko Roettgers: How BitTorrent wants to save the Internet. IETF Proposal.
  • Lars George has written a great book on HBase: HBase: The Definitive Guide. The Architecture chapter especially has an illuminating explanation of hardisk Seek vs. Transfer tradeoffs in database design and how it relates to B+Trees vs. Log-Structured Merge-Trees.
  • Brandon Wirtz on Quantified Price Reduction through Optimizations showing how he reduced his costs on GAE by 5x by: moving Python 2.5 to Python 2.7; optimized how use of Instance Memory, MemCache, And DataStore; fixed Cache Headers. Also, MemCache Vs. EdgeCache.
  • Scaling with the Kindle Fire. Good article how Pulse, a news reading app for iPhone, iPad, and Android, uses GAE to serve 100Ms of requests per day. Use memcache, set cache control headers, tune down instance creation, buy the Premier account, split load across different applications.
  • Cloud email service price comparison. Will comparing a bunch of email services. Sending bulk email is huge PITA so it's good to see a comparison. On Hacker News. So which email cloud provider should you use? Use the graphs I made, but price is only going to be one factor, so check what each provider offers. I’ve linked to all the pricing pages below.
  • Programming language impact on the development of distributed systems by Debasish Ghosh, Justin Sheehy, Kresten Krab Thorup, Steve Vinoski.  In this paper, we first present a history of programming languages and distributed systems, and then explore several alternative languages along with modern systems built using them.
  • Intel Guide for Developing Multithreaded Applications. The Guide provides general advice on multithreaded performance. Hardware-specific optimizations have deliberately been kept to a minimum. Also,  Dmitry Vyukov talks about his fast AddressSanitizer for Go.
  • Netflix has released: Curator - The Netflix ZooKeeper Library. Looks like a great easy to use wrapper on top of ZooKeeper. The article lists the common problems with ZooKeeper and how they handle them. At Netflix ZooKeeper is used for: lock for sequence ID generators; Cassandra Backups; TrackID Service;leader selection; locking 3rd party services; caching.
  • High Availability for Cloud Computing Database Systems. James Hamilton reviews 1) RemusDB: Database high availability using virtualization, and 2) DBECS: Database high availability and availability using eventually consistent cloud storage. Lots of good high availability talk. Also, Global Netflix Platform: Large Scale Java PaaS Running on AWS.
  • Cassandra at Gowalla. Adam Keys with a generous discussion of their Cassandra usage.  It’s become out database of choice for applications with relatively fixed query patterns that, for us to succeed, need to handle a rapidly growing dataset. 
  • Stale cache serving strategy with reactive flush. John Clarke Mills jas achieved the goal of no cache misses by first building what I call a persistent cache; memcache backed by disk or database.
  • Nigel Poulton has Seen The Future of SSD Arrays! and it is going to look like an industry standard off the shelf x86 server, crammed full of industry standard form-factor hot-pluggable SSD drives, running SCSI over PCIe with all of the smarts and clevers in software.
  • If you are interested in Software Defined Networking there's a new SDN meetup you might be interested in. 
  • deviantArt shows Faster Web Development with Virtual Machines. Every developer gets their own virtual machine, which means: Fewer Commits, Reduced Contention for the Staging Server, Freedom to Experiment. Also, Chaos Gerbils: An Explanation - a gripping story of recovering from a data corruption bug with statement-based replication used for unique IDs.