Stuff The Internet Says On Scalability For August 19, 2011

You may not scale often, but when you scale, please drink HighScalability:

  • Akamai: - 95,811 Servers, 1,000 Networks, 70 Countries.
  • Quotably quotable quotes:
    • @segphault : Linus talking about the kernel's scalability. Beneficial to have one kernel used from embedded to high-end bc improvements span use cases.
    • suspended : I am sure that scalability is the future, there are just too many platforms and screen sizes out there
    • @russferriday : Just completed a proposal for a rare bird data gathering system using #CouchDB *and* #Cassandra. Nice project. #NoSQL
    • @drelu : Oracle - everything is very convenient until it fails. #nosql
  • How do you model Google+ circles with MongoDB? Some ideas in this Google Groups thread. More on MongoDB with Mat Wall explaining Why I Chose MongoDB for guardian.co.uk . 
  • ACM SIGCOMM Test of Time Paper Award. Award winning papers through the years. A lot of good ones, worth a peruse. 
  • Read Amplification Factor. Mark Callaghan with an interesting discussion of the the equivalent of the write amplication factor for reading that talks about the impacts of different data storage choices: Variations of the log-structured merge tree have been used by many new storage servers including HBase, Bigtable,  Cassandra and leveldb. These servers append changes (delete, insert, update) to the end of a file rather than in place. To find one row by key value with an LSM the server might have to read from from multiple files or multiple locations within one file to fine one. I have been calling this the read penalty because a workload is very likely to do more disk reads when using an LSM than when using an update-in-place engine.
  • Werner Vogels on how to create a staticish site: S3 for content, Javascript accessible services for the dynamic bits, CMS editing features with Jekyll, DropBox to replicate templates and blog posts to all machines for local editing.
  • According to Conor O'Mahony the NoSQL market serves areas not served well by current technologies, augementing incumbents, and will likely stay small, in some cases SQL will augment NoSQL, SQL's low barrier to entry will keep it dominant, SQL will innovate too.
  • Data Races at the Processor Level. Bartosz Milewski  gives some insight into data races (two simultaneous accesses to the same memory location) at the processor level. Deep dive into x86, sequential consistency, spinlocks, and triangular data races.
  • Database coprocessors are not magic. You still have to figure out how to use them: Coprocessors and batch processing. The architecture of the database determines the granularity at which you can operate.
  • Surprisingly large improvement in CouchDB performance by using async libraries and setting the TCP socket option nodelay to false. Speaking of CouchDB, Curt Monash has a thorough Couchbase business update. 
  • Mutexes are the bane of scalability. Robert Haas recounts a such a story in Linux and glibc Scalability: The problem turned out to be that pgbench calls random().  Since random() does not take a random seed as an argument, it has to rely on some kind of global state, and is therefore not inherently thread-safe.  glibc handles this by wrapping a mutex around it - on Linux.
  • Exclusive: How LinkedIn used Node.js and HTML5 to build a better, faster app. Jollie O'Dell interviews Kiran Prasad and we learn using Node.js: The improvements the team saw were staggering. They went from running 15 servers with 15 instances (virtual servers) on each physical machine, to just four instances that can handle double the traffic. 
  • Netlfix talks about their continuous build system and deployment system. So does Google in Build in the Cloud: How the Build System works.
  • Urban scaling reveals patterns of growth within cities. This stuff is absolutely fascinating. Core patterns persist through nature and even man made things like cities. How about the digital realm?
  • SimpleGeo talks about Building a Scalable Geospatial Database on top of Apache Cassandra.
  • ReadWriteWeb with an excellent Data Terminology guide. Good source if you want the big picture in one gulp.
  • Storage is Cheap, Don't Mutate - Copy on Write. Hashbo thinks storage is so cheap now we could avoid a lot of problems (blocking, conflict, caching, history) by simply copying on write. 
  • SmartOS: The Complete Modern Operating System from Joyent. Interesting packaging of Zones, ZFS, DTrace and KVM and multi-tenant environment.
  • Iqbal Khan with a detailed look at  Scalability Bottlenecks with Distributed Caching: Find Scalability Bottlenecks; Code for Performance; Choose the Right Communication Protocol; Use Caching to Improve Client Performance; Distributed Caching for Service Scalability; Storing Session State in a Distributed Cache; Managing Data Relationships in the Cache; Synchronizing the Cache with a Database; Enterprise Service Bus for SOA Scalability; Cache Scalability and High Availability.
  • ScaleBase utilizes two techniques for scaling:  read-write splitting and transparent sharding (a technique for massively scaling-out relational database). Transparent sharding is holly grail of sharding architectures.
  • How does MVCC handle locking? Paul Davis: Reads are never locked. They have concurrent access to the database file and can read to the heart's content. Writes on the other hand aren't locked either, they're just coordinated through a central writer process.