Stuff The Internet Says On Scalability For March 22, 2013

Hey, it's HighScalability time:

  • 1 Billion/Month : Active Mostly Mobile YouTube Users
  • Quotable Quotes:
    • @NeckbeardHacker: "Wait...he reimplemented swatch and rsync in chef, node and mongo?""Yup." "......Why?""Go easy on him...he's a Ninjipsterstar."
    • @b6n: Scale myth #1: Your service is a unique snowflake.
    • @polotek: If you don't care about bugs, design or scalability, it only takes 2 days to build anything.
    • @NasHope: Even with the rapid scalability of Rackspace, I've still waited for over an hour for Dominos to actually deliver my pizza.
    • George Dyson: Computers may turn out to be less important as an end product of technological evolution and more important as catalysts facilitating evolutionary processes through the incubation and propagation of self-replicating filaments of code.
    • George Dyson: Von Neumann believed that all fields of science, including pure mathematics, derive their sustenance through contact with real problems in the physical world.
  • For those of us on the outside looking in, Sebastian Stadil gives a rare view from the cool kids table with an early look at How Google Compute Engine stacks up to Amazon EC2. Is it as wonderful as we all imagined?  The differences: AWS has a much richer set of services; GCE is on-demand only, so AWS can be cheaper; GCE faster disk; GCE faster network IO, especially between datacenters; GCE faster boot times, GCE can mount read-only partitions multiple machines; GCE shares images across regions. Sebastian ponders what new architectures Google's feature set will encourage to flourish? Interesting was the idea that because the inter-datacenter network IO is so fast it will be possible to put read slaves in multiple datacenters, replicate in real-time. 

  • Maybe it's not just about the programmers? At about 25 minutes into DevOps Cafe Episode 39 there's an interesting counter-hype discussion on the downside of schemalessness (strange word), inspired by the article Why MongoDB Never Worked Out at Etsy. As a programmer, schema's suck, but operationally, a schema has surprising value. Field and tech support often rely on dumping tables for debugging. When records are blobs nobody can inspect the data without a developer mind meld or direct application support. 

  • Cool example of iterative performance optimization, with tradeoffs. Single Producer/Consumer lock free Queue step by step: Starting point: Lock free, single writer principle; Set capacity to power of 2, use mask instead of modulo; Use lazySet instead of volatile write; Minimize volatile reads by adding local cache fields; Pad all the hot fields. 

  • NuoDB took Google Compute Engine for a test drive on a 32 node cluster. Result on YCSB: 1.8 million transactions per second with one millisecond latencies (no journaling). Good description of the setup process. Looks straight forward enough.

  • Hey Bay Area people, there's a new meetup you might be interested in: Bay Area Systems-Programming Interest Group

  • The first Meetup? Arnold Schoenberg's Society for Private Musical Performances: pantonal, you never know what the topic will be, no critics, and no applause.

  • Scaling TextRazor in the Cloud with Nginx and Lua. Came up with a hybrid cloud-bare-metal approach because Amazon XL EC2 instances are expensive and underform, yet they want to be in cloud for testing, etc. They came up with a routing system that runs through EC2 with the routing logic built using Nginx scripted with Lua via the ngx_lua plugin. Excellent example configuration files for the DYIer.

  • Pinboard is thinking of charging for its API. Good Google Groups discussion. The idea is to create "a road tax that then pays for further construction, repair, and people in orange vests." Some benefits of tying income to resource use:  clients will use resources more carefully; there's an incentive to expand the API. Scripts hammer APIs. Adding a search feature, for example, through the API would require a lot more resources (people, machines, etc) that need to be paid for. < Sounds quite reasonable. 

  • I've said to myself, self, I want a private cloud. How would I build one? No answer. Ryan Geyer has an answer: Building A Rackspace Private Cloud … In My Garage. Download some ISOs, use Chef to configure nodes for their roles, and plug into RightScale to manage it all. It is easy and functional enough to replace the Public Cloud?

  • Combining and uniqueness and order is always a problem: Guaranteeing globally unique TimeUUID's in a high throughput distributed system

  • Even if the list is debatable, the lesson is clear, craft your alliances carefully: 5 Products That Should Fear Google’s Next Killing Spree

  • Mathematical Definition, Mapping, and Detection of (Anti)Fragility: We propose a detection of fragility, robustness, and antifragility using a single "fast-and-frugal", model-free, probability free heuristic that also picks up exposure to model error. The heuristic lends itself to immediate implementation, and uncovers hidden risks related to company size, forecasting problems, and bank tail exposures.

  • Facebook's Jay Parikh: Everybody is dealing with scale today, and it's getting to be a more difficult challenge in terms of the amount of data that people want to collect and analyze. Sometimes companies are collecting data and they don't know what to do with it yet, or they're collecting data that they don't even know they have. The fundamental problems are how do you store it, how do you process it and how do you derive useful insights?

  • Clear and useful exploration of replication lag: MongoDB: Replication Lag and the Facts of Life. Minimize lag by: have enough horsepower, adjust your write concern, delay index creation on secondaries, use non-blocking backups, check for errors, replication requires unique indexes.

  • Taba: Low Latency Event Aggregation. Good description of how 3 servers aggregate 10 million events per minute in support of monitoring, real-time feedback, and dashboards. Clients feed agents which feed servers which dump everything into Redis. A virtual buck type sharding layer is used in-front of Redis. Events aren't stored directly as it would take too much storage. Intermediate results are stored which are smaller and faster to query.

  • NoDB: Efficient Query Execution on Raw Data Files: Our contribution in this paper is the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine.

  • Gumball: A race condition prevention technique for cache augmented SQL Database Management systems - Developed: GT is a simple technique that acknowledges that while caching system increase scalability, it adds an overhead of maintaining consistency in data between KVS and RDBMS. The technique is simple to implement and requires no specific feature from RDBMS or KVS, thus translating to better rate of returns via increased reduction in stale data.

  • petewarnock on Why is FastCGI /w Nginx so much faster than Apache /w mod_php?: The bigger issue is the memory footprint. If the server hits its swap, the performance will degrade. Nginx facilitates much larger scale on smaller, more economical hardware by using a smaller, more predictable amount of memory under load. It's a SWAG, but Nginx might be a little slower because each request has multiple asynchronous events; I think along the lines of send upstream and listen upstream.