Stuff The Internet Says On Scalability For February 14th, 2014

Hey, it's HighScalability time:


Climbing the World's Second Tallest Building

  • 5 billion: Number of phone records NSA collects per day; Facebook: 1.23 billion users, 201.6 billion friend connections, 400 billion shared photos, and 7.8 trillion messages sent since the start of 2012.
  • Quotable Quotes:
    • @ShrikanthSS: people repeatedly underestimate the cost of busy waits
    • @mcclure111: Learning today java․net․URL․equals is a blocking operation that hits the network shook me badly. I don't know if I can trust the world now.
    • @hui_kenneth: @randybias: “3 ways 2 be market leader - be 1st, be best, or be cheapest. #AWS was all 3. Now #googlecloud may be best & is the cheapest.”
    • @thijs: The nice thing about Paper is that we can point out to clients that it took 18 experienced designers and developers two years to build.
    • @neil_conway: My guess is that the split between Spanner and F1 is a great example of Conway's Law.
  • How Facebook built the real-time posts search feature of Graph search. It's a big problem: one billion new posts added every day, the posts index contains more than one trillion total posts, comprising hundreds of terabytes of data. 

  • Chartbeat Engineering shares some of their experiences in two excellent articles: Part 1,  Part 2. Lessons: DNS is not a great means of load balancing traffic; Modifying sysctl values from their defaults can be important to ensure reliability; Graphing metrics is your friend;  Through TCP tuning and utilizing AWS Elastic Load Balancer we were able to decrease our response time by 98.5%, decrease our server footprint by 20% on our front end servers;  Enabling cross-zone load balancing got our request count distribution extremely well balanced;  planning to move from the m1.large instance type to the c3.large.  The c3.large is almost 50% cheaper and gives us more compute units which in turn yields slightly better response times.

  • Creating a resilient organization is a little like getting an allergy shot, you have to ingest a little of what ails you to boost your immune system. That's the idea behind DiRT, Disaster Recovery Testing event. In Weathering the Unexpected is the story of how far Google goes to improve their corporate immune system with disaster scenarios. Disasters can range from a walk-through of a backup restore to a company wide zombie attack simulation. More here and here.

  • 37signals' shows the power of focus by shedding all their products except Basecamp and even renaming themselves to be just Basecamp. A company can can grow wild unless pruned and shaped to let in the maximum amount of sunlight, growing the most and ripest fruit. While a hard prune is common in the orchard, it's not so common in an organization. A very brave move.

  • When I suggested this I was laughed at. So there! Patch Panels in the Sky:A Case for Free-Space Optics in Data Centers: We explore the vision of an all-wireless inter-rack datacenter fabric. 

  • Scaling Meteor: The Challenges of Real-time Apps. Highlights the distinction between traditional web apps, single-page web apps, and real-time web apps where in real-time the server pushes data to the client. What does it push and how does it push is the thing. The idea is to push changes to the client by looking at changes in the database by tailing the database log and specifying a query to find what changes to send to clients. The kind of approaches have a problem in finding the right balance of not sending too much data to the client vs putting  so much complexity on the filter side to send just the right thing that it becomes hard to scale. Though it is generic and does take complexity out of the application code.

  • How Heroku made builds 40% faster. Instrument to see where you are slow; Parallel and Pipelined HTTP GET to improve file transfers and storage; more efficient about how dependencies are downloaded, to making better use of the build cache, to pre-fetching common dependencies.

  • Simulating a virtual game world with Web Actors. Looks relatively simple and clean. Nice example.

  • A crazy detailed look into the effects of "final" in Java. When I say final, I mean FINAL!. On reflection, make a local variable reference to get the best code generated. The result is a 2%-10% increase in performance.

  • Google's Secret Weapon Against Amazon: Blistering Fast Networks Is that enough? Or is it just a niche and features will win the day?

  • A Decade of Building Facebook. In 2004 thefacebook.com rents its first server for $85. 2007 the platform launches. 2009 first datacenter. 2010 Hip Hop. 2011 HHMV. 2011 Open Compute Project founded. 2012 code pushed twice a day from hundreds of engineers; native mobile apps. 2013 graph search. 

  • Yet another TCP in UDP multiplexing protocol to rule them all. This time by Google and it's used by Chrome, so it just might work. QUIC – Next-generation multiplex transport over UDP (video, design). Implemented in user space, supports encryption, implements retransmission, lower latency, etc.

  • Great list of wins and losses when using Elasticsearch as a NoSQL Database. No sense of transactions or rollback; it's fast; delayed index visibility for updates; flexible schema; it's a document database so you must denormalize your data; doesn't gracefully handle some errors like out-of-memory errors; highly distributed and easy to scale out massive data sets on commodity hardware; lacks security features. In short, no one database to rule them all, but if you want to make the tradeoffs in favor of search it's a good solution rather than traditional database virtues it's a good solution.

  • DataStax Enterprise Reference Architecture. Though this is specifically for Cassandra, the same sort of issues are good for anyone to consider when building out their architecture. Also, What is the story with AWS storage?

  • Interesting, Olympics' video from Sochi is being served from Azure.

  • Good survey by Hakka Labs of different BigData setups at various companies as seen in the wild: Big, Small, Hot or Cold - Your Data Needs a Robust Pipeline (Examples from Stripe, Tapad, Etsy & Square). Many different approaches to learn from.

  • Where to put all that data? Efficient and Scalable Off-Site Backup to Amazon Glacier. Found the pricing confusing; the upload costs add up so bundle the uploads; they have a cool restoral system; getting files into Glacier is very fast; not good for daily backups because you get charged for 3 months even though you delete it a day later.

  • How to put the zoom in Go. You'll find the usual suspects, batch stuff, remove locks, increase page sizes, reduce GC, use lock-free algorithms. Looks like a good set of improvements.

  • Why is Your SQL Server Slow?: stop using index hints. If you have a query that is not using the optimal plan, don’t just take a short cut and use a hint. Figure out why it’s not getting the optimal plan and fix that. Index hints don’t address the core problem.

  • Handling Growth with Postgres: 5 Tips From Instagram: partial indexes; functional indexes;  pg_reorg For Compaction; WAL-E for WAL archiving and backups; Autocommit mode and async mode in psycopg2. 

  • WProf: a lightweight in-browser profiler that produces a detailed dependency graph of the activities that make up a pageload. We find that computation is a significant factor that makes up as much as 35% of the critical path, and that synchronous JavaScript plays a significant role in page load time by blocking HTML parsing. Caching reduces page load time, but the reduction is not proportional to the number of cached objects, because most object loads are not on the critical path. SPDY reduces page load time only for networks with high RTTs and mod_pagespeed helps little on an average page.

  • I heart this Post Portem from Paperless Post. When most of your traffic comes in one period like Valentine's Day you have conflicting goals. Get everything ready so fix it now which means you may not be prepared for the crush of users. 

  • NPR shows how to add a little dynamicity to a static website in Complex But Not Dynamic: Using A Static Site To Crowdsource Playgrounds. Looks like a solid approach and the advantages are clear: Static sites with asynchronous architectures stay up under great load, cost very little to deploy, and have low maintenance burden.

  • Stefan Wrobel with a detailed experience report of moving from Heroku to OpsWorks: We had a positive experience moving Cult Cosmetics, a rails app, from Heroku to OpsWorks, with response times reduced by ~40%. If you're thinking about making this move, DO IT, but realize that it will probably require more time & effort than you expected.

  • The Anatomy of a Great Stack Overflow Question (After Analyzing 10,000). Knowing how to ask a good question is often the difference between getting stuff done today or being stuck for a week. So some good advice here. Though it does bring up the age of old question of do great people make the times or do the times make great people? Biggest surprise is the time you ask a question doesn't seem to make a difference. 

  • So selling stuff is the best way to make money? 7,000 app developers in 127 countries say e-commerce is now the best mobile monetization strategy

  • This on the content side of things, not the backend where the challenge is everyone contributing to a site. Scaling Asana.com: all content is in Markdown; put the Markdown in Statamic, a file-based CMS; make changes in Github; staging server is realtime; test the site with CasperJS; deploy daily.

  • Causality is expensive (and what to do about it): In this post, I briefly motivate the use of causality in distributed systems, discuss (likely) fundamental lower bounds on metadata overheads required to capture it, and discuss four strategies for circumventing these overheads.

  • The scalable commutativity rule: Designing scalable software for multicore processors.  Murat says: The paper is a tour de force. The authors developed a tool called COMMUTER that accepts high-level interface models and generates tests of operations that commute and hence could scale. The tool uses a combination of symbolic and concolic execution, and generates test cases for an arbitrary implementation based on a model of that implementation's interface.

  • Parallel Parking: performance has increase almost 10x, simply by parking when we have a CAS failure due to parallel threads trying to write at the same time. So sleeping can actually make your program run faster? 

  • Data Efficiency at Scale: Essentially, this form of deduplication means trading a write of a duplicate chunk for a read. Depending on the design of the underlying block virtualization layer, duplicate chunks may be widely dispersed throughout the system. In that case, the bigger the system gets, the more expensive reads get - so processing of duplicate data becomes slower and slower as the storage system fills - this is why you won't find many 100 TB NetApp file systems with deduplication turned on. Certainly not for primary storage applications, the system would be flooded with random read requests and NetApp's deduplication process can end up taking months, years or even never complete.

  • Coordination-Avoiding Database Systems: In this paper, we identify a necessary and sufficient condition for achieving coordination-free execution without violating application-level consistency: invariant confluence. By explicitly considering application-level invariants, invariant confluence analysis allows databases to coordinate between operations only when anomalies that might violate invariants are possible. This provides a formal basis for coordination-avoiding database systems, which coordinate only when it is necessary to do so. We demonstrate the utility of invariant confluence analysis on a subset of SQL and via a coordination-avoiding proof- of-concept database prototype that scales linearly to over a millionTPC-C New-Order transactions per second on 100 servers.

  • Embassies: Radically Refactoring the Web: We reenvision the web interface based on the notion of a pico-datacenter, the client-side version of a shared server datacenter. Mutually untrusting vendors run their code on the user’s computer in low-level native code containers that communicate with the outside world only via IP. Just as in the cloud datacenter, the simple semantics makes isolation tractable, yet native code gives vendors the freedom to run any software stack. Since the datacenter model is designed to be robust to malicious tenants, it is never dangerous for the user to click a link and invite a possibly-hostile party onto the client.

  • Stochastic Forecasts Achieve High Throughput and Low Delay over Cellular Networks: This paper has many interesting design surprises. One of the surprising observations in this paper is that link speeds change dramatically with time, and current transport protocols build up multi-second queues in network gateways, thereby yielding low throughputs and highly variable packet delays. This paper proposes Sprout, an end-to-end transport protocol primarily designed for interactive applications to function well over highly variable cellular network channels. The second surprising design choice in Sprout is that it does not function anything like any known TCP-style congestion control protocol;rather, it learns a stochastic model of the cellular channel based on packet arrival times, and uses the model to quickly adapt the transmission rate of a flow. Sprout specifically aims to avoid the queue buildup problem experienced by existing transport protocols. The third surprising aspect about the paper is the evaluation. Across four different cellular networks, the authors show that Sprout achieved a factor of 7-9 reduction in the end-to-end self-inflicted delay, and a factor of 2-4 enhancement in the bit rate throughput, for 3 popular interactive applications: Google Hangout, Skype and Apple FaceTime. The authors also show that Sprout matches or outperforms TCP Cubic coupled with an active queue management technique that requires changes to the cellular network. These surprises put together make this paper a very interesting read.

  • BlinkDB - Queries with Bounded Errors and Bounded Response Times on Very Large Data: a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time, and (2) A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements. We have evaluated BlinkDB on the well-known TPC-H benchmarks, a real-world analytic workload derived from Conviva Inc. and are in the process of deploying it at Facebook Inc. < More here.

  • Another thoughtful set of Quick Links from Geeking with Greg.