Stuff The Internet Says On Scalability For December 13th, 2013

High Scalability

13 Dec 2013 — 8 min read

Hey, it's HighScalability time:

Test your sense of scale. Is this image of something microscopic or macroscopic? Find out.

80 billion: Netflix logging events per day; 10 petabytes: Ancestry.com data; six million: Foursquare checkins per day;
Quotable Quotes:
- George Lakoff: What can't all your thoughts be conscious? Because consciousness is linear and your brain is parallel. The linear structure of consciousness could never keep up.
- @peakscale: "Engineers like to solve problems. If there are no problems handily available, they will create their own problems" - Scott Adams
- @kiwipom: “Immutability is magic pixie dust that makes distributed systems work” - Adrian Cockcroft
- @LachM: Netflix: SPEED at SCALE = breaks EVERYTHING. #yow13
- Joe Landman: … you get really annoyed at the performance of grep on file IO (seriously folks? 32k or page size sized IO? What is this … 1992?) so you rewrite it in 20 minute in Perl, and increase the performance by 5-8x or so.
- @rjrogers87: "Goldman Sacs has 36,000 employees, 6,000 are developers. They support these folks w/half-a million cores. #GartnerDC”
- @KentLangley: Dear Amazon AWS, Please STOP with the aggressive reserved instances sales push. I use on-demand ON PURPOSE.

Good story of how nearby.lk moved from Google App Engine to EC2, nodejs, and mongodb. Migration decision based on: ease of development, performance and cost. GAE suffers from slow database operatations, costly over provisioning of instances, slow startup times lead to timeouts, low 1MB memcache size limit, bulk loads and exports of data a nightmare, search is slow. Like about nodejs: one programming language on client and server, portability, fast. Like from GAE: could see everything in one place with the management console, easy deployment, easy to create new code and test it out.

Wikipedia's Order of Magnitude page: The pressure of a human bite is about 1/9th of the atmospheric pressure on Venus. The fastest bacterium on earth is just outstripping the fastest glacier. A square meter of sunshine in the spring imparts about 1 horsepower.

Peter Bailis On Consistency and Durability which discusses some points from a Redis bashing thread on the Redis email list. To be on the Internet is to be bashed, so you can't worry about that so much, but often illuminating discussions result. And that is the case here.

As they say, when Bill Gates walks into a bar the average income of all the revelers becomes worth partying about. The Most Misleading Measure of Response Time: Average. Calculating the 99th percentile they found a balanced CDN architecture with Akamai with EdgeCast resulted in a "major reduction in 99th percentile response times across the board." Resilience and better performance, ah, that feels good. Lots of pretty graphs. See also, jsDelivr.

You think all those silly pictures people take are worthless? Not so. When analyzed they reveal tremendous amounts of monetizable data about you, where you like to go, what you like to do, and who you like to be with. Given the size of pictures isn't this sort of analysis expensive? Maybe not: How to analyze 100 million images for $624. For $12 a m1.xlarge server can process two million instagram-sized phots per day. Using open source: Hadoop, HIPI is used to efficiently process images in a Hadoop cluster, OpenCV for object recognition. Very cool. The comments suggest using spot instances or SSD as an end around the local IO limit.

In a nod towards Aspect Oriented Programming, Google is instrumenting key calls in the browser using a new Navigation Timing API. "It allows retrieve and analyze a detailed profile of all the critical network timing information for each resource on the page." Ilya Grigorik dishes the details at Measuring network performance with Resource Timing API.

The always thoughtful Michael Bernstein with a nuanced exploration on Why We Need Explicit State. Most interesting to me is the linkage of time and state: complexity comes along with state and that state is a function of time. IMHO, for logic to have a field of action requires both time and space to operate within, so that complexity is inescapable and is only hidden by selectively emphasizing one over the other.

Yep, flash is magic sauce. OLTP goes from 20K TPS to over 80K TPS. Creating Time and Minting Money - Your Business on Flash.

DataStax tested Google Compute Engine across 100 nodes in two physical zones. Results: excellent.

Gorgeous pictures of Google's Data Centers. What a difference a hundred years makes.

Cassandra transactions go on a diet. New features in Cassandra 2.0 – More on Lightweight Transactions. Compare-and-set in a distributed quorum based architecture is far tricker than it may first appear. FoundationDb helpfully points out why these are not real transactions: If your application uses CAS for any common or important operation, it has effectively chosen CAP Consistency over CAP Availability and will not be available in minority partitions. You are getting, potentially, the worst of both worlds - most of your application (and the users thereof) has to deal with all the concurrency and consistency problems of an eventually consistent database, and no better availability than you could get with a fully transactional database.

High bandwidth does not mean low latency. Ilya Grigorik: My LTE + Nexus 5 connection gets better throughput than my wired Comcast plan! I wish the connection was symmetric (upload vs. download), but 13Mbps uplink is nonetheless a good start - it's enough to stream HD video directly from my phone! On the other hand, latency is about 2x compared to wired connection.. This isn't surprising, but nonetheless, as far as mobile carrier networks go, ~50ms is directly in the target zone for LTE.

LinkedIn is now HTTPS by default. Excellent description of their migration process and settings that might be useful if you are considering the big switch.

How to feed the beast. Data locality, or how I made a benchmark 50x faster just by rearranging some memory: We’ve been lied to. They keep showing us charts where CPU speed goes up and up every year as if Moore’s Law isn’t just a historical observation but some kind of divine right. Without lifting a finger, we software folks watch our programs magically accelerate just by virtue of new hardware. With today’s hardware, it can take hundreds of cycles to fetch a byte of data from RAM. If most instructions need data, and it takes hundreds of cycles to get it, how is that our CPUs aren’t sitting idle 99% of the time waiting for data?

Why does performance always seem to be an afterthought with developers? Asked and answered in The Right Stuff: Breaking the PageSpeed Barrier with Bootstrap, an awesome breakdown of the steps required to make a page fast. Example takes a page render from 833ms down to 151ms. The process however is complex. If you are in a constant state of pumping out changes and you have a small team how practical is it to do this sort of tuning continuously? Continuous deployment is not often paired with continuous performance optimization.

LevelDB versus MySQL: The numbers show a clear winner: MySQL. I guess that using a ‘traditional’ RDBMS for my historical data is not such a bad idea after all…

37signals with concise overview of how they handle ajax operations with Server-generated JavaScript Responses: Server creates or updates a model object; Server generates a JavaScript response that includes the updated HTML template for the model; Client evaluates the JavaScript returned by the server, which then updates the DOM. Server side templating is one of those hard design decisions. It reduces duplicate code but also seems a waste of a good browser. Good comments on the design tradeoffs.

In the ever changing CDN landscape you can count on Dan Rayburn to help get your bearings and find your way: Here’s What The Current CDN Landscape Looks Like, With List Of Vendors.

Updated IaaS Pricing Patterns and Trends: The high correlation in available memory/hourly costs indicates a shared understanding of the importance of memory pricing; Google is the most aggressive in terms of the memory per hourly unit of cost; HP is apparently pegging itself to AWS; Softlayer and, to a lesser extent, Rackspace, are likely to be less competitive for memory focused buyers; Amazon remains the standard against which other programs are judged and/or judging themselves; Intentionally or not, providers are signaling their prioritizations from an infrastructure perspective.

Live at RICON|West 2013. How the world has changed. Now we have open source that does a lot of things well in production. Now you can build on that and keep moving forward. In the past you would build on some proprietary thing. We are getting to place where we have good open source code that handles consensus, eventual consistency, CRDTs, that's done. So let's tackle harder problems. The hope is researchers and industry can ship stuff in a year are two, creating a compounding virtuous circle.

Domain sharding was once a key performance tip. Is it now a trap? Reducing Domain Sharding: the variant that sharded across two domains was the clear winner. 50-80ms faster page load times for image heavy pages (e.g. search), 30-50ms faster overall, up to 500ms faster load times on mobile, 0.27% increase in pages per visit.

LinkedIn promisses this will work: We are extremely pleased with the use of promises in Javascript. Using promises has: Made it simple to implement and reason about the complex data flows present in the iPad server. Simplified our error handling significantly. Allowed us to provide more visibility into our system due to almost all error handling passing through shared functions that log encountered errors.

Markets, they aren't rational. Spot instance pricing fluctuations: It seems that over the past month, the spot prices of just about every EC2 instance size have been greatly fluctuating at levels much higher than On-Demand. Many of the instances are going for 5 to 20 times higher than their On-Demand prices - for example a m1.large instance that is normally $0.24 per hour is currently going for $1.20 in 1d and $1.60 in 1c.

Moving Persistent Notifications from MySQL to Cassandra at Zoosk: Even though Zoosk is a web scale property with millions of Daily Active Uniques, our 5 node Cassandra cluster can easily keep up. We do use db class servers with lots of RAM and SSDs. We use a single data center and single rack topology. Even though there are four times more writes than reads, write latency is two orders of magnitude faster than read latency. Our service cluster runs on three web class machines.

Not everyone is going AWS. TestingBot is moving out of AWS and into its own cloud. AWS was expensive and noisy neighbors slowed down the hood. VMWare was found to be complicated and expensive. Settled on: GridCentric, KVM, Qemu.

WalmartLabs Discusses how Node.js Performed on Black Friday. Arnorhs with a good summary: They are a very distributed team and that seems to work well for them. Interesting to see at a large company like Walmart; The node.js framework they use (I believe they are the original authors as well) is http://spumko.github.io/; There's a ton of upside in the network centric nature of node.js for them; They sound like a very competent, small team, so some of the successes can also be attributed to that, rather than necessarily be all thanks to node.js (despite my own bias); It's great to have such high traffic installation of node.js out there, since it brings up the production quality of node.js.

BayesDB: a Bayesian database table, lets users query the probable implications of their tabular data as easily as an SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.

Stuff The Internet Says On Scalability For December 13th, 2013

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale