hot links

Stuff The Internet Says On Scalability For April 15th, 2016

High Scalability

15 Apr 2016 — 10 min read

Hey, it's HighScalability time:

What happens when Beyoncé meets eCommerce? Ring the alarm.If you like this sort of Stuff then please consider offering your support on Patreon.

$14 billion: one day of purchases on Alibaba; 47 megawatts: Microsoft's new data center space for its MegaCloud; 50%: do not speak English on Facebook; 70-80%: of all Intel servers shipped will be deployed in large scale datacenters by 2025; 1024 TB: of storage for 3D imagery currently in Google Earth; $7: WeChat average revenue per user; 1 trillion: new trees;

Quotable Quotes:
- @PicardTips: Picard management tip: Know your audience. Display strength to Klingons, logic to Vulcans, and opportunity to Ferengi.
- Mark Burgess: Microservices cannot be a panacea. What we see clearly from cities is that they can be semantically valuable, but they can be economically expensive, scaling with superlinear cost.
- ethanpil: I'm crying. Remember when messaging was built on open platforms and standards like XMPP and IRC? The golden year(s?) when Google Talk worked with AIM and anyone could choose whatever client they preferred?
- @acmurthy: @raghurwi from @Microsoft talking about scaling Hadoop YARN to 100K+ clusters. Yes, 100,000
- @ryanbigg: Took a Rails view rendering time from ~300ms to 50ms. Rewrote it in Elixir: it’s now 6-7ms. #MyElixirStatus
- Dmitriy Samovskiy: In the past, our [Operations] primary purpose in life was to build and babysit production. Today operations teams focus on scale.
- @Agandrau: Sir Tim Berners-Lee thinks that if we can predict what the internet will look like in 20 years, than we are not creative enough. #www2016
- @EconBizFin: Apple and Tesla are today’s most talked-about companies, and the most vertically integrated
- Kevin Fishner: Nomad was able to schedule one million containers across 5,000 hosts in Google Cloud in under five minutes.
- David Rosenthal: The Web we have is a huge success disaster. Whatever replaces it will be at least as big a success disaster. Lets not have the causes of the disaster be things we knew about all along.
- Kurt Marko: The days of homogeneous server farms with racks and racks of largely identical systems are over.
- Jonathan Eisen: This is humbling, we know virtually nothing right now about the biology of most of the tree of life.
- @adrianco: Google has a global network IP model (more convenient), AWS regional (more resilient). Choices...
- @jason_kint: Stupid scary stats in this. Ad tech responsible for 70% of server calls and 50% of your mobile data plan.
- apy: I found myself agreeing with many of Pike’s statements but then not understanding how he wound up at Go.
- @TomBirtwhistle: The head of Apple Music claims YouTube accounts for 40% of music consumption yet only 4% of online industry revenue
- @0x604: Immutable Laws of Software: Anyone slower than you is over-engineering, anyone faster is introducing technical debt
- surrealvortex: I'm currently using flame graphs at work. If your application hasn't been profiled recently, you'll usually get lots of improvement for very little effort. Some 15 minutes of work improved CPU usage of my team's biggest fleet by ~40%. Considering we scaled up to 1500 c3.4xlarge hosts at peak in NA alone on that fleet, those 15 minutes kinda made my month :)
- @cleverdevil: Did you know that Virtual Machines spin up in the opposite direction in the southern hemisphere? Little known fact.
- ksec: Yes, and I think Intel is not certain to win, just much more likely. The Power9 is here is targeting 2H 2017 release. Which is actually up against Intel Skylake/Kabylake Xeon Purley Platform in similar timeframe.
- @jon_moore: Platforms make promises; constraints are the contracts that allow platforms to do their jobs. #oreillysacon
- @CBILIEN: Scaling data platforms:compute and storage have to be scaled independently #HS16Dublin

A morning reverie. Chopped for programmers. Call it Crashed. You have three rounds with four competitors. Each round is an hour. The competitors must create a certain kind of program, say a game, or a productivity tool, anything really, using a basket of three selected technologies, say Google Cloud, wit.ai, and Twilio. Plus the programmer can choose to use any other technologies from the pantry that is the Internet. The program can take any form the programmer chooses. It could be a web app, iOS or Android app, an Alexa skill, a Slack bot, anything, it's up to the creativity of the programmer. The program is judged by an esteemed panel based on creativity, quality, and how well the basket technologies are highlighted. When a programmer loses a round they have been Crashed. The winner becomes the Crashed Champion. Sound fun?

Jeff Dean when talking about deep learning at Google makes it clear a big part of their secret sauce is being able to train neural nets at scale using their bespoke distributed infrastructure. Now Google has released Tensor Flow with distributed computing support. It's not clear if this is the same infrastructure Google uses internally, but it seems to work: using the distributed trainer, we trained the Inception network to 78% accuracy in less than 65 hours using 100 GPUs. Also, the tensorflow playground is a cool way to visualize what's going on inside.

Christopher Meiklejohn with an interesting history of the Remote Procedure Call. It started way back in 1974: RFC 674, “Procedure Call Protocol Documents, Version 2”. RFC 674 attempts to define a general way to share resources across all 70 nodes of the Internet.

Dan Luu with an excellent set of Notes on Google's Site Reliability Engineering book: "this is the most valuable technical book I’ve read in the past year." Liked this: Going from 5 9s to 100% reliability isn’t noticeable to most users and requires tremendous effort. Ex: if a user is on a smartphone with 99% reliability, they can’t tell the difference between 99.99% and 99.999% reliability. Reliability isn’t linear in cost. It can easily cost 100x more to get one additional increment of reliability.

Billing by Millionths of Pennies, Cloud Computing’s Giants Take In Billions: “The secret to cloud economics is utilization — every minute unsold is money you don’t get back,” said Greg DeMichillie, the head of product for Google’s cloud. “There are very few of us who can work at this scale.”

It's almost always a configuration problem. Google Compute Engine Incident #16007: To maximize service performance, Google’s networking systems announce the same IP blocks from several different locations in our network, so that users can take the shortest available path through the internet to reach their Google service...on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration. The inconsistency was triggered by a timing quirk in the IP block removal...One of our core principles at Google is ‘defense in depth’, and Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations in the event of an upstream failure or bug. These safeguards include a canary step...decided to revert the most recent configuration changes made to the network even before knowing for sure what the problem was.

One key to rule them all. What could go wrong? Canadian Police Obtained BlackBerry’s Global Decryption Key.

Here's how Nylas scaled their primary datastore by over 20x after product launch. Hint: it involves sharding so as not to hit the InnoDB 2TB per table limit. They moved off RDS so they could have more flexibility at the cost of writing a lot of management code themselves. Growing up with MySQL: In the days following our launch of N1, we discovered dozens of individual and unique bottlenecks in our application code. We came face-to-face with the upper limits of Redis and MySQL’s insertion volume. And we even learned some interesting things about AWS network topologies.

Server Sizing Comments: the case for 1-socket in OLTP: It is time to consider the astonishing next step, that a single socket system is the best choice for a transaction processing systems. First, with proper database architecture and tuning, 12 or so physical cores should be more than sufficient for a very large majority of requirements...transaction processing performance is heavily dependent on serialized round-trip memory access latency...In a single socket system, 100% of memory accesses are local node, because there is only 1 node...in transaction processing, it is memory round-trip latency the dominates, not memory bandwidth.

Thinking of a program as a material has a lot power, but how far can the analogy be extended? Compressive Strength and Parameter Passing in the Physics of Software: Over time, thinking in terms of materials and forces could also help to conceive new solutions, either by borrowing from niche paradigms or by engineering a material / shape on purpose.

From side project to 250 million daily requests. Good story of starting small and simple and bootstrapping that into something. Also a testament to the power of StackOverflow. The stack: I moved everything to AWS, with the servers behind elastic load balancers... servers in 3 regions (US east coast, US west coast, and Frankfurt), and then made use of Route 53’s latency-based routing to route to the lowest latency server...I switched to Elastic Beanstalk...each server can independently answer every API request, there’s no shared database or anything...It’s super quick and well over 90% of our 250 million daily requests are handled in less than 10 milliseconds.

In the top quartile of introductions. Statistics for Software.

With Azure gaining ground, especially in the enterprise space, you might want to learn more of what's going on under the hood. Microservices Part 2: Introduction to Service Fabric with Mark Russinovich: Mark talks Service Fabric, Stateful Services and cloud scale with Seth as they explore how Azure Service Fabric helped Microsoft build global scale services like Cortana, Azure SQL Databases.

Programming as a form of capture, like an addiction. Capture: Unraveling the Mystery of Mental Suffering: The theory of capture is composed of three basic elements: narrowing of attention, perceived lack of control, and change in affect, or emotional state. Sometimes these elements are accompanied by an urge to act. When something commands our attention in a way that feels uncontrollable and, in turn, influences our behavior, we experience capture.

Evaluating Database Compression Methods: Update: If you’re looking for a fast compression algorithm that has decent compression, consider LZ4. It offers better performance as well as a better compression ratio, at least on the data sets we [Percona] tested.

CockroachDB puts themselves through the gauntlet of self-knowledge. DIY Jepsen Testing CockroachDB. What did they learn? As the story unfolds one is reminded the journey is its own reward, but they did find: two consistency-related bugs in CockroachDB: #4884 and #4393. This is exactly the kind of discovery we had expected from the ordeal, and we are thrilled to join the club of Jepsen advocates!

QCon London 2016 Report. At 86 pages a lot of talks are covered in a useful amount of detail. Also Ben Basson has his QCon 2016 report: This year it was obvious that unlike in previous years, microservices are now being used in production in a wide variety of situations, and are becoming a much more mainstream design pattern...there is an increasing focus on culture within development teams, and also the wider company culture.

Is the teleology of the Internet to move data or to copy data? It depends on who wants to be in control. Brewster Kahle's "Distributed Web" proposal: Another way of looking at this is that the current Internet is about moving data from one place to another, NDN (Named Data Networking) is about copying data. By making the basic operation in the net a copy, caching works properly (unlike in the current Internet).

So can we expect microservices to eventually die out? Most of the Tree of Life is a Complete Mystery: You see this pattern in bacteria that end up inside insect cells—their genomes tend to shrink and they lose genes that are important for a free-living existence.

Facebook's Applied Machine Learning (AML) group, like Google's team, also has a practical bent, helping create systems for translation, photo search, talking pictures, and real-time video classification. To do this Facebook built an AI backbone: that powers much of the Facebook experience and is used actively by more than 25 percent of all engineers across the company. Powered by a massive 40 PFLOPS GPU cluster that teams are using to train really large models with billions of parameters on huge data sets of trillions of examples, teams across the company are running 50x more AI experiments per day than a year ago, which means that research is going into production faster than ever.

Facebook’s Aquila Drone Creates a Laser-net In the Sky. Cool, but it's a little spooky to think of all these drones constantly circling and watching above us. Also, Introducing Facebook's new terrestrial connectivity systems — Terragraph and Project ARIES.

Do GPU optimized databases threaten the hegemony of Oracle, Splunk and Hadoop?: GPU accelerated databases are superficially analogous to in-memory DBs, however the availability of many more GPU cores, much faster GDDR5 memory and instruction sets optimized for combining, multiplying, summing and filtering complex data sets makes GPUs particularly adept at accelerating many database operations...a 12x speedup in performance, far faster than the 20-25% annual improvements seen with conventional CPUs. The slope of this performance curve will not be lost on enterprise application developers where databases represent the next frontier in GPU penetration.

Scaling the roof of the Milan Duomo – The huge cathedral that took 600 years to build! The Long Now Foundation builds things that will last longer than 600 years, far longer, but it's hard to imagine anything in tech being worked on continuously for over 600 years. Maybe Minecraft?

Authenticated Data Structures, as a Library, for Free!: In this post, I'll show that in a language with sufficiently powerful abstraction facilities (in this case OCaml), it is possible to implement Miller et al.’s solution as a library within the language, with no need to create a new language, or to alter an existing language’s implementation.

How to write a Bloom filter in C++. Short and too the point.

Building a 32-Thread Xeon Monster PC for Less Than the Price of a Haswell-E Core i7. You can do amazing things with "old" and now much cheaper technology.

youtube/doorman: a solution for Global Distributed Client Side Rate Limiting. Clients that talk to a shared resource (such as a database, a gRPC service, a RESTful API, or whatever) can use Doorman to voluntarily limit their use (usually in requests per second) of the resource. Doorman is written in Go and uses gRPC as its communication protocol. For some high-availability features it needs a distributed lock manager. The purpose of Doorman is to apportion and distribute capacity to clients based on some definition of fairness.

twitter/pelikan: Twitter's unified cache backend. More information: Caching in datacenters.

mirage/jitsu: a forwarding DNS server that automatically boots unikernels on demand.

cmeiklejohn/PMLDC: This repository is a work-in-progress curriculum on models and languages for distributed computing.

An Implementation and Analysis of a Kernel Network Stack in Go with the CSP Style: Modules for the major networking protocols, including Ethernet, ARP, IPv4, ICMP, UDP, and TCP, were implemented. In this study, the implemented Go network stack, called GoNet, was compared to a representative network stack written in C. The GoNet code is more readable and generally performs better than that of its C stack counterparts. From this, it can be concluded that Go with CSP style is a viable alternative to C for the language of kernel implementations.

Stuff The Internet Says On Scalability For April 15th, 2016

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale