hot links

Stuff The Internet Says On Scalability For April 7th, 2017

High Scalability

07 Apr 2017 — 14 min read

Hey, it's HighScalability time:

Visualization of the magic system behind software infrastructure. (eyezmaze, @ThePracticalDev)
If you like this sort of Stuff then please support me on Patreon.

10-20: aminoacids can be made per second; 64800x: faster DDL Aurora vs MySQL; 25 TFLOPS: cap for F1 simulations; 15x to 30x: Tensor Processing Unit faster than GPUs and CPUs; 100 Million: Intel transistors per square millimeter; 25%: Internet traffic generated by Google; $1 million: Tim Berners-Lee wins Turing Award; 43%: phones FBI couldn't open because of crypto;

Quotable Quotes:
- @adulau: To summarize the discussions of yesterday. All tor exit nodes are evil except the ones I operate.
- @sinavaziri: Let's say a data center costs $1-2B. Then the TPU saved Google $15-30B of capex?
- Vinton G. Cerf: While it would be a vast overstatement to ascribe all this innovation to genetic disposition, it seems to me inarguable that much of our profession was born in the fecund minds of emigrants coming to America and to the West over the past century.
- Alan Bundy: AI systems are not just narrowly focused by design, because we have yet to accomplish artificial general intelligence, a goal that still looks distant.
- JamesBarney: Soo much this, just worked on a project that sacrificed reliability, maintainability, and scalability to use a real time database to deal with loads that were on the order of 70 values or 7 writes a second.
- bobdole1234: 3.5x faster than CPU doesn't sound special, but when you're building inference capacity by the megawatt, you get a lot more of that 3.5x faster TPU inside that hard power constraint.
- Eugenio Culurciello: As we have been predicting for 10 years, in SoC you can achieve > 10x more performance that current GPUs and > 100x more performance per watt.
- Google: The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency.
- visarga: TPU excited me too at first, but when I realized that it is not related to training new networks (research) and is useful only for large scale deployment, I toned down my enthusiasm a little.
- Julian Friedman: Kube is being designed by system administrators who like distributed systems, not for programmers who want to focus on their apps.
- shadowmint: Given what I've seen, I'd argue that clojure has an inherent complexity that results in poor code quality outcomes during the software maintenance cycle.
- weberc2: I like Go, but it's not dramatically faster than Java. Any contest between the two of them will probably just be a back and forth of optimizations. They share pretty much the same upper bound.
- adrianratnapala: All this means is that we should stop thinking of this stuff as RAM. Only the L1 cache is really RAM. Everything else is just a kind of fast, volatile, solid state disk that just happens to share an address space with the RAM.
- pbreit: Getting a million users is infinitely harder than scaling a system to handle a million users. Most systems could run comfortably on a Raspberry Pi.
- @sustrik: If you want your protocol to be fully reliable in the face of either peer shutting down, the terminal handshake has to be asymmetric. As we've seen above, TCP protocol has symmetric termination algorithm and thus can't, by itself, guarantee full reliability.
- @damonedwards: Unit tests are critical for good dev, but aren't really ops concern. Integration tests are critical for good ops. Ops wants more int tests.
- mannigfaltig: the brain appears to spend about 4.7 bits per synapse (26 discernible states, given the noisy computation environment of the brain); so it seems to be plenty enough for general intelligence. This could, of course, merely be a biological limit and on silicon more fine-grained weights might be the optimum.
- marwanad: The main power of GraphQL is for client developers and lies in the decoupling it provides between the client and server and the ability to fulfill the client needs in a single round trip. This is great for mobile devices with slower networks.
- kyleschiller: As a pretty good rule of thumb, a system that fails 1/nth of the time and has n opportunities to fail has ~.63 probability of failure, where n is more than ~10.
- jjirsa: databases aren't where you want to have hipster tech. You want boring things that work. For me, Cassandra is the boring thing that works.
- @etherealmind: "rule #1 of Enterprise IT: easier to spend 10 million on equipment than 100k for a person. A third person would increase capacity by 30%"
- @SwiftOnSecurity: “Just pick a good VPN” is like telling thirsty people to “go to a store and drink clear liquid.” They drank bleach, but at least you helped.
- falsedan: There's 2 secrets to scaling to millions of users: 1. You aren't going to have millions of users so any work you do to support it is stopping you from delivering features that will make your existing 10 clients happier. 2. Write code that can be replaced (i.e. design for change).
- X86BSD: Have you tested running it on a FreeBSD box with ZFS? It has lz4 compression by default and makes such a great storage solution for PG. You get compression, snapshots, replication (not quite realtime but close), self healing, etc etc in a battled hardened and easy to manage filesystem and storage manager. I've found you can't beat ZFS and PG for most applications. Edge cases exist of course everywhere.

Worried about too much infrastructure? Only 2% of DNA codes for proteins, the other 98% codes for RNA. Harry Noller Lecture. Maybe lots of infrastructure is not a bad thing. One of they key differences in programming and biology is how in biology form completely determines function. Just amazing to watch in action: mRNA Translation (Advanced). Programming is the complete opposite.

Simulation technology could mean the end of the apprenticeship model for teaching surgeons. Lifelike simulations that make real-life surgery safer. Instead of practicing on real people surgical teams are using a blend of Hollywood special effects and 3D printing to create amazingly lifelike reproductions of real patients. Surgeons can use the same strategy of progressive practice used by athletes and pilots to perform life like rehearsal prior to game time.

Awesome look at a different stack. Building a realtime server backend using the Orleans Actor system, Dotnet Core and Server-side Redux. It's love: Once you understand how an actor system works, the whole idea of gobbling data together to render a page on the webserver by executing database queries, mapping to objects, caching data, dealing with conflicts caused by data duplication, locking, and doing that over and over again for every web request, seems very cumbersome. It works: Updates can be initiated from the client and from the server. Page rendering works server-side and client-side, and the page is updated automatically when the state changes. The complete action history is saved in an Azure Storage Table. Not everything works: The only limitation of an Orleans-based architecture that I ran into is that it provides no built-in way of searching for data, or doing interactive analysis on data.

Another turn on the people productivity matters more than language speed wheel. Yes, Python is Slow, and I Don’t Care. Which is true, until it isn't. That's usually when you are paying too much for too many boxes or a global mutex means you can't string a few service calls together fast enough to send a reply to a user within an iceage.

When Boring is Awesome: Building a scalable time-series database on PostgreSQL: a new open-source time-series database optimized for fast ingest and complex queries...Looks, feels, speaks just like PostgreSQL...introducing horizontal scale-out, automatic space/time partitioning, and distributed query optimizations for time-series data...benchmarks consistently show a greater than 15x improvement on inserts versus vanilla PostgreSQL...Our engine ensures that chunks are right sized and time-interval aligned to ensure that the multiple B-trees for a table’s indexes can reside in memory during inserts...When queries arrive, we avoid querying extraneous chunks via constraint exclusion analysis, and then employ different techniques to parallelize the query across the remaining chunks efficiently...All this complexity is hidden from the user behind an abstraction we call a “hypertable”, which provides the illusion of a single table across all space and time.

Scaling Unsplash with a small team. Unsplash handles 10M+ requests per day. How? It's a matter of principle: Build boring, obvious solutions; Focus on solving user problems, not technology problems; Throw money at technical problems. Heroku is used to simplify deployment, configuration, testing, maintenance, and scaling of our primary applications; In the application logic, we lean heavily on frameworks built by other people; We lean heavily on Redis, ElasticSearch, and Postgres for all production loads; We aggressively use worker queues, pushing as many operations into an asynchronous processing queue; data processing uses Snowplow; image hosting and infrastructure to Imgix; We push all of our user activities to Stream; we use TinEye for reverse image search and Google Vision for image understanding and classification; On the frontend our team uses React and Webpack.

Ascent on Scaling to Millions of Users: The first rule of scaling is keep your solutions as simple as possible, but not simpler; chief among the complexity injections are cacheing and denormalizing, which are not necessary before 100,000's of users; prioritize the bigger machines over more machines; think of scaling in order-of-magnitude chunks; do enough to keep the site high performing while keeping the team available to build actual value for customers.

Interesting idea, using virualization in embedded systems. Xen and the art of embedded virtualization. When programming in C or C++ corruption by stomping over memory comes with the territory. Virualization makes sense as a way of protecting components from each other.

ML can find your whales. Using machine learning for insurance pricing optimization: Approximately 7-10% of AXA’s customers cause a car accident every year. Most of them are small accidents involving insurance payments in the hundreds or thousands of dollars, but about 1% are so-called large-loss cases that require payouts over $10,000... after developing an experimental deep learning (neural-network) model using TensorFlow via Cloud Machine Learning Engine, the team achieved 78% accuracy in its predictions.

More in the do we really own our stuff anymore department? IoT garage door opener maker bricks customer’s product after bad review.

You can add capacity but at some point you need to protect yourself against from bad actors by limiting the damage they can do. Stripe with an epic post on Scaling your API with rate limiters: A rate limiter is used to control the rate of traffic sent or received on the network...A load shedder makes its decisions based on the whole state of the system rather than the user who is making the request...At Stripe, we operate 4 different types of limiters in production...Request rate limiter...restricts each user to N requests per second...Concurrent requests limiter...It is completely reasonable to tune this limiter up so it rejects more often than the Request Rate Limiter....Fleet usage load shedder...this type of load shedder ensures that a certain percentage of your fleet will always be available for your most important API requests...Worker utilization load shedder...This load shedder is the final line of defense. If your workers start getting backed up with requests, then this will shed lower-priority traffic.

You have mountains of data, how do you turn that into a business? After Years of Challenges, Foursquare Has Found its Purpose -- and Profits. Don't target consumers, or small local business, target enterprises. Foursquare is now a location intelligence company for business. For example, Snapchat uses Foursquare to improve its geo-filtering. Also an interesting use as a way to determine that all so difficult to determine conversion metric, when does a digital stimulus result in an effect in the meat world?

Sounds like a fun way to learn. Wireshark Layer 2-3 pcap Analysis w/ Challenges (CCNP SWITCH): In this blogpost I am publishing the captured pcap file with all of these 22 protocols. I am further listing 45 CHALLENGES as an exercise for the reader. Feel free to download the pcap and to test your protocol skills with Wireshark.

A revolutionary solution to an important problem. No foolin'. Why Satellites Are Programmed Differently: Most engineers never consider the weight of the firmware in their designs...What are space agencies doing to address this issue? “We have tried to encourage our coders to write programs that compile to the fewest possible number of ones, but this has proven to be an extraordinarily daunting task.” said scientist Joe Snietzelberg, on condition of anonymity. “Even worse, they need to consider whether their code will increase the number of ones stored in the ECC bits.”

Perhaps we need a personality test to match us with our soul-VPN? This Massive VPN Comparison Spreadsheet Helps You Choose the Best for You.

James Hamilton on At Scale, Rare Events aren’t Rare. When you build it yourself you can pick a different point on the tradeoff curve: I’m lucky enough to work at a high-scale operator where custom engineering to avoid even a rare fault still makes excellent economic sense so we solved this particular fault mode some years back. In our approach, we implemented custom control firmware such that we can continue to multi-source industry switch gear but it is our firmware that makes the load transfer decisions and, consequently, we don’t lockout.

It's a piece of Cake. How do multiplayer games sync their state? Part 1: clients send updates in a fixed interval; prediction with reconciliation.

The truth is every method we use to remotely ask software to do something for us sucks in some deeply disturbing way. Is GraphQL the Next Frontier for Web APIs?: religious adherence to REST is overrated and its perceived advantages have never materialized as fully as its proponents hoped. Whatever we choose next should aim to be flexible and efficient, and GraphQL seems like a good candidate for that. We should move to GraphQL as a backend and combine it with great language-specific libraries that leverage good type systems to catch integration mistakes before the first HTTP call flies, and which allow developers to use their own tooling to auto-complete (in the sense of VS IntelliSense or Vim’s YouCompleteMe) to success.

Eve Online reports their hardware upgrade was a big success. Interesting database change: the database servers are two Lenovo x880 dual CPU servers with 768 GB RAM. We knew that Lenovo does have a special clip designed for the FLEX platform where you essentially dock together servers, much like GPU SLI, so when this clip was put in place and we fired up the server, the windows operating system see’s 4x CPUs and 1.5 TERABYTES of RAM!

Chip Overclock with a very personal meditation on time. My Stratum-1 Desk Clock.

Fabulous Incident Summary: 2017–03–16 from Square. Active-active in multiple datacenters isn't always enough. Services are deployed over 250 times per day to production. SOP: conference engineers across offices; rolled back all software changes that happened leading up to the incident; activated a “Crisis team”; updated issquareup.com to notify sellers of the disruption with continued updates; do not stop exploring solutions until we have a confirmed fix; hold a post-mortem meeting to discus what went well and poorly in the response.

Amen. Please stop writing new serialization protocols: The result of all this is that, instead of having a computer ecosystem where anything can talk to anything else, we have a veritable tower of babel where nothing talks to anything else. Imagine if there were 40 competing and completely mutually unintelligible versions of html or text encodings: that’s how I see the state of serialization today.

Machine learning has gone distributed. Curious to see what kind of problems can be solved this way. Federated Learning: Collaborative Machine Learning without Centralized Training Data: Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud...It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model...Federated Learning allows for smarter models, lower latency, and less power consumption, all while ensuring privacy. And this approach has another immediate benefit: in addition to providing an update to the shared model, the improved model on your phone can also be used immediately...Federated Learning can't solve all machine learning problems (for example, learning to recognize different dog breeds by training on carefully labeled examples)

Lord of the Flies datacenter style. Faster page loads: The system, dubbed Flowtune, essentially adopts a market-based solution to bandwidth allocation. Operators assign different values to increases in the transmission rates of data sent by different programs. For instance, doubling the transmission rate of the image at the center of a webpage might be worth 50 points, while doubling the transmission rate of analytics data that’s reviewed only once or twice a day might be worth only 5 points.

Does this ever happen in programming, these kind of deep unexpected similarities? Pasta spirals link neutron stars and the machinery of your cells: The insides of neutron stars and the membranes inside our cells can form strikingly similar structures resembling cavatappi pasta spirals

Preslav Mihaylov has an excellent series of posts on how numbers are represented in computers. How does the binary nature of computers affect our data types, Floating point numbers, and so on.

How do you handle multi-terabyte size range ad-hoc segmentation queries? Solving Our Slow Query Problem: sharding would carry a high development price tag, increase our hosting cost by an order of magnitude, and introduce a high degree of vendor lock-in; Partitioning large tables would carry similar development costs; we realized that it is possible to keep the cached results fresh in realtime, provided that we “recheck” segment membership anytime a subscriber event occurs.

Nice high level view of several architectures. Architecture of Giants: Data Stacks at Facebook, Netflix, Airbnb, and Pinterest.

antirez/rax: a radix tree implementation initially written to be used in a specific place of Redis in order to solve a performance problem, but immediately converted into a stand alone project to make it reusable for Redis itself, outside the initial intended application, and for other projects as well.

rudyrucker/chaos: This is a free release of the source, manual, and executables of a 1991 Autodesk DOS program that was called "James Gleick's CHAOS: The Software."

Killzone's AI: dynamic procedural combat tactics: In this document we describe the design of 'dynamic procedural combat tactics' AI for Killzone.

Application-Level Consensus: This article explores the benefits of using a consensus algorithm, such as Raft, to build clustered services. The core of this type of system is deterministic execution, replicated consensus log, and snapshotting of state to avoid replay from the beginning of time. Such a consensus approach offers simplicity, debug-ability, fault tolerance and scalability.

In-Datacenter Performance Analysis of a Tensor Processing Unit: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC—called a Tensor Processing Unit (TPU)— deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU’s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Communication-Efficient Learning of Deep Networks from Decentralized Data: Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning.

Greg has a few Quick links that might be of interest.

Stuff The Internet Says On Scalability For April 7th, 2017

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale