Stuff The Internet Says On Scalability For September 16th, 2016

Hey, it's HighScalability time:

The struggle for life that kills. Stunning video of bacteria mutating to defeat antibiotics.

If you like this sort of Stuff then please support me on Patreon.

  • 60%: time spent cleaning dirty dirty BigData; 10 million: that's a lot of Raspberry Pi; 365: days living in a Mars simulation; 100M: monthly League of Legends players; 1.75 billion: copyright takedowns by Google; 3.5 petabytes: data Evernote has to move to Google cloud; 11%: YoY growth in time spent on mobile apps; 4 hours: time between Lambda coldstarts; 

  • Quotable Quotes:
    • Camille Fournier: humans struggle to tangibly understand domains that are theoretically separate when they are presented as colocated by the source code.
    • @songcarver: The better example: iPhone 7 is showing 115% of 2016 Macbook single core performance, 88% of multi-core.
    • ex3ndr: We (actor.im) also moved from google cloud to our servers + k8s. Shared persistent storage is a huge pain. We eventually stopped to try to do this, will try again when PetSets will be in Beta and will be able to update it's images.
    • @mcclure111: "Well maybe you should get your spaceship working before you try to implant nanites in your brain, DUDE"
    • IOpipe: Organizations I’ve spoken to have expressed an average of 10x cost savings over microservices-based infrastructure for the code they’ve moved to AWS Lambda.
    • avitzurel: Kube is winning for the same reason React/Redux (and now Mobx) is winning and why Rails was winning at the time. Community.
    • @etherealmind: Evernote is moving to public cloud. A strong sign that its in financial trouble, or lacking product direction.
    • @codinghorror: In 8 years of colocating servers I have seen multiple spinning rust disks fail, and one PSU, but zero SSDs failed from 2013-on.
    • Caltech: Now, with the new simulation—which used a network of thousands of computers running in parallel for 700,000 central processing unit (CPU) hours—Caltech astronomers have created a galaxy that looks like the one we live in today, with the correct, smaller number of dwarf galaxies.
    • Andy Grove: Rust is gearing up to be particularly suitable for building scalable asynchronous io and getting Rust onto servers is a great way to drive adoption of the language. 
    • James Hamilton: We have long believed that 80% of operations issues originate in design and development… When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there.
    • Google: even the possibility of a future quantum computer is something that we should be thinking about today.
    • Alan Kay: This doesn’t mean that “objects are now hidden”, but that they should be part of the “modeling and designing of ideas and processes” that is the center of what programming needs to be.
    • Packet Pushers: In the future the world be made of clouds and users. The user will be sitting in Starbucks and accessing the cloud and your network will be totally irrelevant.
    • StorageMojo: Our current system for the diffusion of knowledge is breaking down. How are we going to fix it?
    • Ron Miller: Flywheel Effect is the idea that once you have your core tech pieces in place, they have an energy of their own that drives other positive changes and innovations.
    • stonogo: Intel needs everything to be NUMA-aware. They're betting a lot of money on Xeon Phi, and once the self-booting KNL machines are out nobody will want to deal with the pcie cards any more.
    • @Fruzenshtein: It's strange to listen a talk about microservices when you have already heard about serverless architecture💩
    • MORGAN HOUSEL: There’s often a big gap between changing the world and convincing people that you changed the world.
    • @JoeEmison: Another under-reported aspect of moving from VMware to AWS: almost everyone is getting a massive performance improvement.
    • Dan Rayburn: Twitter’s NFL stream, taking place Thursday Sept 15th, will be delivered by Akamai and Level 3 and I do not expect it to have a large simultaneous audience. My estimate is under 2M simultaneous streams.
    • @johngirvin: Running serverless infrastructure this morning. In that the servers are all down.
    • Vlad Ilyushchenko: QuestdbWorker is pure worker implementation as far as worker consumers don't necessarily process same number of queue items. It is slower due to constant interaction with memory barriers and is at heavy disadvantage in this particular benchmark because it can't benefit from batching. Despite that workers can be useful when queue item processing cost is non-uniform.
    • ArkyBeagle: The path to concurrency is paved with a mix of finite state machines and event-driven programs. IMO, neither FP nor OO have all that much to say about that.
    • matt_oriordan: Having static servers handling load is only part of the problem in our experience. The true complexity and scalability of a system comes when you consider how it copes under load with unexpected failures (network, hardware), but more importantly expected maintenance such as regular deploys, scaling up and scaling down events
    • matthieum: I think my biggest complaint about the try-with-resources pattern is that... it just doesn't work. RAII just works, without effort on the client part, no matter how she uses the class.
    • Brandon Beck: I remember we had something like 20 folding chairs and, without knowing if anyone would watch, decided to stream the games. We ended up getting over 100,000 concurrent viewers, which just blew our minds. It was there we realized this was something League players loved and started to really take it seriously.
    • jandrewrogers: The weakness of GPU databases is that while they have fantastic internal bandwidth, their network to the rest of the hardware in a server system is over PCIe, which generally isn't going to be as good as what a CPU has and databases tend to be bandwidth bound. This is a real bottleneck and trying to work around it makes the entire software stack clunky.
    • @pmarca: 1 Software eats the world, 2 Every company becomes a software company, and 3 Software people run every company:
    • @benalexau: Benchmarked @mjpt777's Aeron w/ SBE and @grpcio for bulk xfers between JVMs. While different sweet spots, Aeron ~200 times higher throughput
    • Freeman Dyson: So, anyway, that’s sort of my view about the brain. That we won’t really understand the brain until we can make models of it which are analog rather than digital, which nobody seems to be trying very much.

  • Drivers and users turn out to be relatively price insensitive to Uber fares. As Uber approaches a monopoly position there's a lot of consumer surplus that can be turned into profits (if the war chest lasts). Why Uber Is an Economist’s Dream: if you extrapolate to the whole U.S., we found that the overall consumer surplus added up to almost $7 billion. So people spent about $4 billion on Ubers, but they actually would have been willing to spend about $11 billion.

  • Making money in Apple's app store ain't what it used to be, at least for developers. Here's a thoughtful discussion on the transition from a charge-up-front model to advertising suppported apps by long time app developers David Smith and Marco Arment: Overcast trying ads, dark theme now free and Under the Radar #45. You may lament advertising as the go to model allowing developers to make a decent living, but it turns out advertising within apps nicely aligns developer incentives with user goals in a way that doesn't happen for content. For content the drive to increase page views encourages a race to the bottom. Click-bait dominates as CPMs tumble. For apps the incentive is to provide a good user experience for every interaction. You want to encourage the user to use your app because that's when you get paid. Individually the payoff isn't so great that it warps the incentives to "encourage" a user to use your app, but over a whole installed base the more users use your app the more you get paid, so as a developer you have an incentive to keep developing features and making nice little improvements to the app. In the charge-up-front model the developer is disincentivized from making changes because there's always a well-founded fear any changes won't be rewarded by increased sales. If efforts aren't rewarded there's no point in efforting...and make no mistake, programming does take a lot of effort. And if a user really doesn't want ads they can pay to have them removed, there are no app ad blockers. Everyone wins. This has been your moment of Zen.

  • Evernote’s Future Is in the [Google] Cloud. Is this a big get for Google? It's an interesting one. Presumably as a service Evernote is cost sensitive so if running your own infrastructure is so cheap then this move wouldn't make sense. Why is Evernote making the move?: Improvements in performance, security, efficiency, scalability, resources freed to accelerate current development and improve core products, roll out feature updates in less time. Evernote obviously has a potential big data play and being inside Google cloud means they'll have access to Google's best of class machine learning and big data tools. 

  • We have ourselves an oligopoly. Microsoft Azure boss Scott Guthrie: Cloud price war with Amazon Web Services is winding down. Now there's a real opportunity for private cloud solutions to compete on price because they'll never compete on features.

  • If "Going to microservices is like going from Newton’s physics to Einstein’s physics" then maybe going to from microservices to serverless is like going to Quantum Mechanics? Data on the Outside versus Data on the Inside

  • It sounds like an episode from Mr Robot, but it's actually an unexpectedly twisty self-hack: really loud sounds can damage hard drives. A Loud Sound Just Shut Down a Bank's Data Center for 10 Hours. The sound made by gas expelling from an inert gas extinguishing system can reach over 130dB, which causes enough vibration to move hard drive read/write heads by the 1/1,000,000 of an inch necessary to cause damage. Brendan Gregg demonstrates with a rebel yell.

  • The evolution and Scaling Airbnb’s Payment Platform. Airbnb started out like a minimum viable product, it matched hosts to guests, payments were handled offline directly between the two parties, Airbnb was out of the loop. But Airbnb had a "problem", they were a global platform so needed a global payments system that people could trust, US centric modalities would not work. An early version was based on Rails and ActiveRecord, which could only last so long. Version two features a sophisticated Billing API, Payment Gateway, and Financial Pipeline with Spark based reporting. In the future they plan on: creating something that sounds like their own currency to spend within system; near real-time financial reporting; leveraging machine learning and consuming signals from both our production system and our processors to make dynamic decisions about how to route a transaction, optimizing for cost, acceptance, or speed.

  • The future never comes as fast or as clean as we hope. Intel’s Xpoint is pretty much broken: we have three direct claims by Intel about their upcoming NVRAM technology called Xpoint. They claimed 10x the density of DRAM, it is now 4x or a 2.5x decrease. That is a stunning deliverable but sadly it is the best performance of any of their claims. Latency missed by 100x, yes one hundred times, on their claim of 1000x faster, 10x is now promised and backed up by tests. More troubling is endurance, probably the main selling point of this technology over NAND. Again the claim was a 1000x improvement, Intel delivered 1/333rd of that with 3x the endurance.

  • The Internet Scale Services Checklist, which started as a take on James Hamilton's awesome 2007 paper On Designing and Deploying Internet-Scale Services, has been respun to reflect a Serverless Architecture scenario, based on something like a stack of: AWS Lambda + API Gateway + DynamoDB. The changes aren't as dramatic as you might think, that's why Serverless != NoOps.    

  • Oh well, DNA storage is decades away. Nature's DNA storage clickbait: I actually believe that, decades from now, DNA will be an important archival medium. But I've been criticizing the level of hype around the cost of DNA storage for years. 

  • rdsubhas: You need to be this tall to use [micro] services: Basic Monitoring, instrumentation, health checks; Distributed logging, tracing; Ready to isolate not just code, but whole build+test+package+promote for every service; Can define upstream/downstream/compile-time/runtime dependencies clearly for each service; Know how to build, expose and maintain good APIs and contracts; Ready to honor b/w and f/w compatibility, even if you're the same person consuming this service on the other side; Good unit testing skills and readiness to do more...and a few more.

  • When it gets right down to it, do we really understand why anything works? The Extraordinary Link Between Deep Neural Networks and the Nature of the Universe: the universe is governed by a tiny subset of all possible functions. In other words, when the laws of physics are written down mathematically, they can all be described by functions that have a remarkable set of simple properties...We have shown that the success of deep and cheap learning depends not only on mathematics but also on physics, which favors certain classes of exceptionally simple probability distributions that deep learning is uniquely suited to model

  • Redshift vs BigQuery: Google BigQuery is a stellar option when you need a pay-as-you-go pricing model on gigantic datasets that can lead to huge cost savings, but Redshift holds the top spot always-on cloud-service data warehouses.

  • An actually understandable explanation of Restricted Boltzmann Machines on Talking Machines. 

  • Good story on Yemeksepeti: Our Shift to Serverless Architecture: By leveraging serverless architecture, we didn’t have to install and manage an operating system and dependencies. Using Amazon API Gateway, AWS Lambda, Amazon S3, and Amazon RDS, our architecture runs in a highly available environment. We don’t need to learn and manage any master-slave replication features or third-party tools. As our service gets more requests, AWS Lambda adds more Lambda instance, so it runs at any scale. We are able to copy our service to another region using the features of AWS services as we did before going into production. Finally, we don’t run any servers, so we benefit from the cost advantage of serverless architecture.

  • A Funny Thing Happened on the Way to Java. Switching to JRE 8 in production caused the system load to rise from 5 (a normal level) to 20 (an abnormal level). Finding out why makes for an interesting detective story, full of twists and turns. The culprit: The default codecache size for JRE 8 is about 250MB, about five times larger than the 48MB default for JRE 7. Our experience is that JRE 8 needs that extra codecache. We have switched about ten services to JRE 8 so far, and all of them use about four times more codecache than before.

  • How LaunchDarkly Serves Over 4 Billion Feature Flags Daily: We quickly learned that the need for fault tolerance and isolation trumps the conceptual neatness of having a service per concern. With fault tolerance in mind, we sliced our services along a different axis-- separating high-throughput analytics writes from the lower-volume read requests coming from the site. This shift dramatically improved the performance of our site, as well as our ability to evolve and scale the huge write load we see on the analytics side.

  • 4ad with a great explanation of how thread scheduling in Go: But there is a problem now. If we set GOMAXPROCS too high, performance is bad. We don't get the speed-up we expect. The problem is that many goroutines now compete for the same scheduler lock. Lock contention is bad and prevents scalability. So we introduce Ps, and split up the G:M relation into G:P:M. P stands for processor.
    There is a n:1 relation between Gs and Ps. When a Go program starts, it creates exactly GOMAXPROCS Ps. When Go code wants to run, it first has to acquire a P. You can think of this as "G needs to acquire processor time". When a new goroutine is created, it's placed in a per-P run queue. Most of the scheduling is done through per-P run queues. There is still a global run queue, but the idea is that is seldom used, in general, the per-P run queue is preferred, and this allows to use a per-P lock instead of a global lock.

  • Shopify is getting good results after moving to an SRE model. Why Shopify Moved to The Production Engineering Model. So it doesn't just work for Google and Facebook. They now release software changes into production on average about 150 times a day and the core back end Shopify commerce platform deploys new releases 30-40 times every day and dedicated teams are now free to focus on building out long term projects such as next generation networking infrastructure, massive scale data storage sharding, and automating everything from simple deployments to full data center failover scenarios. 

  • You might find this podcast interesting: NASA in Silicon Valley Podcast

  • alfalfasprout: our market data stream via ZeroMQ using UDP multicast gets around 20 million msg/s (~15 gb/s) with <10 microseconds of latency. But we also buy very high end switches and NICs (that guarantee near-zero packet loss) with built-in redundancies. We use quorum and hardware fencing for fast failover. Even then our total network hardware cost is < $20k for 8 nodes. Our stream aggregator/producer and clients use fast serialization formats like flatbuffers that minimize dynamic memory allocations. JSON is suuuuper slow in comparison.

  • zlib vs zstd for MyRocks running Linkbench: zstandard reduces CPU by 45% vs zlib level 1 for the load test; zstandard reduces CPU by 11% vs zlib level 1 for the query test; zstandard gets 8% more TPS vs zlib level 1 for the query test

  • How clever. An Algorithmic Tale of Crime, Conspiracy, and Computation: “In this novel-meets-computer-science-textbook, private eye Frank Runtime hunts for the thieves who stole a trove of documents from the capital’s police station. He’ll use search algorithms to solve the mystery—and explain high-level computational concepts along the way.” —The Wall Street Journal

  • zalando/typhoon: a stress and load testing tool for distributed systems that simulates traffic from a test cluster toward a system-under-test (SUT) and visualizes infrastructure-, protocol- and application-related latencies. It provides an out-of-the-box, cross-platform solution for investigating protocols and microservice latencies, and is operable as a standalone application. For scalability and accuracy, its runtime environment is Erlang.

  • baidu/Paddle: PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.

  • lyft.github.io/envoy (article): a high performance C++, L7 proxy distributed proxy and communication bus designed for large service oriented architectures. Envoy is a self contained process that is designed to run alongside every application server. All of the Envoys form a transparent communication mesh in which each application sends and receives messages to and from localhost and is unaware of the network topology. 

  • facebookincubator/dhcplb (article):  Facebook's implementation of a DHCP v4/v6 relayer with load balancing capabilities. Facebook currently uses it in production, and it's deployed at global scale across all of our data centers.

  • circonus-labs/libcircmetrics: A C library for tracking metrics. High performance, low overhead, real histograms.

  • facebook/infer: a static analysis tool for Java, Objective-C and C, written in OCaml.

  • WaveNet: A Generative Model for Raw Audio: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. When applied to text-to-speech, it yields state-of- the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Chinese. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity.

  • LEOPARD: Lightweight Edge-Oriented Partitioning and Replication for Dynamic Graphs (slides): This paper introduces a dynamic graph partitioning algorithm, designed for large, constantly changing graphs. We propose a partitioning framework that adjusts on the fly as the graph structure changes. We also introduce a replication algorithm that is tightly integrated with the partitioning algorithm, which further reduces the number of edges cut by the partitioning algorithm.

  • Optimizing Indirect Memory References with milk (article, article): In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP. We evaluate the MILK compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3x.

  • Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++: Thrill uses template meta-programming to compile chains of subsequent local operations into a single binary routine without intermediate buffering and with minimal indirections. Second, Thrill uses arrays rather than multisets as its primary data structure which enables additional operations like sorting, prefix sums, window scans, or combining corresponding fields of several arrays (zipping). We compare Thrill with Apache Spark and Apache Flink using five kernels from the HiBench suite. Thrill is consistently faster and often several times faster than the other frameworks. At the same time, the source codes have a similar level of simplicity and abstraction