hot links

Stuff The Internet Says On Scalability For August 4th, 2017

Hey, it's HighScalability time:

Hands down the best ever 25,000 year old selfie from Pech Merle cave in southern France. (The Ice Age)

If you like this sort of Stuff then please support me on Patreon.

35%: US traffic is now IPV6; 10^161: decision points in no-limit Texas hold’em; 4.5 billion: Facebook translations per day; 90%: savings by moving to Lambda; 330TB: IBM's tiny tape cartridge, enough to store 330 million books; $108.9 billion: game revenues in 2017; 85%: of all research papers are on Sci-Hub; 1270x: iPhone 5 vs Apollo guidance computer; 16 zettabytes: 2017 growth in digital universe;

Quotable Quotes:
- Andrew Roberts: [On Napoleon] No aspect of his command was too small to escape notice.
- Jason Calacanis: The world has trillions of dollars sitting in bonds, cash, stocks, and real estate, which is all really “dead money.” It sits there and grows slowly and safely, taking no risk and not changing the world at all. Wouldn’t it be more interesting if we put that money to work on crazy experiments like the next Tesla, Google, Uber, Cafe X, or SpaceX?
- @icecrime: The plural of “it’s not a bug, it’s a feature” is “it’s not a bug tracker, it’s a backlog”.
- Jeff Darcy: When greater redundancy drives greater dependency, it’s time to take a good hard look at whether the net result is still a good one.
- uhnuhnuhn: "They ran their business into the ground, but they did it with such great tech!"
- Anglés-Alcázar: It’s very interesting to think of our galaxy not as some isolated entity, but to think of the galaxy as being surrounded by gas which may come from many different sources. We are connected to other galaxies via these galactic winds.
- @ojiidotch: Main app now running Python 3.6 (was 2.7 until yesterday). CPU usage 40% down, avg latency 30% down, p95 60% down.
- Nemanja Mijailovic: It’s really difficult to catch all bugs without fuzzing, no matter how hard you try to test your software.
- SandwichTeeth: a lot of companies have security teams solely to meet audit requirements. If you find yourself on a team like that, you'll be spending a lot of time just gathering evidence for audits, remediating findings and writing policy. I really loved security intellectually, but in practice, the blue-team side of things wasn't my cup of tea.
- jph: security is needed to gradually escalate a user's own identity verification -- think of things like two-factor auth and multi-factor auth, that can phase in (or ramp up) when a user's actions enter a gray area of risk. Some examples: when a user signs in from a new location, or a user does an especially large money transfer, or a user resumes an account that's been dormant for years, etc.
- @hichaelmart: So while Google is doubling down on gRPC it seems that Amazon is going all in with CBOR. DDB DAX uses some sort of CBOR-over-sockets AFAICT
- Wysopal: I’d like to see someone fixing this broken market [insecure software and hardware market]. Profiting off of that fix seems like the best approach for a capitalism-based economy.
- Matthias Käppler: Microservices are often intermediate nodes in a graph of services, acting as façades where an incoming request translates to N outgoing requests upstream, the responses to which are then combined into a single response back downstream to the client.
- Jack Fennimore: EA Play 2017 was watchable the same way Olive Garden is edible.
- erikb: [On SoundCloud] TL;DR Top Management started too late to think about making actual money. They also hired an asshole for their US offices. When they got an opportunity to be bought by Twitter they asked for way too much money. And the CEO is basically on a constant holidays trip since 2014, while not failing to rub it in everybody's face via Instagram photos.
- Jennifer Mendez: If you don’t have the games people want to play, you can wave goodbye to return on investment on a powerful console. Does hardware matter? Of course it does! But it doesn’t matter if you don’t have anything to play on it.
- Alex Miller: The utility of a blockchain breaks down in a private or consortium setting and should, in my opinion, be replaced by a more performant engine like Apache Kafka.
- Krish: most of the multi-cloud usecases I am seeing are about using different cloud for different workloads. It could change and I would expect them to embrace the eventual consistency model initially
- Ian Cutress: Then there is the Ryzen 3 1300X. Compared to the Core i3-7300/7320 and the Core i5-7400, it clearly wins on performance per dollar all around. Compared to the Core i3-7100 though, it offers almost 5% more performance for around $10-15 more, which is just under 10% of the cost.
- throw2016: Just from an year ago the cpu market has changed completely. The sheer amount of choice at all levels is staggering. For the mid level user the 1600 especially is a formidable offering, and the 1700 with 8 cores just ups the ante.
- danmaz74: the main reason Rails is declining in relevance isn't microservices or the productivity (!) of Java, but the fact that more and more development effort for web applications is moving into JS front-end coding.
- Rohit Karlupia: we can deal with [S3] eventual consistency in file listing operations by repeating the listing operation, detecting ghost and conceived files and modifying our work queues to take our new knowledge about the listing status into account.
- tboyd47: It's the end of an era. From 2005 to 2007, the "Web 2.0" craze, the release of Ruby on Rails, and the rise of Agile methods all happened at once. These ideas all fed into and supported each other, resulting in a cohesive movement with a lot of momentum. The long-term fact turned out to be that this movement didn't benefit large corporations that have always been and usually still are the main source of employment for software developers. So we have returned to our pre-Rails, pre-agile world of high specialization and high bureaucratic control, even if Rails and "Agile" still exist with some popularity.
- @reneritchie: Only beginning to see the advantages of Apple making everything from atom to bit. Everything will be computational.
- Vasiliy Zukanov: switching to Kotlin will NOT have any appreciable positive gains on the cost, the effort or the schedule of software projects
- visarga: Over the years I have seen astronomy become more like biology - diverse both in the kinds of objects it describes and their behavior.
- Jaana B. Dogan: I think the industry needs a breakdown between product and infra engineering and start talking how we staff infra teams and support product development teams with SRE. The “DevOps” conversation is often not complete without this breakdown and assuming everyone is self serving their infra and ops all the times.
- David Rosenthal~ Does anybody believe we'll be using Bitcoin or Ethereum 80 years from now?
- Richard Jones: There is a physical lower limit on how much energy it takes to carry out a computation – the Landauer limit. The plot above shows that our current technology for computing consumes energy at a rate which is many orders of magnitude greater than this theoretical limit (and for that matter, it is much more energy intensive than biological computing). There is huge room for improvement – the only question is whether we can deploy R&D resources to pursue this goal on the scale that’s gone into computing as we know it today.

Million WebSockets and Go at Mail.ru. This kind of thing has been done a million times outside of Go, so it was interesting to see the steps they needed to make it work. Lots of great code examples with detailed explanations. And yep, they pretty much had to rewrite everything. The problem: Mail polling involves about 50,000 HTTP queries per second, 60% of which return the 304 status meaning there are no changes in the mailbox. The solution: reduce the load on the servers and speed up mail delivery to users by writing a publisher-subscriber server that would receive notifications about state changes and subscriptions for notifications. The changes: A read goroutine with a buffer inside is expensive. Solution: netpoll (epoll, kqueue), reuse the buffers; A write goroutine with a buffer inside is expensive. Solution: start the goroutine when necessary, reuse the buffers; With a storm of connections, netpoll won’t work. Solution: reuse the goroutines with the limit on their number; net/http is not the fastest way to handle Upgrade to WebSocket. Solution: use the zero-copy upgrade on bare TCP connection.

Gophercon 2017 videos, slides, and code are now available.

Uber has an interesting way of composing applications from isolated plugins. Engineering Scalable, Isolated Mobile Features with Plugins at Uber. It's a holy grail many projects shoot for, but usually fail to achieve as more pragmatic concerns dominate. There's a lot of upfront architecture work, but it pays off. BrianAttwell: we prioritized mobile app architecture and saw big wins as a result. Usually its hard to commit time upfront to app architecture. We've made that the norm (ex, 80% of our app code lives inside plugins). And we've seen eng productivity measurably increase over the last year because of these investments. In many cases doubled. Plus build times are way better :) See also, Uber Mobility: Deep Scope Hierarchies, Scope Hierarchies, Uber Mobility: RIB (Router Interactor Builder)

Do you know what your AI is doing today? You get what train for. daly: These various techniques that currently work by training, either supervised or self-training, can have fatal flaws. Take, for example, some high-tech camera technology. Use it on a drone to take pictures of warships from thousands of angles. You take pictures of U.S. warships, Russian warships, and Chinese warships. You achieve 100% accuracy of identifying each ship using some Neural Net technology. One day this net decides that an approaching warship is Chinese and sinks it. But it turns out to be a U.S. warship. Clearly a mistake was made. Deep investigation reveals that what the Neural Network "learned" happened to be related to the area of the ocean, based on sunlight details, rather than the shape of the warship or other features. Since the Chinese ships were photographed in Chinese waters, and the U.S. warship that was sunk was IN those waters at the moment, the Neural Net worked perfectly. Recognition and action are only part of the intelligence problem. Analysis is also needed.

The topic that will not die, Exactly-once or not, atomic broadcast is still impossible in Kafka – or anywhere, so if you are looking for nice short explanation of terms, LgWoodenBadger has got your back: "At most once" has the implication of "you may never get it" (if the one and only attempt failed anywhere along the way); "Effectively once" implies "you got the same thing 1000 times but ignored 999 of them because you already got it."; With an "exactly once" guarantee, I could send you a stream of integers (1,2,3,4,5) and you could blindly/naively/simplistically add them with no special concern and be confident that your answer of 15 was correct.; With "effectively once," you'd have to keep track of what you've seen before so you know not to add 4 an extra 6 times and come up with the wrong sum of 39; With "at most once" you may be sitting around with a sum of 0 and think that's correct.

In case you wondering, two jets flying close together at 35,000 feet can airdrop photos to each other. Jamaica Aviation Spotters.

Facebook would like you to know Google is not the only company that can use those neural net thingys to translate text. Transitioning entirely to neural machine translation. A lot of people were unhappy with Facebook's old translation system, so this is likely a big improvement. Facebook: recently switched from using phrase-based machine translation models to neural networks to power all of our backend translation systems, which account for more than 2,000 translation directions and 4.5 billion translations each day...we started with a type of recurrent neural network known as sequence-to-sequence LSTM (long short-term memory) with attention...average relative increase of 11 percent in BLEU...we saw a relative improvement of 3.7 percent BLEU for English to Spanish, based only on tuning model hyperparameters...We implemented our translation systems in the deep learning framework Caffe2. Its down-to-the-metal and flexible nature allowed us to tune the performance of our translation models during both training and inference on our GPU and CPU platforms.

Netflix has the coolest tools. ChAP: Chaos Automation Platform: we run our experiments in production. To do that, we have to put some requests at risk for the sake of protecting our overall availability...To limit this blast radius, in ChAP we take a small subset of traffic and distribute it evenly between a control and an experimental cluster...We designed a circuit breaker for the experiment that would automatically end the experiment if we exceeded a predefined error budget...Let’s say we want to explore how API handles the failure of the Ratings system...We then override the request routing for the control and experimental populations to direct just that traffic to the new clusters instead of the regular production cluster...With ChAP, we have safely identified mistuned retry policies, CPU-intensive fallbacks, and unexpected interactions between circuit breakers and load balancers.

Excellent description in Trying to Understand Tries. Why are they important?: Tries often show up in white boarding or technical interview questions, often in some variation of a question like “search for a string or substring from this sentence”. Given their unique ability to retrieve elements in constant time, they are often a great tool to use

The future of deep learning: Models will be more like programs, and will have capabilities that go far beyond the continuous geometric transformations of the input data that we currently work with...In particular, models will blend algorithmic modules providing formal reasoning, search, and abstraction capabilities, with geometric modules providing informal intuition and pattern recognition capabilities...They will be grown automatically rather than handcrafted by human engineers...As such, this perpetually-learning model-growing system could be interpreted as an AGI—an Artificial General Intelligence

Everyone loves a free party, but they don't stay to clean up after. The Inside Story Of SoundCloud's Collapse: SoundCloud’s downfall, according to many former employees, was largely the result of a strategic misstep — a move to compete head-on with the giants of the music-streaming world. With the March 2016 launch of SoundCloud Go, a $9.99 per month subscription service, SoundCloud was a late entrant to a ferociously competitive streaming music space and with an array of services that offered no differentiation from incumbents like Spotify and Apple Music.

A beautiful meditation on not doing what is unnecessary. Solving Imaginary Scaling Issues (at scale). Do you really need to shard that database and serve the core application from multiple regions across the world to cut down on the latency of database reads when requests hit the origin server? Nah. Don't solve problems you don't have.

Watch the Gravity Sort, it's so cool looking.

Apple’s Adoption Of HEVC Will Drive A Massive Increase In Encoding Costs Requiring Cloud Hardware Acceleration: with Apple adopting H.265/HEVC in iOS 11 and Google heavily supporting VP9 in Android, a change is on the horizon...New codecs like H.265 and VP9 need 5x the servers costs because of their complexity. Currently, AV1 needs over 20x the server costs...it’s not hard to do the math and see that for some, encoding costs could increase by 500x over the next few years as new codecs, higher quality video, 360 video and general demand increases...This is why over the past year, a new type of accelerator in public clouds called Field Programmable Gate Array (FPGA) is growing in the market...to deliver 60fps 20x c4.8xlarge instances would be required which would cost around $33 an hour...deliver over 60 fps on a single f1.2xlarge instance. The total cost would be around $3. Also, a lot of good comments.

Not the usual FP vs. OO melee. What kinds of questions are more interesting than is FP better than OO?: What concepts and ideas help me to write better code? Do classes help to structure your code? Can namespaces offer the same benefit? Is inheritance helpful or harmful? What about roles / mixins / traits? Which problems does the actor model solve? Which ones are better solved by single-threaded concurrency? What about Communicating Sequential Processes? Do we need encapsulation if we have immutable state? Does immutability help me write safer multi-threaded code? If so, why? Which advantages do persistent data structures have? What are the disadvantages? Is my team more productive in a dynamic programming language or in a language with dependent types? Where do they produce more bugs? How do concepts like type inference, gradual typing, or clojure.specs change that?

I've coded this very same bug. In a timer context you always have to limit the amount work that is done. Production postmortem: The lightly loaded trashing server: The expiration timer is hit, and we now have a lot of items that need to be expired. RavenDB expiration is coarse, and it runs every few minutes, so each run we had a lot of stuff to delete. Most of it was on disk, and we needed to access all of it so we can delete it. And that caused us to trash, affecting the overall server performance...The solution was to remove the expiration usage and handle the cache invalidation in the client, when you fetched a cached value, you checked its age, and then you can apply a policy decision if you wanted to update it or not.

Do you need to generate a random medieval fantasy city? Here you go.

Docker vs. Kubernetes vs. Apache Mesos: Why What You Think You Know is Probably Wrong: If you are a dev/devops team and want to build a system dedicated exclusively to Docker container orchestration, and are willing to get your hands dirty integrating your solution with the underlying infrastructure Kubernetes is a good technology for you to consider...If you want to build a reliable platform that runs multiple mission critical workloads including Docker containers, legacy applications (e.g., Java), and distributed data services (e.g., Spark, Kafka, Cassandra, Elastic), and want all of this portable across cloud providers and/or datacenters, then Mesos (or our own Mesos distribution, Mesosphere DC/OS) is the right fit for you.

Sounds like what they need is a church. Decentralized Long-Term Preservation: Trying by technical means to remove the need to have viable economics and governance is doomed to fail in the medium- let alone the long-term. What is needed is a solution to the economic and governance problems. Then a technology can be designed to work in that framework. Blockchain is a technology in search of a problem to solve, being pushed by ideology into areas where the unsolved problems are not technological.

Application amplification attacks do make sense. Starting the Avalanche. The problem: We’d [Netflix] like to introduce you to one of the most devastating ways to cause service instability in modern micro-service architectures: application DDoS. A specially crafted application DDoS attack can cause cascading system failures often for a fraction of the resources needed to conduct a more traditional DDoS attack. This is due to the complex, interconnected relationships between applications. Traditional DDoS attacks focus on exhausting system resources at the network level. In contrast, application layer attacks focus on expensive API calls, using their complex interconnected relationships to cause the system to attack itself — sometimes with a massive effect. Some fixes: reduce inter-dependencies on services; understand which microservices impact each aspect of the customer experience; putting a limit on the allowable work per request can significantly reduce the likelihood of exploitation; enabling a feedback loop to provide alerts from the middle tier and backend service to your WAF (web application firewall); prioritize authenticated traffic over unauthenticated traffic; have reasonable client library timeouts and circuit breakers.

Has the counter-reformation begun? The Case Against Kotlin. Not really, the article is pro Kotlin, it just details a lot of real-world problems--learning curve, build times, development stability, static analysis--that will hopefully all get better over time. I found the lack of improvement in programmer productivity most damming. Why switch then? Just because you like it better? Is that really a good reason?

Storage will not be as free as it used to be. Disk media market update: Seagate's poor performance poses a real problem for the IT industry, similar to problems it has faced in other two-vendor areas, such as AMD's historically poor performance against Intel, and ATI's historically poor performance against Nvidia. The record shows that big customers, reluctant to end up with a single viable supplier of critical components, will support the weaker player by strategic purchases of less-competitive product...The even bigger problem for the IT industry is that flash vendors cannot manufacture enough exabytes to completely displace disk...So the industry needs disk vendors to stay in business and continue to invest in increasing density, despite falling unit shipments. Because hard disk is a volume manufacturing business, falling unit shipments tend to put economies of scale into reverse, and reduce profit margins significantly.

caffe2.ai: A New Lightweight, Modular, and Scalable Deep Learning Framework.

gobwas/ws: Tiny WebSocket library for Go.

Metaclasses: Generative C++: The only way to make a language more powerful, but also make its programs simpler, is by abstraction: adding well-chosen abstractions that let programmers replace manual code patterns with saying directly what they mean. There are two major categories: Elevate coding patterns/idioms into new abstractions built into the language; Provide a new abstraction authoring mechanism so programmers can write new kinds of user-defined abstractions that encapsulate behavior.

A spatially localized architecture for fast and modular DNA computing: We create logic gates and signal transmission lines by spatially arranging reactive DNA hairpins on a DNA origami.

Gray Failure: The Achilles’ Heel of Cloud-Scale Systems: Cloud scale provides the vast resources necessary to replace failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this paper, we discuss our experiences with gray failure in production cloud-scale systems to show its broad scope and consequences. We also argue that a key feature of gray failure is differential observability: that the system’s failure detectors may not notice problems even when applications are afflicted by them

Toward a DNA-Based Archival Storage System: We [Microsoft] think the time is ripe to seriously consider DNA-based storage and explore system designs and architectural implications. Our ASPLOS paper was the first to address two fundamental challenges in building a viable DNA-based storage system. First, how should such a storage medium be organized? We demonstrate the tradeoffs between density, reliability, and performance by envisioning DNA storage as a key-value store. Multiple key-value pairs are stored in the same pool, and multiple such pools are physically arranged into a library. Second, how can data be recovered efficiently from a DNA storage system? We show for the first time that random access to DNA-based storage pools is feasible by using a polymerase chain reaction (PCR) to amplify selected molecules for sequencing

Hey, just letting you know I've written a novella: The Strange Trial of Ciri: The First Sentient AI. It explores the idea of how a sentient AI might arise as ripped from the headlines deep learning techniques are applied to large social networks. Anyway, I like the story. If you do too please consider giving it a review on Amazon. Thanks for your support!

Stuff The Internet Says On Scalability For August 4th, 2017

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale