hot links

Stuff The Internet Says On Scalability For May 12th, 2017

High Scalability

12 May 2017 — 14 min read

Hey, it's HighScalability time:

Earth's surface is covered with accidental hidden letters. Can you find them? (ABC: The Alphabet from the Sky)If you like this sort of Stuff then please support me on Patreon.

1 million: cord cutters in Q1; 500 billion: FINRA validations of stock trades every day on Lambda; 100k: messages sent per hour at Airbnb; 21.1 billion: transistors in GV100 GPU; 11,500: crashes to train a drone; 84,469: Backblaze hard drives; 8,000: questions per day asked on StackOverflow;

Quotable Quotes:
- Jonathan Taplin: Google Is as Close to a Natural Monopoly as the Bell System Was in 1956
- Tom Goldenberg: more companies on the site [StackShare.io] use JavaScript on the back-end (6,000) than Python (4,100) or Java (3,900).
- Andrew Shafer: The dark ages of of the relational database and the Java middleware stack paused everything for a decade.
- @Taytus: "We are early stage investors. Call me when you hit 1 million monthly active users"
- @chrisjrn: "At this point I was drunk on Perl" @bradfitz #tweetsincontext #oscon
- Bryan Cantrill: AWS is underwriting a war on big box retail.
- Paul Gilster: You’re reading that right — one-tenth of a milliwatt is enough to create error-free communications between the Sun and Alpha Centauri through two FOCAL antennas [gravitational lens].
- Vadim Markovtsev: There is a productivity peak between 2 pm and 5 pm for all the languages, when the commit frequency is the highest. This is the industry’s golden time. Managers should never distract coders during this interval.
- Patrick Tucker: The goal, one day, is a neural net that can learn instantaneously, continuously, and in real-time, by observing the brainwaves and eye movement of highly trained soldiers doing their jobs.
- @alicegoldfuss: it is incredibly difficult to balance "don't burn out and become a statistic" with "get as far as you can fast so they can't take it away"
- Jonathan Taplin: With the advent of YouTube and other streaming services, revenue for musicians has fallen 70%. If you had a song that had a million downloads on iTunes, you would get $900,000. On YouTube, you’d get $900.
- David Robinson: In short, if we had to summarize the average story [after analyzing 100,000 stories] that humans tell, it would go something like Things get worse and worse until at the last minute they get better.
- Confucius: He who cannot describe the problem will never find the solution to that problem
- Peter Thiel: competition is for losers
- Jason McGee~ Serverless adoption is moving 10x faster than Container adoption.
- Max Ehrenfreund: An average, workers born in 1942 earned as much or more over their careers than workers born in any year since
- Michael Elad: To put it bluntly, your grandchild is likely to have a robot spouse. And here is the punch line: much of the technology behind this bizarre future is likely to emerge from deep learning and its descendant fields.
- aliostad: We just did a benchmarking for a PoC on DocumentDB side-by-side Cassandra. It does the job, I have not yet seen anything revolutionary. Cassandra benchmarks seemed better.
- AWS Lambda Engineer: When you develop a Lambda function that uses SQS, SNS, Dynamo and other stuff in the cloud.. you can’t really debug it on your local. People just need to change their mindset
- sbuttgereit: What looks compelling about the PostgreSQL offering as compared to AWS RDS is that it looks like you get a PostgreSQL cluster rather than a single database in a shared cluster.
- Warren Toomey: Simulated hardware is infinitely easier to obtain, configure and diagnose than real hardware.
- Kate Kaye: The mistake companies have made, he says, is to rely too much on targeted advertising, cutting too far back on broader advertising that builds brand awareness with people outside the existing customer base and eventually leads to new sales.
- cbanek: I've had to work on mission critical projects with 100% code coverage (or people striving for it). The real tragedy isn't mentioned though - even if you do all the work, and cover every line in a test, unless you cover 100% of your underlying dependencies, and cover all your inputs, you're still not covering all the cases.
- jpfr: These are the exact same complaints [re: Rust] of students learning to program for the first time. Have you tried functional programming? Lisp? Ocaml? The complaints of newbie functional programmers are also nearly the same. The "struggle" is necessary. If there is no struggle, there is no learning of fundamentally new approaches you are not yet comfortable with. See it as part of the training regimen that lets people emerge as stronger programmers on the other side.
- x2398dh1: The central thesis of the article [The world’s most valuable resource is no longer oil, but data] is that these companies have become large and the barriers to entry which come from this size may be perceived as being against the public good, as in the case of Standard Oil. That is what is meant by, "Data is the new oil," - it has nothing to do with the actual perceived or actual value of oil or data, but rather the idea that while it is impossible to compete with these companies.
- Charlie Demerjian: Since made Haswell the 2-3% a year price increase cadence history, things haven’t gotten better for customers. The prices increases have gone steadily up from the 8% to heights unseen since the last bump.
- Storage Mojo: Bottom line: 4k will be a bonanza for the storage industry. Add to that the decreasing costs of other production inputs, and its clear that the impact of video on how we produce, share, and consume our stories will only accelerate.
- throwasehasdwi: CoDel is different from the packet scheduling algorithm even though both fight bufferbloat in different ways. CoDel is a congestion control algorithm for controlling what happens when outgoing buffers start overflowing. This is on a lower level than TCP and happens to any type of packet. The scheduling algorithm, like VEGAS or BBR, controls the transmit rate of only the TCP protocol.
- Sam Schechner: These new systems crunch mountains of historical and real-time data to predict how customers and competitors will react to any price change under different scenarios, giving them an almost superhuman insight into market dynamics. Programmed to meet a certain goal—such as boosting sales—the algorithms constantly update tactics after learning from experience.
- Veronique Greenwood: This paradox — that some of the most important proteins seem to be the most delicate — may reflect how evolution has shaped them to do their jobs. If a protein has many roles to play, it might gain an advantage from being somewhat unstable and prone to unfolding and refolding, since this could allow it to assume various shapes appropriate to whatever its next target might be.
- Jack Clark: Norvig claims that the unpredictability and emergent behavior endemic to machine learning approaches means computer science is becoming an empirical science where work is defined by experimentation as well as theory. This feels true to me – most AI researchers spend an inordinate amount of time studying various graphs that read out out the state of networks as they’re training, and then use those graphs to help them mentally navigate the high-dimensional spaghetti-matrices of the resulting systems.
- bilog78: My fear is that they are producing hardware [re: Volta GPU] which is a little too over-specialized...What we are starting to see now is a renewed differentiation with associated specialization which indicates a loss of disruptive power...For developers, this is not a good thing: it makes it much harder to target hardware classes, and it completely reverts one of the leading pillars of the GPGPU revolution.
- Bryan Catanzaro: It’s worth noting that Volta has the biggest change to the GPU threading model basically since I can remember and I’ve been programming GPUs for a while. With Volta we can actually have forward progress guarantees for threads inside the same warp even if they need to synchronize, which we have never been able to do before. This is going to enable a lot more interesting algorithms to be written using the GPU, so a lot of code that you just couldn’t write before because it potentially would hang the GPU based on that thread scheduling model is now possible.

Is bundling a race to the bottom for content creators? What's the future of game monetization?: the value of games seems to keep falling...The fact that we want everything free now because it costs less (not 'nothing', remember) to produce each additional unit is a fairly entitled view and, I suggest, it would lead to the destruction of the games industry in the same way that it's gutted the music industry...The success of Spotify and Netflix's models in other industries worries me and we see a bit of a move in that direction with things like Humble Bundles...If we're not careful, we'll get to where there's no money to be made in games and only the most trite, generic, relatively low cost and mass-appealing titles (the Call of Duties and FIFAs) will be financially viable...it's worth noting that these titans are resorting to F2P to try and shore up their player numbers. Will we ever see subscription models in new games again?

A 10,000+ phone Chinese click farm looks a lot like Facebook's mobile device testing lab.

Todd Montgomery in Async or Bust!? makes an insightful observation that applies to more than just computers: sequentiality is an illusion. When something happens sequentially energy was expended to create order. In computers sequentiality is imposed, think network packets, disk writes, and CPU instructions, requests / responses. That reminded me how chemical reactions are typically driven by thermal fluctuations, but energy must be expended to sequentially ratchet mRNA through a ribosome to create amino acids. That got me thinking about time and how everything appears to us to happen sequentially. Then shouldn't time require energy? On that topic there's a discussion, or two, or three, that says no, but I'm afraid I still don't get it.

A lot of lessons were learned at ServerlessConf. Amiram Shachar with an excellent set of Takeaways From the ServerlessConf 2017. Will James pens his State of Serverless based on the conference, also excellent. There's also Three Things we Learned at Serverless Conf Austin. And there's 5 Takeaways from ServerlessConf. Ryland Goldstein weighed in with his Serverless conference 2017 tl;da (Too long didn’t attend).

What You Need to Know About The Intel AMT Vulnerability. Great detective work by the folks at Embedi. You will not believe this bug. You can login without a password because the login handling code compares two hashes incorrectly. Take everything you know about comparing two strings and do the opposite. It's just not the coding, but how could this have passed any sort of test suite? Remember, this is software that can remotely control your computer even if it’s powered off.

At their best software systems are allostatic: anticipating the body’s energy needs [and] preparing to meet those needs before they arise; rather than homeostatic: efficiently maintains energy regulation in the body. Is “Allostasis” The Brain’s Essential Function?

The ServerlessCast #6 - Event-Driven Design Thinking with Paul Johnston~ Events are core to Serverless: your Lambda function is a response to something happening. If you think about it like a normal monolithic application it's not how we do things. Most people go procedural. I need four things to happen so I'm going to make a function to do all four. Even if it's OO underneath they are still thinking procedurally. Developers need to start thinking distributed and asynchronously. If you go down an event driven thinking process you realize you don't need everything in the same place. You realize you can use a service from any 3rd party service in the world on the basis of an event. One of the problems with Lambda is the infinite scale. Your data must scale with it. You can't have relational database being a single point of failure. Push logic to an orchestration layer: put only one thing in a function. The problem when you do three or four things in a function is what if you want to do the last two steps of the function? You may pass a flag to bypass the first few steps, but it would be easier to separate all the steps out. You want it asynchronously separated out. You need some way of creating an event to call them. You start putting a lot more events into your system. With Serverless hiring less experienced developers can work out better: Instead of hiring high end cloud developers look for people who can think distributed. You don't need typical developer skills, they are more sysadmin skills. Someone with a sysadmin background is more likely to understand distributed thinking that goes with building an entire system of events than a framework developer. Had good success hiring a person with two years of votech training because they didn't have the baggage of working with frameworks and servers and all of those kind of things. That baggage does seem to get in the way. So hire younger hungrier developers who don't have that experience behind them. (he didn't mention cheaper, but we know that's part of it).

Great detailed description of how to run your game server on Kubernetes. Scaling Dedicated Game Servers with Kubernetes: Part 1 – Containerising and Deploying: In this example, a relatively small amount of custom code (~500 loc) was able to deploy, create, and manage game servers across a large cluster of machines by leveraging the power of software containers and Kubernetes.

Poki Going for Go and Sticking with SQL. They're moving to microservices. Though they were using PHP they didn't think PHP and microservices were a good fit because: PHP has high startup costs, for every request, database connections and classes have to be instantiated, which adds unnecessary latency; Containerized PHP is a minefield, HP requires Nginx and PHP-FPM (or similar) for process management and connection pooling. Went with Go for all the usual reasons. They didn't go NoSQL, they were more comfortable with the idea of building small, independent services that get something done and are written to be easily upgraded or replaced whenever needed.

observations re packet socket exploit: Now, as regards attack surface, they can’t be more different. A sandbox specifically reduces the attack surface of the kernel. All the fun features that an attacker would like to use are locked up. Even if there’s an exploit in say sysctl, if it can’t be called by a sandboxed process, that’s one less thing to worry about. But containers and namespaces do the opposite. They expose all this new attack surface, previously only accessible to root, and let even regular users poke and prod it. And thus what might be a root to root exploit (kinda boring) becomes a privilege escalation

Messaging Sync — Scaling Mobile Messaging at Airbnb: new messages and thread updates are fetched only when data change, which greatly reduces the number of network requests. This means navigation between the inbox and message thread screens is much faster, hitting the local mobile storage most of the time, instead of issuing a network request with each screen change. Messaging sync also reduces the response size for each network fetch, which results in a 2x improvement in terms of API latency. These user experience gains are magnified in areas with slow networks.

As we reach peak processor it makes sense that an open source royalty-free microprocessor might take root. Open-source chip mimics Linux's path to take on closed x86, ARM CPUs. indolering: The primary advantage is that RISC-V is truly RISC: they have a core ISA that is frozen but extendable. This means that they can have application-specific CPUs with intelligent fallback and full compatibility. Also, Proposal for an ideal extensible instruction set.

Useful set of instructions for How to create microservices and set-up a microservice architecture with MariaDB, Docker and Go. Uses an example of a photo-sharing application.

State of Serverless: Another big theme that came out was that serverless applications enable architectures that are designed around events rather than around data. Subscribing applications to an event queue is a nice way to manage service communication, since you can easily add new services onto an existing queue to modify or add functionality, as opposed to baking data flow directly into applications, which introduces strong coupling...It is extremely interesting how both at architectural and application levels, using first class events is a powerful way of decoupling logic from state: one way data flow makes things simpler to reason about. At the application level we have ELM architecture and React/Redux as core examples, and now in the cloud we can use cloud functions combined with a core event stream to create functional cloud applications that operate at scale.

Great introduction to how the internet works. BBR, the new kid on the TCP block: The Internet was built using an architectural principle of a simple network and agile edges...But while TCP might look like a single protocol, that is not the case. TCP is a common transport protocol header format...For many years, the so-called “Reno” flow control algorithm appears to have become the mainstay of TCP data flow control...BBR is very similar to TCP Vegas, in that it is attempting to operate the TCP session at the point of onset of queuing at the path bottleneck...Like TCP Vegas, BBR calculates a continuous estimate of the flow’s RTT and the flow’s sustainable bottleneck capacity.

CPU Utilization is Wrong: CPU utilization has become a deeply misleading metric: it includes cycles waiting on main memory, which can dominate modern workloads. You can figure out what %CPU really means by using additional metrics, including instructions per cycle (IPC). An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound. Excellent discussion on HackerNews.

Be the first person in your neighborhood to run their own atomic clock. Chip Overclock, true to his name, shows how in My Stratum-0 Atomic Clock. You'll learn lots about clocks, spend several thousand dollars, and be able to serve as a stable source of highly accurate time when the timeocolypse hits.

Due to a temporary buying opportunity Backblaze is now using some enterprise drives, not just consumer drives. Hard Drive Stats for Q1 2017. Not enough data yet to determine reliability, but they've found: The enterprise drives load data faster; The enterprise drives use more power; Enterprise drives have some nice features. Faster doesn't seem to matter: drive speed has never been a bottleneck in our system. A system that can load data faster will just “get in line” more often and fill up faster. There is always extra capacity when it comes to accepting data from customers.

Build or buy your chat service? If you choose build here's A simple chat architecture for your MVP: For Pub/Sub, we chose Redis — an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker ...messages should be saved in PostgreSQL and published in Redis...The connection from the WebSocket to the Client should not be responsibility of the web server...The client could only receive new messages via WebSockets, the insertion of new messages would still occur via HTTP.

Microsoft officially unveils Project Neon, the 'Fluent Design System' for Windows 10: The goal of Fluent Design is to deliver harmonious, intuitive, inclusive and responsive cross-device experiences and interactions. Developers will be able to build beautiful, expressive apps with the Fluent Design language, with animations, blur and fluidity, according to Microsoft.

Bulat-Ziganshin/FastECC: provides O(N*log(N)) Reed-Solomon coder, running at 1.2 GB/s on i7-4770 with 2^20 blocks. Version 0.1 implements only encoding, so it isn't yet ready for real use.

Nordstrom/hello-retail: an open-source, mobile-first, 100% serverless, and event-driven functional proof-of-concept showcasing a central unified log approach as applied to the retail platform space.

pubkey/rxdb: Client-Side Database for Browsers, NodeJS, electron, cordova, react-native and every other javascript-runtime.

Deterministic Components for Interactive Distributed Systems – with Transcript: First of all, let’s start with defining determinism. Our very first definition will go along the following lines: “A program is deterministic if and only if its outputs are 100% defined by its inputs”. There are several important observation which follows from our Definition 1. The first one is that “Non-deterministic program cannot be fully testable using only deterministic testing”

Privacy Threats through Ultrasonic Side Channels on Mobile Devices: Device tracking is a serious threat to the privacy of users, as it enables spying on their habits and activities. A recent practice embeds ultrasonic beacons in audio and tracks them using the microphone of mobile devices. This side channel allows an adversary to identify a user's current location, spy on her TV viewing habits or link together her different mobile devices.

The Restoration of Early UNIX Artifacts: The history of the development of UNIX has been well documented, and over the past decade or so, efforts have been made to find and conserve the software and documentation artifacts from the earliest period of UNIX history. This paper details the work that has been done to restore the artifacts from this time to working order and the lessons learned from this work.

Stuff The Internet Says On Scalability For May 12th, 2017

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale