Stuff The Internet Says On Scalability For December 23rd, 2016

Hey, it's HighScalability time:

A wondrous ethereal mix of technology and art. Experience of "VOID"
If you like this sort of Stuff then please support me on Patreon.

  • 2+ billion: Google lines of code distributed over 9+ million source files; $3.6 bn: lower Google taxes using Dutch Sandwich; $14.6 billion: aggregate value of all cryptocurrencies; 2x: graphene-fed silkworms produce silk that conducts electricity; < 100: scientists looking for extraterrestrial life; 48: core Qualcomm server SoC; 455: original TV series in 2016;

  • Quotable Quotes:
    • Ben Thompson~ It's so easy to think of tech with an 80s mindset with all the upstarts. We still glorify people in garages. The garage is gone...Our position in the world is not the scrappy upstart. It is the establishment.
    • The Attention Merchants: True brand advertising is therefore an effort not so much to persuade as to convert. At its most successful, it creates a product cult, whose loyalists cannot be influenced by mere information
    • @seldo: Speed of development always wins. Performance problems will (eventually) get engineered away. This is nearly always how technology changes.
    • @evgenymorozov: How Silicon Valley can support basic income: give everyone a bot farm so that we can make advertising $ from fake traffic to their platforms
    • @avdi: Apple has 33 Github repos and 56 contributors. Microsoft now has ~1,200 repos and 2,893 contributors.
    • Peter Norvig: Understanding the brain is a fascinating problem but I think it’s important to keep it separate from the goal of AI which is solving problems ... If you conflate the two it’s like aiming at two mountain peaks at the same time—you usually end up in the valley between them .... We don’t need to duplicate humans ... We want humans and machines to partner and do something that they cannot do on their own.
    • Brave New Greek: Unbounded anything—whether its queues, message sizes, queries, or traffic—is a resilience engineering anti-pattern. Without explicit limits, things fail in unexpected and unpredictable ways. Remember, the limits exist, they’re just hidden. By making them explicit, we restrict the failure domain giving us more predictability, longer mean time between failures, and shorter mean time to recovery at the cost of more upfront work or slightly more complexity.
    • Naren Shankar (Expanse): Everybody feels like they can look at the show and find parts of themselves in it. When you can give people collective ownership of the creative product you get the best from people. At the end of the day it shows. People work their asses off and accomplish the impossible.
    • Richard Jones: a corollary of Moore’s law (sometimes called Rock’s Law). This states that the capital cost of new generations of semiconductor fabs is also growing exponentially
    • Waterloo: His [Napoleon] strategy was simple. It was to divide his enemies, then pin one down while the other was attacked hard and, like a boxing match, the harder he punched the quicker the result. Then, once one enemy was destroyed, he would turn on the next. The best defense for Napoleon in 1815 was attack, and the obvious enemy to attack was the closest.
    • Daniel Lemire: beyond a certain point, reducing the possibility of a fault becomes tremendously complicated and expensive… and it becomes far more economical to minimize the harm due to expected faults
    • @greglinden: “For some products at Baidu, the main purpose is to acquire data from users, not revenue.” — @stuhlmueller
    • strebler:  Deep Learning has made some very fundamental advances, but that doesn't mean it's going to make money just as magically!
    • sulam: Twitter clearly doesn't have growth magic (or they'd be growing faster) -- but is that an engineer's fault? At the end of the day, any user facing engineering is beholden to the product team. Engineers at Twitter can run experiments, but they can't get those experiments shipped unless a PM is behind it.
    • Gil Tene: The right way to read "99%'ile latency of a" is "1 or a 100 of occurrences of 'a' took longer than this. And we have no idea how long". That is the only information captured by that metric. It can be used to roughly deduce "what is the likelihood that 'a' will take longer than that?". But deducing other stuff from it usually simply doesn't work.
    • @esh: Unheralded tiny features like AWS Lambda inside Kinesis Firehose streams replace infrastructure monstrosities with a few lines of code
    • @postwait: Listening to this twitter caching talk... *so* glad my OS doesn't even contemplate OOMs. How is that shit still in Linux? A literal WTF.
    • SomeStupidPoint: Mostly, it was just a choice to save $1-2k on a laptop (every 1-2 years) and spend the money on cellphone data and lattes.
    • @timbray: Oracle trying to monetize Java... Golang/Rust/Elixir all looking better. Assume all JVM langs are potential targets.
    • Kathryn S. McKinley: In programming languages research, the most revolutionary change on the horizon is probabilistic programming, in which developers produce models that estimate the real world and explicitly reason about uncertainty in data and computations. 
    • cindy sridharan: Four Golden Signals 1) Latency 2) Traffic 3) Errors 4) Saturation
    • @FioraAeterna: as a tech company grows in size, the probability of it developing its own in-house bug tracking system approaches 1
    • The Attention Merchants: In 1928, Paley made a bold offer to the nation’s many independent radio stations. The CBS network would provide any of them all of its sustaining content for free—on the sole condition that they agree to carry the sponsored content as well

  • philips: Essentially I see the world broken down into four potential application types: 1) Stateless applications: trivial to scale at a click of a button with no coordination. These can take advantage of Kubernetes deployments directly and work great behind Kubernetes Services or Ingress Services. 2) Stateful applications: postgres, mysql, etc which generally exist as single processes and persist to disks. These systems generally should be pinned to a single machine and use a single Kubernetes persistent disk. These systems can be served by static configuration of pods, persistent disks, etc or utilize StatefulSets. 3) Static distributed applications: zookeeper, cassandra, etc which are hard to reconfigure at runtime but do replicate data around for data safety. These systems have configuration files that are hard to update consistently and are well-served by StatefulSets. 4) Clustered applications: etcd, redis, prometheus, vitess, rethinkdb, etc are built for dynamic reconfiguration and modern infrastructure where things are often changing. They have APIs to reconfigure members in the cluster and just need glue to be operated natively seemlessly on Kubernetes, and thus the Kubernetes Operator concept

  • Top 5 uses for Redis: content caching; user session store; job & queue management; high speed transactions; notifications.

  • Is machine learning being used in the wild? The answer appears to be yes. Ask HN: Where is AI/ML actually adding value at your company? Many uses you might expect and some unexpected: predicting if a part scanned with an acoustic microscope has internal defects; find duplicate entries in a large, unclean data set; product recommendations; course recommendations; topic detection; pattern clustering; understand the 3D spaces scanned by customers; dynamic selection of throttle threshold; EEG interpretation; predict which end users are likely to churn for our customers; automatic data extraction from web pages; model complex interactions in electrical grids in order to make decisions that improve grid efficiency;sentiment classification; detecting fraud; credit risk modeling; Spend prediction; Loss prediction; Fraud and AML detection; Intrusion detection; Email routing; Bandit testing; Optimizing planning/ task scheduling; Customer segmentation; Face- and document detection; Search/analytics; Chat bots; Topic analysis; Churn detection; phenotype adjudication in electronic health records; asset replacement modeling; lead scoring;  semantic segmentation to identify objects in the users environment to build better recommendation systems and to identify planes (floor, wall, ceiling) to give us better localization of the camera pose for height estimates; classify bittorrent filenames into media classify bittorrent filenames into media categories; predict how effective a given CRISPR target site will be; check volume, average ticket $, credit score and things of that nature to determine the quality and lifetime of a new merchant account; anomaly detection; identify available space in kit from images; optimize email marketing campaigns; investigate & correlate events, initially for security logs; moderate comments; building models of human behavior to provide interactive intelligent agents with a conversational interface; automatically grading kids' essays; Predict probability of car accidents based on the sensors of your smartphone; predict how long JIRA tickets are going to take to resolve; voice keyword recognition; produce digital documents in legal proceedings; PCB autorouting.

  • Videos from Systems We Love are now available. Especially liked Lessons from the Cell: What Software Developers Can Learn From Biochemical Systems.

  • Modern Garbage Collection generated a lot of garbage, er discussion. Great threads on mechanical-sympathyHacker Newson Reddit, and on GolangKonstantin Khomoutov: The article is a typical case of the "sofa theorist" write-up: take a couple of phrases out of several Go announces, do no research, read no source code, be not familiar with any developer working on the Go's GC, have not to hear their reasoning, lump together whatever you know about your pet platform (Java), point at decades of academic research.  Done. 

  • A handy dandy Big-O Complexity Chart that also categorizes Common Data Structure Operations and Array Sorting Algorithms.

  • American Pickers with a fun episode picking one of the first computer stores in NY that is finally closing its doors. If you like early Apple computernalia this will a be a trip down memory lane for you. The owners said they are building a "clear a path to the future," which is a phrase I can't get out of my mind. The business has changed they said. They used to repair Macs and now there's no need. There are 6 Apple stores within walking distance of their store. Interesting how pictures of the early days of the store show it packed with people just hanging out. At one time the computer store was a 3rd place, like book stores used to be, like coffee shops are now. and like ???? will be.

  • So cool! Pixar in a Box: a behind-the-scenes look at how Pixar artists do their jobs. You will be able to animate bouncing balls, build a swarm of robots, and make virtual fireworks explode. 

  • In Flight Hacking System. Even if you aren't interested in the hacking part this is an awesome explanation of how in-flight entertainment and how airplane networks work. They found vulnerabilities in the PHP seat-to-seat chat system, the credit card check could be bypassed, the file system could be accessed, and most telling of all, a simple SQL injection attack worked.

  • Brendan Gregg is so convincing you'll probably only need 5 minutes. Give me 15 minutes and I'll change your view of Linux tracing. Pretty impressive. For more tracing goodness there's DTrace at Home, which shows how to solve an out of file descriptor problem.

  • Because there's always a manifesto. The Serverless Compute Manifesto: Function are the unit of deployment and scaling; No machines, VMs, or containers visible in the programming model; Permanent storage lives elsewhere; Scales per request; Never pay for idle; Implicitly fault-tolerant; Metrics and logging are a universal right. All sounds good, though in practice groups of functions along with supporting libraries are usually deployed.

  • Here's a Practical Deep Learning For Coders you may find useful. And it's free! From the author jph00: the expectation is that people put in at least 10 hours a week. So the amount of material is designed to be what you can learn in 10 hours of study, with the support of the wiki and forums. The lessons are not designed to stand alone, or to be watched just once straight through. The details around setting up an effective environment are a key part of the course, not a diversion. We really want to show all the pieces of the applied deep learning puzzle.

  • The drone wars are close they are. U.S. Navy's Drone Boat Swarm Practices Harbor Defense: Drone boats belonging to the U.S. Navy have begun learning to work together like a swarm with a shared hive mind...Four drone boats showed off their improved control and navigation software by patrolling an area of 4 nautical miles by 4 nautical miles...If they spotted a possible threat, the swarm of roboboats would collectively decide which of them would go track and trail the intruder vessel...The system can be installed on many common boats already being used by the U.S. Navy.

  • Nebula as a Storage Platform to Build Airbnb’s Search Backends. A low latency platform to "to keep rapidly growing history of user behavior. It requires real-time user actions to be recorded and available immediately to help personalize search results (and improve other products). A data snapshot needs to be provided so that other applications can use it (e.g. for analytics or validation). It needs periodic compactions to aggregate and potentially truncate the older history, plus bulk load a new batch of features (computed offline) back to the system." They used Nebula, a schema-less, versioned data store service with both real-time random data access and offline batch data management. The data model model for personalization maps a user id to a column each for searches, bookings, listing_views, and other user interactions. Each column can accumulate a large number of events with different timestamps.

  • Peel-and-stick is coming to sensors. Sensor Net to Run on RF Power: PARC hopes to stage demonstrations within 18 months of peel-and-stick temperature and humidity sensors and an RF hub to power them. The sensors target costs of less than $10 while the hub would send micro-joules of energy distances initially up to 10 meters and cost less than $100.

  • OK, this React code from @thomasfuchs  may really hurt your eyes: ~15 years trying to make everyone separate HTML, JS & CSS. And then suddenly everything went south and we’re writing code like this. Lots of strong discussion in this HackerNews thread. Reactors make a good case for their favorite framework, but the problem is there's probably no way to do this right. For a great intro to React there's React: Facebook's Functional Turn on Writing Javascript.

  • JSConf.Asia 2016 videos are now available.

  • Stop hating the pull. Prometheus loves it. Pull doesn't scale - or does it?. It's at scale: a single big Prometheus server can easily store millions of time series, with a record of 800,000 incoming samples per second. How? For each target, the Prometheus server simply fetches the current state of all metrics of that target over HTTP (in a highly parallel way, using goroutines) and has no other execution overhead that would be pull-related...TCP/HTTP overhead in Prometheus is still negligible compared to the other work that the Prometheus server has to do to ingest data...For scaling purposes, it doesn't matter who initiates the TCP connection over which metrics are then transferred...Prometheus is not an event-based system...We would argue that you cannot escape this configuration effort for serious monitoring setups in any case.

  • NIPS is a conference about Neural computation, learning theory, algorithms and architectures, neuroscience, vision, speech, control and diverse applications. Here's a list of All Code Implementations for NIPS 2016 papers. And if that's not enough, here are 50 things I learned at NIPS 2016, which is a fantastic review. There's a lot of stuff going on, but then there's always a lot of stuff going on.


  • To save money on S3: Delete files after a certain date that are no longer relevant; Delete unused files which can be recreated; When using S3 versioned bucket, use “lifecycle” feature to delete old versions; Clean up incomplete multipart uploads; Compress Data Before You Send Them to S3; Your Data Format Matters; Use Infrequent Access Storage Class; API calls cost the same irrespective of the data size; Batch objects whenever it makes sense to do so; if you do a lot of cross region S3 transfers it may be cheaper to replicate your S3 bucket to a different region than download each between regions each time.

  • Which browser is fastest? Edge first, Chrome second, and Firefox third. Browser benchmark battle October 2016: Chrome vs. Firefox vs. Edge.

  • We have a fake benchmark problem. Benedikt Meurer with an epic post on The truth about traditional JavaScript benchmarks: If we are serious about performance for the web, we need to start judging browser by real world performance and not their ability to game four year old benchmarks. We need to start educating the (tech) press, or failing that, at least ignore them.

  • The interesting process Riot Games went through to reduce memory usage in their game. ELEMENTALIST LUX: 10 SKINS IN 30 MEGABYTES

  • On Bitcoins, Tulips And IRS Tax Compliance: If you were using bitcoins to hide income on a grand scale, this shot across the bow by the IRS should make you reflect that it probably was not such a good idea.  

  • heathermiller/dist-prog-book: This is a book about the programming constructs we use to build distributed systems. These range from the small, RPC, futures, actors, to the large; systems built up of these components like MapReduce and Spark. We explore issues and concerns central to distributed systems like consistency, availability, and fault tolerance, from the lens of the programming models and frameworks that the programmer uses to build these systems.

  • Catalyzing Cloud-Fog Interoperation in 5G Wireless Networks: An SDN Approach: In this article, we consider fog computing as an ideal complement rather than a substitute of cloud computing, and we propose a software defined networking (SDN) enabled framework for cloud-fog interoperation, aiming at improving quality of experience and optimizing network resource usage.

  • Greg geeks with more Quick links.