hot links

Stuff The Internet Says On Scalability For December 7th, 2018

Wake up! It's HighScalability time:

This is your 1500ms latency in real life situations - pic.twitter.com/guot8khIPX November 27, 2018

Do you like this sort of Stuff? Please support me on Patreon. I'd really appreciate it. Know anyone looking for a simple book explaining the cloud? Then please recommend my well reviewed (31 reviews on Amazon and 72 on Goodreads!) book: Explain the Cloud Like I'm 10. They'll love it and you'll be their hero forever. And if you know someone with hearing problems they might find Live CC very useful.

$181.5M: top 10 YouTube star earnings; 153 million/1.2 billion/27 billion: Reddit posts/comments/votes; 12%: cut for Epic's new Steam competitor; 2,368: Baidu AI patents; 360: Walmart cleaning robots; 2:1: AMD CPUs outselling Intel; 11157: images in Disguised Faces in the Wild dataset; 7,518: FCC approved SpaceX LEO satellites; $40 million: saved Down Under by Tesla’s giant battery; intense beam of ultra-high energy heavy ions: testing AI chips for space; $100 million: bot heist; 16TB: HAMR hard drives; 12.8Tbps: new Tofino ASIC; 85%: cloud-hosted TensorFlow workloads run on AWS; 39%: tech workers depressed; 16%: drop in GPU shipments; 4 x 10⁸⁴: photons ever emitted in the universe; 0.00%: blockchain success rate;

Quotable Quotes:
- @math_rachel: About half of > 200 people seeking visas for @black_in_ai #NeurIPS2018 were denied. This is a huge issue harming our field.
- @heipei: It's Friday, I've been in a jumpsuit doing manual labor all day (crazy, I know) and weighing my options between passing out on the couch over some Youtube videos, reading the Friday @highscal blog post or writing code. Currently favouring the path of least resistance 😛
- Paul Tassi: Again, it’s hard to overstate just what a big deal this is, and it’s a massive undertaking for Epic, but one they believe they’re able to handle given how massive Fortnite has scaled to be. Epic famously dodged the Apple store entirely when releasing Fortnite on iOS, and it’s clear those were the first warning shots for something like this.
- IBM: Our device uses phase-change memory (PCM) for in-memory computing. PCM records synaptic weights in its physical state along a gradient between amorphous and crystalline. The conductance of the material changes along with its physical state and can be modified using electrical pulses. This is how PCM is able to perform calculations. Because the state can be anywhere along the continuum between 0 and 1, it is considered an analog value, as opposed to a digital value, which is either a 0 or a 1, nothing in between.
- @ivanveram: World R&D leading companies 2017 in US$. No.1: Amazon; No.2: Alphabet (Google); No.3: Samsung. :) That´s the way it is!!!. :)
- 40acres: Full disclosure, I work for Intel. We have a fabrication plant in Chengdu, it's public knowledge that this fab is helping to manufacture products built on the latest process technology. As I've come to learn more about China's tactics when dealing with foreign companies it's become of great concern to me what this plant means for our future. I don't think it would be far-fetched to assume that some very protected and valuable IP has leeched through our doors and into China's hands. In all honestly I really can't fathom how the American government let this deal occur.
- @SimonSapin: “[…] still had a buffer overrun discovered in 2016 (in code added in the 2001 and 2002). This shows that the notion that code written in a memory-unsafe language becomes safe by being “battle-hardened” if it has been broadly deployed for an extended period of time is a myth.”
- charity.wtf: There actually seems to be a direct link between teams that give engineers lots of leeway to own their technical decisions and that team’s ability to hire and retain top-tier talent, particularly senior talent. Everything is a tradeoff, obviously, but accepting somewhat more chaos in exchange for a stronger sense of individual ownership is usually the right one, and leads to higher-performing teams in the long run.
- @davidgerard: The purchasing power of Bitcoin and most cryptos is one-fifth of what it was in December. That's 400% inflation. Ether's is one-tenth - that's 900% inflation. Dunno if you count that as "hyper-".
- James Hamilton: I’ve long predicted that machine learning workloads will require more server resources than all current forms of server computing combined.
- @swardley: X : What's the difference between AWS QL DB and Blockchain? Me : Not having to burn the planet in pursuit of proof of work that most use cases don't actually need? X : But it's centralised. Me : Is that an argument for burning the planet? Where are we upto in global energy use?
- Eric Elliott: Good MOP means that instead of all of these systems reaching out and directly manipulating each other’s state, the system communicates with other components via message dispatch.
- @mweagle: HN: I could get a puppy in a weekend. World: puppies grow up and are dogs for a long time. Will you take care of them? HN: That's a dog problem. World:
- chubot: You need a new business model to go along with the new tech. What is that for crypto?
- @QuinnyPig: Proposal: all AWS service names get a 30 second review from me before they ship. “Route53?” “Love it.” “Snowball Edge?” “Urban Dictionary both of those words.” “Systems Manager Session Mana—wow that sounds terrible now that I hear it spoken aloud.” I’m here to help. “Fargate?” “Sounds clever, but nobody will know what it does.” “Timestream?” “I have a giant repo of Doctor Who references.” “We’ll risk it.” “The repo is bigger on the inside.”
- @swardley: ... this is the future of software engineering. A combination of development, architectural and finance skills. Monitoring, operating, refactoring and developing based upon capital flow. Even risk management will utilise this. The future practices are very different from today.
- Matt Klein: The best case for open source has been with software infrastructure, rather than software application projects. If cloud computing companies become the infrastructure providers for software, their market control might allow them to take over open-source projects, and sell those software services at a lower price point than companies who market open source services. If this scenario comes to pass, is there any future for open source companies?
- Elizabeth Woyke: [Audi] plans to build its own, private wireless network at its production sites using 5G, the next generation of cellular technology. Audi believes this private 5G network, which the company would be responsible for managing, will enable it to connect manufacturing robots and other devices faster and more securely than existing Ethernet, Wi-Fi, or 4G LTE options
- Netflix: The largest cache that we have warmed up is about 700 TB and 46 billion items. This cache had 380 nodes replicas. The cache copy took about 24 hours with 570 populator instances.
- @JeffDean: In April, '17, @jsomers of @NewYorker reached out & said he wanted to do a small profile of me & my longtime colleague Sanjay Ghemawat, watch us work for a few hours, maybe dinner, etc. It came out today. I think it captures our working style really well.
- TheGRS: My take is that if we want both indie games to be a thing and have good indie games, then maybe Valve should pay attention to this sort of feedback and make sure the hard-working, full-time indie devs are getting enough attention to be sustainable. Otherwise you basically get the Youtube model of lowest common denominator drivel.
- Martin Griggs: Taking risks and adopting new technology is fine. What you should avoid though is jumping the technology bandwagon. If that new shiny piece of tech is really as fantastic as people claim it is, it will still be great in a year or so. Keep an eye out for anything that could improve your stack, but make sure to vigilantly guard your stack and your sanity.
- Daniel Henrique: We can handle 501,120,000 votes, scattered throughout the day, with $0.16. That is the almost the sum of population of USA and Indonesia, the 3rd and 4th biggest populations of the world
- @troyhunt: Just rolled over past 2.2M verified @haveibeenpwned subscribers last night, tracking at 1,261 new ones per day for the last 4 weeks. Interesting watching how this is impacting notifications post data load, a huge number of breach emails going out these days.
- @neilcybart: Given Apple's current buyback pace ($20B per quarter), for every $10 price drop in AAPL shares, management can repurchase an additional one percent of the company over two years.
- Timothy B. Lee: Fares are based on time and distance, and customers can expect fares to be roughly on par with what you'd pay for an Uber or Lyft trip—perhaps even a bit lower. The above Waymo-provided screenshot shows a customer booking this trip, which is 4.6 miles long and takes about 12 minutes. Waymo charges $7.32 for the trip. I punched the same route into Lyft and Uber aps on Tuesday afternoon and got quotes of $8.29 and $9.38, respectively.
- @mijustin: MailChimp was a side-project for 6 years. Todoist was a side-project for 4 years. Basecamp took 2 years before it was paying their salaries. Maybe the rest of us shouldn't be in such a hurry.
- @codinghorror: Microsoft drops Edge rendering engine due to poor adoption rates. Cool, cool. I suggest Microsoft should similarly drop Bing, which also has poor adoption after many years, and switch to Google search.
- @Carnage4Life: Bing is a profitable business. What's the value of a browser virtually no one uses and isn't even supported on the most popular versions of Windows in the wild (Win7 + Win8)? I've been having this debate for 2.5 years at Microsoft. Come at me bro.
- Ahmed Kabil: The Equation of Time Cam was built to enable the Clock to convert from local solar time to absolute time while accounting for these predicted long-term variations over the next 10,000 years
- @qconlondon: Insider view on the #softwarearchitecture of @StarlingBank from @jasonmaude : "The system is composed of around 19 applications hosted on AWS and running Java and backed by a PostgreSQL database."
- Joanne Itow: In 2017 the United States maintained a semiconductor trade surplus of $2.1 billion with China. In addition, most of the U.S. imports of semiconductors from China are products from U.S. semiconductor companies that are either designed and/or have undergone front-end fabrication outside of China.
- Robin Marx: I still think QUIC might fail. I don’t think the chance is high, but it exists. Conversely, I also don’t think there is a big chance it will succeed from the start and immediately gain a huge piece of the pie with a broader audience outside of the bigger companies. I think the chance is much higher that it fails to find a large uptake at the start, and that it instead has to gain a broad deployment share more slowly, over a few years. I think this will be slower than what we’ve seen with HTTP/2, but (hopefully) faster than IPv6. I personally still believe strongly in QUIC (I should, I’m betting my PhD on it…). It’s the first major proposed change on the transport layer that might actually work in practice (the arguments in this post are several times worse and more extensive for many previous options).
- Charlotte Jee: Facebook found ways to access users’ call history without alerting them, in order to make “People You May Know” suggestions and tweak news-feed rankings. Facebook planned to make it as hard as possible for users to know that this was happening.
- ralish: I’m an Australian and am in something close to a state of despair between this legislation passing with the help of our main and utterly useless opposition, and alongside several unrelated non-technology political issues. Against seemingly all expert advice, both legal and technical (excluding of course the LEOs lobbying for it), the legislation was passed without the time for the vast majority of the politicians voting on it to even read it, spurred by a scare campaign by the government that these laws are necessary to prevent terrorism over the Christmas period.
- Mark LaPedus: The report forecasts predict that the 3D XPoint memory’s sub-DRAM prices will drive revenues to over $5 billion by 2028, while stand-alone MRAM and STT-RAM revenues will exceed $4 billion, reaching one hundred times the 2017 level. “Emerging memory technologies are just now gaining traction, but soon process shrinks and improving economies of scale will reduce prices, and these technologies will replace today’s volatile and nonvolatile memories
- @xaprb: “A programmer does not primarily write code; rather, he primarily writes to another programmer about his problem solution.” There’s no better argument for observability. Even a great programmer’s mental models, and thus writing, is wrong; and observability is key to finding out.
- @chrismunns: Friendly reminder: Firecracker is super awesome, but you probably never need to touch it and shouldn't be too concerned by it. Runtime API is similar, you should try and use managed runtimes as much as possible and save yourself the work involved. #serverless #awsLambda
- @mipsytipsy: This why I care so much about investing in various throttles, blacklists and filters; by user, component, service and db: to protect your platform for the 99% who are good actors from the 1% who aren't. And it's why your platform should be written in a multithreaded language. And it's why your platform should be written in a multithreaded language. Otherwise, if you're stuck using the request-per-process model, any time ANY component behind that service gets slow the pool of workers will begin to fill up with requests waiting for that slow component. It can take your entire system down in seconds flat.. for *everyone.*
- @davidgerard: "My favourite terrible blockchain idea is real estate on the blockchain. Because presumably that means that if someone guesses or steals my password, they now own my house." - simonw
- Sendeos18: we ended up implementing similar solution for our "search requests". Have an elasticsearch cluster being updated by triggers in the database. Main reason we went this route was so that none of the other 50+ services that were modifying the product data would need to be changed. Again this is not ideal but in our retroactive situation of "why do i have to wait 9 seconds for a search" it worked really well.
- Alexis Ducastel: For every other common usage, I would recommend Calico. Just like WeaveNet, don’t forget to set the MTU in the ConfigMap if you are using jumbo frames. It has proven to be multipurpose and efficient in terms of resources consumption, performance and security. Calico already works on very large clusters and has very interesting BGP features.
- @StephenPOwens: +1. Seeing this happening. We’re already choosing languages based on cold start time differences in serverless, DB based on cost per outcome specific unit of value, in-house vs service based on time to market vs service cost...
- Ayende Rahien: Let’s assume a simple application using CosmosDB, as an example. With 200 page views / sec on the site, and each page view generating 80 requests to the database, that gives us a total of 16,000 requests a second, which translates to an end of the month bill of about 10,000$.
- @jmspool: It’s not Big Design Up Front if you do in-depth research to understand the user’s problem. It’s not BDUF if you spend detailed time learning who needs this thing and why they need it. It’s not BDUF if you help every team member know what success looks like.
- @randfish: In the US: - Google properties have 93% of the market - YouTube gets almost 3X as many searches as Facebook - DuckDuckGo is one of the only non-Google properties growing its market share - Google web search alone earns ~22X the searches of Bing+Yahoo
- @ChrisHaggstrom: This @swardley thread is outstanding, as is learning about Wardley Maps. Perhaps my favorite q/a pair: X : Then why are so many enterprises investing in containers? Me : I mostly assume it's executives with an eye on retirement and no eyes on the landscape.
- @rabimba: Meanwhile I realized it might be fun to see this data in a more visual way in #WebVR. This is not yet complete. But you can already see a small amount in a Oculus Device (or mobile)
- Alvy Ray Smith: Everybody says, “Steve Jobs ripped off Xerox PARC.” He didn’t. Bill Gates did. He took Simonyi, who didn’t just know the look of the Alto, he knew how it worked. And Simonyi came up there and wrote Word and Excel based on stuff he had done at Xerox PARC. That was the end. That was the real taking of the guts of the place.
- pbailis: Reflecting on its promise, it’s surprising that Big Data isn’t considered more of a failure. Why deep learning isn't a panacea, and what's missing from today's analytics stack:
- @davidgerard: whatever the proximate causes, the crypto crash was fundamentally from a shortage of actualmoney coming into the system. they couldn't fake the price as being 6000+ any more, tether or no. someone broke ranks mid-Nov and sold up while the selling was good. it's been chaos since.
- Richard Sanders: Above all, articles laid out the rules for the division of booty. Captain Lowther’s were typical: The captain is to have two full shares; the master is to have one share and a half; the doctor, mate, gunner, and boatswain, one share and a quarter. The rest of the men received a single share.
- jartelt: It is beginning to happen in the US. Battery adoption on the grid has been slow because the utility companies like solving problems by building new generation plants (natural gas plants mostly) or by building new transmission lines. They have historically always done this and they make a guaranteed rate of return on this type of infrastructure. This has led to lots of un-needed generation capacity and expensive transmission projects passed on to rate payers.
  
  Regulators in many states are now finally forcing utilities to examine the use of alternative solutions to fix grid issues. They are finding that battery installations can do a better job than new generation or transmission in many cases and the batteries can be cheaper over the long run. It's just a matter of getting the utilities to try something new and get out of their typical playbook.
- Adrian Bejan: Hierarchy arises because it is good for every component of the global flow system. The big need the small just as surely as the small need the big. The individual sustains the crowd—and vice versa. The big river sustains the many tiny streams of the river basin, just as those tiny streams feed the river basin. Citizens (the rivulets of politics) sustain the governments that serve them; workers (the rivulets of business) sustain the companies that employ and, in turn, sustain
- @CraigSilverman: Google has kicked two popular apps out of the Play store after confirming our report that they engaged in app-install fraud. One of the apps, Kika Keyboard, was the top keyboard in the *entire* Play store, with +200 million downloads:
- @parismartineau: i spoke to 20 people in the influencer marketing industry & what i learned has ruined YouTube/Instagram for me forever little is real, dark money is everywhere, & the only ones with real influence are the brands shelling out $60K+ for a spot in yr feed. that cup of coffee on your favorite youtuber’s desk? probably paid for. the same goes for that cute sweater your fave fashion blogger tagged on Instagram. and things just get worse from there...
- Chris Hacken: I always knew running fiber was expensive. As an entrepreneur your job is to figure out how to do something better, faster, and for less. I thought we'd be able to do that. I was wrong. Not wrong in the sense that we can't do it, but wrong that we could do it on our first try. We made a number of mistakes. The good news is we learned from them and I'm confident we can avoid them as we move forward. The bad news is, those mistakes cost money; a lot of it. When we started this project I budgeted around $30,000. I've honestly lost track of the costs because we've been working to complete it under whatever means necessary due to our time constraints. I think when it's all said and done that it'll end up costing around twice that.
- Rob Kirberich: GraphQL is great for us, but we also made plenty of mistakes. We sometimes struggle with keeping our API truly backwards-compatible and had to invest in extra monitoring for deprecated fields in addition to performance monitoring and better error reporting. Updating the GraphQL schema with every API change manually can be tedious, and by making it extremely easy to communicate between backend services via GraphQL we accidentally broke some of our own service boundaries. But in the end, it has helped us develop faster while keeping the mental model of our infrastructure easy and giving our teams significantly more autonomy.
- Rachel Rose O'Leary: Infura handles around 13 billion code request per day and provides a way for developers to connect to ethereum without having to run a full node. And while the exact usage stats aren’t public, by creating a simpler method for interfacing with the network, it’s said to underpin the majority of decentralized applications in the ethereum ecosystem. But here’s the thing: Infura is operated by a single provider – the ethereum development studio ConsenSys – and relies on cloud servers hosted by Amazon. As such, concerns exist that the service represents a single point of failure for the entire network.
- Geoff Huston: Most of the Internet runs to the same clock, but what is “same” admits to a rather broad approximation. Some 53% of clocks on Internet connected devices run within 2 seconds of a server’s reference clock, which is close to the limit of accuracy within this experimental method. Of the remainder of the set of tested devices, some 38% of clocks were slow and 9% were fast. What can we say about time and the Internet? The safest assumption is that most systems will be in sync with a UTC reference clock source as long as the definition of “in sync” is a window of 24 hours! If the ‘correct’ behaviour of an Internet application relies on a tighter level of time convergence with UTC time than this rather large window, then it’s likely that a set of clients will fall outside of the application’s view of what’s acceptable.
- Jordana Cepelewicz: Zenil hopes to explore whether biological evolution operates according to the same computational rules, but most experts have their doubts. It’s unclear what natural mechanism could be responsible for approximating algorithmic complexity or putting that kind of mutational bias to work. Moreover, “thinking of life totally encoded in four letters is wrong,” said Giuseppe Longo, a mathematician at the National Center for Scientific Research in France. “DNA is extremely important, but it makes no sense if [it is] not in a cell, in an organism, in an ecosystem.” Other interactions are at play, and this application of algorithmic information cannot capture the extent of that complexity.

How do you level up in China?

Is AWS running out of ideas? Not yet, though the pace may be slowing, as must eventually happen when you've reached Total Addressable Feature Space. You can watch re:Invent videos from a curated list or get the whole enchilada on YouTube. Don't want to spend the rest of your life watching video? Recaps to the rescue! The Cloudcast podcast covered All of the 2018 AWS reInvent Announcements. re:Invent 2018 Security Review. Comic Relief's Takeaways from AWS re:Invent. InfoQ's Recap of AWS re:Invent 2018 Announcements. Jeremy Daly with his re:Capping re:Invent. James Beswick with What I learned from AWS re:Invent 2018. There's The somewhat different AWS re:Invent recap. Jennine Townsend with Notable AWS re:invent Sessions. And of course there's Netflix at AWS re:Invent 2018.

Scaling has always meant specialization. Bleeding edge hyperscalers throughout history have built custom solutions and their ideas have trickled down to become industry standard practice. In a fascinating twist, James Hamilton talks about how a similar process works for hardware in the cloud. AWS Inferentia Machine Learning Processor: Whereas in the past it was nearly impossible for an enterprise to financially justify hardware specialization in all but fairly exotic workloads, in the cloud there are thousands to possibly tens of thousands of even fairly rare workloads. Suddenly, not only is it possible to use hardware optimized for a specific workload type, but it would be crazy not to. In many cases it can deliver an order of magnitude of cost savings, consume as little as 1/10th the power, and these specialized solutions can allow you to give your customers better service at lower latency. Hardware specialization is the future. Believing that hardware specialization is going to be a big part of server-side computing going forward, Amazon has had a custom ASIC team focused on AWS since early 2015 and, prior to that, we worked with partners to build specialized solutions. Two years ago at re:Invent 2016, I showed the AWS custom ASIC that has been installed in all AWS servers for many years (Tuesday Night Live with James Hamilton). Even though this is a very specialized ASIC, we install more than a million of these ASIC annually and that number continues to accelerate. In the server world, it’s actually a fairly high volume ASIC.

Videos from All Things Open are now available.

Uber has an interesting way of dealing with network lag and offline mode. How Uber’s New Driver App Overcomes Network Lag: Any component of the driver app capable of operating optimistically begins the flow by submitting an optimistic request. An optimistic request has the ability to serialize and deserialize to disk, very similar to a regular network request, and every optimistic request is paired with an optimistic transform. When an optimistic request is submitted to the client, the transform associated with the request is applied immediately to move the app into an optimistic state, making it appear that the request has completed. The optimistic state outputted from the transform will be maintained until a response from the server is received with the actual state, syncing app and server...we have observed that the average time saved per optimistic operation is about 13.5 seconds. Even at this early stage in the new driver app’s life we are totaling over a year’s worth of continuous driver time saved in aggregate each and every day.

Analysis showed that it was 90% cheaper to run the video transcoders inhouse. Here's how Egnyte serves video at scale. First thing they did was look at the actual kind of video users consumed and tailored their system to fit. For their global messaging system they started with the latest Google PubSub client but reverted to using a previous version, since the latest one would not renew leases for a longer duration. FFmpeg is by far the best free software available for transcoding videos and it has very good HLS support. Theirr first preference was to leverage a serverless architecture for deploying video transcoders. Video transcoding is a CPU-intensive operation and needs specific hardware like dedicated CPUs (or even GPUs) with enough memory and native ffmpeg installed on it. It was challenging to build this on serverless. They determined Kubernetes suited them better, so they created Alpine docker containers of FFmpeg + Python and deployed these within Kubernetes. They found that video transcoding jobs run faster on GPUs, but doing this isn’t cost effective. The best trade-off between speed and cost was allocating 4 CPUs to each video transcoder job. At 4 CPUs, we were able to process videos at about 25-40% of the video play time. In other words, a 1-hour video would take about 15-25 minutes to transcode. Adding more CPUs to a video transcoder job did not produce linear benefits so it was best to stick to 4 CPUs per job and instead provision more jobs. They set up a bulkhead between video service and their regular service because video can go viral. Their video service is based on OpenResty, with all authentication and video discovery written in Lua. It's deployed on a cache on a dedicated infrastructure fronted by a dedicated domain name such as media.egnyte.com. This video service does not share any infrastructure components like firewalls, switches, and ISP links with our primary services which allows us to scale the video service and rate limit users purely on our video needs.

IMHO academics don't need to dumb down content to reach a broader audience, they just need to write better. The Myth of ‘Dumbing Down’.

Put on those shades, the future looks k8s. The State of K8s 2018: Kubernetes has crossed the chasm. About 60% of respondents are using Kubernetes today, and 65% expect to be using the technology in the next year...Half of the organizations running Kubernetes are doing so in production. The bigger and more complex the organization, the more likely they’re already in production; 77% of organizations with more than 1,000 developers and 88% of organizations with more than 1,000 containers...63% are running stateful apps, 53% have entrusted data analytics to the platform and 31% operate IoT apps on Kubernetes... 63% of organizations that have deployed Kubernetes are immediately using their resources more efficiently. And 58% have shortened their software development cycles.

Good example of evolving a small system to a more complex system. Distributed Systems: When you should build them, and how to scale. A step-by-step guide: My main point is: don’t try to build the perfect system when you start your product. Most of your design choices will be driven by what your product does and who is using it...Focus on figuring out what people need, and try to come up with a solution to their problem, even if it has a lot of manual steps. Then think about ways to automate, spend your time coding and destroying, and use third parties where it makes sense...Don’t scale but always think, code, and plan for scaling. Build your system step by step, don’t address system design issues based on features that are not mature yet, and finally always try to find the best trade-off between the time you will spend and the gain in performance, money, and lowered risk.

Facebook's Mobile @Scale — Tel Aviv recap. Titles include: Learnings for scaling mobile dev at Facebook; Building for emerging markets.

Bringing the operational infrastructure in-house is a huge undertaking. It may be less efficient early on, but it has opened the gates for scaling Periscope Data to what it is today and has laid the groundwork for future growth and optimization. 9 Lessons Learned Migrating From Heroku To Kubernetes with Zero Downtime: As we grew, so did our infrastructure requirements. Heroku wasn't able to keep up with all of those requirements and that's why we moved to hosting Kube ourselves...Kubernetes offers an excellent set of tools to manage containerized applications. You can think of it as managing a desired state for the containers...Lesson #1: A reverse proxy app can be a powerful tool to manage HTTP requests...Lesson #2: Horizontally scalable services don't always scale across different deployments...Lesson #3: A carefully chosen ratio of concurrency resources can lead to a highly optimized setup...Lesson #4: Achieve zero downtime during migration by using a reverse proxy app...Lesson #5: Managing releases to two production environments is highly error prone...Lesson #6: Database connection pooling is a great optimization in conserving database resources...Lesson #7: Make sure to explicitly specify CPU and memory, request and limit value...Lesson #8: CPU throttling looks like 5xx errors...Lesson #9: Be proactive with training and enablement.

Didn't we know this three years ago? 'Low code' and 'no code' products is the hottest trend in enterprise startups. Also, The Cloud Is the New OS - A Developer's Perspective.

Playing games is no game. It takes a lot of work. Riot Games built several competing authentication and authorization implementations running in parallel. But there can be only one. The one is OpenID Connect, they made it their standard for handling authentication and authorization. How is detailed in a richly detailed article on Globalizing Player Accounts. There's also a Re:Invent talk. The context: "We deploy League of Legends to 12 disparate game shards in Riot-operated regions, and many more in China and southeast Asia via our publishing partners Tencent and Garena. With 10 clustered databases storing hundreds of millions of player account records, hundreds of thousands of valid logins and failed authentication requests, and over a million account lookups per minute, we have our work cut out for us." They chose Continuent Tungsten Clustering suite which consists of a number of processes that live alongside MySQL that wrap around and manage the cluster. Database instances are deployed using Terraform and allocated static IP addresses when provisioned. The services are all containerized and launched via docker-compose, which is written through userdata startup scripts. All general maintenance, restarts, and upgrades are managed by Ansible. It's deployed to a multi-region composite cluster with the intended primary residing in us-west-2. As you can see in the image above, there are three nodes per region with a primary or relay and two secondaries. A relay functions as a local read-only primary that replicates off of the current global primary node. The secondaries local to each relay replicate off of their local relay node. The connector is configured to send read requests to the most up-to-date node in the same AWS region, but if there is a local outage, requests will be proxied off to the other regions. This global cluster has a single write primary, and each of our backend services that do writes to the database connect to the appropriate primary over the DirectConnect backend by leveraging a connector. There's also an interesting story about how high CPU consumption can cause havoc through failed health checks.

Videos from Code Mesh LDN 2018 are now available. You might love: From quadcopters to helicopters: formal verification for safer vehicles.

What if all that magic neural net dust isn't necessary? What to do with Big Data? Making ML useful is a platform problem: Despite the widespread collection of data at scale, there’s little evidence that most enterprises are successful in efficiently realizing this value...on this structured data, much simpler models often perform nearly as well. Instead, the bottleneck is in simply putting the data to use...Buried on page 12 of the Supplemental Materials, we see that logistic regression (appearing in lecture 3 of our intro ML class at Stanford) “essentially performs just as well as Deep Nets” for these predictive tasks, coming within 2-3% accuracy without any manual feature engineering...For many use cases, putting data to work doesn’t require a new deep network, or more efficient neural architecture search. Instead, it requires new software tools...Help navigate organizations’ existing data at scale...Provide results users can trust...Work alongside users.

AWS Aurora vs Google Cloud SQL Pricing. AWS Aurora (Reserved Instance): $145.23/mo. Google Cloud SQL: $322.62/mo. Also, How to Figure Out AWS Pricing for a Mobile Application.

We fight against it, but every force is aligning in the direction of removing humans from the hunt-kill and other OODA loops. And we know how humans love to take the easier path. In the military context, it’s going to be manned and unmanned teaming. No one believes that all pilots will leave all the airplanes. It’s not going to happen. There’s only so many variables that you can program into any piece of software. A retired Navy captain explains how drones will shape the future of war: It used to be that a warrior prepares, trains, deploys to a foreign location where he is face-to-face with an enemy, he may or may not survive, and at the end, he comes home...One of the other things that’s important for drones is not only that there is no pilot or crew aboard, but they also have the ability to stay over the target for 24 hours or more...there’s an unmanned surface ship that’s in sea trials right now. It’s 132 feet and called the Sea Hunter, and it’s designed to go off and do missions of up to 10,000 miles on a single tank of fuel with no one on board...There’s another interesting project that attempts to take relatively small unmanned aerial vehicles, launch them out of the back of a cargo plane, have them do their mission and recover them in midair and bring them back into the airplane...Another focus will be transportation on roads. The biggest killer of soldiers is improved explosive devices, so if the work is being done on automated convoys...when you talk about unmanned submarines. You just can’t talk to them when they’re below the water. So you have to make sure you have secure communications. That’s a big vulnerability. Also, Artificial Intelligence, China And The U.S. - How The U.S. Is Losing The Technology War. Also also, Chip wars: China, America and silicon supremacy.

The good thing about starting over is you get to pick a new stack. React Native at Picnic: we decided to build this app in React Native instead...an important requirement for the new application was that there should be no device or operating system lock-in...needs to operate well under uncertain networking conditions. Hence, offline support is very important...we decided to use [TypeScript] instead of Flow...For navigation, we use React Navigation...we use Microsoft CodePush...For state persistence, we use redux combined with redux-persist for offline support...axios as our HTTP client...On the UI-side, we use styled components for styling and storybook to document our UI components. Snapshots are automatically generated for each story by using StoryShots and React Native Storybook Loader...Any argument about syntax, we defer to Prettier. Finally, as the cherry on the cake, we use husky to run pre-commit and pre-push hooks that verify that all code that we check in is up to the standards that we have set for ourselves.

The ETL pipeline operates a micro-batching window of one minute and processes a few billion events per day. The pipeline runs on our YARN cluster and uses 64 single core containers with 8 GB of memory. Sessionizing Uber Trips in Real Time: We refer to the data underlying each trip as a session, which begins when a user opens the Uber app...A typical trip lifecycle like this might span across six distinct event streams, with events generated by the rider app, driver app, and Uber’s back-end dispatch server. These distinct event streams thread into a single Uber trip...How do we contextualize these event streams so they can be logically grouped together and quickly surface useful information to downstream data applications? The answer lies in defining a time-bounded state machine modeling the flow of different user and server-generated events towards completion of a single task. We refer to this type of state machine, consisting of raw actions, as a “session.”...Putting all the relevant events for our session lifecycle in one place unlocks a wide variety of use cases, such as: Our Demand Modeling team can compare app impressions, Our Forecasting team can see how many sessions are in the Shopping state within a given area during a particular time window...We used Spark Streaming to implement the Rider Session State Machine...we’re looking at moving to Flink due to its deeper support for out-of-box event time processing and wider support at Uber...Clock synchronization: Given the wide array of handsets and variations of mobile operating systems, not to mention user settings, you can never really trust the timestamps sent from mobile clients...Back-pressure and rate limit It uses a PID rate estimator to control the input rate of subsequent batches.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (article): GPipe is a scalable pipeline parallelism library that enables learning of giant deep neural networks. It partitions network layers across accelerators and pipelines execution to achieve high hardware utilization. It leverages recomputation to minimize activation memory usage. For example, using partitions over 8 accelerators, it is able to train networks that are 25× larger, demonstrating its scalability. It also guarantees that the computed gradients remain consistent regardless of the number of partitions. It achieves an almost linear speedup without any changes in the model parameters: when using 4× more accelerators, training the same model is up to 3.5× faster. We train a 557 million parameters AmoebaNet model and achieve a new state-of-theart 84.3% top-1 / 97.0% top-5 accuracy on ImageNet.

FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network: FastGRNN then extends the residual connection to a gate by reusing the RNN matrices to match state-of-the-art gated RNN accuracies but with a 2-4x smaller model. Enforcing FastGRNN’s matrices to be low-rank, sparse and quantized resulted in accurate models that could be up to 35x smaller than leading gated and unitary RNNs. This allowed FastGRNN to accurately recognize the “Hey Cortana” wakeword with a 1 KB model and to be deployed on severely resource-constrained IoT microcontrollers too tiny to store other RNN models.

Stuff The Internet Says On Scalability For December 7th, 2018

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale