hot links

Stuff The Internet Says On Scalability For April 26th, 2019

High Scalability

26 Apr 2019 — 33 min read

Wake up! It's HighScalability time:

Found! The One Ring. In space!

Do you like this sort of Stuff? I'd greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I'm 10 for people who need to understand the cloud. And who doesn't these days? On Amazon it has 45 mostly 5 star reviews (103 on Goodreads). They'll learn a lot and hold you in even greater awe.

$30 million: Apple's per month AWS bill (a ~50% reduction); 73%: Azure YoY growth; 3,500: times per day andon cords are pulled at Toyota; $1 trillion: size of micromobility market; $1 billion: cryptopiracy is the new sea piracy; $702 million: Tesla fist quarter loss; $5.0 billion: FTC Facebook fine; 1.56 billion: Facebook DAUs, 8% growth; 93%: Facebook mobile advertising revenue out of total; 40%: internet traffic driven by bots; one litre per hour: required by Roman galley oarsmen; 1200: Fortnite World Cup cheaters; 26: states ban community broadband; $50M: Slack yearly AWS spend; 575: companies paying slack $100k/year;

Quotable Quotes:
- Claude Shannon: Then there’s the idea of dissatisfaction. By this I don’t mean a pessimistic dissatisfaction of the world — we don’t like the way things are — I mean a constructive dissatisfaction. The idea could be expressed in the words, This is OK, but I think things could be done better. I think there is a neater way to do this. I think things could be improved a little. In other words, there is continually a slight irritation when things don’t look quite right; and I think that dissatisfaction in present days is a key driving force in good scientists.
- Albert Kao: A feature of modular structure is that there’s always information loss, but the effect of that information loss on accuracy depends on the environment. Surprisingly, in complex environments, the information loss even helps accuracy in a lot of situations.
- Eduards Sizovs: Be the company that says: we are hiring mentoring.
- ???: There Is No Shortage of Talent. There's a Shortage of Suckers.
- Andrew Leonard: AWS is striking at the Achilles’ heel of open source: lifting the work of others, and renting access to it. Observers of the clash between AWS and open source worry that the room for further innovation may be rapidly shrinking.
- V8: Like many with a background in programming languages and their implementations, the idea that safe languages enforce a proper abstraction boundary, not allowing well-typed programs to read arbitrary memory, has been a guarantee upon which our mental models have been built. It is a depressing conclusion that our models were wrong — this guarantee is not true on today’s hardware.
- John Allspaw: resilience is something that a system does, not what it has
- Valantar: Denser than DRAM, not NAND. Speed claims are against NAND, price/density claims against DRAM - where they might not be 1/10th the price, but definitely cheaper. The entire argument for 3D Xpoint is "faster than NAND, cheaper than DRAM (while persistent and closer to the former than the latter in capacity)", after all.
- @JoeEmison: Don’t need luck when I have serverless.
- bzbarsky: Caches, say. Again as a concrete example, on Mac the OS font library (CoreText) has a multi-megabyte glyph cache that is per-process. Assuming you do your text painting directly in the web renderer process (which is an assumption that is getting revisited as a result of this problem), you now end up with multiple copies of this glyph cache. And since it's in a system library, you can't easily share it (even if you ignore the complication about it not being readonly data, since it's a cache). Just to make the numbers clear, the number of distinct origins on a "typical" web page is in the dozens because of all the ads. So a 3MB per-process overhead corresponds to something like an extra 100MB of RAM usage...
- T-Mobile: millimeter wave (mmWave) spectrum has great potential in terms of speed and capacity, but it doesn’t travel far from the cell site and doesn’t penetrate materials at all. It will never materially scale beyond small pockets of 5G hotspots in dense urban environments.
- @StegerPatrick: I think it's a bit of both honestly. True an idle instance makes AWS coin but at the same time they have to build out datacenters to support those idle boxes. My thinking is that margins are higher with lambda per box. 0 numbers to back it up, just my gut.
- @jamesurquhart: I remember the question being asked in an AWS ops meeting: “Do we know mathematically that we detected that error as early as statistically reasonable?”
- Herman Narula: Video games are the most important technological change happening in the world right now. Just look at the scale: a full third of the world’s population (2.6 billion people) find the time to game, plugging into massive networks of interaction. These networks let people exercise a social muscle they might not otherwise exercise. While social media can amplify our differences, could games create a space for us to empathize?
- George Dyson: There are two kinds of creation myths: those where life arises out of the mud, and those where life falls from the sky. In this creation myth, computers arose from the mud, and code fell from the sky.
- returnofthecityboy: As someone who has worked in both fields, often it's nigh impossible in big corporations, and I imagine the military and aerospace is no exception, to improve something that you're not immediately tasked with. There are 0.1% of things that are "your job", and 99.9% of things are "not your job". If you see something wrong in "not your job", deep hierarchies, leadership that is selected for seniority over intelligence, and lots of layers of bureaucratic crap make it impossible for you to change anything about it.
- Fortnite Source: The executives keep reacting and changing things. Everything has to be done immediately. We’re not allowed to spend time on anything. If something breaks — a weapon, say — then we can’t just turn it off and fix it with the next patch. It has to be fixed immediately, and all the while, we’re still working on next week’s patch. It’s brutal. I hardly sleep. I’m grumpy at home. I have no energy to go out. Getting a weekend away from work is a major achievement. If I take a Saturday off, I feel guilty. I’m not being forced to work this way, but if I don’t, then the job won’t get done. I know some people who just refused to work weekends, and then we missed a deadline because their part of the package wasn’t completed, and they were fired. People are losing their jobs because they don’t want to work these hours.
- @hillelogram: Catastrophic accidents like the 737 are signs of deep systemic issues at all layers of the system and cannot be isolated to things like "good practices". This is something we see in all major accidents: people scapegoat the technicians and miss the errors of the C-levels.
- @Electric_Genie: Exclusive: For efficiency, Boeing wanted huge engines on the 737 Max. However, thoroughly redesigning the plane to accommodate the huge engines would have been very costly. So Boeing took the inexpensive route, relying primarily on software. Bad idea.
- @deadprogrammer: AWS is not about paying for what you use, it’s about paying for what you forgot to turn off.
- mdbm: I've seen a lot of friends publish apps to different ecosystems (e.g. marketplace.atlassian.com, https://apps.shopify.com, etc.) and make a steady profit. Seems like this is a good way to tap into an existing user base (although you share a portion of the revenue).
- Peter Rüegg: The researchers took it a step further: they created a biological dual-core processor, similar to those in the digital world, by integrating two cores into a cell. To do so, they used CRISPR-Cas9 components from two different bacteria. Fussenegger was delighted with the result, saying: “We have created the first cell computer with more than one core processor.” This biological computer is not only extremely small, but in theory can be scaled up to any conceivable size. “Imagine a microtissue with billions of cells, each equipped with its own dual-core processor. Such ‘computational organs’ could theoretically attain computing power that far outstrips that of a digital supercomputer – and using just a fraction of the energy,”
- Laurence Scott: Our ongoing challenge, then, will be to negotiate the inherent inauthenticity and cynicism of an influence economy while preserving our ability to be occupied, and perhaps changed for the better, by the alien ideas of other people.
- Oscar Schwartz: Perhaps we can take a lesson from the author Edgar Allan Poe. When he viewed von Kempelen’s Mechanical Turk, he was not fooled by the illusion. Instead, he wondered what it would be like for the chess player trapped inside, the concealed laborer “tightly compressed” among cogs and levers in “exceedingly painful and awkward positions.”
- Polina Marinova: Once the very embodiment of Silicon Valley venture capital, the storied firm has suffered a two-decade losing streak. It missed the era’s hottest companies, took a disastrous detour into renewable energy, and failed to groom its next-generation leadership. Can it ever regain the old Kleiner magic?
- torpfactory: Let's just all pause for a moment of silence to ponder just how amazing even the PX Xavier is compared to somewhat recent history: ASCI Red was the first 1 TOP computer. It consumed 850kW of power and cost $46m. It was the fastest supercomputer in the world 19 years ago this June. Now we're arguing over whether all our cars will be driving around with a computer either 144 or 300x as fast using either 8500x or 1700x less power and costing in the neighborhood of 100,000x less.
- xibbie: With the shuttle disasters, NASA was at the forefront of science, and the pilots signed up knowing the risks. NASA may have been under time pressure (I’m not familiar with the root causes here, just echoing your reasons), but the deadlines didn’t carry commercial penalties. With the 737 crashes, commercial interests have clearly compromised safety objectives, and the passengers and pilots were BY DESIGN kept unaware of the increased risks.
- KillerCodeMonky: The tendency to want to "fix" things is sometimes part of the problem. It's called lava flow pattern, and it's when your codebase is programmed in waves of different guiding design and architecture. One of the best things you can do in an old codebase is just follow existing patterns and design, even if you don't agree with them. If you're going to refactor anyway, make it very obvious where the line is between old and new, so that others know which design to follow where. Preferably putting the new stuff into a separate namespace or even library.
- OpenAI Five: We were expecting to need sophisticated algorithmic ideas, such as hierarchical reinforcement learning, but we were surprised by what we found: the fundamental improvement we needed for this problem was scale.
- david gerard: It’s 2019, and, of course, no systems using blockchains for access control of patient data are in production. Because this is snake oil that claims to work around political, social and legal issues using impossible and nonexistent technological magic. The only beneficiaries will be blockchain consultancies. Patient outcomes — which is a number that you have to provide for literally every exciting new medical initiative — will only be negative.
- Geoff Huston: To understand just how dramatic the change could be for the DNS, it has been observed that some 70% of all queries to the root name servers relate to non-existent top-level names. If all recursive resolvers performed DNSSEC validation and aggressive NSEC caching of just the root zone by default, then we could drop the query rate to the root servers by most of this 70%, which is a dramatic change.
- Pascal Luban: So that’s the lesson to draw for game designers: When a game features characters we can identify with and situations we can relate to, the intensity of the gamer experience is multiplied. This BAFTA reward illustrates a trend I identified several years ago: The growing role of emotions in games and their consequences on their design and the part of a good narrative.
- Merv Adrian: The DBMS market returned to double digit growth in 2017 (12.7% year over year in Gartner’s estimate) to $38.8 billion. Over 73% of that growth was attributable to two vendors: Amazon Web Services and Microsoft, reflecting the enormous shift to new spending going to cloud and hybrid-capable offerings. In 2018, the trend grew, and the erosion of share for vendors like Oracle, IBM and Teradata continued. We don’t have our 2018 data completed yet, but I suspect we will see a similar ballpark for overall growth, with the same players up and down as last year. Competition from Chinese cloud vendors, such as Alibaba Cloud and Tencent, is emerging, especially outside North America.
- @stevecheney: The “getting multiple term sheets” is a myth in startup / VC land. Only about 10% of startups get more than 2 term sheets and the vast majority get just 1.
- @stevecheney: VCs spend around 1/3 of their time on portfolio co’s, 1/3 sourcing deals and the remaining on misc & fund management. That means your VC - split among say 10 boards - spends about 3% of their day job thinking about you... that’s not a lot, so make it impactful.
- John DiLullo: Today’s security experts recognize that the typical enterprise is in retreat and not winning the battle. Losses due to cybercrime hit a record in 2018 and it is largely believed that another record will be broken in 2019. People often believe that they are next in line, if they have not already been breached. No one feels totally safe. No one has that much hubris.
- Peter Newman: amsung is attempting to take on the likes of Qualcomm and Intel, as well as third-party semiconductor manufacturers like TSMC, with its costly chip plans [$116 billion through 2030]
- steve cheney: But one thing is very clear: we are entering an era where cars will become autonomous, navigate by themselves and prevent catastrophes from happening at unprecedented scale. And just as smartphones spawned a different set of winners every decade, you can bet the next set of car makers will look much different than today’s. Google, Uber and Tesla will all be involved… And although we know almost nothing of Tesla’s future chip plans, we know they did something uncharacteristic and deeply impressive. They built a hardware and software platform foundation from the ground up, funding and subsidizing it – just barely – based on the incredible dream they are selling today.
- turlockmike: A nice to heave feature would be an option on the lambda to set a "Keep-Warm" option. The ability to set a minimum number of warmed instances, as well as being able to set a specific schedule (similar to a cron job schedule) all in exchange for some fee. This would help developers not have to write workarounds for services that need a certain level of responsiveness during cold periods. (Think people doing uncommon things late at night).
- Sarah Jackson: We must also recognise [for Mayans] that personhood is a dynamic state. An entity isn’t always or inherently a person. This is kind of wild – not only do we have to keep our eye out for the various persons who might surround us on a daily basis, but we have to be aware that things might be entering or exiting this state.
- Andy Greenberg: Kaspersky found evidence that the Asus and videogame attacks, which it collectively calls ShadowHammer, are likely linked to an older, sophisticated spying campaign, one that it dubbed ShadowPad in 2017. In those earlier incidents, hackers hijacked server management software distributed by the firm Netsarang, and then used a similar supply chain attack to piggyback on CCleaner software installed on 700,000 computers. But just 40 specific companies' computers received the hackers second-stage malware infection.—including Asus
- @GossiTheDog: Amusing oops from Facebook - Facebook Marketplace included the exact GPS location of sellers in their public Json data by mistake. When told about it through bug bounty, they closed it twice as ‘not an issue’.
- Joseph Trevithick: Just a little over a month after MD Helicopters unveiled its latest armed helicopter, the MD 969, the company has revealed a new option for the chopper, a seven-round launcher that sits inside the main cabin and pops precision-guided munitions out through a hatch in the rear of the fuselage. The system uses the increasingly popular Common Launch Tube, or CLT, which can already accommodate a wide variety of payloads, including small drones.
- 013a: Today, if you're not locked in, you're leaving business value on the table. I hope that changes in the future, and maybe Kube will be the standard platform we've needed to push it forward (for example, I wish I could tell kube "give me a queue with FIFO and exactly once delivery", it knows what cloud provider you're on, if you're on AWS it provisions an SQS queue, if you're on GCloud it errors because they don't have one of those yet, and in either case I communicate with it from the app using a standard Kube API, not the aws-sdk). But for now, lean in. Don't fight the lock-in; every minute you spend fighting it is a minute that you should be spending fighting your competition.
- BENJAMIN SEIBOLD: The transition from uniform traffic flow to jamiton-dominated flow is similar to water turning from a liquid state into a gas state. In traffic, this phase transition occurs once traffic density reaches a particular, critical threshold at which the drivers’ anticipation exactly balances the delay effect in their velocity adjustment. The most fascinating aspect of this phase transition is that the character of the traffic changes dramatically while individual drivers do not change their driving behavior at all.
- DSHR: I've shown that, whatever consensus mechanism they use, permissionless blockchains are not sustainable for very fundamental economic reasons. These include the need for speculative inflows and mining pools, security linear in cost, economies of scale, and fixed supply vs. variable demand. Proof-of-work blockchains are also environmentally unsustainable. The top 5 cryptocurrencies are estimated to use as much energy as The Netherlands. This isn't to take away from Nakamoto's ingenuity; proof-of-work is the only consensus system shown to work well for permissionless blockchains. The consensus mechanism works, but energy consumption and emergent behaviors at higher levels of the system make it unsustainable.
- @allafarce: Hertz saw they did not have the internal capabilities in tech and digital experience, so they went to a vendor. But they also made that vendor the "product owner". Seems like a pretty critical misstep. If you build ONE internal capability, product ownership is a good one.
- Reed Albergotti: Then Thomas realized her daughter’s nightmares were real. In August, she walked into the room and heard pornography playing through the Nest Cam, which she had used for years as a baby monitor in their Novato, Calif., home. Hackers, whose voices could be heard faintly in the background, were playing the recording, using the intercom feature in the software. “I’m really sad I doubted my daughter,” she said.
- Xavin: Millimeter-wave is really going to only ever be used for wireless connections inside one room where extremely high bandwidth is needed. Stuff like a TV on the wall with the receiver not connected over to the side, or wireless VR headsets. It can also be used for outdoor point to point connections, but the drawbacks mean it won't ever work well for devices you move around or hold, like cellphones. Part of the confusion with 5G is that in addition to the millimeter-wave stuff that's all smoke and mirrors as far as cellular service, it's also just the next improved spec for cellular hardware. So there will be benefits to switching to it, lower latency, higher bandwidth, more security, etc, it's just going to be evolutionary and not anything most people will notice. Certainly nothing that will sell handsets, which is why they are hyping the millimeter-wave stuff even though everyone with any domain knowledge knows it's useless for cellular in 99.9% of cases.
- jdleesmiller: 1. Multiple Rates of Change: Why must changing one module in a monolith take longer than changing the same module in its own service? Perhaps a better use of time would be to improve CI+CD on the monolith. 2. Independent Life Cycles: Ditto. If the tests partition neatly across modules, why not just run the tests for those modules? If they don't, not running all your tests seems more likely to let bugs through, wasting the investment in writing said tests. 3. Independent Scalability: How bad is it to have a few more copies of the code for account administration module loaded even if it's not serving a lot of requests? The hot endpoints determine how many servers you need; the cold ones don't matter much. And load is stochastic: what is hot and what is cold will change over time and is not trivial to predict. If you separate the services, you have to over-provision every service to account for its variability individually, with no opportunity to take advantage of inversely correlated loads on different modules in the monolith. 4. Isolated Failure: Why not wrap the flaky external service in a library within the monolith? And if another persistence store is needed to cache, it is no harder to hook that up to a monolith. 5. Simplify Interactions: Ditto. The Façade pattern works just as well in a library. 6. Freedom to choose the right tech: This is true, but as they say having all these extra technologies comes with a lot of extra dev training, dev hiring and ops costs. Maybe it would have been better to use a 'second best' technology that the team already has rather than the best technology that it doesn't, once those costs are accounted for. The cultural points are largely valid for large organisations, but I feel like it positions microservices as an antidote to waterfall methods and long release cycles, which I think could be more effectively addressed with agile practices and CI+CD.
- Josh Waitzkin: And a lot of what I work on with guys is creating rhythms in their life that really are based on feeding the unconscious mind, which is the wellspring of creativity information and then tapping it. So for example, ending the workday with high quality focus on a certain area of complexity where you can use an insight and then waking up first thing in the morning creating input and applying your mind to it, journaling on it. Not so much to do a big brainstorm, but to tap what you've been working on unconsciously overnight. Which of course, is a principle that Hemingway wrote about when he spoke about the two core principles in his creative writing process, number one ending the workday with something left to write and -- Tim Ferriss: Yeah, often in mid-sentence even. Josh Waitzkin: Right. So not doing everything he had to do. Which most people do, but they feel this sense of guilt if they're not working. You and I have discussed this at length, but leaving something left to write and then the second principle, release your mind from it. Don't think about it all night. Really let go. Have a glass of wine. Then wake up first thing in the morning and reapply your mind to it. And it's amazing because you're basically feeding the mind complexity and then tapping that complexity or tapping what you've done with it. This rhythm, the large variation of it is overnight, and then you can have microbursts of it throughout the day. Before workouts pose a question, do a workout, release your mind after workout, return to it, and do creative bursts. Before you go to the bathroom, before you go to lunch, before anything. And in that way you're systematically training yourself to generate the crystallization experience, that ah-ha moment that can happen once a month or once a year. A lot of what I do is work on systems to help it happen once a day or four times a day, and when we're talking about guys who run financial groups of $20 to $30 billion, for example, if they have a huge insight that can have unbelievable value. If you can really train people to get systematic about nurturing their creative process, it's unbelievable what can happen and most of that work relates to getting out of your own way. It's unloading. It's the constant practice of subtraction, reducing friction.

Why does Uber test in production? Testing in Production at Scale.
- 600 cities. 64 countries. 750m active riders. 3m active drivers. 15m trips per day. 10b cumulative trips. 1000s of microservices. 1000s of commits per day.
- Less operational cost of maintaining a parallel stack. One knob to control capacity. No synchronization required. More accurate end-to-end capacity planning. Ensures stack can handle load. Test traffic takes same code path as production traffic. Enables other use cases like Canary, Shadowing, A/B testing.
- Tenancy Oriented Architecture. How to make different tenants—testing and production—coexist. Want isolation between test & production, tenancy-based access control, and minimal deviation between test and production and environments. Want to use the same stack for both. Must be able to support multiple architectures at the same time.
- Tenancy Building Blocks. Attach a tenancy context to everything. Tenancy aware infrastructure. Tenancy aware environments. Tenancy aware routing.

Usenix Fast'19 and NSDI'19 videos are now available.

Can Luminary become the Netflix of podcasting? Unlikely. The production and distribution costs for podcasts are simply too low for an aggregator to gain leverage. YouTube worked because streaming video is technically hard (and expensive). Audio isn't. Netflix works because creating high quality video content is hard. We already have a near endless supply of high quality niche podcast content. Where can an aggregator add value? One area is in the interface/platform. Podcasting is stuck in the past because nobody is in a position to drive the platform forward. Like for ebooks, podcasts have not enriched the underlying audio data type at all. Ebooks are similarly stunted. The Kindle should be capable of so much more. We have supercomputers in our pockets! What do they do? Display content at the level of stone knives and bear skins. Multimedia apps in the 1990s were far more advanced that we have today. This is the Apple play. Make a user experience that is so much better that consumers simply want to use it more than any other podcast platform. For example, we can't do a simple thing like share an audio snippet. Podcasts have zero viral loops. And we still can't read the text of a podcast. That's been a solved problem for several years now. Apple is just an RSS feed, they don't care about podcasting as a platform. If someone seriously made the underlying podcast platform better, that might do it. Just aggregating content won't be enough. Also, Epic Games Boss Says They'll Stop Doing Exclusives If Steam Gives Developers More Money

This. IAM Is The Real Cloud Lock-In. And this. Your CS Degree Won’t Prepare You For Angry Users, Legacy Code, or the Whims of Other Engineers.

Which tech stack(s) is AWS built on? in0pinatus: I am ex-AWS and therefore qualified to comment. Everyone knows that almost everything runs on EC2, but it's a little known-fact that most of the control plane is actually a collection of PHP scripts. However for backwards internal compat reasons they are stuck on PHP4.3. The hardware itself is bought ad-hoc from Fry's whenever needed and AWS uses spare Warehouse Fulfilment Associates to run it over to the DCs. When people say that "Logistics is at the heart of Amazon" this is what they mean. S3 isn't in PHP though, it's actually written in Perl and like all Perl code is littered with star wars in-jokes and data structures invented especially for the purpose. I never want to see another Classified Cumulative Purple Tree as long as I live. Glacier's name is in jest, it's actually a massive array of vinyl records and the name is a pun about the heatsinks used for data erasure; the whole thing is actually designed to help cool the AZs, which is why Glacier launched in hot climate regions first. Jeff did indeed buy a chip factory, but bought it in the UK by mistake and the resulting megashipment of potato crisp products is the reason he had to buy Wholefoods as well. James Hamilton is completely absent from AWS these days, having take the leadership principle of Own A Ship to the ultimate extreme, and will only be upstaged when Bezos succeeds in his plan - hatched after he misread a six-pager about LP revisions during a particularly gruelling OP session - to Own A Space Ship and Earn Thrust. Finally, and most horrifyingly of all, the ghastly secret behind "Serverless" lambda is that it really doesn't use servers. Lambda actually uses the collective spare brain capacity of every Java programmer trying to fix mysterious Gradle issues on misconfigured ECS containers. When they say "AWS runs on Java" this is what is really meant. Andy Jassy got the idea after a broken Echo read the whole of the Hyperion Cantos to him and wouldn't respond to "ALEXA STOP". Nice!

BBC iPlayer: Architecting for TV: we basically moved to the idea of using server side render wherever we could. This meant that we have a hybrid app where some of the logic is in the client, but then a lot of it is HTML, it's JavaScript chunks, it's CSS built by the back end and then swapped out in the right places on the client. Performance massively increased, our lower performing devices loved it. Memory Usage really went down. The garbage collection seem to kick in and managed better with that. There was less need for the TAL abstractions. And importantly, we had way less logic in the front end, so it was much easier to test and reason with that. This is before and after. We reduced memory usage quite significantly despite actually tripling the number of images and the complexity of the UI that we had. We were using significantly less memory, performance was better and we were motoring again.

Develop Hundreds of Kubernetes Services at Scale with Airbnb: As of a week ago, 50% of our services are now in Kubernetes. Our goal is by the end of H1 this year, for the rest of our services to be in Kubernetes. This is about the services we know, so several hundred services. About 250 of those are in the critical production path. Some of these are more critical than others. But for example, when you go to airbnb.com, and you look at that homepage, that's a Kubernetes service. When you type into the search bar, "I am looking for homes in London this weekend to stay at," that's also a Kubernetes service. Similarly, when you're booking that Airbnb and making those payments, that's also going through Kubernetes services. The point of this talk. Kubernetes out-of-the-box worked for us, but not without a lot of investment. There were a lot of issues that we noticed right away and we had to figure out how to solve them. The 10 takeaways that I wrote individually at different points in the talk. We started with the configuration itself, abstracting away that complex Kubernetes configuration. Then one strong opinion we took is standardizing on environments and namespaces, and that really helped us at different levels, especially with our tooling. Everything about a service should be in one place in Git. Once we store it in Git, we can get all these other things for free. So we can make best practices a default by generating configuration and storing that in Git. We can version it and refactor it automatically, which just reduces developer toil and also gets important security fixes, etc., in on a schedule. We can create an opinionated tool that basically automates common workflows, and we can also distribute this tool as kubectl plugin. We can integrate with Kubernetes for the tooling. CI/CD should run the same commands engineers run locally in a containerized environment. You can validate configuration as part of CI and CD. Code and configuration should be deployed with the same process. In that case, that's our deploy process, custom controllers to deploy custom configuration. And then you can just use custom resources and custom controllers to integrate with your infrastructure.

Bug Juice, Amazon Envy: Inside Airbnb’s Engineering Challenges: Airbnb is shifting its code to what is known as a services architecture. Like many startups, Airbnb originally grew its business off a single, interconnected code base written in the coding language Ruby on Rails. That meant that as the company grew, software engineers would have to make updates to the site one at a time, slowing down how quickly the work could progress. The new technical infrastructure will allow engineers to do things like make changes to the code powering messaging between hosts and guests without affecting engineers working on the listing results page. But the technical work is expensive. The company plans to grow its 1,000-person engineering staff by 40% to 50% this year, people briefed on the matter said. The infrastructure changes also are driving up Airbnb’s spending on AWS compute and storage. Airbnb sharply increased its AWS usage in the early part of this year, a person familiar with the matter said. Last year, Airbnb’s AWS bill was more than $170 million, which was in line with what the company projected to spend for the year, the person said. AWS didn’t comment for this article. “We’re just under-resourced. Compared to Uber, we’re completely under-penetrated on engineers,” a person close to the company said.

Tinder’s move to Kubernetes: It took nearly two years, but we finalized our migration in March 2019. The Tinder Platform runs exclusively on a Kubernetes cluster consisting of 200 services, 1,000 nodes, 15,000 pods, and 48,000 running containers. Infrastructure is no longer a task reserved for our operations teams. Instead, engineers throughout the organization share in this responsibility and have control over how their applications are built and deployed with everything as code.

James Hamilton has the description of Tesla's new ASIC you've been looking for: Overall, it’s a nice design. They have adopted a conservative process node and frequency. They have taken a pretty much standard approach to inference by mostly leaning on a fairly large multiple/add array. In this case a 96×96 unit. What’s particularly interesting to me what’s around the multiply/add array and their approach to extracting good performance from a conventional low-cost memory subsystem. I was also interested in their use of two redundant inference chips per car with each exchanging results each iteration to detect errors before passing the final plan (the actuator instructions) to a safety system for validation before the commands are sent to the actuators. Performance and price/performance look quite good.

The GRAND stack is GraphQL, React, Apollo, Neo4j. And there's also Elixir, Phoenix, Absinthe, GraphQL, React, and Apollo. benwilson-512: We've also felt the complexity of React and Apollo. It works best when, as others mentioned, you've got distinct teams that can focus on each part. In situations where that isn't the case the same decouplings that make it easier for teams to operate independently just add overhead and complexity. We're in a similar boat these days so in fact our latest projects are back to simple server side rendering, but we're still making the data retrieval calls with GraphQL. It ensures that that the mobile app and reporting tools we're also developing will have parity, and we don't need to write ad hoc query logic for each use case. The built in docs and validations have simply proven too useful to pass up, and you really don't need a heavy weight client to make requests.

ThoughtWorks Technology Radar Vol. 20. It's a ist of of different technologies and if they think you should adopt, trial, assess, hold on them. For example, in the techniques section you should adopt: Four key metrics; Micro frontends; Opinionated and automated code formatting; Polyglot programming; Secrets as a service. trial: Chaos Engineering; Container security scanning; Continuous delivery for machine learning (CD4ML) models; Crypto shredding; Infrastructure configuration scanner; Service mesh. assess: Ethical OS; Smart contracts; Transfer learning for NLP; Wardley mapping. hold: Productionizing Jupyter Notebooks; Puncturing encapsulation with change; data capture; Release train; Templating in YAML. They have sections on Platforms, Languages & Frameworks, and Tools.

Good advice on when you should use a ledger vs a blockchain. AmazonWebServices: QLDB is aimed at applications that require a complete and verifiable record of all changes to the database. Amazon Managed Blockchain is aimed at applications where you have multiple parties that wish to interact through a blockchain. Customers building on QLDB will trust that AWS is faithfully executing their SQL statements to update the current and history views of their data. But once the journal transactions are published, they cannot be changed even by AWS without detection. An example of a customer that might benefit from QLDB is a logistics company. When they receive a shipment from a supplier and forward it to the receiver, they can record the relevant information in QLDB and publish periodic journal digests to all of their customers. Later, if the supplier or receiver claims that the updates didn’t happen in a timely manner and wants to audit the update trail, the company can give them direct access to the journal from AWS and they can verify that the transactions were executed at the time. A value for them is that QLDB can help them prove to their customers that history hasn’t changed.

Standford has The Silicon Genesis collection, which gathers together roughly 100 oral histories and interviews with the people who conceived, built and worked in the semiconductor industry centered in Silicon Valley since the 1950s.

While important to avoid overoptimism, cynics about the lack of real-world applications are missing a fundamental point. We're starting to see AI solve a class of "unsolvable" problems and then solved in a weekend. I'm nervously excited about AI. Nervous about the disruption to society, but excited about the improvements to health and climate change. Change doesn't happen overnight, but when a team has the right algorithms, parameters and scale, it can happen in two days. Takeaways from OpenAI Five (2019): 2019. Rough year for professional gamers. Great year for AI research. OpenAI's Dota 2 and AlphaStar have outright or almost beat the best gamers with limited handicaps...Deep Reinforcement Learning Scales on Some Grand Challenges...OpenAI's 2018 analysis showed the amount of AI compute is doubling every 3.5 months. TI9 is no exception; it had 8x more training compute than than TI8. TI9 consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months...AlphaGo and AlphaStar's breakthroughs are also attributed to scaling existing algorithms...self-play has a large benefit: human biases are removed. It comes with expensive tradeoffs: increased training times and difficulty in finding proper reward signals can prevent model convergence...Impact on Human Psyche - Humans play with hesitation. Bots don't. This is a jarring experience. Professional Go player described "it is like looking at yourself in the mirror, naked". This often results in humans playing slightly from his/her normal play styles (likely sub-optimal)....

An indepth summary of Key Takeaway Points and Lessons Learned from QCon London 2019. @danielbryantuk: "Accidental microservices" at Google, via @el_bhs at #qconlondon "Microservices here largely emerged from the requirement of running applications at planetary scale" @lizthegrey: Because it's so much more expensive to do a remote RPC than do a function call in the same process, we're making two steps back performance-wise when we make one step forward decoupling things. Read Hellerstein's paper for more. #QConLondon @lizthegrey: It's a 1000x performance hit to move things across process boundaries. And it magnifies market dominance of proprietary solutions. Think about when it makes sense to deploy functions to the edge, but proceed with caution for your core performance critical services. #QConLondon

How is Fauna different the CockroachDB or Spanner? Learn the secrets in Unpacking Fauna: A Global Scale Cloud Native Database - Episode 78.

Before TCP/IP a babel of different networking protocols made it so computers from different lands could not talk to each other. Shocking, isn't it? Relive those barbaric times in this excellent History of DECnet with Dave Oran podcast episode. Crave more history? Can you imagine a time when you would call someone up because they were cussing on the internet? That and more in The History of the ISP Industry With Sonic's Dane Jasper.

True, but it shouldn't be. Storing HD photos in a relational database: recipe for an epic fail.

Before you turn on HTTP/2, make sure that your application can handle the differences. HTTP/2 has different traffic patterns than HTTP/1.1, and your application may have been designed specifically for HTTP/1.1 patterns. Why Turning on HTTP/2 Was a Mistake: with HTTP/2, the browser can now send all HTTP requests concurrently over a single connection. Multiplexing substantially increased the strain on our servers. First, because they received requests in large batches instead of smaller, more spread-out batches. And secondly, because with HTTP/2, the requests were all sent together—instead of staggered like they were with HTTP/1.1—so their start times were closer together, which meant they were all likely to time out...server support for HTTP prioritization is spotty at best. Many CDNs and load balancers either don’t support it at all, or have buggy implementations, and even if they do, buffering can limit its effectiveness.

Beating round-trip latency with Redis pipelining: What is actually different here? Instead of sequentially doing the requests, we pipeline them. This means we send the requests without waiting on the responses. We send as many requests as our machine can, and the responses come in whenever they're ready from the server. This means if we take the same latency figures as above, we can assume the time to retrieve all these ttls is going to be around 100ms! for 40 keys, the latency will probably be a bit higher, but closer to 100ms than 400ms for sure! 400 keys? Much closer to 100ms than 4000ms!

Just how expensive is the full AWS SDK?: WebPack improves the Initialization time across the board. Without any dependencies, Initialization time averages only 1.72ms without WebPack and 0.97ms with WebPack. Adding AWS SDK as the only dependency adds an average of 245ms without WebPack. This is fairly significant. Adding WebPack doesn’t improve things significantly either. Requiring only the DynamoDB client (the one-liner change discussed earlier) saves up to 176ms! In 90% of the cases, the saving was over 130ms. With WebPack, the saving is even more dramatic.

It's nice to see the use of queues to solve a queue problem instead of using a database. Copy Millions of S3 Objects in minutes: All-in-all copying a million files from one bucket to another, took 8 minutes if they’re in the same region. For cross-region testing, it 25 minutes to transfer the same million files between us-west-1 and us-east-2. Not bad for solution that would fit easily into your free AWS tier...Using an SQS we’re able to scale our lambda invocations up in an orderly fashion, which allows our back-ends (e.g. DynamoDB or S3) to also scale up in tandem. Instantly invoking 800 lambdas to read an S3 bucket is a recipe for a Throttling and the dreaded ClientError

I wouldn't have thought of this case either. Facebook's AI missed Christchurch shooting videos filmed in first-person. Also, Applied machine learning at Facebook - Kim Hazelwood (Facebook)

Interested in side channel analysis? This ChipWhisperer wiki page is for you. ChipWhisperer is an open source toolchain dedicated to hardware security research.

The test is a medium.com clone. But really, how hard is it to display "You read a lot. We like that."? A RealWorld Comparison of Front-End Frameworks with Benchmarks (2019 update). Small footprint: Svelte, Stencil, and AppRun. Small code base: ClojureScript with re-frame, AppRun and Svelte. Fastest: AppRun, Elm, Hyperapp.

rancher/k3os: a linux distribution designed to remove as much as possible OS maintaince in a Kubernetes cluster. It is specifically designed to only have what is need to run k3s. Additionally the OS is designed to be managed by kubectl once a cluster is bootstrapped.

facebook/folly (article): a 14-way probing hash table that resolves collisions by double hashing. Up to 14 keys are stored in a chunk at a single hash table position. Folly’s F14 is widely used inside Facebook. F14 works well because its core algorithm leverages vector instructions to increase the load factor while reducing collisions, because it supports multiple memory layouts for different scenarios, and because we have paid attention to C++ overheads near the API. F14 is a good default choice — it delivers CPU and RAM efficiency that is robust across a wide variety of use cases.

Microsoft/BosqueLanguage: The Bosque programming language is a Microsoft Research project that is investigating language designs for writing code that is simple, obvious, and easy to reason about for both humans and machines. The key design features of the language provide ways to avoid accidental complexity in the development and coding process. The result is improved developer productivity, increased software quality, and enable a range of new compilers and developer tooling experiences.

FAST '19 - Reaping the performance of fast NVM storage with uDepot. Optane can provide ~.06Moops/sec at ~ 10 microseconds of latency. Storage is no longer the bottleneck, the network is. The problem is existing KV stores are built for slower devices. Use sync IO. Data structures with ingerent IO amplification. Cache data in DRAM. Rich feature set. RocksDB is 7x slower than necessary.

FAST '19 - Optimizing Systems for Byte-Addressable NVM by Reducing Bit Flipping. We were able to reduce the number of bits flipped by up to 3.56× over standard implementations of the same data structures with negligible overhead. We measured the number of bits flipped by memory allocation and stack frame saves and found that careful data placement in the stack can reduce bit flips significantly. These changes require no hardware modifications and neither significantly reduce performance nor increase code complexity, making them attractive for designing systems optimized for NVM.

Haxe (video). A language well known in the game industry. Haxe is a strictly-typed (and type-inferring) programming language with a diverse set of influences, including OCaml, Java and ActionScript. Its syntax will be familiar to anyone who’s worked with modern OO languages, however it has features you’d expect in a meta language, such as: everything’s-an-expression, compile-time code manipulation and pattern matching. In addition, it boasts an unusual talent; it can generate code in other programming languages.

apache/pulsar: a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.

apache/incubator-hudi: manages storage of large analytical datasets on HDFS and serve them out via two types of tables

DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching: We present DistCache, a new distributed caching mechanism that provides provable load balancing for large-scale storage systems. DistCache co-designs cache allocation with cache topology and query routing. The key idea is to partition the hot objects with independent hash functions between cache nodes in different layers, and to adaptively route queries with the power-of-two-choices. We prove that DistCache enables the cache throughput to increase linearly with the number of cache nodes, by unifying techniques from expander graphs, network flows, and queuing theory. DistCache is a general solution that can be applied to many storage systems. We demonstrate the benefits of DistCache by providing the design, implementation, and evaluation of the use case for emerging switch-based caching.

Liquid brains, solid brains: How distributed cognitive architectures process information: Other systems are formed by sets of agents that exchange, store and process information but without persistent connections or move relative to each other in physical space. We refer to these networks that lack stable connections and static elements as ‘liquid’ brains, a category that includes ant and termite colonies, immune systems and some microbiomes and slime moulds. What are the key differences between solid and liquid brains, particularly in their cognitive potential, ability to solve particular problems and environments, and information-processing strategies? To answer this question requires a new, integrative framework.

Modular structure within groups causes information loss but can improve decision accuracy: We find that modular structure necessarily causes a loss of information, effectively silencing the input from a fraction of the group. However, the effect of this information loss on collective accuracy depends on the informational environment in which the decision is made. In simple environments, the information loss is detrimental to collective accuracy. By contrast, in complex environments, modularity tends to improve accuracy. This is because small group sizes typically maximize collective accuracy in such environments, and modular structure allows a large group to behave like a smaller group (in terms of its decision-making). These results suggest that in naturalistic environments containing correlated information, large animal groups may be able to exploit modular structure to improve decision accuracy while retaining other benefits of large group size.

NEEXP in MIP*: The main result of this work is the inclusion of NEEXP (nondeterministic doubly epxonential time) in MIP*. This is an exponential improvement over the prior lower bound and shows that proof systems with entangled provers are at least exponentially more powerful than classical provers.

Maybe storing important stuff in public is a bad unfuture proofable idea? The Blockchain Bandit: Finding Over 700 Active Private Keys On Ethereum's Blockchain: In this paper we examine how, even when faced with this statistical improbability, ISE discovered 732 private keys as well as their corresponding public keys that committed 49,060 transactions to the Ethereum blockchain. Additionally, we identified 13,319 Ethereum that was transferred to either invalid destination addresses, or wallets derived from weak keys that at the height of the Ethereum market had a combined total value of $18,899,969. In the process, we discovered that funds from these weak-key addresses are being pilfered and sent to a destination address belonging to an individual or group that is running active campaigns to compromise/gather private keys and obtain these funds. On January 13, 2018, this “blockchainbandit” held a balance of 37,926 ETH valued at $54,343,407.

Compress Objects, Not Cache Lines: An Object-Based Compressed Memory Hierarchy: Zippads transparently compresses variable-sized objects and stores them compactly. As a result, Zippads consistently outperforms a state-of-theart compressed memory hierarchy: on a mix of array- and object-dominated workloads, Zippads achieves 1.63× higher compression ratio and improves performance by 17%.

Using Fault-Injection to Evolve a Reliable Broadcast Protocol: This is the first article in a series about building reliable fault-tolerant applications with Partisan, our high-performance, distributed runtime for the Erlang programming language. As part of this project, we will start with some pretty simple protocols and show how our system will guide you in adjusting the protocol for fault-tolerance issues

Seven Sketches in Compositionality: An Invitation to Applied Category Theory: The purpose of this book is to offer a self-contained tour of applied category theory. It is an invitation to discover advanced topics in category theory through concretereal-world examples. Rather than try to give a comprehensive treatment of these topics—which include adjoint functors, enriched categories, proarrow equipments, toposes, and much more—we merely provide a taste of each. We want to give readers some insight into how it feels to work with these structures as well as some ideas about how they might show up in practice

Stuff The Internet Says On Scalability For April 26th, 2019

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale