hot links

Stuff The Internet Says On Scalability For May 14th, 2020

High Scalability

14 May 2020 — 58 min read

Hey, it's HighScalability time!

LOL. Who knew a birthday service could lead to an existential crisis?

Do you like this sort of Stuff? Without your support on Patreon this kind of Stuff can't happen. You are that important to the fate of the smart and thoughtful world.

Know someone who wants to understand the cloud? I wrote Explain the Cloud Like I'm 10 just for them. On Amazon it has 111 mostly 5 star reviews. Here's a recent non-randomly selected review:

Number Stuff:

4.8 billion: plays of Drake - Toosie Slide on TikTok. 1.6M on YouTube. As of Saturday May 9.
80%: of enterprises will shutdown their traditional datacenters by 2025, versus 10% today.
32%: of US CFO anticipate layoffs in the next 6 months. 49% plan on making remote work a permanent option.
50%+: expected drop in NY Times adverting next quarter. Added 587,000 net new digital subscriptions compared to the end of the fourth quarter of 2019.
50%: of online ad spending goes to industry middlemen. Out of a total of 267m of ads placed online, it was only possible to match the end-to-end process for 31m.
22K: new mobile games on Apple app store. Down from 285K in 2016.
2 billion: TikTok dowloads. 5 months ago they were at 1.5 billion. Generated most downloads for any app ever in a quarter, accumulating more than 315 million installs across the App Store and Google Play.
14.7%: US unemployment rate. 20.5 million people lost their jobs in April.
>0: Dropbox’s first quarterly profit
$4,800: revenue from 2 million subscriber TierZoo YouTube channel, down from $11,000 a month. Advertising down all around. Two-thirds of income from sponsorship deals inked separately with advertisers.
3.3: workers replaced by each additional robot. That increased use of robots in the workplace also lowered wages by roughly 0.4 percent. It turns out robots are good at competing against people.
65%: Apple's gross margin on services. 515 million subscriptions. 1B+ active iPhones.
$10 billion: AWS quarterly revenue. 33% growth on an annualized basis. AWS is now bigger than all of Oracle.
59%: Growth in Azure cloud.
1,000 light-years: closest blackhole to Earth.
30,000: bugs generated each month by 47,000 Microsoft programmers. An AI system can find 97 percent of critical and non-critical security bugs.
29: day cluster trace — a record of every job submission, scheduling decision, and resource usage data for all the jobs in a Google Borg compute cluster, from May 2011.
$27 Billion: cost of cybercrime damages by 2025.
70%: of @DynamoDB IOPS and storage is being consumed by attribute names. Short names will eliminate 14K WCU at peak and reduce table size by 3.5TB.
~2x: traffic at Shopify.
$6.9 billion: raised by AI startups in Q1 2020.
more: zero-day exploits in 2019 than any of the previous three years. Private companies are likely creating and supplying a larger proportion of zero-days than they have in the past.
200 million: packages delivered every day by pypi.org.
$2.7 billion: total cybercrime loss in 2018 in the US according to the FBI. Investment scames were #1.

Quotable Quotes:

@sfiscience: Every computation must produce heat, but both energy consumption and heat production can be outsourced — by, say, human leadership, distributed computing, or a simple virus. SFI Prof David Wolpert on Landauer's Bound
Urs Hölzl: Efficiency improvements have kept data center energy usage almost flat across the globe—even as demand for cloud computing has skyrocketed. Compute capacity up 550%, energy use up 6% (!) according to a paper in Science. Glad we've [Google] done our part!
Lotem Finkelstein: The Naikon group has been running a longstanding operation, during which it has updated its new cyberweapon time and time again, built an extensive offensive infrastructure and worked to penetrate many governments across Asia and the Pacific
latch: We recently migrated a few small systems to CockroachDB (as a stepping stone). Overall, the experience was positive. The hassle free HA is a huge peace of mind. I know people say this is easy to do in PG. I have recently setup 2ndQuadrant's pglogical for another system. That was also easy (though the documentation was pretty bad). The end result is quite different though and CockroachDB is just simpler to reason about and manage and, I think, more generally applicable.
@johncutlefish: OH: “At Acme, we’re really good at spending $2 million+ on something. Less than $500k not so much. Less than $100k is credit card money almost.” Good reminder for “enterprise #SaaS startups”. The problem may be that you’re not charging enough for them to even pay attention.
@rakyll: The goal was to abstract away the infrastructure. We instead converted every application developer into an infrastructure developer.
@iamwcr: Our actual use-case is a little complex to go into in tweets. But suffice to say, the PUT costs alone to S3 if we did 1-to-1 would end up being just under half our total running costs when factoring in DDB, Lambda, SQS, APIG, etc.
@paulg: Mere rate of shipping new features is a surprisingly accurate predictor of startup success. In this domain, at least, slowness is way more likely to be due to inability than prudence. The startups that do things slowly don't do them any better. Just slower.
@lowellheddings: Publications that depend on extremely overpriced direct ads are having serious problems. Vox, Buzzfeed, etc, were able to command 10-20x ad prices compared to everybody else despite having low ROI. Its never going back
Lev Grossman: He [William Gibson] said once that he was wrong about cyberspace and the internet when he first conceived it, he thought it was a place that we would all leave the world and go to. Whereas in fact, it came here.
@Molson_Hart: Amazon just wiped out the affiliate advertising game. Lowered commissions from 10% to 3%, effective 7 days from today. This is what it is like to dance with the beast.
@chamath: A case study will be written on how Microsoft allowed Zoom to eat their lunch. They spent millions on subterfuge trying to paint Slack as an inferior enemy when MSFT Teams actually can't do what Slack does and Teams' real competitor was Zoom. Now Zoom has 300M Daily Users. Lol.
@tmclaughbos: Serverless will be the biggest LOL f*uck you on developers since full stack. They think they‘re buying into only having to focus on code and not infrastructure, but instead we are secretly turning them into distributed systems engineers more than developers.
Rob Pike: So although it was gratifying and important to see Docker, Kubernetes, and many other components of cloud computing written in Go, it's perhaps not too surprising. Go has indeed become the language of cloud infrastructure.
stanmancan: I run a website that gets decent ad revenue (usually around $2,000/month) and so far month to date in April is down 54% versus the same day last month, and March was already down 14% from February. CPC for March was $0.60 but April is only $0.30 so far. Traffic, CTR, and fill rate haven’t changed, just drastically lower CPC.
John Conway: You know in my early twenties, let's say, people always thought that I would, you know, be a great mathematician. And be good at various things and so on. And in my late twenties, I hadn't achieved any of the things that people were predicting, and so I call it my black period. I started to wonder you know whether it was all nonsense. Whether I was not a good mathematician after all and so on, and then I made a certain discovery and was shot into international prominence. As a mathematician when you become a prominent mathematician, in that sense, it doesn't mean that many people know your name. It means that many mathematicians know your name, and there aren't many mathematicians in the world anyway, you know. So it doesn't count very much, but it suddenly released me from feeling that I had to live up to my promise. You know, I had lived up to my promise. I sort of made a vow to myself. It was so nice not worrying anymore that I thought I'm not going to worry anymore ever again, I was going to study whatever I thought was interesting and not worry whether this was serious enough. And most of the time I've kept to that.
@Carnage4Life: 25% of advertisers have paused their ad campaigns while 46% reduced budgets. 75% of advertisers expect this to be worse impact than the housing crisis. Every business that relies on advertising is going to have the same story. More traffic, less money.
@houlihan_rick: Here are the access patterns and #NoSQL schema for the Amazon Kindle Collection Rights Service. Running 9 distinct queries millions of times a day on RDBMS was a massive waste of CPU. Switching to @DynamoDB saved big $$.
David Rosenthal: Productivity is the ratio of outputs to inputs. It is easy for companies or institutions to believe that by outsourcing, replacing internal opex and capex by payments to suppliers, they will get the same output for less input. But experience tends to show that in most cases the result is less output for more input. The Gadarene rush to outsource may be a major contributor to the notorious decay of productivity
@pitdes: CEO of a 400+ employee business says WFH is working well, he may not renew his SF lease to save $10m a year (office + lunches etc) and instead do a couple of all-hands offsites a year (much cheaper). Lots of others must be thinking similarly- CRE could be permanently altered.
@Grady_Booch: "Good design means then when I make a change, it’s as if the entire program was crafted in anticipation of it. I can solve a task with just a few choice function calls that slot in perfectly, leaving not the slightest ripple on the placid surface of the code." @munificentbob
@_joemag: In recent design reviews, I'm observing more and more teams plan to write their new data planes in Rust. It looks like the tide has turned, and Rust is becoming the default choice for new data plane software.
Giff Constable: I joined Meetup having been an organizer for many years. I had lived the frustration of being a customer. By the late-2010s, I felt like the main product had gotten worse over the years, not better. One cause was the overwhelming amount of technical debt that had built up in the 17-year old company. As my former colleague Yvette Pasqua, a brilliant CTO and a big reason why I joined in the first place, once said, “Meetup is carrying four to five times the amount of code that it should for what it does.” But, as is often the case, the problem was not solely technical. There was an enormous amounts of design debt that increased the complexity of every change, as well as a need to improve how the teams were actually setting goals and going about their work.
@levie: How IT strategies will change overnight: Some cloud -> 100% cloud Trusted devices -> Any device Protecting perimeter -> No perimeter Monolithic tools -> Best-of-breed apps UX secondary -> UX above all Employees only -> Extended enterprise
@timbray: Just now an SQS team chat flurry caught my eye. Seems there was a queue with 253,477,653,099 unread messages. Owner was reading at only 1.56M messages/second. It’s gonna take them a while to clear the backlog. But they say it’s a backfill op, so OK. Everything worked fine.
@sfiscience: "If you look at science over the last 50 years ask, what are the drivers of progress?, it's the ability of computers to simulate complex systems." J. Doyne Farmer (@Oxford)
dredmorbius: Hierarchies appear where there is a clear gradient or flow. The canonical examples are river networks and trees (from which: dendritic). Even here, flows are often not entirely directed -- a river system experiences evaporation, precipitation, ground seepage, and aquifer flows, in addition to its dominant gravity-induced current. A tree has flows which originate at leaves (photosynthesis) and roots (water, mineral sourcing), as well as elsewhere (flowers, parasites and symbiotes, community flora and fauna). Similarly, evolutionary trees represent not only ancestral inheritance, but cross-species gene transfers -- bacteria, viruses, mitochondria, even other higher-order organisms (generally through an intermediary). A true ontology to me is a largely nondirected graph. This may be a union of several directed graphs, a region of low-directedness within a larger directed space, or (fairly rarely) a near-universal truly nondirected system. Given that even cosmological megastructures are defined in terms of attractors and repellers, that is: gradients, truly nondirected spaces seem. likely rare. Though that's describing physical entities, conceptual spaces seem to me similar -- at least analogous, perhaps even more than that. A tremendous risk in conceptual spaces is to equate some gradient with a moral, ethical, or social value gradient.
Hannu Rajaniemi: Tools always break. She should have remembered that.
@houlihan_rick: Need operational analytics in #NoSQL? Maintain time bound rollups in @DynamoDB with Streams/Lambda then query relevant items by date range and aggregate client side for fast reporting on scaled out data. Turn complex ad hoc queries into simple select statements and save $$$
@snowded: In the past every village had an idiot, and we could all deal with that. Now the internet is allowing idiots to connect and it is normalising idiocy.
@DAlexForce: AWS DAX (DynamoDB Accelerator) is WICKED fast ⚡.Regular caching (left) is much different from DAX (right). Improving response from milliseconds to microseconds. Now that is caching on steroids.
@tmclaughbos: Any critique of company's architectural choices needs to also include their business metrics too. X: That architecture is terrible! Me: Yet looking at their revenue, growth, and stock price it appears to be working just fine for them.
@dabit3: 8 things that have made me a better programmer: - Be ok with not being the best - Small steps > no steps - Look for negative feedback - Don't be embarrassed to ask questions - Learn new tech early - Talk less, do more - A student / growth mindset - Master *something*
@tqbf: The boy just asked if he should learn C++. It’s long past due for THAT conversation, and bad parenting that I waited for him to ask. I’m glad we caught him before he started experimenting on his own.
@dvassallo: 2000 customers @ $39/month is almost $1M/year. If your software can handle 2000 customers, you can worry about scalability once the $1M/year is flowing in. Scalability doesn’t get you cutomers. First have customers, then worry about scalability. The order matters.
@houlihan_rick: Data modeling in #NoSQL is about replacing JOINs with indexes. Store all objects in one table, decorate them with common attributes, and index those attributes to produce groupings of objects needed by the app. Time complexity of an index query is far less than a JOIN. @dynamodb
@benedictevans: Google Meet and Microsoft teams combined are doing close to 5bn minutes of calling each day. The entire UK telecoms system does about 600m.
@randompunter: Serverless does not imply microservices. We run full web applications in a single lambda (with static content off S3/Cloudfront).
@timfox: Currently trying to "cloud first development" from home. Incredibly painful experience. This model seems fundamentally broken for the new normal of working from home as it assumes you're in an office with a fast upload connection so you can push images. Give me local dev any day.
@igvolow: ’ve found that creating new microservices for a new project is the epitome of and posterboy for premature optimization. Make your code loosely coupled enough so any functionality can be easy extracted out into a microservice if need be. Embrace the modulith!
@ajaynairthinks: OH: “a customer will always choose a generic alternative you pitch over their incumbent (tool) because it could fix everything, but rarely pick the the specific alternative you end up building over the same incumbent unless it is solves their specific problem”. #PollingFallacy
@mikeb2701: I've literally seen latency spikes caused by the system page faulting in rarely used code. In some cases requiring mlocking the pages containing the shared objects to mitigate.
@anshelsag: Anyone look at the @Steam hardware survey for April? @AMD now has 21.89% CPU market share to Intel's 78.28%. Up from 17.5% in November 2018. AMD gained over half a percent just from March to April which is pretty significant.
David Rosenthal: Although there are significant technological risks to data stored for the long term, its most important vulnerability is to interruptions in the money supply.
@troyhunt: There's been a significant uptick in @haveibeenpwned usage since COVID-19 restrictions. Way more registrations (previous spikes after major breaches), daily unique users up from about 150k to 200k (Cloudflare) and monthly users up 41% on previous 12 month high (Google Analytics)
Hannu Rajaniemi: As you are no doubt aware, things are a little bit … restless on Earth at the moment.’ ‘If by restless, you mean eaten by recursively self-improving non-eudaimonistic agents, then, yes,
John Gruber: So, yes, a $400 iPhone SE bests a $3,000 top-of-the-line MacBook Pro in single-core CPU performance.
@txase: #Serverless is how all apps will be built in the future. The benefits are simply too great. 90% reduction in cost, for example, is what our customers see as well when they modernize apps.
@joshelman: Zoom vs Google Meet is the difference of a singular company and product focus vs just thinking you are building a feature as part of an enterprise suite.
@ShortJared: Fun fact... the entire content of the classic "Moby-Dick" fits into a single DynamoDB row with xz compression. Therefore, I conclude that @dynamodb is the ultimate docker container.
Chip Overclock: This article is about the iterations I went through to establish my own permanent Differential GNSS base station, and to cobble together a Rover that I could use in the field for precision geolocation. I leveraged and significantly modified my existing Hazer GPS software (com-diag-hazer on GitHub), but most of that was just bookkeeping and glue code; the firmware in the u-blox chip does all of the heavy lifting, including sending and receiving the RTCM messages.
@randompunter: There is an inflection point where EC2 is more cost effective but I feel its much higher than what people realise. Lambda I can deploy some code and forget about it for a year. For EC2 I have have to plan a 30 day machine re-image/rebuild maintenance schedule.
@mjpt777: I know the math yet it always surprises me to see the difference in performance when buffers are appropriately sized. Flow is key to distributed systems yet so few people give it any consideration.
Nestor: This is what they do to people! They put a machine on the floor, and if it has programming that doesn't take your money and you win on their machine, they will throw you in jail!
Corey Quinn: Their “region-pairing” strategy indicates that you can think of each Azure region as an AWS or GCP Availability Zone. That’s great right up until it isn’t, and you wind up with a bunch of small regions rather than fewer more robust ones, and a sudden influx of demand causes those regions to run out of headroom.
@AngieMaxwell1: Found the kid playing with her dog instead of Zooming with her teacher. She told me not to worry. She took a screenshot of herself “paying attention,” then cut her video & replaced it with the picture. “It’s a gallery view of 20 kids, mom. They can’t tell.” She is 10. #COVID19
@simonw: SQL is a better API language than GraphQL. Convince me otherwise! To counter some obvious arguments...You don't have to expose your entire schema, instead expose carefully designed SQL views (so you can refactor your tables without breaking your API) Read-only, obviously! Use time limits to cut off expensive queries (GraphQL needs this too)
@m_a_y_o_w_a: As a Software Engineer, the more you gain professional experience, the more you realize that there is no 'one size fits all' solution to all Engineering problems. Every potential solution comes with its pros and cons, the key decision is in what you are willing to tradeoff.
L. B. Lewis: It’s an open secret in Silicon Valley that the tech industry suffers from ageism — “the stereotyping, prejudice, and discrimination against people on the basis of their age.” In a recent survey, it’s been shown that tech companies hire on the young side. And, by the age of 29, tech workers already start to feel the effects of ageism.
Bastian Spanneberg: When we recently did a review of our cost structure with Mike Julian and Corey Quinn from Duckbill Group, they pointed out that we have a lot of volumes that could be actually switched from GP2 to ST1. This would be a significant cost reduction as ST1 volumes are roughly half the cost!
datenwolf: Gaaah, please stop advertising optical computers as the technology that will overcome Moore's law. It makes no effing sense.Wavelength of the light emitted by these devices: ~4000nm. Latest generation commodity CPU transistor structure size: 7nm. Add to that that photons really don't like being trapped; you essentially need a delay line and optical amplifier to hold them indefinitely (that's essentially the core technology my whole PhD thesis centers around), it makes them a really impractical thing to store bits with. Things with a rest mass can be stored easily, though. Things like, say, electrons!
@BrianRoemmele: I looked at the code of a fintech startup that had some of the best programers with the best degrees and asked about their precision on floating point with their “agile back end”. Beautiful code. Brilliant. But it also cost the company ~4 million. They think different now. I looked at an “upgrade” to Python at a major insurance company. The company was losing $2 million dollars per year since moving off the COBOL platform and did not know why. It used floating point math and the rounding errors that where not anticipated. They went back to COBOL.
@JoeEmison: Running a serverless insurance startup, selling home, auto, renters, and umbrella in five states (http://ourbranch.com) on AWS. April AWS bill was just under $740. DynamoDB - $202 CodeBuild - $116 Cloudwatch - $100 S3 - $66 AWS Directory Service - $36. Our back end is AppSync, Cognito, Lambda. Combined charges for all three of those is $32...Oh, yes—algolia, cloudinary, launchdarkly, LogRocket, sentry, segment, http://customer.io—they add up to more than AWS. But overall so much cheaper/faster/better than anything else.
JoeAltmaier: So I took 2 days, made a chart of every path and pattern of events (restarting a timer from a timer callback; having another time interval expire while processing the previous; restarting while processing and another interval expires; restarted timer expires before completing previous expiration and on and on). Then writing exhaustive code to deal with every case. Then running every degenerate case in test code until it survived for hours. It never had to be addressed again. But it did have to be addressed. So many folks are unwilling to face the music with complexity.
Joel Hruska: The history of computing is the history of function integration. The very name integrated circuit recalls the long history of improving computer performance by building circuit components closer together. FPUs, CPU caches, memory controllers, GPUs, PCIe lanes, and I/O controllers are just some of the once-separate components that are now commonly integrated on-die. Chiplets fundamentally reverse this trend by breaking once-monolithic chips into separate functional blocks based on how amenable these blocks are to further scaling...The most exciting thing about chiplets, in my opinion, isn’t that they offer a way to keep packing transistors. It’s that they may give companies more latitude to experiment with new materials and engineering processes that will accelerate performance or improve power efficiency without requiring them to deploy these technologies across an entire SoC simultaneously.
Steven Levy: This encryption function was only part of Diffie’s revolutionary concept, and not necessarily its most important feature. Public key crypto also provided the first effective means of truly authenticating the sender of an electronic message. As Diffie conceived it, the trapdoor works in two directions. Yes, if a sender scrambles a message with someone’s public key, only the intended recipient can read it. But if the process is inverted—if someone scrambles some text with his or her own private key—the resulting ciphertext can be unscrambled only by using the single public key that matches its mate.
@Carnage4Life: The idea that a bad quarter for Google means they ONLY grew revenues 13% year over year due to #COVID19 is a mind blowing statistic around how much headroom there still was in online advertising until the lockdowns.
@rakyll: Google Cloud's Go and Node support was originated by a small group of people including me. Today, we were reflecting that there was a time in cloud it was extremely easy to make such huge impact. Today, the scale is immerse. I can't easily keep up with minor launches anymore.
@copyconstruct: An ontology of “stateful” systems ... Or at least my attempt at it. All stateless systems are stateless in no more than 4 ways. All stateful systems are stateful in ... 21 possible ways?
npunt: Magic Leap made one of the classic mistakes that other before-their-time products make: they tried to create a general purpose product because they didn't have a killer app that could focus their efforts. When you're building a product without a focused use case, you are pulled in a ton of different directions. In AR, this means focusing on fidelity, embodied in high resolution, wide field of view visuals, powerful processing, and compelling input methods.
goatherders: I ran an AWS practice for a large hosting company for two years. We had a number of situations where our prospect would say "annual spend with your managed services and our expected AWS bill would be $1M. Google is offering us a 2 million dollar credit to choose their cloud. What do you say to that?" "You should take them up on that offer." The incentives available from other cloud providers are MASSIVE if your business has the chance to grow in coming years. They will literally buy your business for years on the bet that at some point they will make it back.
ram_rar: I worked in a startup that was eventually acquired by cisco. We had the same dilemma back then. AWS and GCP were great, but also fairly expensive until you get locked in. Oracles bare metal cloud sweetened the deal soo much, that it was a no brainer to go with them. We were very heavy on using all open source tech stuff, but didnt rely on any cloud service like S3 etc. So the transition was no brainer. If your tech stack is not reliant on cloud services like S3 etc, you're better off with a cloud provider who can give you those sweet deals. But you'll need in house expertise to deal with big data.
Rimantas Ragainis: The first two — fio and sysbench — test scenarios indicated that GCP NVMes are slower than on AWS side.
Tim Bray~~ Event driven and asynchronous application design should be used by almost everyone who is trying build something big and that has good scaling characteristics because you can't build a super cloud scale application without having some asynchronous buffering in it to deal with load surges. That typically means being message or event driven. Eventing is at the center of everything. It all has to do not so much with the classical paradigm of software that handles requests and looks something up in the database, but with state that comes in from the outside and flows through the system, which is the way to go for large heavily loaded applications. The state of the art is not that well defined. There aren't text books that say how to do this.
pachico: One of our applications receive more than 10m hits a day through Kong, which uses Redis for its rate limit plugin. We put a t3.micro for that and never had any issue. In reality, during our performance tests we got to much higher volumes and it always worked fine.
throwaway_aws: Throwaway account for obvious reasons. In the past, AWS has used the data from third party hosted services on AWS to build a similar service and in fact start poaching their customers. Source: I used to be at AWS and know the PM & his manager who built a service this way. I was hired on that team.
1_person: The CDN will be fronting most of the load, behind that 10 decently specced servers running sanely architected code can scale to millions, if not tens of millions of requests per second. Drop the servers in HA sets of 2-3 nodes across 3-4 regions, anycast your service endpoint from each cluster. The hardest thing to replicate without AWS is the 6-7 figure bills.
Cloudflare: For a small test page of 15KB, HTTP/3 takes an average of 443ms to load compared to 458ms for HTTP/2. However, once we increase the page size to 1MB that advantage disappears: HTTP/3 is just slightly slower than HTTP/2 on our network today, taking 2.33s to load versus 2.30s.
@lynncyrin: I feel like writing "production grade" python code is much much harder than doing so for golang
- @lizthegrey: This is exactly why the SRE team at Google mandated "no new Python projects" and shifted all new automation to golang.
gregdoesit: I joined Uber in 2016, right around when on every conference you'd hear a talk along the lines on "Lessons learned at Uber on scaling to thousands of microservices" [1]. After a year or two, those talks stopped. Why? Turns out, having thousands of microservices is something to flex about, and make good conference talks. But the cons start to weigh after a while - and when addressing those cons, you take a step back towards fewer, and bigger services. I predict Monzo will see the same cons in a year or two, and move to a more pragmatic, fewer, better-sized services approach that I've seen at Uber. In 2020, Uber probably has fewer microservices than in 2015. Microservices are fantastic for autonomy. However, that autonomy also comes with the drawbacks. Integration testing becomes hard. The root cause of most outages become parallel deployments of two services, that cause issues. Ownership becomes problematic, when a person leaves who owned a microservice that was on the critical path. And that no one else knew about. Oncall load becomes tough: you literally have people own 4-5 microservices that they launched. Small ones, sure, but when they go down, they still cause issues. To make many services work at scale, you need to solve all of these problems. You need to introduce tiering: ensuring the most ciritical (micro)services have the right amount of monitoring, alerting, proper oncall and strong ownership. Integration testing needs to be solved for critical services - often meaning merging multiple smaller services that relate to each other. Services need to have oncall owners: and a healthy oncall usually needs at least 5-6 engineers in a rotation, making the case for larger services as well.
Nelson Elhage: For the solutions, a big theme — also cited by several of the people I link to — is moving to empiricism and experiment, instead of abstract reasoning. I think this need additionally explains some of the recent observability movement; as we rely more and more on empirical observation of our systems, we need better and better tools for actually making and analyzing observations and characterizing the empirical behaviors.
AusIV:With much of the economy moving online, spot instance termination rates have gone up a lot. We've had to switch our autoscaler's SpotAllocationStrategy from the default to "capacity-optimized", then added a few more instance types, and our termination rates seem to have dropped off a bit. The SpotAllocationStrategy setting wasn't a thing when we set up our autoscalers in the first place, or we probably would have done it then.

Useful Stuff:

Zoom has had to scale from 20 million to 300 million users virtually over night. What's incredible is from the outside they've shown little in the way of apparent growing pains. On the inside a lot of craziness must be going on. Sure, they've made some design decisions that made sense as a small spunky startup that don't make a lot of sense as a defacto standard, but that's to be expected. It's not a sign of bad architecture as many have said. It's just realistically how products evolve, especially when they have to uplift over hours, days, and weeks.
- Everyone wants to know how Zoom works. Here's what we know. There's Here’s How Zoom Provides Industry-Leading Video Capacity. A recent post that's A Message to Our Users. A year old marketing video How Zoom's Unique Architecture Powers Your Video First UC Future. A 2016 document on Global Infrastructure and Security Guide. A year old press release Zoom Expands with Equinix to Future-Proof and Scale Its Video-First, Cloud-Native Architecture. And how Zoom CFO explains how the company is grappling with increased demand. And an on-line Q&A Ask Eric Anything.
- And there's the usual grasping for reflected glory. Most of Zoom runs on AWS, not Oracle - says AWS: the service has moved a large quantity of real-time video-conferencing traffic to AWS since the pandemic struck, and has also placed a lesser amount of capacity on the Oracle Cloud...CEO Eric Yuan clarified this further, explaining that Zoom historically handled real-time video conferencing traffic in "its own data centers"...Our real-time traffic always stayed inside our own data center[s] for our paid customers...During this pandemic crisis, every day is a new record. Our own existing data center[s] really cannot handle this traffic...This meant that AWS spun up thousands of new servers for Zoom every day...So ultimately, our own data center[s], and primarily Amazon, and also the Oracle cloud, those three together to serve all the unprecedented traffic.
- Zoom sees their architecture as a competitive advantage. Everyone will be using video, so how do we scale to everyone? Competitors trombone traffic through a datacenter, transcode into a normal view for everybody else, and then send a mixed video out to every individual participant. That introduces latency, uses a lot of CPU resources, and it's hard to scale and deploy datacenters to meet increased load.
- Zoom chose the SVC (Scalable Video Codec) codec over AVC. AVC is a protocol where you send a single stream and the single stream has a bitrate. If you want to send multiple bitrates you have to send multiple streams. This increases bandwidth utilization if you want to send multiple bitrates.
- SVC is a single stream with multiple layers. That allows sending a 1.2 mbs stream that has every resolution and bitrate you may need to scale down to given network conditions. In the past you could only do SVC with an ASIC. Now, thanks to Moore’s law, you can do it on the desktop. SVC can be done in software.
- Zoom created Multimedia Routing to solve the problems traditional vendors have with AVC. Cutting out transcoding can get rid of latency and increase scale. Multimedia routing takes user content into their cloud and when you run into issues they switch a different video stream to you. When you want a different resolution you subscribe to a different layer of that person’s resolution. They don’t transcode or mix anything or form any views. You are literally pulling multiple streams from multiple people directly from routing with zero processing. This is why you see such a great user switching and voice switching experience and low latency.
- Developed application layer QoS that works between the cloud and the client. It detects network conditions. Telemetry data determines which stream is switched to a client. It looks at CPU, jitter, packet loss, etc. The client talks to the cloud and the cloud knows when it doesn't get certain packets back it will switch a different stream down to you. The client can automatically downsize your own send video if in a bad network environment so you're not killing your own downstream bandwidth. They work in tandem to deliver the right audio stream, the right video stream, across the right network, so the user experience is as good as it can be.
- Network aware means trying for the best experience first, which is UDP. If UDP doesn’t work it tries HTTPS. If HTTPS doesn’t work it falls back to HTTP. The client negotiates that. Telemetry shows why the connection was bad. The worst thing you can do is give the user an inconsistent experience. It just works. They really focus on making the system simple and intuitive, which may explain some of the earlier design decisions.
- Zoom disrupted the market with 40 minute meetings with video and chat. They added free dial-in conferencing. They deliver the best VOIP experience in the market. Competitors average VOIP adoption is less than 30%, Zoom is 89%. $3 billion a year is spent on audio conferencing and Zoom gives it away for free. Delivered software based video conferencing room experience. Delivered one button push for competitors. Gave away digital signage and room displays. Zoom's competitors are sunk in revenue models they can't get out of. The can't innovate because they'll disrupt their revenue model. Zoom disrupted the meetings market, disrupted the audio market, disrupted the rooms market, and now they want to disrupt telephony. Though this was 2019, now with the pandemic that strategy may be being revisited.
- Goal is to create the largest network of connected collaboration. Deliver on the promise of VOIP from twenty years ago. Tearing down every pay wall for people to collaborate with each other. Rolling out PSTN connectivity. Connect everyone through chat meeting phone all across IP at lowest rate on any network.
- Also Netflix: What Happens When You Press Play?
- kelp: When they say "running a datacenter" they almost certainly mean "buying servers to put into rented colocation space". Just about anyone who has significant network connectivity has a footprint in an Equinix datacenter. In the Bay Area you want to be in Equinix SV1 or SV5, at 11, and 9 Great Oaks, San Jose. If you're there, you can order a cross connect to basically any telco you can imagine, and any other large company. You can also get on the Equinix exchange and connect to many more. But, Equinix charges you a huge premium for this, typically 2 - 3x other providers for space and power. Also they charge about $300 per month per cross connect. So your network backbone tends to have a POP here, and maybe you put some CDN nodes here, but you don't build out significant compute. It's too expensive. On the cheaper, but still highish quality end you have companies like CoreSite, and I'm pretty sure AWS has an entire building leased out at the CoreSite SantaClara campus for portions of us-west-1. (Pretty sure because people are always cagey about this kind of thing.) I also know that Oracle cloud has been well know for taking lots of retail and wholesale datacenter space from the likes of CoreSite, and Digital Reality Trust, because it was faster to get to market. This is compared to purpose build datacenters, which is what the larger players typically do. In the case of AWS, I know they generally do a leaseback, where they contract with another company who owns the building shell, and then AWS brings in all their own equipment. But all these players are also going to have some footprint in various retail datacenters like Equinix and CoreSite for the connectivity, and some extra capacity. Zoom is probably doing a mix of various colocation providers, and just getting the best deal / quality for the given local market they want to have a PoP in. Seems like they are also making Oracle Cloud part of that story.

How is the internet doing? Very well thank you very much. The web may sometimes falter under load, but that's not an internet problem.
- Here's a huge deck on Global Network Traffic (COVID-19). It contains traffic graphs and data from networks around the world — Internet Service Providers, Internet Exchange Points, etc.
- Internet Usage Measurements in the Time of Corona: All these measurements show that the Internet is resilient enough to keep functioning under the additional usage, even if many markers of the performance have been influenced: download speeds have dropped, and there are more outages.
- From the trenches. Heavy Networking 513: How The Internet Is Handling The Covid-19 Load
  - The BT (British Telecom) network is experiencing higher average network traffic, but nowhere near peak.
  - Only peak traffic matters. The peak on the BT network was 17tps during a football match.
  - During the crises traffic has grown wider, not higher. Usually the network is quiet during the day and when people get home traffic goes up. Now it's busy all day. Nighttime traffic drops earlier as more people get up earlier than usual. Maybe there's also a little TV fatigue as people get tired of TV.
  - At Netflix people are watching throughout the day and is traffic less peaky, it goes up and stays up.
  - The average is up, but peaks have not increased. Enterprises don't really generate that much traffic. It's tiny in comparison to consumer style broadband traffic.
  - Enterprise traffic on BT is way down, but their at home VPN system is way up.
  - A lot more people are using video.
  - At Netflix the network is operating great. They've seen large amounts of growth across the world. Netflix build capacity ahead, so what they ended up doing is pulling in capacity augments that were planned for later this year. Thanks to their acquisition team they already has datacenter space for the servers they needed. They are keeping up. They are talking with ISP partners all the time to make sure they aren't adding to their stress. ISPs have already started building capacity too.
  - Netflix is seeing slowdowns in remote hands services because datacenters are short on people.
  - From a supply chain perspective there's unecertainy. Netflix knows where they are going to get their next tranche of hardware from, but not sure after that. Much of the international cargo shipping is dedicated to humantiarian efforts. Most cargo shipping goes over the airlines which aren't flying right now. But nothing is on fire right now, it's a little warm.
  - The EU looked at problems in Italy and Spain and played it safe by asking Netflix to reduce video traffic across the EU. Netflix is still serving 4k, it's just a lower bitrate 4k.
  - The internet is really a small place. There are a limited number of networks. They talk to each other and were able to react quickly to problems. They are were able to work with networks instead of regulators. There was a light touch by the regulators.
  - The internet is a fantastic platform. Handling this crisis has proven that. We can finally see what the internet can be. We're at the very early stages of collaboration on the internet.
  - Ironically, the PSTN network has busy signals because it hasn't been upgraded.
  - Mobile network traffic is down slightly, but mobile traffic is dimensioned around being in cities, and people aren't coming into cities. BT was worried how the network would perform, but it has performed well. People are using wifi calling, Facetime, WhatsApp video, Facebook Portal, etc. which goes over wifi and reduces mobile traffic.
  - 10 years ago we would have seen wide scale BGP related outages, but recent changes to cleanup routing tables have helped stabalize the network.
  - Everyone is in a world they aren't used to, operating a network, while everyone is at home and people are getting sick.
  - Capacity is not a zero or one. We've spent decades making the internet resilient to service interruptions and service contention. There won't be a moment where things suddenly break. The average consumer won't notice problems even if there are slow downs.
  - We've transitioned to a point where pretty much every industry is reliant on the internet because the internet is reliable and easy to use. Now how can we take this technology one step forward?
- Looking at the stats. Keeping the Internet up and running in times of crisis
  - 60%: increase in internet traffic experienced by some operators.
  - IXPs report record net increases of up to 60% in total bandwidth handled per country from December to March 2020.
  - Individual IXPs have also reached new records of peak traffic. DE-CIX Frankfurt, one of the largest IXPs in the world, is now regularly peaking over 9.1 terabits per second (Tbps) data, which equals a simultaneous transmission of up to 2 million high-definition videos. During the COVID-19 health emergency, the exchange has seen a 120% increase in video conferencing traffic and a 30% increase in online and cloud gaming
  - In Korea, operators have reported traffic increases of 13%, reaching 45% to 60% of their deployed capacity. In Japan, NTT Communications reports an increase in data usage of 30% to 40%. In the United Kingdom, BT reports a 35% to 60% increase in daytime weekday fixed broadband usage. Telefónica reports nearly 40% more bandwidth in Spain, with mobile traffic growth of 50% and 25% in voice and data, respectively. In Italy, Telecom Italia has experienced a traffic increase of 63% and 36% in the fixed and mobile network, respectively. In France, Orange reports that its international infrastructure has been in high demand with 80% of the traffic generated by users in France going to the United States, where a good part entertainment and content is located.
  - In the United States, Verizon reported a 47% increase in use of collaboration tools and a 52% increase of virtual private network traffic. AT&T has seen mobile voice and Wi-Fi call minutes up 33% and 75% respectively, while consumer voice minutes were up 64% on fixed lines: a reversal of previous trends. AT&T also reported that its core network traffic was up 23%. The content and application industries report similar surges.
  - Cisco Webex, the most prevalent cloud-based video conferencing application, is peaking at 24 times higher volume.
  - According to the New York Times, Facebook experienced increases of 100% on voice calls and 50% on text messaging over its WhatsApp, Facebook Messenger and Instagram platforms, while group calls in Italy increased tenfold.
- I have dozens more links on this subject, but that should cover it. That state of the internet is good. Let's not f*ck it up. I'm looking at you China. China and Huawei propose reinvention of the internet

13 Days of GCP is a clever series of shorts on different GCP topics. Short, but you do actually learn something. I especially liked Day 9 Serverless Microservices and Day 3 Mobile Backend Architecture.

Netflix Images Enhanced With AWS Lambda. Netflix is famous for creating their own sophisticated cloud native infrastructure on top AWS, so this a fascinating test case. Lambda or custom? Who will win? After using Lambda in production since the beginning of 2020 Netflix found Lambda was much faster on handling the first request for a cold EC2 cluster and Lambda was 300ms slower on a warm cluster. The biggest difference was the cost. The cost for Lambda was <$100 per day and the EC2 solution cost $1000 per day. Lambda is also much better at handling 15x-20x load volumes.

Technium Unbound. Kevin Kelly defines technium as a web of inter-dependent technologies.
- Super-organism: a collection of agents which can act in concert to produce phenomena governed by the collective.
- Any complex invention is a network of dependent and self-sustaining technologies. A new idea is just the last little bit added to a network that already exists. That's why simultaneous independent inventions are the norm.
- A cell has thousands of parts of pathways. None of those parts are living, yet the cell is alive.
- As an example he uses the near impossibility of recreating a toaster from scratch. You quickly learn that recreating a toaster depends on entire web of technologies that are nearly impossible to recreate.
- If you're trying to recreate a cloud, isn't that a lot like trying to recreate a toaster? There's a whole deep network of technologies underlying a cloud and the cost of recreating those technologies is onerous. Just buy a toaster and make some toast...if you're into carbs and wheat that is.

Internet economics means free can be profitable. I gave away my books for free, and sales increased 4x: I sold, on average, 541 copies of both books per month prior to March. In the months of March and April, that average shot up to 32,450, or a 60x increase in cumulative sales. I was able to give away almost 65,000 copies of my books...I averaged 328 paid book sales per month in all months prior to March, and then sold 1,910 copies in March and April (almost a 4x increase!).

The talk for this isn't available yet, but I'm looking forward to it. To Microservices and Back Again - Why Segment Went Back to a Monolith. The key here is the monorepo. Having a repo per microservice sounds a little crazy to me. That's a lot of process overhead for very little if any gain. The article's key takeaway: spending a few days or weeks to do more analysis could avoid a situation that takes years to correct.

The cloud is different and Jeremy is one of those people who is great at showing just how different it is. Everything you need to know about How to fail with Serverless • Jeremy Daly. How do you handle if an error happens and your event is lost? It can happen.
- Jeremy's ripple of event loss evil: an event on failure can't be written to a log because the network is down; the function container crashes; the function never runs; function timeout; out-of-memory error; throttling errors.
- The idea is the cloud is better than you at handling errors, so let the cloud handle the error for you by failing up the stack. Don't swallow try-catch errors in your lambda function, fail the function. Return an error back to the invoking service.
- Use the builtin in retry mechanism to retry the events and if the fails use the dead-letter-queue (DLQ) to make sure errors are captured somewhere instead of loss.
- Jeremy favors a lambda-native architecture where you compose of your system of little functions that execute a single piece of business logic instead of larger aggregates like the Mighty Lambdalith or the Fat Lambda.
- The advantage of the lambda-native approach is: it can be invoked synchronously or asynchronously; failures are total and can be passed back to the invoker; can be reused as parts of workflows like Step Functions; can use least privilege IAM roles.
- I don't think it matters because even with larger aggregates you need a way to only run a selected function and as long as you don't handle partial failures in the code then you can make sure errors are pushed back up the stack.
- Most cloud services guarantee at least once delivery, which means you have to handle duplicate events by making functions idempotent.
- DLQs mean you can inspect event,s alarm on them, and even retry from there. SNS, SQS, and Lambda support DLQs.
- Four ways to invoke Lambda functions: synchronous, asynchronously via an event, from a stream, using a polling service. For synchronous invocations you need to retry. For async Lambda now has built-in retry behaviour and DLQs. You can also use SQS and it has retries built-in. Lambda also has Destinations so a destination can be triggered on failure. For stream processing there's a bisect batch option that filters out poison pill messages. SQS has built-in retries.
- For async invocations using Lambda Destinations you can route a function to a handler based on success or failure. Should be favored over DLQs because it captures the context and not just the payload. A destination can be SQS, SNS, Lambda function/State Machines, Event Bridge.
- Throttling. Use SQS to throttle events through the system. SQL invokes a Lambda function subject to the Lambda concurrency limit and if there's a failure it will retry and eventually deposit the event in a DLQ. Tricky to configure but powerful.
- This part of the talk went fast, so I'm not sure I got it right. Circuit breaker. Introduce a status check that detects if a failure count is over a certain threshold and then refuses to move a request downstream. This can protect an API that's under stress.
- Key takeaways: be prepared for failure because everything fails all the time; utilize the built in retry mechanisms of the cloud because they are better than you can write; understand failure modes to protect against data loss; buffer and throttle events to distributed systems so you don't overwhelm downstream systems; embrace async processes to decouple components.
- Also, Your Worst-case Serverless Scenario Part I: Invocation Hell.

Videos from the Failover Conf 2020 are now available. Lots of good content.

Nice writeup. Postmortem of a Scaleway related not-too-critical downtime. When moving to a different instance type make sure you have one on reserve in case you have to fail back. Make sure your backups include everything, even the secret stuff. Remember, nobody cares about backups, what people care about are restores.

Damn, that's quick progress. If only Google was as good at the platform stuff. This should be everywhere and in everything. An All-Neural On-Device Speech Recognizer (paper): Today, we're [Google] happy to announce the rollout of an end-to-end, all-neural, on-device speech recognizer to power speech input in Gboard. This means no more network latency or spottiness — the new recognizer is always available, even when you are offline...Model quantization delivered a 4x compression with respect to the trained floating point models and a 4x speedup at run-time, enabling our RNN-T to run faster than real time speech on a single core. After compression, the final model is 80MB.

Server side rendering may give way to edge side rendering. But there's a lot of special casing going on that seems a bit tricky. Using lambda@Edge for Server-Side Rendering:
- Thanks to AWS Lambda Edge, we were able to create a lambda function associated with each CloudFront Edge that hosts the static content to handle the SSR...So we used this feature to differentiate between a page request that requires SSR and an asset request that goes directly to S3...We implemented the server WebPack config to use the generated client-build from React scripts. And there's a lot more fiddly stuff to make it all work.
- The result: In the end, we were pleased with the performance results of the Lambda@Edge and its high availability.
- There's a graph that shows the response time in Paris at 88ms and 44ms in Frankfurt. Average of 168ms with a p95 of 449ms and a p99 of 671ms. Which is great.
- Also Evaluating new languages for Compute@Edge, Platforms for Serverless at Edge: A comparison

How do you go about building a 10,000 node cluster? Fix one bug a time. It turns out basing a design on a solid theory only gets you so far, and even then it may not be enough. Great story. Sad ending. 10000 nodes and beyond with Akka Cluster and Rapid.
- One of the motivations for Rapid was to be faster at scale than existing consensus protocols. The Rapid paper shows that Rapid can form a 2000 node cluster 2-5.8 times faster than Memberlist (Hashicorp’s Lifeguard implementation) and Zookeeper.
- One of the problems with membership services that rely entirely on random gossip is that random gossip leads to higher tail latencies for convergence. This is because messages in larger clusters can go round and round and bump into conflicting versions, especially in the presence of unstable nodes.
- Another one of the big contrasts that in my opinion sets Rapid ahead of other membership protocols is the strong stability provided by the multi-node cut detector. A flaky node can be quite problematic in protocols where the failure detection mechanism doesn’t provide this type of confidence.
- AWS autoscaling groups are expensive! Cloudwatch costs are almost as high as EC2 costs, and they’re not much faster than doing this via RunInstances. Back to launching EC2 instances directly.
- Dissemination of membership information in Rapid happens via broadcast. In the version of Rapid used in the paper, the broadcast is based on gossip that takes advantage of the expander graph topology according to which nodes are organized in Rapid. These types of graphs have strong connectivity properties and it is possible to know how many times a gossip message needs to be relayed before it has traveled the entire graph.

How YouTube Stars Are Getting Paid During the Pandemic. Down from $11,000 a month to around $4,800. Anything depending on advertising has taken a hit during Covid-19. Sponsorship deals are more stable sources of income.

Interesting biological model for malware protection. Rethinking Our Assumptions During the COVID-19 Crisis~ How does our immune system mount a successful attack against a pathogen it has never seen before? It generates a random variation in its antibodies. This is enabled by a process called somatic hypermutation. Since it's random the immune system can end up attacking itself. To counter that it's developed a method of negative selection or clonal deletion whereby any element of the immune system that attacks self is tagged and deleted. What you are left with is the negative complement of the self, which is everything that might infect you. You generate random responses trying to cover as much of the space as possible, take out anything that attacks you, and in the space what is left is hopefully an adaptive response to a novel infection. It's important because you are anonymizing yourself. You are not signaling what you are. The self is not present in that set, only the infection. For a computer virus you don't know in advance which bit of code the computer virus is going to use or what it's going to attack. Applying the same principles, you generate a random set of possible interventions, eliminate any that attack your own code, leaving only the response which attacks the foreign code. This approach anonymizes the self, removes the self from the picture, in order to remove non-self.

What's Next in Gaming. John Riccitiello, CEO of the game software development company Unity Technologies.
- A big difference between traditional sports and e-sports is that in an e-sport the game is owned by one single property. The NFL, NBA, and MLB are owned by a set of teams the collaborate, compete and cooperate. The economics of e-sports is different because Riot, Tencent, EA, etc. completely control everything. It's not an unfettered competitive marketplace. It's controlled by individual capital that thinks of e-sports as a marketing program.
- Individual gamers on YouTube have over 100 million subscribers. A good fraction of Netflix's user base. There's something about gaming that monetizes better than Hollywood.
- The app store model where you pay to download something is conflict with the multi-platform model that has developed. The music industry has transformed into a Spotify model, labels have lost a lot of value, and the consumer relationship is with the platform companies. Netflix, Disney, Amazon and Apple are spending a lot of money generating content hoping to create a direct relationship to the consumer and disintermediate the producer of the content. In music and film content is King, but distribution is God. There's a notion that the distribution layer is more valuable. A lot of people believe the game industry will end up with a similar model over time. Will someone invest enough to become the Netflix of games?
- It's unlikely games can be streamed in the same way music and video can be streamed because you don't don't know what's coming. For a lot of game types the consumer notices high latencies, so games can't be streamed as easily and in the music and video industries because you can't compress it to the same degree and because latency is a controlling factory. We won't see a big streaming solution, we'll see in gaming different solutions depending on the content type and it will live behind a brand like Google or Microsoft or Netflix. The gaming industry will organize differently around the distribution mechanism because of performance issues, cost issues, and because some things don't work. We'll see edge of network, central cloud, on device, a blend of things will come together.
- Our digital lives in games will become more important and more valid than our meat lives. Increasingly people are living in that space.
- Industries will shift to real-time 3D for content creation and dissemination. Why evaluate a car with 2D pictures instead of immersing yourself in a 3D world? On a construction site you can see a 3D world in real-time which radically reduces communication overhead. It's a cloud service that is a better, faster, cheaper way of creating and distribute content.
- In 3 years you'll be able to freeze a frame on Netflix and step into the scene. You can mix real and non-real things in ways that defy your imagination.
- Obviously, Unity is positioning themselves has having the tools to make this real-time 3D world a reality.

Adrian Cockroft on Why are services slow sometimes? The secret is to run barefoot and train at high altitudes.
- Be careful to measure Response Time at the user, and Service Time at the service itself.
- Waiting in a Queue is the main reason why Response Time increases.
- Throughput is the number of completed successful requests. Look to see if Arrival Rate is different, and to be sure you know what is actually being measured.
- For infrequent requests, pick a time period that averages at least 20 requests together to get useful measurements.
- For spiky workloads, use high resolution one-second average measurements.
- For each step in the processing of a request, record or estimate the Concurrency being used to process it.
- Little's Law Average Queue = Average Throughput * Average Residence
- Constant rate loop tests don't generate queues, they simulate a conveyor belt.
- The proper random think time calculation is needed to generate more realistic queues.
- Truly real world workloads are more bursty and will have higher queues and long tail response times than common load test tools generate by default.
- Plan to keep network utilization below 50% for good latency.
- Inflation of average residence time as utilization increases is reduced in multi-processor systems but "hits the wall" harder.
- Rule of thumb is to keep inflation of average residence time below 2-3x throughout the system to maintain a good average user visible response time.
- The problem with running a multiprocessor system at high average utilization is that small changes in workload levels have increasingly large impacts.
- Load Average doesn't measure load, and isn't an average. Best ignored.
- Systems that hit a sustained average utilization of 100% will become unresponsive, and build up large queues of work.
- Timeouts should never be set the same across a system, they need to be much longer at the front end edge, and much shorter deep inside the system.
- Think about how to shed incoming work if the system reaches 100% average utilization, and what configuration limits you should set.

Lesson learned. If you're going to cheat be smart about it. Spread it out. Don't win too much too fast. Oh, and there's almost always a cheat path somewhere in a system. Finding a Video Poker Bug Made These Guys Rich—Then Vegas Made Them Pay: but the new Game King code had one feature that wasn't in the brochure—a series of subtle errors in program number G0001640 that evaded laboratory testing and source code review...The key to the glitch was that under just the right circumstances, you could switch denomination levels retroactively. That meant you could play at 1 cent per credit for hours, losing pocket change, until you finally got a good hand—like four aces or a royal flush. Then you could change to 50 cents a credit and fool the machine into re-awarding your payout at the new, higher denomination...But after seven hours rooted to their seats, Kane and Nestor boiled it down to a step-by-step recipe that would work every time...The “Double Up bug” lurking in the software of Game King video poker machines survived undetected for nearly seven years, in part because the steps to reproduce it were so complex.

What we call simplification is often just shifting responsibility to someone else. Complexity Has to Live Somewhere: A common trap we have in software design comes from focusing on how "simple" we find it to read and interpret a given piece of code. Focusing on simplicity is fraught with peril because complexity can't be removed: it can just be shifted around. If you move it out of your code, where does it go?...With nowhere to go, it has to roam everywhere in your system, both in your code and in people's heads. And as people shift around and leave, our understanding of it erodes. Complexity has to live somewhere. If you embrace it, give it the place it deserves, design your system and organisation knowing it exists, and focus on adapting, it might just become a strength.

Lessoned learned. Anonymous means stay anonymous. Beware of social engineering attacks. The Confessions of Marcus Hutchins, the Hacker Who Saved the Internet: So Vinny asked for Hutchins' address—and his date of birth. He wanted to send him a birthday present, he said. Hutchins, in a moment he would come to regret, supplied both.

Optimizing the HTTP/2 tail, these high packet loss scenarios became the design goal of HTTP/3. This is the unfinished work of HTTP/2. How HTTP/3 and QUIC aim to help the connections that need it most: Using HTTP/3 and QUIC to target packet loss performance isn’t a scenario where a “rising tide lifts all boats.” While it is targeted at the worst performing connections, HTTP/3 might not significantly affect performance for the average connection — or even affect it at all. This is especially true if the comparison HTTP/2 session has almost no packet loss and is already using best practice recommendations for TLS and TCP configurations...Focusing effort on protocols that create even a marginal improvement for the tail of connections may mean the difference between a person using video on a conference call, supporting HD video, or even, in the case of managing audio dropouts, making a workable phone call.

Scalability Testing of a Production Kubernetes Cluster: K8s itself is quite robust. We could scale up to 4x the number of nodes and workloads of our current product cluster. But user experience starts to degrade without performance tuning.

Great example of how AWS hooks you in. Scaling Our AWS Infrastructure. Start with a simple lift and shift and pretty soon you're in the AWS crack den mainlining services right and left.

Few things are more satisfying than removing a layer. How we [LinkedIn] reduced latency and cost-to-serve by merging two systems: In this blog post, we’ll share how we merged two layers of the identity services that handle more than half a million queries per second (QPS) that drove a 10% reduction in latency and reduced our annual cost-to-serve significantly.... In summary, the merger improved p50, p90, and p99 by 14%, 6.9%, and 9.6%, respectively...The memory allocation rate in the identity midtier after the merger is about 100MB/s (or 28.6%) lower per host than would have been required to serve the same amount of traffic before the merger...After decommissioning the entire data service cluster, the physical resources that we saved added up to over 12,000 cores and over 13,000 GB of memory, which translated into significant annual savings.

Sharding is not dead. It's nice to see an architecture where AWS services aren't nearly every block on the diagram. But this is a lot of work to make work. MySQL sharding at Quora

10 Lessons Learned from Start-ups on their AWS Journey: in order to meet the demands of the financial institutions, we recommended a multi-account structure; use AWS Compute Optimizer to determine the provisioning level for their instances. We then reviewed the EC2 dashboard and looked at the cost analysis in cost explorer. We reduced instance sizes where possible and then implemented a savings plan; Consolidated Billing reduced the bill by 1% by combining storage tiers across multiple accounts and also simplified the billing process; We created an S3 endpoint in the VPC, which reduced latency, removed the cost of data transfer, and improved this start-up’s security. The second item was more severe than the first in terms of impact on cost. By correctly configuring the S3 lifecycle policy, we reduced the client’s AWS bill by ~$40k per year; We implemented Blue-Green deployments which reduced the application downtime by 30-60 minutes per month, which translated to $180-360k in downtime savings per month; We implemented a pre-signed URL that provided the content to the customers but ensured a higher level of security for the files.

Ever dream of living off the grid, your will matched against nature? Here's what that life could have looked like—if you were a better person. Of course, I'm talking for myself. 50 years off-grid: architect-maker paradise amid NorCal redwoods. Some great wisdom quotes:
- The limitation isn't necessarily bad, they actually inspire creation.
- When you go from concept to completion without external elements there's a personal sort of expression that you will not get if you have to have other people involved.
- I don't design, I follow logic.
- Simplicity is difficult. It's easy to make things complex.
- Human beings go to extremes in whatever they do.
- Rather than be in a commune with a hundred people I have backhoes and D7 cats.
- It was built to be a temporary building. It's lasted 51 years.

We've had a number of events pivot to an online format, not an easy task and some were more successful than others. We know a strategy tax is bad, but a partnership tax can be even worse.
- The NFL is a very successful example. AWS partners with the NFL for the first-ever remote NFL Draft: the NFL sent over 150 smart phones to top prospects, coaches, teams, and personnel with instructions on how to set them up and keep them running during the Draft. AWS is hosting the "always-on" streaming with capacity to support the additional demand happening all at once—making sure that when ESPN and NFL Network cut to the live feed of any of the 150-plus available shots, there are no interruptions.
- American Idol sent contestants a package of equipment. Contestants used iPhones to capture the video and send it back for rebroadcasting. The result was excellent. The picture was clear. Sound was good. And people were obviously comfortable using the iPhone.
- The Voice used Microsoft Teams and Surface Pro 7. The result was horrible. The picture and sound were bad, the whole thing looked unprofessional. Maybe pick your stack based on the result, not who will pay you the most money?

Yep, serverless is great for embarrassingly parallel intermittent jobs. How Serverless Saved Us For $2: You could setup a cron job on one of your servers and it will take care of launching a command to generate the PDF. The main problem: it doesn’t scale. ~20 seconds (pdf generation) x 2,000 PDF = 11 hours of PDF generation...Up to 3,000 lambda instances are started. Each lambda receives one message containing the data to generate 1 PDF... it's stored in a S3 bucket...All the 2,000 PDFs are generated within 2 minutes.

I wouldn't say give up, but like the Hegelian dialectic, people will never understand databases. 17 Things I Wished More Developers Knew About Databases: You are lucky if 99.999% of the time network is not a problem; ACID has many meanings; There are anomalies other than dirty reads and data loss; My database and I don’t always agree on ordering; Significant database growth introduces unpredictability...

Real World Serverless with theburningmon. Got acquired for $400 million. ~2 people in a year delivered 25 microservices and 170 lambda functions. The cloud makes it easier to go multi-region in the case of an acquisition. If you start out in one region and your new customer base is in another then you can move that functionality over. For example, latency went from 1.5 seconds to 150ms in a multi-region setup. Used CI to deliver to four regions in parallel. Needed to go to four regions because the data was required to stay in country. They chose Apollo over AppSync because AppSync wasn't in all regions and it only worked with Cognito. Apollo also has a cool feature called Apollo Federation that allows you to federate your graph across microservices. Clients see a single GraphQL endpoint but it stitches all the remote schemas together. From a client perspective it feels like one graph. The Assessments API would own the assessments part of the graph and the Challenges API would own the challenges part of the graph and they are stitched together in one public endpoint.

Is multi-cloud really the dominant strategy? That just doesn't sound quite right. Flexera 2020 State of the Cloud Report: 93 percent of enterprises have a multi-cloud strategy; 87 percent have a hybrid cloud strategy; Respondents use an average of 2.2 public and 2.2 private clouds; 20 percent of enterprises spend more than $12 million per year on public clouds; More than 50 percent of enterprise workloads and data are expected to be in a public cloud within 12 months; 73 percent of organizations plan to optimize existing use of cloud (cost savings); Azure is narrowing the gap with AWS in both the percentage of enterprises using it and the number of virtual machines (VMs) enterprises are running in it; 40 percent of enterprise AWS users spend at least $1.2 million annually versus 36 percent for Azure; Organizations are over budget for cloud spend by an average of 23 percent and expect cloud spend to increase by 47 percent next year; 65 percent of organizations are using Docker for containers, and 58 percent use Kubernetes.

Behind the Scenes at PlanetScale Engineering: Why is Multi-Cloud a Hard Problem?: Why use true multi-cloud clusters? Two reasons: Disaster recovery and freedom from vendor lock in. dkhenry: Here is what we found, previously when people talked about going to the cloud the state of the art was everything is done targeting a specific cloud provider, you put your software in immutable AMI's and you use ASG, and ELB, along with S3 and EBS to have really robust systems. You instrument everything with CloudWatch and make sure everything is locked down with IAM and security groups. What we have seen lately is that because of Kubernetes that has all changed. Most systems being designed today are being done very much provider agnostic, and the only time you want to be locked into a specific technology is when the vendor provided solution doesn't really have an alternative in a truly vendor agnostic stack. Part of what this service is doing is taking the last true bit of Gravity for a cloud provider and removing it, you can now run in both clouds just as easily as if you were all in on one of them. There are some additional costs if you are transferring all your data across the wire, but that is where the power of Vitess's sharding comes in. You can run your service across two clouds, while minimizing the amount of cross talk, until you want to migrate off.

Wide distribution is another form of resilience. Cordkillers 309 - Star Wars: What-If? Interesting idea that Netflix is robust against the pandemic because it's global. Netflix shoots in every market because they have local markets. Now that there's high quality production in multiple parts of the world, when countries come back online at different times, they can go into production while others are down. Disney can't do that.

Good example of a glue layer that uses events and state machines to coordinate between systems. But man is that a complex workflow. Building an automated knowledge repo with Amazon EventBridge and Zendesk. Also, Decoupling larger applications with Amazon EventBridge

Learn How Dubsmash Powers Millions of Users with Stream: Dubsmash uses a hybrid approach to hosting their application. Because the app is fundamentally a mobile application for iOS and Android devices, only the backend, and backend related services need to be hosted. For this, Dubsmash relies heavily on Heroku’s Container Service – all Docker images are stored in Quay, allowing for near real-time rollbacks in the event that something goes wrong. Heroku allows Dubsmash to scale both horizontally and vertically in an infinite way without having to worry about employing a large number of backend infrastructure engineers. Simply build, push, and deploy. Aside from Heroku, Dubsmash makes heavy use of AWS as their primary service provider with GCP coming in second due to their robust cloud data warehouse – BigQuery.

Is something that runs in a single container really a microservice? How does Monzo keep 1,600 microservices spinning? Go, clean code, and a strong team: Each Monzo microservice runs in a Docker container. "One of our biggest decisions was our approach to writing microservices," said Patel. There is a shared core library, which is available in every service; this is essentially copied in every container, though the build process will strip out unused code. This means that "engineers are not rewriting core abstractions like marshalling of data". It also enables metrics for every service so that after deployment it immediately shows up in a dashboard with analysis of CPU usage, network calls and so on. Automated alerting will identify degraded services. A lot of thought goes into the interface or API that each service exposes. The team favours writing many small services, each dedicated to a single purpose, rather than fewer more complex services.

Why Serverless Apps Fail and How to Design Resilient Architectures: Throughput and concurrency limitations; Increased latency; Timeout errors. Instead of sending requests directly from API Endpoint 1 to the Lambda function 1, we first store all requests in a highly-scalable SQS queue. The API can immediately return a 200 message to clients. The Lambda function 1 will later pull messages from the queue in a rate that is manageable for its own concurrency limits and the RDS instance capabilities.

Amen. Lesson Learned: No matter how small the change is, validate changes in a test environment first. This requires using a thorough test script for the changes. Loss of Automatic Generation Control During Routine Update.

In 2018, we hit a point where deploying as fast as possible was hurting the stability of our product. Deploys at Slack: Every pull request at Slack requires a code review and all tests to pass. Once those conditions are met, an engineer can merge their code into master. However, merged code is only deployed during North America business hours to make sure we are fully staffed for any unexpected problems. Every day, we do about 12 scheduled deploys. During each deploy, an engineer is designated as the deploy commander in charge of rolling out the new build to production. This is a multistep process that ensures builds are rolled out slowly so that we can detect errors before they affect everyone. These builds can be rolled back if there is a spike in errors and easily hotfixed if we detect a problem after release.

500MB of Memory Saved Us ~60% in our DynamoDB Bill: with just a few days of work, using a simple in-memory cache, we were able to reduce its price tag by around 60%...Using said monitoring tools, we were able to identify that the write throughput of the index tables was very high...Combining this knowledge with the AWS cost calculator, we assumed those writes were the main reason our DynamoDB bill sky-rocketed...We decided to add an in-memory write-through cache in front of each index table...With just 250MB given to the cache, we reached an amazing cache hit-rate of around 75%, and over time it climbed up to 80%...Our DynamoDB write throughput dropped by more than 80%...And the crown jewel — our DynamoDB bill started to shrink after 2 years it just climbed unstoppably.

How we reduced our Google Maps API cost by 94%: This approach made the API calls independent of the number of vehicles and dependent only on our stops, which helped us in scaling up our fleet with no additional cost. However, the cost here is dependent on the number of stops in the network. After multiple iterations of tuning and tailoring for our use case we observed at least 94% reduction in our cost for Google Directions API.

Software is growing, but engineering not so much. State of Software Engineering in 2020:
- Software engineering has seen explosive growth over the last 20 years, and it seems to be keeping that momentum up. According to Fortune data, total revenue of top 15 technology companies in the world was a record 1.67 Trillion US Dollars in 2019, which is up 2% from 2018. There are more software companies than ever now.
- 10M new developers joined the GitHub in 2019. These new developers contributed to 44M+ repositories from all countries around the world. 80% of all code commits were made from outside the US.
- On average, each open source project on GitHub had contributors from 41 different countries and regions.
- It is no shock that the programming language that is powering most of Web, JavaScript, is still number one.

I always wondered how Cloudflare made updates. They handle fourteen million HTTP requests per second, pushing changes within seconds to 200 cities in 90 countries for over 26 million customers. The key to this architecture is Cloudflare understands their traffic patterns. That means they could build a custom system that matched their needs perfectly. Introducing Quicksilver: Configuration Distribution at Internet Scale.
- The first part of the article is why Kyoto Tycoon (KT) no longer works for them.
- They selected LMDB, primarily because it allows snapshotting with little read degradation. They service tens of millions of reads per second across thousands of machines, but only change values relatively infrequently.
- Using LMDB, the DNS service saw the 99th percentile of reads drop by two orders of magnitude.
- LMDB also allows multiple processes to concurrently access the same datastore. This is very useful for implementing zero downtime upgrades for Quicksilver.
- LMDB is also append-only, meaning it only writes new data, it doesn’t overwrite existing data. Beyond that, nothing is ever written to disk in a state which could be considered corrupted. This makes it crash-proof, after any termination it can immediately be restarted without issue.
- LMDB does a great job of allowing us to query Quicksilver from each of our edge servers, but it alone doesn’t give us a distributed database. We also needed to develop a way to distribute the changes made to customer configurations into the thousands of instances of LMDB we now have around the world. We quickly settled on a fan-out type distribution where nodes would query master-nodes, who would in turn query top-masters, for the latest updates.
- when users make changes to their Cloudflare configuration it is critical that they propagate accurately whatever the condition of the network. To ensure this, we used one of the oldest tricks in the book and included a monotonically increasing sequence number in our Quicksilver protocol. It is now easily possible to detect whether an update was lost, by comparing the sequence number and making sure it is exactly one higher than the last message we have seen. The astute reader will notice that this is simply a log.
- LMDB stability has been exceptional. It has been running in production for over three years. We have experienced only a single bug and zero data corruption. Considering we serve over 2.5 trillion read requests and 30 million write requests a day on over 90,000 database instances across thousands of servers
- Also, How databases scale writes: The power of the log

Soft Stuff

facebookincubator/ntp (article): Collection of Facebook's NTP libraries.
liftbridge-io/liftbridge: provides lightweight, fault-tolerant message streams by implementing a durable stream augmentation for the NATS messaging system. It extends NATS with a Kafka-like publish-subscribe log API that is highly available and horizontally scalable. The goal of Liftbridge is to provide a message-streaming solution with a focus on simplicity and usability.
What a great idea! Database schema templates: a collection of real world database schemas from open-source packages and real-world apps that you can use as inspiration when architecting your app.
hse-project/hse: an embeddable key-value store designed for SSDs based on NAND flash or persistent memory. HSE optimizes performance and endurance by orchestrating data placement across DRAM and multiple classes of SSDs or other solid-state storage. Scales to terabytes of data and hundreds of billions of keys per store. For these YCSB workloads, MongoDB/HSE delivered up to nearly 8x more throughput than MongoDB/WiredTiger. For these YCSB workloads, HSE delivered up to nearly 6x more throughput than RocksDB.
Serverless Redis: Run Redis without thinking about the servers. ~2msec latency. ~6x cheaper than Elasticache and RedisLabs. mattiabi: We started with the most used commands, and we are planning to add missing features gradually.
Every little bit matters. How Netflix brings safer and faster streaming experiences to the living room on crowded networks using TLS 1.3. Play delay improvement ranged from 3.5% to 8.2%. 7.4% improvement in media rebuffers. Can reduce the CPU load.
TerminusDB: an open source (GPLv3) full featured in-memory graph database management system with a rich query language: WOQL (the Web Object Query Language).
asyncapi/asyncapi: The AsyncAPI specification allows you to create machine-readable definitions of your asynchronous APIs

Pub Stuff:

Turbine: Facebook’s Service Management Platform for Stream Processing: Turbine, a management platform designed to bridge the gap between the capabilities of the existing general-purpose cluster management frameworks and Facebook’s stream processing requirements. Specifically, Turbine features a fast and scalable task scheduler; an efficient predictive auto scaler; and an application update mechanism that provides fault-tolerance, atomicity, consistency, isolation and durability. Turbine has been in production for over three years, and one of the core technologies that enabled a booming growth of stream processing at Facebook. It is currently deployed on clusters spanning tens of thousands of machines, managing several thousands of streaming pipelines processing terabytes of data per second in real time. Our production experience has validated Turbine’s effectiveness: its task scheduler evenly balances workload fluctuation across clusters; its auto scaler effectively and predictively handles unplanned load spikes; and the application update mechanism consistently and efficiently completes high scale updates within minutes.
Goods: Organizing Google’s Datasets: In this paper, we describe Google Dataset Search (Goods), such a post-hoc system that we built in order to organize the datasets that are generated and used within Google. Specifically, Goods collects and aggregates metadata about datasets after the datasets were created, accessed, or updated by various pipelines, without interfering with dataset owners or users. Put differently, teams and engineers continue to generate and access datasets using the tools of their choice, and Goods works in the background, in a nonintrusive manner, to gather the metadata about datasets and their usage. Goods then uses this metadata to power services that enable Google engineers to organize and find their datasets in a more principled manner.
Why is Maxwell's Theory so hard to understand? Its ultimate importance is to be the prototype for all the great triumphs of twentieth-century physics. It is the prototype for Einstein's theories of relativity, for quantum mechanics, for the Yang-Mills theory of generalised gauge invariance, and for the unified theory of fields and particles that is known as the Standard Model of particle physics. All these theories are based on the concept of dynamical fields, introduced by Maxwell in 1865. All of them have the same two-layer structure, separating the world of simple dynamical equations from the world of human observation. All of them embody the same quality of mathematical abstraction that made Maxwell's theory difficult for his contemporaries to grasp. We may hope that a deep understanding of Maxwell's theory will result in dispersal of the fog of misunderstanding that still surrounds the interpretation of quantum mechanics. And we may hope that a deep understanding of Maxwell's theory will help to lead the way toward further triumphs ofphysics in the twenty-first century.
Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log: The main contributions in Scalog are that: it allows applications to customize data placement, it supports reconfiguration with no loss in availability, it recovers quickly from failures
Gray Failure: The Achilles’ Heel of Cloud-Scale Systems (article): In this paper, we discuss our experiences with gray failure in production cloud-scale systems to show its broad scope and consequences. We also argue that a key feature of gray failure is differential observability: that the system’s failure detectors may not notice problems even when applications are afflicted by them. This realization leads us to believe that, to best deal with them, we should focus on bridging the gap between different components’ perceptions of what constitutes failure.
CSE138 (Distributed Systems) lectures, Spring 2020: Lecture videos from an undergrad distributed systems course at UC Santa Cruz.
ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs: This paper addresses the need for in-memory computation with little to no change to DRAM designs. It is the first work to demonstrate in-memory computation with off-theshelf, unmodified, commercial, DRAM. This is accomplished by violating the nominal timing specification and activating multiple rows in rapid succession, which happens to leave multiple rows open simultaneously, thereby enabling bit-line charge sharing. We use a constraint-violating command sequence to implement and demonstrate row copy, logical OR, and logical AND in unmodified, commodity, DRAM. Subsequently, we employ these primitives to develop an architecture for arbitrary, massively-parallel, computation
WormSpace: A modular foundation for simple, verifiable distributed systems: The paper introduces the Write-Once Register (WOR) abstraction, and argues that the WOR should be a first-class system-building abstraction. By providing single-shot consensus via a simple data-centric API, the WOR acts as a building block for providing distributed systems durability, concurrency control, and failure atomicity.
Turbine: Facebook’s Service Management Platform for Stream Processing: Turbine features a fast and scalable task scheduler; an efficient predictive auto scaler; and an application update mechanism that provides fault-tolerance, atomicity, consistency, isolation and durability.