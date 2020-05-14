Hey, it's HighScalability time!

LOL. Who knew a birthday service could lead to an existential crisis?

4.8 billion: plays of Drake - Toosie Slide on TikTok. 1.6M on YouTube. As of Saturday May 9.

80%: of enterprises will shutdown their traditional datacenters by 2025, versus 10% today.

32%: of US CFO anticipate layoffs in the next 6 months. 49% plan on making remote work a permanent option.

50%+: expected drop in NY Times adverting next quarter. Added 587,000 net new digital subscriptions compared to the end of the fourth quarter of 2019.

50%: of online ad spending goes to industry middlemen. Out of a total of 267m of ads placed online, it was only possible to match the end-to-end process for 31m.

22K: new mobile games on Apple app store. Down from 285K in 2016.

2 billion: TikTok dowloads. 5 months ago they were at 1.5 billion. Generated most downloads for any app ever in a quarter, accumulating more than 315 million installs across the App Store and Google Play.

14.7%: US unemployment rate. 20.5 million people lost their jobs in April.

>0: Dropbox’s first quarterly profit

$4,800: revenue from 2 million subscriber TierZoo YouTube channel, down from $11,000 a month. Advertising down all around. Two-thirds of income from sponsorship deals inked separately with advertisers.

3.3: workers replaced by each additional robot. That increased use of robots in the workplace also lowered wages by roughly 0.4 percent. It turns out robots are good at competing against people.

65%: Apple's gross margin on services. 515 million subscriptions. 1B+ active iPhones.

$10 billion: AWS quarterly revenue. 33% growth on an annualized basis. AWS is now bigger than all of Oracle.

59%: Growth in Azure cloud.

1,000 light-years: closest blackhole to Earth.

30,000: bugs generated each month by 47,000 Microsoft programmers. An AI system can find 97 percent of critical and non-critical security bugs.

29: day cluster trace — a record of every job submission, scheduling decision, and resource usage data for all the jobs in a Google Borg compute cluster, from May 2011.

$27 Billion: cost of cybercrime damages by 2025.

70%: of @DynamoDB IOPS and storage is being consumed by attribute names. Short names will eliminate 14K WCU at peak and reduce table size by 3.5TB.

~2x: traffic at Shopify.

$6.9 billion: raised by AI startups in Q1 2020.

more: zero-day exploits in 2019 than any of the previous three years. Private companies are likely creating and supplying a larger proportion of zero-days than they have in the past.

200 million: packages delivered every day by pypi.org.

$2.7 billion: total cybercrime loss in 2018 in the US according to the FBI. Investment scames were #1.

@sfiscience: Every computation must produce heat, but both energy consumption and heat production can be outsourced — by, say, human leadership, distributed computing, or a simple virus. SFI Prof David Wolpert on Landauer's Bound

Urs Hölzl: Efficiency improvements have kept data center energy usage almost flat across the globe—even as demand for cloud computing has skyrocketed. Compute capacity up 550%, energy use up 6% (!) according to a paper in Science. Glad we've [Google] done our part!

Lotem Finkelstein: The Naikon group has been running a longstanding operation, during which it has updated its new cyberweapon time and time again, built an extensive offensive infrastructure and worked to penetrate many governments across Asia and the Pacific

latch: We recently migrated a few small systems to CockroachDB (as a stepping stone). Overall, the experience was positive. The hassle free HA is a huge peace of mind. I know people say this is easy to do in PG. I have recently setup 2ndQuadrant's pglogical for another system. That was also easy (though the documentation was pretty bad). The end result is quite different though and CockroachDB is just simpler to reason about and manage and, I think, more generally applicable.

@johncutlefish: OH: “At Acme, we’re really good at spending $2 million+ on something. Less than $500k not so much. Less than $100k is credit card money almost.” Good reminder for “enterprise #SaaS startups”. The problem may be that you’re not charging enough for them to even pay attention.

@rakyll: The goal was to abstract away the infrastructure. We instead converted every application developer into an infrastructure developer.

@iamwcr: Our actual use-case is a little complex to go into in tweets. But suffice to say, the PUT costs alone to S3 if we did 1-to-1 would end up being just under half our total running costs when factoring in DDB, Lambda, SQS, APIG, etc.

@paulg: Mere rate of shipping new features is a surprisingly accurate predictor of startup success. In this domain, at least, slowness is way more likely to be due to inability than prudence. The startups that do things slowly don't do them any better. Just slower.

@lowellheddings: Publications that depend on extremely overpriced direct ads are having serious problems. Vox, Buzzfeed, etc, were able to command 10-20x ad prices compared to everybody else despite having low ROI. Its never going back

Lev Grossman: He [William Gibson] said once that he was wrong about cyberspace and the internet when he first conceived it, he thought it was a place that we would all leave the world and go to. Whereas in fact, it came here.

@Molson_Hart: Amazon just wiped out the affiliate advertising game. Lowered commissions from 10% to 3%, effective 7 days from today. This is what it is like to dance with the beast.

@chamath: A case study will be written on how Microsoft allowed Zoom to eat their lunch. They spent millions on subterfuge trying to paint Slack as an inferior enemy when MSFT Teams actually can't do what Slack does and Teams' real competitor was Zoom. Now Zoom has 300M Daily Users. Lol.

@tmclaughbos: Serverless will be the biggest LOL f*uck you on developers since full stack. They think they‘re buying into only having to focus on code and not infrastructure, but instead we are secretly turning them into distributed systems engineers more than developers.

Rob Pike: So although it was gratifying and important to see Docker, Kubernetes, and many other components of cloud computing written in Go, it's perhaps not too surprising. Go has indeed become the language of cloud infrastructure.

stanmancan: I run a website that gets decent ad revenue (usually around $2,000/month) and so far month to date in April is down 54% versus the same day last month, and March was already down 14% from February. CPC for March was $0.60 but April is only $0.30 so far. Traffic, CTR, and fill rate haven’t changed, just drastically lower CPC.

John Conway: You know in my early twenties, let's say, people always thought that I would, you know, be a great mathematician. And be good at various things and so on. And in my late twenties, I hadn't achieved any of the things that people were predicting, and so I call it my black period. I started to wonder you know whether it was all nonsense. Whether I was not a good mathematician after all and so on, and then I made a certain discovery and was shot into international prominence. As a mathematician when you become a prominent mathematician, in that sense, it doesn't mean that many people know your name. It means that many mathematicians know your name, and there aren't many mathematicians in the world anyway, you know. So it doesn't count very much, but it suddenly released me from feeling that I had to live up to my promise. You know, I had lived up to my promise. I sort of made a vow to myself. It was so nice not worrying anymore that I thought I'm not going to worry anymore ever again, I was going to study whatever I thought was interesting and not worry whether this was serious enough. And most of the time I've kept to that.

@Carnage4Life: 25% of advertisers have paused their ad campaigns while 46% reduced budgets. 75% of advertisers expect this to be worse impact than the housing crisis. Every business that relies on advertising is going to have the same story. More traffic, less money.

@houlihan_rick: Here are the access patterns and #NoSQL schema for the Amazon Kindle Collection Rights Service. Running 9 distinct queries millions of times a day on RDBMS was a massive waste of CPU. Switching to @DynamoDB saved big $$.

David Rosenthal: Productivity is the ratio of outputs to inputs. It is easy for companies or institutions to believe that by outsourcing, replacing internal opex and capex by payments to suppliers, they will get the same output for less input. But experience tends to show that in most cases the result is less output for more input. The Gadarene rush to outsource may be a major contributor to the notorious decay of productivity

@pitdes: CEO of a 400+ employee business says WFH is working well, he may not renew his SF lease to save $10m a year (office + lunches etc) and instead do a couple of all-hands offsites a year (much cheaper). Lots of others must be thinking similarly- CRE could be permanently altered.

@Grady_Booch: "Good design means then when I make a change, it’s as if the entire program was crafted in anticipation of it. I can solve a task with just a few choice function calls that slot in perfectly, leaving not the slightest ripple on the placid surface of the code." @munificentbob

@_joemag: In recent design reviews, I'm observing more and more teams plan to write their new data planes in Rust. It looks like the tide has turned, and Rust is becoming the default choice for new data plane software.

Giff Constable: I joined Meetup having been an organizer for many years. I had lived the frustration of being a customer. By the late-2010s, I felt like the main product had gotten worse over the years, not better. One cause was the overwhelming amount of technical debt that had built up in the 17-year old company. As my former colleague Yvette Pasqua, a brilliant CTO and a big reason why I joined in the first place, once said, “Meetup is carrying four to five times the amount of code that it should for what it does.” But, as is often the case, the problem was not solely technical. There was an enormous amounts of design debt that increased the complexity of every change, as well as a need to improve how the teams were actually setting goals and going about their work.

@levie: How IT strategies will change overnight: Some cloud -> 100% cloud Trusted devices -> Any device Protecting perimeter -> No perimeter Monolithic tools -> Best-of-breed apps UX secondary -> UX above all Employees only -> Extended enterprise

@timbray: Just now an SQS team chat flurry caught my eye. Seems there was a queue with 253,477,653,099 unread messages. Owner was reading at only 1.56M messages/second. It’s gonna take them a while to clear the backlog. But they say it’s a backfill op, so OK. Everything worked fine.

@sfiscience: "If you look at science over the last 50 years ask, what are the drivers of progress?, it's the ability of computers to simulate complex systems." J. Doyne Farmer (@Oxford)

dredmorbius: Hierarchies appear where there is a clear gradient or flow. The canonical examples are river networks and trees (from which: dendritic). Even here, flows are often not entirely directed -- a river system experiences evaporation, precipitation, ground seepage, and aquifer flows, in addition to its dominant gravity-induced current. A tree has flows which originate at leaves (photosynthesis) and roots (water, mineral sourcing), as well as elsewhere (flowers, parasites and symbiotes, community flora and fauna). Similarly, evolutionary trees represent not only ancestral inheritance, but cross-species gene transfers -- bacteria, viruses, mitochondria, even other higher-order organisms (generally through an intermediary). A true ontology to me is a largely nondirected graph. This may be a union of several directed graphs, a region of low-directedness within a larger directed space, or (fairly rarely) a near-universal truly nondirected system. Given that even cosmological megastructures are defined in terms of attractors and repellers, that is: gradients, truly nondirected spaces seem. likely rare. Though that's describing physical entities, conceptual spaces seem to me similar -- at least analogous, perhaps even more than that. A tremendous risk in conceptual spaces is to equate some gradient with a moral, ethical, or social value gradient.

Hannu Rajaniemi: Tools always break. She should have remembered that.

@houlihan_rick: Need operational analytics in #NoSQL? Maintain time bound rollups in @DynamoDB with Streams/Lambda then query relevant items by date range and aggregate client side for fast reporting on scaled out data. Turn complex ad hoc queries into simple select statements and save $$$

@snowded: In the past every village had an idiot, and we could all deal with that. Now the internet is allowing idiots to connect and it is normalising idiocy.

@DAlexForce: AWS DAX (DynamoDB Accelerator) is WICKED fast ⚡.Regular caching (left) is much different from DAX (right). Improving response from milliseconds to microseconds. Now that is caching on steroids.

@tmclaughbos: Any critique of company's architectural choices needs to also include their business metrics too. X: That architecture is terrible! Me: Yet looking at their revenue, growth, and stock price it appears to be working just fine for them.

@dabit3: 8 things that have made me a better programmer: - Be ok with not being the best - Small steps > no steps - Look for negative feedback - Don't be embarrassed to ask questions - Learn new tech early - Talk less, do more - A student / growth mindset - Master *something*

@tqbf: The boy just asked if he should learn C++. It’s long past due for THAT conversation, and bad parenting that I waited for him to ask. I’m glad we caught him before he started experimenting on his own.

@dvassallo: 2000 customers @ $39/month is almost $1M/year. If your software can handle 2000 customers, you can worry about scalability once the $1M/year is flowing in. Scalability doesn’t get you cutomers. First have customers, then worry about scalability. The order matters.

@houlihan_rick: Data modeling in #NoSQL is about replacing JOINs with indexes. Store all objects in one table, decorate them with common attributes, and index those attributes to produce groupings of objects needed by the app. Time complexity of an index query is far less than a JOIN. @dynamodb

@benedictevans: Google Meet and Microsoft teams combined are doing close to 5bn minutes of calling each day. The entire UK telecoms system does about 600m.

@randompunter: Serverless does not imply microservices. We run full web applications in a single lambda (with static content off S3/Cloudfront).

@timfox: Currently trying to "cloud first development" from home. Incredibly painful experience. This model seems fundamentally broken for the new normal of working from home as it assumes you're in an office with a fast upload connection so you can push images. Give me local dev any day.

@igvolow: ’ve found that creating new microservices for a new project is the epitome of and posterboy for premature optimization. Make your code loosely coupled enough so any functionality can be easy extracted out into a microservice if need be. Embrace the modulith!

@ajaynairthinks: OH: “a customer will always choose a generic alternative you pitch over their incumbent (tool) because it could fix everything, but rarely pick the the specific alternative you end up building over the same incumbent unless it is solves their specific problem”. #PollingFallacy

@mikeb2701: I've literally seen latency spikes caused by the system page faulting in rarely used code. In some cases requiring mlocking the pages containing the shared objects to mitigate.

@anshelsag: Anyone look at the @Steam hardware survey for April? @AMD now has 21.89% CPU market share to Intel's 78.28%. Up from 17.5% in November 2018. AMD gained over half a percent just from March to April which is pretty significant.

David Rosenthal: Although there are significant technological risks to data stored for the long term, its most important vulnerability is to interruptions in the money supply.

@troyhunt: There's been a significant uptick in @haveibeenpwned usage since COVID-19 restrictions. Way more registrations (previous spikes after major breaches), daily unique users up from about 150k to 200k (Cloudflare) and monthly users up 41% on previous 12 month high (Google Analytics)

Hannu Rajaniemi: As you are no doubt aware, things are a little bit … restless on Earth at the moment.’ ‘If by restless, you mean eaten by recursively self-improving non-eudaimonistic agents, then, yes,

John Gruber: So, yes, a $400 iPhone SE bests a $3,000 top-of-the-line MacBook Pro in single-core CPU performance.

@txase: #Serverless is how all apps will be built in the future. The benefits are simply too great. 90% reduction in cost, for example, is what our customers see as well when they modernize apps.

@joshelman: Zoom vs Google Meet is the difference of a singular company and product focus vs just thinking you are building a feature as part of an enterprise suite.

@ShortJared: Fun fact... the entire content of the classic "Moby-Dick" fits into a single DynamoDB row with xz compression. Therefore, I conclude that @dynamodb is the ultimate docker container.

Chip Overclock: This article is about the iterations I went through to establish my own permanent Differential GNSS base station, and to cobble together a Rover that I could use in the field for precision geolocation. I leveraged and significantly modified my existing Hazer GPS software (com-diag-hazer on GitHub), but most of that was just bookkeeping and glue code; the firmware in the u-blox chip does all of the heavy lifting, including sending and receiving the RTCM messages.

@randompunter: There is an inflection point where EC2 is more cost effective but I feel its much higher than what people realise. Lambda I can deploy some code and forget about it for a year. For EC2 I have have to plan a 30 day machine re-image/rebuild maintenance schedule.

@mjpt777: I know the math yet it always surprises me to see the difference in performance when buffers are appropriately sized. Flow is key to distributed systems yet so few people give it any consideration.

Nestor: This is what they do to people! They put a machine on the floor, and if it has programming that doesn't take your money and you win on their machine, they will throw you in jail!

Corey Quinn: Their “region-pairing” strategy indicates that you can think of each Azure region as an AWS or GCP Availability Zone. That’s great right up until it isn’t, and you wind up with a bunch of small regions rather than fewer more robust ones, and a sudden influx of demand causes those regions to run out of headroom.

@AngieMaxwell1: Found the kid playing with her dog instead of Zooming with her teacher. She told me not to worry. She took a screenshot of herself “paying attention,” then cut her video & replaced it with the picture. “It’s a gallery view of 20 kids, mom. They can’t tell.” She is 10. #COVID19

@simonw: SQL is a better API language than GraphQL. Convince me otherwise! To counter some obvious arguments...You don't have to expose your entire schema, instead expose carefully designed SQL views (so you can refactor your tables without breaking your API) Read-only, obviously! Use time limits to cut off expensive queries (GraphQL needs this too)

@m_a_y_o_w_a: As a Software Engineer, the more you gain professional experience, the more you realize that there is no 'one size fits all' solution to all Engineering problems. Every potential solution comes with its pros and cons, the key decision is in what you are willing to tradeoff.

L. B. Lewis: It’s an open secret in Silicon Valley that the tech industry suffers from ageism — “the stereotyping, prejudice, and discrimination against people on the basis of their age.” In a recent survey, it’s been shown that tech companies hire on the young side. And, by the age of 29, tech workers already start to feel the effects of ageism.

Bastian Spanneberg: When we recently did a review of our cost structure with Mike Julian and Corey Quinn from Duckbill Group, they pointed out that we have a lot of volumes that could be actually switched from GP2 to ST1. This would be a significant cost reduction as ST1 volumes are roughly half the cost!

datenwolf: Gaaah, please stop advertising optical computers as the technology that will overcome Moore's law. It makes no effing sense.Wavelength of the light emitted by these devices: ~4000nm. Latest generation commodity CPU transistor structure size: 7nm. Add to that that photons really don't like being trapped; you essentially need a delay line and optical amplifier to hold them indefinitely (that's essentially the core technology my whole PhD thesis centers around), it makes them a really impractical thing to store bits with. Things with a rest mass can be stored easily, though. Things like, say, electrons!

@BrianRoemmele: I looked at the code of a fintech startup that had some of the best programers with the best degrees and asked about their precision on floating point with their “agile back end”. Beautiful code. Brilliant. But it also cost the company ~4 million. They think different now. I looked at an “upgrade” to Python at a major insurance company. The company was losing $2 million dollars per year since moving off the COBOL platform and did not know why. It used floating point math and the rounding errors that where not anticipated. They went back to COBOL.

@JoeEmison: Running a serverless insurance startup, selling home, auto, renters, and umbrella in five states (http://ourbranch.com) on AWS. April AWS bill was just under $740. DynamoDB - $202 CodeBuild - $116 Cloudwatch - $100 S3 - $66 AWS Directory Service - $36. Our back end is AppSync, Cognito, Lambda. Combined charges for all three of those is $32...Oh, yes—algolia, cloudinary, launchdarkly, LogRocket, sentry, segment, http://customer.io—they add up to more than AWS. But overall so much cheaper/faster/better than anything else.

JoeAltmaier: So I took 2 days, made a chart of every path and pattern of events (restarting a timer from a timer callback; having another time interval expire while processing the previous; restarting while processing and another interval expires; restarted timer expires before completing previous expiration and on and on). Then writing exhaustive code to deal with every case. Then running every degenerate case in test code until it survived for hours. It never had to be addressed again. But it did have to be addressed. So many folks are unwilling to face the music with complexity.

Joel Hruska: The history of computing is the history of function integration. The very name integrated circuit recalls the long history of improving computer performance by building circuit components closer together. FPUs, CPU caches, memory controllers, GPUs, PCIe lanes, and I/O controllers are just some of the once-separate components that are now commonly integrated on-die. Chiplets fundamentally reverse this trend by breaking once-monolithic chips into separate functional blocks based on how amenable these blocks are to further scaling...The most exciting thing about chiplets, in my opinion, isn’t that they offer a way to keep packing transistors. It’s that they may give companies more latitude to experiment with new materials and engineering processes that will accelerate performance or improve power efficiency without requiring them to deploy these technologies across an entire SoC simultaneously.

Steven Levy: This encryption function was only part of Diffie’s revolutionary concept, and not necessarily its most important feature. Public key crypto also provided the first effective means of truly authenticating the sender of an electronic message. As Diffie conceived it, the trap­door works in two directions. Yes, if a sender scrambles a message with someone’s public key, only the intended recipient can read it. But if the process is inverted—if someone scrambles some text with his or her own private key—the resulting ciphertext can be unscrambled only by using the single public key that matches its mate.

@Carnage4Life: The idea that a bad quarter for Google means they ONLY grew revenues 13% year over year due to #COVID19 is a mind blowing statistic around how much headroom there still was in online advertising until the lockdowns.

@rakyll: Google Cloud's Go and Node support was originated by a small group of people including me. Today, we were reflecting that there was a time in cloud it was extremely easy to make such huge impact. Today, the scale is immerse. I can't easily keep up with minor launches anymore.

@copyconstruct: An ontology of “stateful” systems ... Or at least my attempt at it. All stateless systems are stateless in no more than 4 ways. All stateful systems are stateful in ... 21 possible ways?

npunt: Magic Leap made one of the classic mistakes that other before-their-time products make: they tried to create a general purpose product because they didn't have a killer app that could focus their efforts. When you're building a product without a focused use case, you are pulled in a ton of different directions. In AR, this means focusing on fidelity, embodied in high resolution, wide field of view visuals, powerful processing, and compelling input methods.

goatherders: I ran an AWS practice for a large hosting company for two years. We had a number of situations where our prospect would say "annual spend with your managed services and our expected AWS bill would be $1M. Google is offering us a 2 million dollar credit to choose their cloud. What do you say to that?" "You should take them up on that offer." The incentives available from other cloud providers are MASSIVE if your business has the chance to grow in coming years. They will literally buy your business for years on the bet that at some point they will make it back.

ram_rar: I worked in a startup that was eventually acquired by cisco. We had the same dilemma back then. AWS and GCP were great, but also fairly expensive until you get locked in. Oracles bare metal cloud sweetened the deal soo much, that it was a no brainer to go with them. We were very heavy on using all open source tech stuff, but didnt rely on any cloud service like S3 etc. So the transition was no brainer. If your tech stack is not reliant on cloud services like S3 etc, you're better off with a cloud provider who can give you those sweet deals. But you'll need in house expertise to deal with big data.

Rimantas Ragainis: The first two — fio and sysbench — test scenarios indicated that GCP NVMes are slower than on AWS side.

Tim Bray~~ Event driven and asynchronous application design should be used by almost everyone who is trying build something big and that has good scaling characteristics because you can't build a super cloud scale application without having some asynchronous buffering in it to deal with load surges. That typically means being message or event driven. Eventing is at the center of everything. It all has to do not so much with the classical paradigm of software that handles requests and looks something up in the database, but with state that comes in from the outside and flows through the system, which is the way to go for large heavily loaded applications. The state of the art is not that well defined. There aren't text books that say how to do this.

pachico: One of our applications receive more than 10m hits a day through Kong, which uses Redis for its rate limit plugin. We put a t3.micro for that and never had any issue. In reality, during our performance tests we got to much higher volumes and it always worked fine.

throwaway_aws: Throwaway account for obvious reasons. In the past, AWS has used the data from third party hosted services on AWS to build a similar service and in fact start poaching their customers. Source: I used to be at AWS and know the PM & his manager who built a service this way. I was hired on that team.

1_person: The CDN will be fronting most of the load, behind that 10 decently specced servers running sanely architected code can scale to millions, if not tens of millions of requests per second. Drop the servers in HA sets of 2-3 nodes across 3-4 regions, anycast your service endpoint from each cluster. The hardest thing to replicate without AWS is the 6-7 figure bills.

Cloudflare: For a small test page of 15KB, HTTP/3 takes an average of 443ms to load compared to 458ms for HTTP/2. However, once we increase the page size to 1MB that advantage disappears: HTTP/3 is just slightly slower than HTTP/2 on our network today, taking 2.33s to load versus 2.30s.

@lynncyrin: I feel like writing "production grade" python code is much much harder than doing so for golang @lizthegrey: This is exactly why the SRE team at Google mandated "no new Python projects" and shifted all new automation to golang.

gregdoesit: I joined Uber in 2016, right around when on every conference you'd hear a talk along the lines on "Lessons learned at Uber on scaling to thousands of microservices" [1]. After a year or two, those talks stopped. Why? Turns out, having thousands of microservices is something to flex about, and make good conference talks. But the cons start to weigh after a while - and when addressing those cons, you take a step back towards fewer, and bigger services. I predict Monzo will see the same cons in a year or two, and move to a more pragmatic, fewer, better-sized services approach that I've seen at Uber. In 2020, Uber probably has fewer microservices than in 2015. Microservices are fantastic for autonomy. However, that autonomy also comes with the drawbacks. Integration testing becomes hard. The root cause of most outages become parallel deployments of two services, that cause issues. Ownership becomes problematic, when a person leaves who owned a microservice that was on the critical path. And that no one else knew about. Oncall load becomes tough: you literally have people own 4-5 microservices that they launched. Small ones, sure, but when they go down, they still cause issues. To make many services work at scale, you need to solve all of these problems. You need to introduce tiering: ensuring the most ciritical (micro)services have the right amount of monitoring, alerting, proper oncall and strong ownership. Integration testing needs to be solved for critical services - often meaning merging multiple smaller services that relate to each other. Services need to have oncall owners: and a healthy oncall usually needs at least 5-6 engineers in a rotation, making the case for larger services as well.

Nelson Elhage: For the solutions, a big theme — also cited by several of the people I link to — is moving to empiricism and experiment, instead of abstract reasoning. I think this need additionally explains some of the recent observability movement; as we rely more and more on empirical observation of our systems, we need better and better tools for actually making and analyzing observations and characterizing the empirical behaviors.

AusIV:With much of the economy moving online, spot instance termination rates have gone up a lot. We've had to switch our autoscaler's SpotAllocationStrategy from the default to "capacity-optimized", then added a few more instance types, and our termination rates seem to have dropped off a bit. The SpotAllocationStrategy setting wasn't a thing when we set up our autoscalers in the first place, or we probably would have done it then.

