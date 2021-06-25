Hey, it's HighScalability time!



nicoff: Genie: I’ll give you one billion dollars if you can spend 100M in a month. There are 3 rules: No gifting, no gambling, no throwing it away. SRE: Can I use AWS? Genie: There are 4 rules

artichokeheart: It reminds me of that ancient joke: A QA engineer walks into a bar. He orders a beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd. First real customer walks in and asks where the bathroom is. The bar bursts into flames, killing everyone.

@drbarnard: Over dinner I told my wife one of my tweets ended up in a court case against Apple. Her first reaction, without even knowing what it said: “Is Apple going to retaliate?!" So yeah, for almost 13 years of making a living on the App Store, we‘ve lived in fear of Apple.

spamizbad: Innovation is the product manager coming to your desk and saying "Hey has feature X. How long would it take for us to build something like that?"

@jamie_maguire1: As I look back on 20 years in tech, I should have moved every 2-4 years. That said, staying a bit longer meant I could see implementations go from inception right through to delivery into prod and gain the experience.

Theo Schlossnagle: the leading reason we decided to create our own database had to do with simple economics. Basically, in the end, you can work around any problem but money. It seemed that by restricting the problem space, we could have a cheaper, faster solution that would end up being more maintainable over time.

Giancarlo Rinaldi~ Lewis Fry Richardson derived many of the complex equations needed for weather prediction in the 1920's. However, the math was so difficult that to predict the weather six hours in advance, it took him six weeks to do the calculations.

Rick Houlihan~ the cost of s3 is in the puts and gets. If you're doing high velocity puts and gets on s3 make sure they are on large objects, otherwise bundle smaller objects into larger objects to control costs.

Agentlien: Near the beginning they talk about how targeting the PlayStation 5, which has an SSD, drastically changed how they went about making the game. In short, the quick data transfer meant they were CPU bound rather than disk bound and could afford to have a lot of uncompressed data streamed directly into memory with no extra processing before use.

string: My primary use has been for serving image assets, switched over from Cloudfront and have seen probably a >80% cost reduction, and no noticeable performance reduction, but as I mentioned I'm operating at a scale where milliseconds of difference don't mean much.

manigandham: All this focus on redundancy should be replaced with a focus on recovery. Perfect availability is already impossible. For all practical uses, something that recovers within minutes is better than trying to always be online and failing horribly.

Mars Helicopter: On the sixth flight, 54 seconds into the flight, a single image did not arrive to be processed. Landmarks were in the wrong place from where they were predicted, and the helicopter adjusted speed and tilt to compensate. This was the reason for the constant oscillation, as every following image bore the wrong timestamp and showed the “wrong" landmarks for the calculated position.

Riccardo Mori: Going through Big Sur’s user interface with a fine-tooth comb reveals arbitrary design decisions that prioritise looks over function, and therefore reflect an un-learning of tried-and-true user interface and usability mechanics that used to make for a seamless, thoughtful, enjoyable Mac experience.

VMO: Risk is a factor of amount of loss x probability of that loss. The reason why cyber risk is as treated high risk because it has higher losses and the probability of occurrence is also really high. This is causing a lot of stress on insurers profitability as the premiums that they earn is not sufficient to cover all payouts. Mere 5 payouts can wash out the entire annual premium earned from 250 companies that have at least $200MM in insurance. This is leading to a different dynamic in the industry.

balabaster: A savvy engineer, like a spy, knows how to gain the confidence of the CEO long before they put a plan of "Gee, I think you should do this" in the CEOs ear. I've spent months ingratiating myself with the political circle before someone says "Hey Ben, what do you think?" and from there, it's game on.

enesakar: Redis REST has overhead less than a millisecond compared to Redis TCP. But when EDGE caching is enabled, reads become much faster as expected. Rough numbers we have seen: REST: ~0.70ms REST (EDGE cached): ~0.12ms

dkubb : I think the next generation of ORMs will be built on top of this approach. Being able to query and receive an entire object graph in a single round trip has changed how we develop apps at my company.

Michelle Hampson: His team tested the new software in a series of simulations. “The results revealed that logic circuits could be flawlessly designed, simulated, and tested," says Marks, noting that they were able to use DNAr-Logic to design some synthetic biological circuits capable of generating up to 600 different reactions.

@kellabyte: So basically it’s cheaper AND faster to store non-indexed data on S3 and pay for ridiculously cheap and compute to buy enough machines to process it in parallel at blazing speeds? And just forget about optimized data structures and local SSD’s?

@johncutlefish: "Let’s start with a basic truth: Great products and services do NOT mainly come from hours worked. The real drivers are things like creativity, collaboration and good decision making." @IamHenrikM

@LiveOverflow: 95% of all vulnerabilities can easily be found by scanners or inexperienced penetration testers. Skilled hackers probably cost 10x-100x and only get you another 2-3%. Nobody can guarantee 100%, so do not waste money on higher quality services #cisotips

@matthlerner: Eventually, I discovered Jobs To Be Done. And here’s how I use it specifically for marketing. First, I interview people who recently signed up for my service or a competitor and ask questions like: 1. What were you hoping to do? 2. Why is that important to you? 3. Where did you look? 4. What else did you try?

DSHR: In other words it is technically and economically infeasible to implement tests for both storage systems and CPUs that assure errors will not occur in at-scale use. Even if hardware vendors can be persuaded to devote more resources to reliability, that will not remove the need for software to be appropriately skeptical of the data that cores are returning as the result of computation, just as the software needs to be skeptical of the data returned by storage devices and network interfaces.

staticassertion: #1 assumption I always make sure to check now is "is the code that's running actually from the source code I'm looking at". So may "impossible" bugs I run into are in fact impossible... in the codebase I'm looking at. And it turns out what's deployed is some other code where it is very much possible.

@emileifrem: In my NODES keynote today, we ran a live demo of a social app with more people nodes than FB (!), backed by a trillion+ relationship graph sharded across more than 1,000 servers, executing deep, complex graph queries that return in <20 ms.

@uhoelzle: Very excited about the new Tau VMs on GCP: 56% faster than Graviton2. 42% better price/performance. That's "leapfrog", not "better".

@gurlcode: Raise your hand if you’ve taken down prod before 🙋‍♀️

smueller1234: The larger your infrastructure, the smaller the relative efficiency win that's worth pursuing (duh, I know, engineering time costs the same, but the absolute savings numbers from relative wins go up). That's why an approach along the lines of "redundancy at all levels" (raid + x-machine replication + x-geo replication etc) starts becoming increasingly worth streamlining.

Katy Milkman: My colleagues argue that their study highlights a common mistake companies make with gamification: Gamification is unhelpful and can even be harmful if people feel that their employer is forcing them to participate in “mandatory fun." Another issue is that if a game is a dud, it doesn’t do anyone any good. Gamification can be a miraculous way to boost engagement with monotonous tasks at work and beyond, or an over-hyped strategy doomed to fail. What matters most is how the people playing the game feel about it.

Robin George Andrews: Altogether, these simulations hint that magnetism may be partly responsible for the abundance of intermediate-mass exoplanets out there, whether they are smaller Neptunes or larger Earths.

Nicole Hemsoth: The pair see the next major challenges and opportunities in the I/O realm revolving around computational storage, which they’ve laid the groundwork for in the new, fully refactored 2.0 spec, even if there is not enough common ground to directly build into the NVMe standard just yet.

murat: Maybe [fail-silent Corruption Execution Errors (CEEs) in CPU/cores] will lead to abandonment of complex deep-optimizing chipsets like Intel chipsets, and make simpler chipsets, like ARM chipsets, more popular for datacenter deployments. AWS has started using ARM-based Graviton cores due to their energy-efficiency and cost benefits, and avoiding CEEs could give boost to this trend.

James Hunter: “No," the chief replied sternly. “Not more important. Just different. As a leader, you must learn that every part is equally important, and every job must be done just so, or things fall apart.

Jill Bolte Taylor: So, from a purely biological perspective, we humans are feeling creatures who think, rather than thinking creatures who feel. Neuroanatomically you and I are programmed to feel our emotions, and any attempt we may make to bypass or ignore what we are feeling may have the power to derail our mental health at this most fundamental level.

Aleron Kong: That was a lesson he’d learned from the internet. As a compendium of all human knowledge, it should have made everyone smarter. Instead, it just gave a pulpit to every idiot who otherwise could have been ignored. The resulting noise was so loud it drowned out logic and good sense.

Robert M. Sapolsky: The cortex and limbic system are not separate, as scads of axonal projections course between the two. Crucially, those projections are bidirectional—the limbic system talks to the cortex, rather than merely being reined in by it. The false dichotomy between thought and feeling is presented in the classic Descartes’ Error, by the neurologist Antonio Damasio of the University of Southern California

Geoff Huston: We appear to be seeing a resurgence of expressions of national strategic interest. The borderless open Internet is now no longer a feature but a threat vector to such expressions of national interest. We are seeing a rise in the redefining of the Internet as a set of threats, in terms of insecurity and cyber-attacks, in terms of work force dislocation, in terms of expatriation of wealth to these stateless cyber giants, and many similar expressions of unease and impotence from national communities. It seems to be a disillusioning moment where we’ve had a brief glimpse of what will happen when we bind our world together in a frictionless highly capable universally accessible common digital infrastructure and we are not exactly sure if we really liked what we saw! The result appears to be that this Internet that we’ve built looks like a mixed blessing that can be both incredibly personally empowering and menacingly threatening at the same time!

Tim Bray: In my years at Google and AWS, we had outages and failures, but very very few of them were due to anything as simple as a software bug. Botched deployments, throttling misconfigurations, cert problems (OMG cert problems), DNS hiccups, an intern doing a load test with a Python script, malfunctioning canaries, there are lots of branches in that trail of tears. But usually not just a bug. I can’t remember when precisely I became infected, but I can testify: Once you are, you’re never going to be comfortable in the presence of untested code.

cjg: We are using Rust for backend web development and other things. For us, the safety is the critical reason to choose Rust - particularly the thread-safety. Also the relatively small memory footprint compared to something like Java. Performance hasn't driven our decision at all - the number of requests per second is very low. It's correctness that matters.

@jordannovet: new: YouTube, the second largest internet site, is moving parts of its service from Google's internal infrastructure to the Google public cloud that external customers use. also, Google Workspace is now using Google public cloud

Ivan: Smaller containers don't just mean faster builds and smaller disk and networking utilization, they mean safer life.

Surjan Singh: I think of the factor of safety as a modern day version of the libation or offering. I’d rather keep pouring out the same amount of wine as my ancestors rather than skimping and risking offending the gods. That $1.5 billion pricetag, which seemed absurd when I began writing this, now just looks like the cost of being human.

Sanjay Mehrotra: Micron has now determined that there is insufficient market validation to justify the ongoing high levels of investments required to successfully commercialise 3D XPoint at scale to address the evolving memory and storage needs of its customers.

@saranormous: the number of companies out there that won't pay $20K a year for off-the-shelf SaaS, but then spend unlimited engineering effort building internal tooling themselves and then complain they don't have enough engineering capacity

@JannikWempe: DynamoDB tip Prefer KSUIDs over something like UUIDv4. KSUIDs include the current time and are sortable (and even more unique than UUIDv4). You can get an access pattern (sorted in chronological order) for free.

boulos: Disclosure: I used to work on Google Cloud. Perlmutter seems like an awesome system. But, I think the “ai exaflops" is a “X GPUS times the NVIDIA peak rate". The new sparsity features on A100 are promising, but haven’t been demonstrated to be nearly as awesome in practice (yet). It also all comes down to workloads: large-scale distributed training is a funny workload! It’s not like LINPACK. If you make your model compute intensive enough, then the networking need mostly becomes bandwidth (for which multi-hundred Gbps worth of NICs is handy) but even without it there are lots of ways to max out your compute. Similarly, storage is a serious need for say giant video corpora, but not for things like text! GPT-2 had like a 40 GiB corpus.

hocuspocus: my employer is becoming an AWS partner and we'll contractually need to spend a lot of money, meaning it's better if we use their SaaS offerings directly rather than going through the marketplace or a 3rd party.

Scott Aaronson: when information from the state (say, whether a qubit is 0 or 1, or which path a photon takes through a beamsplitter network) “leaks out" into the environment, the information effectively becomes entangled with the environment, which damages the state in a measurable way. Indeed, it now appears to an observer as a mixed state rather than a pure state, so that interference between the different components can no longer happen. This is a fundamental, justly-famous feature of quantum information that’s not shared by classical information.

@tlipcon: One of the coolest things about Spanner is that you can scale more or less indefinitely with strong semantics. But previously the scaling was only in the up direction, and the price of entry was prohibitive for small use cases. No more!

Orbital Index: A protoplanet lovingly named Theia is thought to have slammed into the early Earth 4.5 billion years ago, dislodging enough combined material to form the Moon and leaving Earth with a 23-degree tilt. A new paper (pdf) links this event to two continent-sized higher density regions of the Earth’s mantle that underlie Africa and the Pacific Ocean

@Nick_Craver: If you don't have worker space to complete that operation: you're using memory for the file buffer, things like a socket, if it's off-box, etc. until the completion can happen. If other tasks queued up are waiting on those resources: uh oh, deadlock city. That's no good! Let me give an example: When we had a load balancer bug at Stack Overflow assuming web servers were *up* going into rotation, and not *ready to be health checked*, it'd slam the app with 300-500 req/sec instantly. This caused our initial connection to Redis to take 2 minutes.

@lil_iykay: Messed with a varying combination of stacks over the years. Here's what seems to be working super well for me. Context: building a FinTech mobile app that would be used across the country. Being FinTech, the must-haves for me are Security, easy-to-rollback deployments - with < 1000ms downtime across all stacks in the worst case scenario, real-time updates visible to users. Mobile App connects over a bidirectional gRPC stream to the serverless gateway which are several instances of docker containers running on @flydotio - they leverage firecracker VM to deploy your containers to edge servers closest to your users that respond elastically to traffic 🔥.Database (data API I should say) is @fauna - a truly excellent serverless semi-structured NoSQL database - which doesn't depend on syncing clocks for transactions and data consistency - currently using their streaming API on the JVM client for real-time data updates. For syncing and setting up hooks for communicating with our Banking Services platform (http://silamoney.com), I use @Cloudflare workers - serverless functions which aren't isolated on the container level like docker, but by Chrome V8 Isolates. This gets rid of cold starts and the need to keep your functions "warm" (which isn't even deterministic at this scale) and are super lightweight - And orders of magnitudes cheaper than AWS Lambda, Google cloud Functions etc. Oh and the KV service from cloudflare just comes in as a global cache for basically nothing. This stack makes me smile daily 😜 Happy to learn more on this journey.

khanacademy: Yep, Go is more verbose in general than Python…But we like it! It’s fast, the tooling is solid, and it runs well in production

@GergelyOrosz: Coinbase rewrote both their Android and iOS apps in React Native, and are happy with the end result. My takeaways from their writeup (thread) 1. It's hard to hire good lots of native mobile engineers. Heck, this is what was probably a major trigger for their whole transition

@dhh: It’s safe to say that Shopify is pushing the absolute frontier of what’s possible with Rails and Ruby. Their main majestic monolith is a staggering 2.8m lines of Ruby responsible for processing over $100m in sales per hour at peak Exploding head. And they’re doing it running latest Rails

Charles R. Martin: Flight software for rockets at SpaceX is structured around the concept of a control cycle. “You read all of your inputs: sensors that we read in through an ADC, packets from the network, data from an IMU, updates from a star tracker or guidance sensor, commands from the ground," explains Gerding. “You do some processing of those to determine your state, like where you are in the world or the status of the life support system. That determines your outputs – you write those, wait until the next tick of the clock, and then do the whole thing over again."

@jrhunt: Summary: Functions are for quick request/response rewrites (< 1ms runtime and 2MB memory) Pricing: $0.10 per 1 million invocations ($0.0000001 per request) Lambda@Edge still exists for longer lived (> 1ms) functions that run at regional cache nodes.

throwdbaaway: In the worst case, there can be a 20% variance on GCP. On Azure, the variance between identical instances is not as bad, perhaps about 10%, but there is an extra 5% variance between normal hours and off-peak hours, which further complicates things.

Cal Newport: This way of collaborating, this hyperactive hive mind, took over much of knowledge work. Now my argument is, once you are collaborating using the hyperactive hive mind, any non-trivial amount of deep work becomes almost impossible to accomplish.

Mikael Ronstrom: The results show clearly that AWS has the best latency numbers at low to modest loads. At high loads GCP gives the best results. Azure has similar latency to GCP, but doesn’t provide the same benefits at higher loads. These results are in line with similar benchmark reports comparing AWS, Azure and GCP.

Michael Drogalis: Databases and stream processing will increasingly become two sides of the same coin. At this point in engineering history, the metal of those technologies are still melding together. But if it happens, 10 years from now the definition of a database will have expanded, and stream processing will be a natural part of it.

papito: No, Stackoverflow architecture is still a simple server farm with aggressively maintained code and a lite SQL ORM that lets them to highly optimize queries. One of the architects talked about it on their pod a few weeks ago. She mentioned explicitly how they don't jump on new hotness and keep it classic, maintainable, and highly performant without spending tens of millions of dollars on FAANG alumni who seem to think that every shop needs a Google infrastructure because THIS IS THE WAY.

rualca: it seems you're missing the forest for the trees. They put in the work to refactorize their services so that they could be treated as cattle and also autoscale, and the author stated that after doing his homework he determined that for his company Kubernetes offered the most benefits. There isn't much to read from this.

