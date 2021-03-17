Hey, HighScalability is here again!

Reverse engineering an ancient analog computer is a detective story worth reading. A Model of the Cosmos in the ancient Greek Antikythera Mechanism.

@tommchenry: Star Trek story where they invent the replicator, then scramble to invent new tech to add to the replicator to ensure it chokes the infinite plenty with artificial self-imposed scarcity so the social relations of political and economic power aren’t threatened.

REvil’s Unknown: Yes, as a weapon [ransomware] can be very destructive. Well, I know at the very least that several affiliates have access to a ballistic missile launch system, one to a U.S. Navy cruiser, a third to a nuclear power plant, and a fourth to a weapons factory. It is quite feasible to start a war. But it’s not worth it—the consequences are not profitable.

@vzverovich: 500,000 lines of code to control the Mars rover entry, descend and landing. Almost as big as std::tuple implementation!

@axboe: Personal goal achieved: IOPS=3004074, IOS/call=31/31, inflight=128 (128) which is > 3M IOPS on a single CPU core, using io_uring in polled mode and an Optane2 device. 512b random reads. I like moving goal posts, so next milestone is 4M per core.

Karen Hao: I pressed him [Joaquin Quiñonero Candela, a director of AI at Facebook] one more time. Certainly he couldn’t believe that algorithms had done absolutely nothing to change the nature of these issues, I said. “I don’t know, he said with a halting stutter. Then he repeated, with more conviction: “That’s my honest answer. Honest to God. I don’t know.

@jamesurquhart: [With the new API destinations feature of AWS EventBridge] AWS has basically coopted the entire HTTP API ecosystem into its event-driven applications story. It's a brilliant move.

@TrungTPhan: State of Cloud Wars • AWS has plateaued at 33% market share for past few years • Microsoft has doubled share (10-20%) in the last 4 years • Google (9%) and Alibaba (6%) follow while IBM (never mind)

peter_d_sherman: One of the most interesting things which will happen in the near future in computing is the adoption of serial memory interfaces. For pretty much the entire history of modern computing, RAM has been attached to a system via a high-speed parallel interface. Making parallel interfaces fast is hard and requires extremely rigorous control of timing skew between pins, therefore the routing of PCB traces between a CPU and RAM slots must be done with great precision. At the speeds of modern parallel RAM interfaces like DDR4, what is theoretically a digital interface in practice must be viewed as practically analog (to the point that part of a CPU DDR4 controller is called the “PHY). Moreover, the maximum distance between a CPU and its RAM slots is extremely tight. The positioning of RAM slots on a motherboard is largely constrained by these physics considerations. For these reasons you have never seen anything like the flexibility with RAM attachment that you can get with, for example, PCIe or SAS. PCIe and SAS are serial architectures which support cabling and even switching, allowing entire additional chassis of PCIe and SAS devices to be attached to a system via cables.

jxf: The main advantage of Kubernetes isn't really Kubernetes anymore — it's the ecosystem of stuff around it and the innumerable vendors you can pay to make your problems someone else's problems.

Roger Lee: Subscriptions are more important than box office.

@chrismunns: “Buttt Daaaaaaddddd, I wanna build my own traffic routerrrrrr. “Sweetie, when you are 30% of the internet’s traffic you can do that. Until then loadbalancers and DNS are fine for you and your other friends “Ugh I have no fun “You can have fun with loadbalancers

@mbushong: A little surprised at people dunking on OVH. Stuff happens, and I’d think a little empathy is a stronger starting position than some of what I have seen. Reasoned discussions about availability and architecture are obviously positive outcomes. But someone is having a bad day :(

@jesterxl: 1. not having to do a TON of work & maintenance, easier to build things 2a. designing back-ends not in a monolith, adopting functional programming and allowing AWS to own state & side effects 2b. S3's (used to) eventual consistency 2c: active/active architectures - no AMI refreshes - no Docker security finding updates - no rehydrations - no log configuration (Splunk & ELK, wtf is fluentd?) The above seems trite, but dude, MONTHS of slow, boring, pain

@chantastic: My 85 year old grandmother passed today. This is the woman who taught me how to program BASIC on an ATARI. She helped us got our first computer (Intel 386) and told me to "break it". I did and she was the first human to debug my mistakes. I am because she was

@haysstanford: Ask a programmer to review 20 lines of code, they'll find 7 issues. Ask them to review 500 lines & they'll find 0 issues.

qmarchi: Some interesting facts to know for those who don't dig into it. Walmart: - has 80+ internal apps, mostly variants but still unique - runs k8s inside of Distribution Centers - maintains a fleet of >180k mobile devices in the US alone - has a half-dozen data centers in the US - has most International infrastructure seperate from US Stores'

tothrowaway: I'm at OVH as well (in the BHS datacenter, fortunately). I run my entire production system on one beefy machine. The apps and database are replicated to a backup machine hosted with Hetzner (in their Germany datacenter). I also run a tiny VM at OVH which proxies all traffic to Hetzner. I use a failover IP to point at the big rig at OVH. If the main machine fails, I move the failover IP to the VM, which sends all traffic to Hetzner.

ev1: It is kind of interesting that on the US side everyone is in disbelief, or like "why not use AWS" - while most of the European market knows of OVH, Hetzner, etc. My own reason for using OVH? It's affordable and I would not have gotten many projects (and the gaming community I help out with) off the ground otherwise. I can rent bare metal with NVMe, and several terabytes of RAM for less than my daily wage for the whole month, and not worry about per-GB billing or attacks. In the gaming world you generally do not ever want to use usage based billing - made the mistake of using Cloudfront and S3 once and banned script kiddies would wget-loop the largest possible file from the most expensive region botnet repeatedly in a money-DoS. I legitimately wouldn't have been able to do my "for-fun-and-learning" side projects (no funding, no accelerator credits, ...) without someone like them. The equivalent of a digitalocean $1000/m VM is about $100 on OVH.

@_KarenHao: Reporting this thoroughly convinced me that self-regulation does not, cannot work. Facebook has only ever moved on issues because or in anticipation of external regulation. If we have any hope of fixing FB's problems, we can no longer afford to wait for it to do so itself.

@brintly: Limited sample set, I have worked with a few that were going Azure because either the fear of Amazon or Wal-Mart was one of their biggest customers and they felt pressure to move away from AWS. I’ve worked with more retailers that are on/going to AWS than other platforms.

@colmmacc: Also: it's striking that VC relies on pitch decks. I can never decide if this is either A) terrible, and the industry is fundamentally lazy or B) genius, because it focuses almost exclusively on founders' ability to sell and that's all that matters.

Stef Shrader: over twice as many of the drivers who weren't experienced with Level 2 automation but used it for the test didn't remember the bear at all compared to either of the other groups in the test.

Orbital Index: “This little demonstration EST [Enormous Space Telescopes], with a total mass less than 20 kg, including optics that would be positioned along or suspended from the tether at the parabola focal point, would have four times the light gathering capacity of Webb (about thirty times that of Hubble), while costing on the order of 1/1,000th as much.

Jeff Lawson: By 2013 [at Twilio], because of the growth of the codebase and the complexity of the tests and builds, the process was sometimes taking as long as 12 hours! Not only that, but the build would actually fail a substantial number of times — at worst, up to 50 percent of the time — and the developer would have to start over again. We regularly lost days of productivity just getting code out. This was the opposite of moving fast. Writing the code wasn’t the hard part. Wrangling our antiquated systems was. Talk about a self-inflicted wound. As a result, our best engineers started quitting, frustrated at the inability to do their jobs. At first it was a few, and before we knew it, nearly half of our engineers had quit. Half!

@Foone: FUN FACT: the original RFC defining how domain names work doesn't use COM/ORG/NET as the first example. It uses COLORS, FLAVORS, and TRUTH. So your domains are GREEN.COLORS, TRUTH., and CHOCOLATE.NATURAL.FLAVORS

@QuinnyPig: This tells us something about their architecture. ~500GB transferred out, ~620GB transferred in, and ~10TB transferred between AZs (it gets billed twice). Why are they taking data in, then moving it back and forth this much? Kafka, Cassandra, or Kubernetes is likely...Found the bastard! EKS without Fargate costs indicates that they're running Kubernetes. That'd speak to the data transfer cross-AZ charges. Does the newest version respect topology yet? This one doesn't.

ed25519FUUU: To put it in perspective, a single bitcoin transaction can power the average american's household electricity needs for... an entire month.

AgentK20: I'm the CTO of a moderately sized gaming community, Hypixel Minecraft, who operates about 700 rented dedicated machines to service 70k-100k concurrent players. We push about 4PB/mo in egress bandwidth, something along the lines of 32gbps 95th-percentile. The big cloud providers have repeatedly quoted us an order of magnitude more than our entire fleet's cost....JUST in bandwidth costs. Even if we bring our own ISPs and cross-connect to just use cloud's compute capacity, they still charge stupid high costs to egress to our carriers. Even if bandwidth were completely free, at any timescale above 1-2 years purchasing your own hardware, LTO-ing, or even just renting will be cheaper. Cloud is great if your workload is variable and erratic and you're unable to reasonably commit to year+ terms, or if your team is so small that you don't have the resources to manage infrastructure yourself, but at a team size of >10 your sysadmins running on bare metal will pay their own salaries in cloud savings.

spenczar5: Cool! Some thoughts from a former Twitch engineer: - Probably the hardest part of running these things is managing outbound bandwidth costs. You'll need to either limit inbound bitrate or transcode video down to a manageable rate, or else you'll quickly spend a lot of money on shipping 4k video for people. - Right now, your nginx hosts both do ingest and playback, if I understand it right. You might want to separate the two. It makes maintenance easier, and it lets you scale much better - right now, a single stream probably maxes out on viewership based on the CPU capacity of the single nginx host that is ingesting the stream, transcoding it, and delivering it. If you have multiple nginx hosts which could deliver the already-transcoded stream, you could scale much better.

@rafrasenberg: My current AWS full-stack set: ⚒️ Backend: TypeScript GraphQL AppSync DynamoDB Lambda AWS CDK ⚒️ Front-end: TypeScript Svelte S3 API Gateway Cloudfront ⚒️ Additional: Cognito Amplify

fmajid: IIRC at one point Facebook was a 1GB+ executable transpiled from PHP to C++ using HipHop, and that certainly fits any reasonable definition of a monolith, so yes, monoliths can be scaled to an absurd degree.

Joe Jacobsen: none of us were briefed on the original design and most aspects were delegated to just a small number of Boeing … engineers for approval.

Tim Wu: Recall that telephone technology was at the time both primitive and a luxury. For that reason, it is possible that Western Union thought it wasn’t such a big deal to let Bell establish a phone service, imagining it was simply letting Bell run a complementary but unrelated monopoly.

@tmclaughbos: This morning had a brief slack exchange with an engineer about my troubles with split stack AWS APIG. The said, “It was easier when we just added routes i our spring boot app. And there I realized as ops and dev people we’re coming to serverless with very different experiences. My immediate reaction was, “Hold on! I don’t miss the days of orchestrating load balancers, nginx proxies, and not to mention making deploying your application reliably into something you don’t think about. And later I realized they never had to think about that before. Coming from an ops and infrastructure background I both see and love the progress serverless brings for infrastructure management and the problems I don’t have to solve. But many developers are being exposed to this layer of their application for the first time.

Edward Teller: I have no hope of clearing my conscience. The things we are working on are so terrible that no amount of protesting or fiddling with politics will save our souls.

@adrianco: Netflix was using commercial CDNs and outgrew them. There used to be spare CDN capacity in the evenings, but when Netflix peak exceeded daytime peak, they had to deploy their own system, and could get deeper into ISPs with open connect devices.

@iann0036: Just got mind blown by @QuinnyPig on Twitch with the mention of a technique of spinning up a separate account to buy RIs / Savings Plans as they apply to all accounts but the support cost % is per account Exploding head

Forrest Brazeal: Containers are about repackaging the past. Serverless is about reimagining the future.

Neil C. Thompson: We do mean that the economic cycle that has led to the usage of a common computing platform, underpinned by rapidly improving universal processors, is giving way to a fragmentary cycle, where economics push users toward divergent computing platforms driven by special purpose processors. This fragmentation means that parts of computing will progress at different rates. This will be fine for applications that move in the 'fast lane,' where improvements continue to be rapid, but bad for applications that no longer get to benefit from field-leaders pushing computing forward, and are thus consigned to a 'slow lane' of computing improvements. This transition may also slow the overall pace of computer improvement, jeopardizing this important source of economic prosperity.

_fat_santa: This might be an unpopular opinion around here but I believe Leetcode is a terrible metric for assessing developer performance[...]As a programmer, your performance is based upon what you can build. You're never going to have a manager or customer go "well this looks good, but I noticed that your link list function is 3 lines vs 1 line". Leetcode at the end of the day is semantics of programming, no one besides the most hardcore programmers give a damn about it. Sure you can reverse a linked list in a one line of code, but can you build a piece of working functioning software?

mfer: There are numerous schools of thought on programming. For example, I've seen those who are interested in leetcode, algos, and the like. I remember this one time where management wanted to have JS/front-end devs answer questions about C and b-trees. They couldn't find anyone to make it through the whole interview process. The problem was that people who could handle the C and b-trees couldn't cut it at the JS questions that came later. The JS devs never got passed the C/b-tree questions. There is a culture of elite knowledge and a club around that. Some are into the school people have degrees from and that kind of thing. There is another side of it that's about the ability to use code to problem solve. I remember meeting this senior engineer that customers used to constantly request by name. He was one of the most senior levels at the company. I later learned that he had no degree. He had a ton of hands on knowledge and understood the technology from years of working. He learned it like a skilled trade and he was valuable to everyone involved.

Geoff Huston: This is not a new problem by any means. It appears that the common theme of the Internet’s growth over the past thirty years has been one where the capabilities of the infrastructure is the limiting factor, while the underlying dynamics of demand continue to completely outpace the delivery capacity of these platforms. There is no reason to suspect that this will change anytime soon.

Geoff Huston: A more abstract view of the dilemma in security by design was provided by Russ White in his presentation on security by design. As Bruce Schneier pointed out: “The Internet and all the systems we build today are getting more complex at a rate that is faster than we are capable of matching. Security in reality is actually improving, but the target is constantly shifting. As complexity grows, we are losing ground. The consequent question is: “Can we contain the complexity of these systems? Russ’ answer is not all that encouraging: “Reducing complexity locally almost always leads to increasing complexity globally. Optimizing locally almost always leads to decreasing optimisation globally. Oh dear!

Exxact: For CS nerds, it’s [DeepMind’s AlphaFold & the Protein Folding Problem] like trying to find a minimum description length with O(n) complexity for an O(n3)algorithm, as compared to going from O(n3) to O(n2).

Guy Meynants: The cameras on Perseverance have three main improvements over those that flew on Curiosity, say the JPL scientists in the Space Science Reviews paper. Firstly, the Cmosis CMV20000 sensors are colour chips, which gives better contextual imaging capabilities than the monochrome predecessors. The second improvement is that the cameras have a wider field of view – 90° x 70° as opposed to 45° x 45° – which means only five overlapping images are needed to create a 360° panoramic view (Curiosity needed 10 images to achieve the same effect). The third improvement is that the 20-megapixel sensors can resolve greater detail than the older model.

Paul Ratner: Qin developed this algorithm to predict the orbits of planets in the solar system, training it on data of Mercury, Venus, Earth, Mars, Ceres, and Jupiter orbits. The data is "similar to what Kepler inherited from Tycho Brahe in 1601," as Qin writes in his newly-published paper on the subject. From this data, a "serving algorithm" can correctly predict other planetary orbits in the solar system, including parabolic and hyperbolic escaping orbits. What's remarkable, it can do so without having to be told about Newton's laws of motion and universal gravitation. It can figure those laws out for itself from the numbers. Unfortunately, I’m unable to discover where I found this related quote: You can imagine a universe where Newton's laws are never discovered because the human need for abstraction isn't there.

Daesol: Life is an accumulation of bets we’ve made (as well as bets made by our parents, and the society we grew up in etc). The future is also just a series of bets. A non-ergodic path-dependent sequence...So it’s worth putting in more conscious effort into assessing the favourability of a bet. Especially for higher stakes.

Chip Overclock: As you might expect, using the guaranteed delivery Transmission Control Protocol (TCP) instead of UDP solves this problem. And indeed, my initial implementation years ago used TCP. But what I found during these kinds of real-time visualizations, especially in locations with spotty cellular reception, is that the display could lag many seconds behind real-time as the TCP/IP stack retransmitted lost packets. And once behind, it never caught up. In fact, it sometimes got progressively worse. It was far better to have the map pointer jump forward than to permanently and increasingly lag behind what was happening in meat-space.

Pat Helland: Unlike the olden days, we continue to get more and more memory, persistent storage (SSDs and NVMe), network bandwidth, and CPU. Hard disk drive (HDD) storage capacity continues to increase but it is getting colder (e.g., less access bandwidth relative to its capacity)...With all these wonderful improvements, the amount of time waiting to get to something else has become the bottleneck. Latency is the design point.

Benedict Evans: Newspaper revenue really started to collapse well over a decade ago, and we've been discussing what to do about it for almost as long.

anchochilis: I work on a 3-person DevOps team that just finished migrating ~20 services from GCE vms running docker-compose to GKE. It's taken us a little over a year. Partly because K8s has a steep learning curve, but also because safely transitioning services without disrupting product teams adds a lot of overhead. The investment is already yielding great returns. Developers are happy. Actual quote: "Kubernetes is the biggest quality-of-life improvement I've experienced in my career."...1. Reliable rolling deployments. 2. Seamless horizontal scale-out. 3. GitOps/ArgoCD.

mynameisash: One morning when I came in and sat down at my desk, all of the old-timers were having coffee and discussing the fiasco. I was very happy to hear all of them talk about how mistakes happen, and the last person to be blamed for such an outage is the poor guy or gal that hit the ENTER button. Rather, blame falls (to various degrees) on: the engineers in their orbit who should be backing them up; the managers helping to onboard them; the chain of command; the entire system that is in place to prevent inappropriate access.

Lyft: Consider fetching only the fields you need and sorting by _doc (if possible) in Elasticsearch Scroll requests while making use of _routing and terminate_after in Elasticsearch Count requests. These simple changes yielded performance improvements that ultimately helped us reduce cluster resources while ensuring we maintain SLAs. Since Elasticsearch performance is largely based on a variety of factors (document size, search operation rate, document structure, index size, etc.), it is recommended you test with tools like JMeter to accurately measure performance and tune to your needs.

@simoncrosby: Repeat after me: "store then analyze" is a 2005 mindset that doesn't meet the needs of 2020s data-driven apps. AKA: Big data, data lakes... you're about to drown

@mathowie: eptember 2000: Me and @ev jumped in my car in SF and drove to the Palo Alto Fry's to buy a $500 Celeron-powered HP home computer that was on sale. We booted it up back in the office, installed Apache, and it started serving up every early *.blogspot.com site that evening.

@Scott_Wiener: MAJOR WIN FOR NET NEUTRALITY! The federal court just rejected the effort by telecom & cable companies to block enforcement of the net neutrality law I authored, #SB822! The court ruled that California has the authority to protect net neutrality. SB 822 can now be enforced!

Werner Vogels: I think one of the tenets up front was don't lock yourself into your architecture, because two or three orders of magnitude of scale and you will have to rethink it. Some of the things we did early on in thinking hard about what an evolvable architecture would be—something that we could build on in the future when we would be adding functionality to S3—were revolutionary. We had never done that before.

Werner Vogels: There's one other thing that I want to point out. One of the big differences between Amazon the Retailer and AWS in terms of technology is that in retail, you can experiment the hell out of things, and if customers don't like it, you can turn it off. In AWS you can't do that. Customers are going to build their businesses on top of you, and you can't just pull the plug on something because you don't like it anymore or think that something else is better.

@giltene: The core thing you are probably wrestling with is how to encourage timely release of referred-from-heap resources that are not part of the GC’ed heap. Triggering GC based on trending and thresholding if those things is usually the answer. E.g. JVMs trigger on a native memory use.

jhurliman: I had the opportunity to go down to JPL and speak with team members about this design decision. The space hardened processors are not fast enough to do real time sensor fusion and flight control, so they were forced to move to the faster snapdragon. This processor will have not flips on Mars, possibly up to every few minutes. Their solution is to hold two copies of memory and double check operations as much as possible, and if any difference is detected they simply reboot. Ingenuity will start to fall out of the sky, but it can go through a full reboot and come back online in a few hundred milliseconds to continue flying. In the far future where robots are exploring distant planets, our best tech troubleshooting tool is to turn it off and turn it on again.

gresrun: 5+ yrs @ Google, Google is my 5th company. Google has all the building blocks for great backend services and front-end development and, if you know where to look and have some experience with them, you can build a rock-solid product in <6mos, also assuming you have a team that can execute and the political will to ship it. Politics/consensus building is where the real roadblocks lie in Google, and presumably other large companies. Trying to make high-level product & technical decisions when you have 10 stakeholders with 3 VPs, all in different orgs, is serious exercise in patience; months of emails & meetings await you.

@NorminalNews: BREAKING: SpaceX reports that they accidentally uploaded Starship SN10 flight code to Falcon B1059, the F9 booster broke up shortly after the reentry burn as it attempted to transition itself to a bellyflop maneuver.

@greglinden: Dirty secret of cloud computing, lots of inefficiency (most are overprovisioning, lots of idle servers, complexity, switching costs): "subscription mode soon gets soured as the rising monthly bills come in for services nobody knows where and when they are being used"

Bill Joy (1984): These editors tend to last too long - almost a decade for vi now. Ideas aren't advancing very quickly, are they?

TruthWillHurt: Here's mine - We were running on Cloud Foundry, had one DevOps person that mostly dealt with Jenkins, payed for 32-64GB RAM. Decided to move to K8s (Azure AKS), Three months later we have 4-6 DevOps people dealing with networking, cross-az replication, cluster size and autoscaling, And we're paying thousands of $$$ pm for a minimum of 6 64GB VMs. FAIL. Corporate decided to stop trying to compete with cloud vendors and shut down our in-house Cloud Foundry hosting. Also Microsoft sales folks worked client decision makers pretty hard.

Randolph Nesse ~ Why so many false alarms? An optimal system generates many false alarms. When information is limited the cost of defense is less than the cost of no defense.

Brian Bailey: Perhaps the biggest change is that we need to start teaching a new generation of software engineers who are not constrained by the notions of single-threaded execution, by the notion of a single, contiguous, almost limitless amount of memory, and who accept that what they do consumes power and that waste is expensive. Today, indirectly, software engineers are responsible for about 10% of worldwide power consumption, and that number is rapidly rising. It has to stop.

Benedict Evans: Part of the promise of the internet is that you can take things that only worked in big cities and scale them everywhere. In the off-line world, you could never take that unique store in London or Milan and scale it nationally or globally - you couldn’t get the staff, and there wasn’t the density of the right kind of customer (and that’s setting aside the problem that scaling it might make people less interested anyway).

Brent Ozar: Oracle – massively expensive. Microsoft SQL Server – pretty doggone expensive. AWS RDS Postgres and Aurora – inexpensive to mildly expensive. Postgres – free to inexpensive, depending on support

Wayfair: From our testing, it’s clear that the geographic latency impact of switching to Google Cloud Spanner would be significant, especially when compared to similar timings from on-prem SQL Server infrastructure. In Spanner’s best case (nam6 with client in us-central1), read and write timings are double that of SQL Server and in its worst case (nam-eur-asia1 with client in europe-west3), latency is up to 15 times greater.

@dustinmoris: The more I work with @GCPcloud and @Azure at the same time - doing pretty much the same stuff across both clouds for different projects/work - the more I'm astonished by how much better GCP is than Azure. It's on so many levels better, that it's even hard to explain.

Jonathan Brooks: You’re never too old, never too experienced, and never too practiced at what you do to learn. Even though I’ve been doing this for centuries, I’ve discovered that there is always something more to learn–you just need to know where to look. So, keep learning, and someday you might be as good or better than I am.

Stef Shrader: Farmers would rather not deal with this black market at all, so they've become some of the loudest voices in fight to enshrine a formal right to repair act that would guarantee access to the tools and diagnostic systems necessary to fix their own stuff. At least 20 states including farm-heavy Nebraska have introduced right to repair legislation, per Freethink.

M.G. Siegler: As with many of the things Amazon bakes into Prime, Apple is starting to understand the value of creating the illusion of value.

@cpswan: I asked somebody at GCP about this a little while ago. Seems that egress pricing is viewed as a digital moat keeping data on their services (whoever 'they' are).

@SeanMcTex: As a 50 year old human doing software development for a living, I wonder about age's effects on what is sometimes seen as a young person's game. I feel like I've continued to get better at it over the years; good to see research supporting this:

@changeinside: As an early customer of cloudcheckr during a period of fast cloud expansion I can confirm - a great tool, but there’s a point where 0.5% of annual spend is better invested in staff to reduce waste than a tool to track it

@slava_pestov: If you’re a senior engineer, it’s important to understand that nobody wants to “finish up your hacked up, half-assed “prototype implementation of anything. Either do the job properly, or let someone else tackle it

@qhardy: Google Cloud Revenue up 47% year on year Backlog $30 billion, up from $19 billion in the previous quarter Deals over $250 million up 3x Multicloud, real-time analytics, meaningful applied Machine Learning - stuff the others don't have - will continue to differentiate.

Chris Fields: We suggest a developmental explanation for this evolutionary phenomenon: obligate gametic reproduction is the result of germline stem cells winning a winner-take-all competition with non-germline stem cells for control of reproduction and hence lineage survival. We develop this suggestion by extending Hamilton’s rule, which factors the relatedness between parties into the cost/benefit analysis that underpins cooperative behaviors, to include similarity of cellular state. We show how coercive or deceptive cell-cell signaling can be used to make costly cooperative behaviors appear less costly to the cooperating party. We then show how competition between stem-cell lineages can render an ancestral combination of vegetative reproduction with facultative sex unstable, with one or the other process driven to extinction. The increased susceptibility to cancer observed in obligately-sexual lineages is, we suggest, a side-effect of deceptive signaling that is exacerbated by the loss of whole-body regenerative abilities.

Paul Vixie: Engineering economics requires that the cost in CPU, memory bandwidth, and memory storage of any new state added for rate limiting be insignificant compared with an attacker's effort.

hallenworld: RISC-V is a great soft-core for FPGAs. I no longer have to use vendor cores or SDKs for this.

@randybias: So there seems to be a state of affairs where "DevOps" is basically: operators deploy and manage the k8s clusters and maybe some shared app infra services and devs manage the micro services (by team) they deploy on top. Everyone uses the same tools to see the deployment.

Jack Dangermond: I went to design school, first environmental design school and then landscape architecture and then city planning. And in that progression, I came to understand very clearly the idea of problem-solving, because that's what design really is about, you see a problem and you come up, creatively, with something that solves the problem.

Viviane Callier: But why would metamorphosis be better than having two specialized proteins? The scientists theorize in their paper about a couple of linked possibilities. If a single protein can do double duty, it spares the cell from transcribing, translating and maintaining more than one gene. But the more compelling advantage may be that the protein’s ability to transform may give the body a more dynamic way to control its defenses against bacteria.

Project Zero: This blog post discussed three improvements in iOS 14 affecting iMessage security: the BlastDoor service, resliding of the shared cache, and exponential throttling. Overall, these changes are probably very close to the best that could’ve been done given the need for backwards compatibility, and they should have a significant impact on the security of iMessage and the platform as a whole. It’s great to see Apple putting aside the resources for these kinds of large refactorings to improve end users’ security. Furthermore, these changes also highlight the value of offensive security work: not just single bugs were fixed, but instead structural improvements were made based on insights gained from exploit development work.

suborbital/atmo: Building web services should be simple. Atmo makes it easy to create a powerful server application without needing to worry about scalability, infrastructure, or complex networking. adlrocha: The basic idea is that if every peer in a decentralized network includes a common runtime, and all functions and data are uniquely identified in the network, you can run anything, anywhere. And the fact that content-addressed networks give a CDN-by-default capability, would allow an IPFS-based Atmo to scale seamlessly as long as is a peer with available resources to run your bundle. This would enable a global serverless infrastructure and a seamless developer experience (no more worrying about what cloud provider to choose). Also @adlrocha - Building a scalable monolith



CondensationDB/Condensation: a zero-trust distributed database that ensures data ownership and data security. Inspired by the blockchain system, the email system, and git versioning, Condensation's architecture is a unique solution to develop scalable and modern applications, excelling at synchronization.



AbstractMachinesLab/lam (article): a lightweight, universal actor-model vm for writing scalable and reliable applications that run natively and on WebAssembly. It is inspired by Erlang and Lua, and it is compatible with the Erlang VM.



bastion-rs/bastion: a highly-available, fault-tolerant runtime system with dynamic, dispatch-oriented, lightweight process model. It supplies actor-model-like concurrency with a lightweight process implementation and utilizes all of the system resources efficiently guaranteeing of at-most-once message delivery.



An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers. We present an in-depth data-driven analysis on the correlated failures in the SSD-based data centers at Alibaba. We study nearly one million SSDs of 11 drive models based on a dataset of SMART logs, trouble tickets, physical locations, and applications. A non-negligible fraction of SSD failures belong to intra-node and intra-rack failures (12.9% and 18.3% in our dataset, respectively). Also, the intra-node and intrarack failure group size can exceed the tolerable limit of some typical redundancy protection schemes. The likelihood of having an additional intranode (intra-rack) failure in an intra-node (intra-rack) failure group depends on the already existing intra-node (intra-rack) failures. The relative percentages of intra-node and intrarack failures vary across drive models. Putting too many SSDs from the same drive model in the same nodes (racks) leads to a high percentage of intra-node (intra-rack) failures. Also, the AFR and environmental factors (e.g., temperature) affect the relative percentages of intra-node and intra-rack failures Finding 6. MLC SSDs with higher densities generally have lower relative percentages of intra-node and intra-rack failures. The relative percentages of intra-node and intrarack failures increase with age The SMART attributes have limited correlations with intra-node and intra-rack failures Write-dominant workloads lead to more SSD failures overall, but are not the only impacting factor on the AFRs Erasure coding shows higher reliability than replication based on the failure patterns in our dataset. Redundancy schemes that are sufficient for tolerating independent failures may be insufficient for tolerating the correlated failures as shown in our dataset.



virtualagc/virtualagc (video): the previously lost Apollo 10 LM software, as flown (also known as Luminary 69 Rev 2)

Foundational distributed systems papers: here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.)



A network analysis on cloud gaming: Stadia, GeForce Now and PSNow: We find that GeForce Now and Stadia use the RTP protocol to stream the multimedia content, with the latter relying on the standard WebRTC APIs. They result bandwidth hungry and consume up to 45 Mbit/s, depending on the network and video quality. PS Now instead uses only undocumented protocols and never exceeds 13 Mbit/s.



Cloud Native Transformation: How do you serve your customers faster, better, smarter? That’s an easy one: with Cloud Native technology, culture, and strategy. But how do you get started in moving your organisation toward the cloud? That’s not so easy—the choices are many, risk shadows every decision, and the complexity of the whole thing grows as you move forward.



HHVM Jump-Start: Boosting Both Warmup and Steady-State Performance at Scale: In this paper, we argue for HHVM’s Jump-Start approach, describe it in detail, and present steady-state optimizations built on top of it. Running the Facebook website, we demonstrate that Jump-Start effectively solves the warmup problem in HHVM, reducing the server capacity loss during warmup by 54.9%, while also improving steady-state performance by 5.4%.



FirePlace: Placing FireCracker virtual machines with hindsight imitation: We see that in production traffic from Amazon Web Services (AWS), µVM resource use is spiky and short lived, and that forecasting algorithms are not useful. We evaluate Reinforcement Learning (RL) approaches for this task, but find that off-the-shelf RL algorithms are not always performant. We present a forecasting-free algorithm, called FirePlace, that learns the placement decision using a variant of hindsight optimization, which we call hindsight imitation. We evaluate our approach using a production traffic trace of µVM usage from AWS Lambda. FirePlace improves upon baseline algorithms by 10% when placing 100K µVMs.



Silent Data Corruptions at Scale: We [Facebook] provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within a large production fleet. In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of machines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than 18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.



ApplePlatform Security: This documentation provides details about how security technology and features are implemented within Apple platforms. It also helps organizations combine Apple platform security technology and features with their own policies and procedures to meet their specific security needs



Algorithms: This web page contains a free electronic version of my self-published textbook Algorithms, along with other lecture notes I have written for various theoretical computer science classes at the University of Illinois, Urbana-Champaign since 1998.



Scalable Statistical Root Cause Analysis on App Telemetry: In this paper, we propose Minesweeper, a technique for RCA that moves towards automatically identifying the root cause of bugs from their symptoms. The method is based on two key aspects: (i) a scalable algorithm to efficiently mine patterns from telemetric information that is collected along with the reports, and (ii) statistical notions of precision and recall of patterns that help point towards root causes. We evaluate Minesweeper on its scalability and effectiveness in finding root causes from symptoms on real world bug and crash reports from Facebook's apps. Our evaluation demonstrates that Minesweeper can perform RCA for tens of thousands of reports in less than 3 minutes, and is more than 85% accurate in identifying the root cause of regressions.



Reading and Writing the Morphogenetic Code: We focus on the morphogenetic code: the mechanisms and information structures by which cellular networks internally represent the target morphology, and compute the cell activities needed at each time point to bring the body closer to that morphology.

