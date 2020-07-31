Hey, it's HighScalability time!





Serverless is really complex. Or is it? @paulbiggar sparked a thoughtful Twitter thread.

@tsimonite: Larry Page in 2004: "We want to get you out of Google and to the right place as fast as possible" Google in 2020: Google is the right place, no outward links for you. Larry Page in 2004: "We want to get you out of Google and to the right place as fast as possible"Google in 2020: Google is the right place, no outward links for you

@TwitterSupport: To recap: 130 total accounts targeted by attackers. 45 accounts had Tweets sent by attackers. 36 accounts had the DM inbox accessed. 8 accounts had an archive of “Your Twitter Data” downloaded, none of these are Verified.

Charles Fitzgerald: Hyperclown status is best thought of as the ratio of cloud rhetoric to CAPEX spending. The most efficient hyperclowns maximize their marketing while minimizing actual spend.

Dylan: They said, well, you’ve been in the App Store for eight years and you haven’t contributed anything. That is like, wow, this is a couple days before they’re about to have their really feel-good WWDC where they tell all the developers how much we appreciate you. No, we have not contributed to App Store revenue. However, Apple sells $1000 phones. And these $1000 phones they can only sell because of this entire ecosystem so to say that we’ve added no value to their system, that’s offensive. It was a situation where both of us benefitted.

DSHR: So even long before 2017 no-one could have gained access to Google's systems by phishing an employee. Twitter's board should fire Jack Dorsey for the fact that, after at least seven years of insider attacks, their systems were still vulnerable.

@Steve_Yegge: Using Google APIs is like a Choose Your Own Adventure where *almost* every path continues for 2 years before it is deprecated and dies. I will never EVER use Google Cloud Endpoints again, for instance, nor App Engine Java. It's insane how they neglect their docs and libraries.

@alistairmbarr: Some Googlers pushed back against adding a 4th ad to the top of search results because it was such low quality next to the first organic result. Google went ahead anyway because it was under pressure to meet Wall St expectations

Jared Short: When building our application using serverless and cloud-native services, the building blocks we use have greater shared context among individuals; access to a richer vocabulary of pieces. A byproduct of having discrete components means our diagrams communicate intention and purpose with more clarity and less abstraction...What many of us are saying and I think we are finally figuring out how to, is that serverless helps communicate business complexities in a standardized way. Clear purpose components and services help reduce abstraction, and make it easier to understand what you are trying to build. In addition, you can usually understand where your system will have distinct types of issues (scalability, reliability, security) and architect specifically to address those areas without needing to increase the complexity of all the other parts of the system.

@lhochstein: I bet "received input in a form never imagined by the developer" is a more common failure mode for distributed systems than "race conditions".

Ammar: Linus Tech Tips (a technology channel) was the channel that produced most trending [YouTube] videos in 2019. It is a technology (not entertainment nor music) and personal (not corporate) channel. More than that, the 2nd and 3rd channels after Linus Tech Tips were cooking channels.

@tmclaughbos: And I can tell you no team has just switched from containers to serverless and magically gotten faster. They probably got slower.

pm edited: My experience with [K8s] has been extremely negative. Have been working with it for the past few years in three companies I consulted for. In one instance, they replaced two rest APIs running on EC2 with handrolled redundancy and zero downtime automated deployments using only AWS CLI and a couple of tiny shellscripts with a kubernetes cluster. Server costs jumped 2000% (not a typo) and downtimes became an usual thing for all sorts of unknown reasons that often took hours for the “expert” proponents of kubernetes to figure out.

Gerrit De Vynck: On smartphones, the change has been more pronounced. From June 2016 to June 2019, the proportion of mobile searches that led to clicks on free web links dropped to 27% from 40%. No-click searches, which Fishkin says suggests the user found the information they needed on Google, rose to 62% from 56%. Meanwhile, clicks on ads more than tripled, Jumpshot data show. “This has been the slowest but most consistent march in tech,” venture capitalist Bill Gurley wrote on Twitter last year. “If you are still holding out hope for a SEO strategy you must be intentionally ignoring all of the data in front of you,” he added, referring to search engine optimization, a popular way of improving websites to rank higher in Google’s free results.

@schneems: If you provide an API client that doesn't include rate limiting, you don't really have an API client. You've got an exception generator with a remote timer.

@eastdakota: Details on how we caused an 23 min outage for~50% of @Cloudflare's network today. The root cause was a typo in a router configuration on our private backbone. We've applied safeguards to ensure a mistake like this will not cause problems in the future.

@QuinnyPig: Lockin: people, data gravity, The second form of lock-in that gets overlooked is whatever your provider is using for Identity and Access Management (IAM).

hibikir: Having worked at some of those companies, along with small startups, the biggest difference isn't really plain old sophistication, but expenses so high as to make the engineering effort worthwhile. If your production system is 10 instances on some random cloud, a 10% efficiency savings saves me 1 instance, so maybe $2k a year. Taking into account opportunity costs vs doing things that raise revenue, said startup would consider the effort a waste of time unless it took a few hours. With the same architecture, but instead 20K instances, then suddenly that 10% is saving 2000 instances, and 4 million a year. Unless there's a major engineering shortage, chances are that spending a over a month on 4 million in yearly savings will be completely justified, and would even be a highlight in someone SRE's review.

Gartner: Adopt a tactical requirement approach to your CDN selection. Choose a CDN provider that will support all your Tier 1 and at least 75% to 85% of your Tier 2 requirements, and use location, service differentiation and support as key considerations rather than price.

rdkls: All these experts saying don't lift and shift. We did, it worked well. Afterwards we continue to update and optimize. I headed up cost op initiative with each team, strong management support, our bill continues to drop. You need strong direction, clear timelines and diplomacy and everyone aligned (not hard when cloud is clearly the future). Until we had that (new CTO) the initiative floundered. We had a big countdown on one TV till DC exit.

Slack: As a result of the pandemic, we’ve been running significantly higher numbers of instances in the webapp tier than we were in the long-ago days of February 2020. We autoscale quickly when workers become saturated, as happened here — but workers were waiting much longer for some database requests to complete, leading to higher utilization. We increased our instance count by 75% during the incident, ending with the highest number of webapp hosts that we’ve ever run to date.

Corey Quinn: A majority of spend across the board is and always has been the direct cost of EC2 instances. The next four are, in order, RDS, Elastic Block Store (an indirect EC2 cost), S3, and data transfer (yup, another EC2 cost). That's right. The cloud really is a bunch of other people's virtualized computers being sold to you. The rest of it is largely window dressing.

Andy Cockburn: Given the high proportion of computer science journals that accept papers using dichotomous interpretations of p, it seems unreasonable to believe that computer science research is immune to the problems that have contributed to a replication crisis in other disciplines. Next, we review proposals from other disciplines on how to ease the replication crisis, focusing first on changes to the way in which experimental data is analyzed, and second on proposals for improving openness and transparency.

@mcnees: "The whole matter of the world may have been present at the beginning, but the story it has to tell may be written step by step." — Georges Lemaître

@abbyfuller: Software engineering doesn’t need to be some crazy thing where you have to work nights and weekends and make it your sole hobby and love to code! It’s ok to do it as “just” a job. It’s a job I happen to like, but it’s still a job.

@mipsytipsy: the key is to emit only a single arbitrarily-wide event per request per service, and pack data in densely (typically 300-400 dimensions per event for a mature service). and then at very high throughput, some form of intelligent sampling strategy should kick in

AEROSPACE SAFETY ADVISORY PANEL: SpaceX and Boeing have very different philosophies in terms of how they develop hardware. SpaceX focuses on rapidly iterating through a build-test-learn approach that drives modifications toward design maturity. Boeing utilizes a well-established systems engineering methodology targeted at an initial investment in engineering studies and analysis to mature the system design prior to building and testing the hardware.

Stuff Made Here: Using your tools to build new tools is one of life's great pleasures.

@NinjaEconomics: "Our [Facebook] algorithms exploit the human brain's attraction to divisiveness."

dougmwne: In 13 years at 5 companies, I have literally never had the experience you're describing. Workplaces were always mainly adversarial and extractive with a thin veneer of "we're a family" and "I work with the most amazing people" BS spread on top. I think my coworkers have all generally been fine and competent people, it's just that the workplace is a hostile environment where everyone is either trying to swim with the big fish or at least keep from drowning. My parents are retired now, but describe the work place in your terms of real human connection. I can't help but feel that world is gone. One of their older friends was shocked that I worked from home and couldn't understand why I'd want to separate myself from people like that and forgoe all the friendships. But again, I think that world is mostly gone, chewed up by toxic businesses culture which is why so many people would rather sit at home than deal with it.

Shourya Pratap Singh: While On-Demand sounds good, but it can get pretty costly. Recently, we changed some code that required a particular table to be accessed more often, so we switched to On-Demand from provisioned to monitor the capacities, and it costed us almost 2x! We did this to decide the future capacity but the costs were pretty well visible in just 3 days of making the switch. Also, it is to be noted that on-demand to provisioned capacity mode conversion is allowed only once per day. As a general rule of thumb, avoid on-demand as most of the use-cases have a predictable load.

Stephen O'Grady: But it seems equally plausible that Anthos and BigQuery are merely the first manifestation of a fundamentally new approach for Google. For years, would be challengers, armed only with largely similar offerings, dutifully charged up the hill that was an incumbent AWS operating at velocity in its core market. Most were cut down. Google, in particular, never seemed to benefit from this approach. More recently, however, there have been signs of creativity.

cs702: The main point of this article is that Robinhood has brought Silicon Valley-style maximization of user engagement to retail stock-market trading without regard for the psychological, social, and financial consequences to the people who use the service.

Rule11: Sometimes we have to remember the cost of the network is telling us something—just because we can do a thing doesn’t mean we should. If the cost of the network forces us to consider the tradeoffs, that’s a good thing. And remember that if your toaster makes your bread at the same time every morning, you have to adjust to the machine’s schedule, rather than the machine adjusting to yours…

crazygringo: Unfortunately, the author/article seems to completely miss the meaning of "joins don't scale". Obviously indexed joins on a single database server scale just fine, that's the entire selling point of an RDMBS! The meaning of "joins don't scale" is that they don't scale across database servers, when your dataset is too big to fit in a single database instance. Joins scale across rows, they don't scale across servers. Now a lot of people don't realize how insanely powerful single-server DB's can be. A lot of people that assume they need to architect for a multi-server DB don't realize they can get away with a hugely-provisioned SSD single-server indexed database with a backup, with performant queries.

jonathantn: The bottle line is that thinking in "sets of data" instead of "individual records" will pay off handsomely on the operational cost of such a data ingestion pipeline.

The Orbital Index: Al Amal also carries a high resolution imager, capable of 12 megapixel monochrome images (with discrete RGB filters) at 180 fps, creating an opportunity for the first 4K video from another planet—just not in anywhere near real-time since the probe sports 250 kbps - 1.6 Mbps of bandwidth depending on distance to Earth.

Julia Ebert: Positive feedback. Longer intervals between observation gave consistently higher accuracy. Taking observations more frequently got samples quicker but since they were so spatially correlated it meant the overall decisions were less accurate because they didn't represent the true environment.

@mipsytipsy: if you can solve your app problem with a monolith, do that. if you can solve your architecture problem with a LAMP stack, do that. if you can debug your problem with printf to stdout, do that. just watch out for the day when the solution reaches its edge & becomes the new problem.

martintrapp: AD does automatic differentiation, PLLs transform a generative model into some suitable form to perform automatic Bayesian inference, e.g. by using AD and black-box variational inference. Or said differently, in a PPL you specify the forward simulation of a generative process and the PPL helps to automatically invert this process using black-box algorithms and suitable transformations. Without a PPL, you would traditionally write your code for your model and would have to implement a suitable inference algorithm yourself. With a PPL you only specify the generative process and don't have to implement the inference side of things nor care about an implementation of your model that is suitable for inference.

Samuel Greengard: The value of neuromorphic systems is they perform on-chip processing asynchronously. Just as the human brain uses the specific neurons and synapses it needs to perform any given task at maximum efficiency, these chips use event-driven processing models to address complex computing problems. The resulting spiking neural network—so called because it encodes data in a temporal domain known as a "spike train"—differs from deep learning networks on GPUs. Existing deep learning methods rely on a more basic brain model for handling tasks, and they must be trained in a different way than neuromorphic chips.

@GemmaBlackUK: I "love" serverless. But infrastructure complexity, plus setting up IAM policies of least privilege against each resource is why I have decided instead to go to containers for the majority of my backend apps. If it can be abstracted away, I think I'd be tempted to revisit it.

Gerry McGovern: In 1994, there were 3,000 websites. In 2019, there were estimated to be 1.7 billion, almost one website for every three people on the planet. Not only has the number of websites exploded, the weight of each page has also skyrocketed. Between 2003 and 2019, the average webpage weight grew from about 100 KB to about 4 MB. The results? “In our analysis of 5.2 million pages,” Brian Dean reported for Backlinko in October 2019, “the average time it takes to fully load a webpage is 10.3 seconds on desktop and 27.3 seconds on mobile.” In 2013, Radware calculated that the average load time for a webpage on mobile was 4.3 seconds.

Tyler Treat: Luckily, there is a better solution—one that fits our serverless model and enables us to control external traffic while allowing App Engine services to securely communicate internally. IAP supports context-aware access, which allows enforcing granular access controls for web applications, VMs, and GCP APIs based on an end-user’s identity and request context. Essentially, context-aware access brings a richer zero-trust model to App Engine and other GCP services.

@benthompson: Notable to see Apple confirming a point I’ve been trying to make: the company believes it is entitled to *all* commerce that happens on an iPhone.

throwaway_aws: An Amazon spokesman said the company doesn’t use confidential information that companies share with it to build competing products" Maybe...but in the past, AWS proactively looked at traction of products hosted on its platform, built competing products, and then scraped & targeted customer list of those hosted products. In fact, I was on a team in AWS that did exactly that. Why wouldn't their investing arm do the same?

John Hagel: That sets the stage for a new way of organizing the gig economy. We’re going to begin to see impact groups forming and coming together into broader networks that will help them to learn even faster. That’s where guilds come in.

Bryon Moyer: Nakamura started by reminding us that 5G went live just this last March and that it will be the dominant technology through the 2020s. He said that 6G is a technology for the 2030s...6G then will focus on solving social issues and a closer fusing of the physical and the cyber-worlds, enabled by an expanded set of higher-bandwidth communications options and by more sophisticated fusion between the physical and cyber realms.

Bruce Schneier: But inefficiency is essential security, as the COVID-19 pandemic is teaching us. All of the overcapacity that has been squeezed out of our healthcare system; we now wish we had it. All of the redundancy in our food production that has been consolidated away; we want that, too. We need our old, local supply chains -- not the single global ones that are so fragile in this crisis. And we want our local restaurants and businesses to survive, not just the national chains. We have lost much inefficiency to the market in the past few decades. Investors have become very good at noticing any fat in every system and swooping down to monetize those redundant assets. The winner-take-all mentality that has permeated so many industries squeezes any inefficiencies out of the system.

Frederic Lardinois: Indeed, Workers Unbound, Cloudflare argues, is now significantly more affordable than similar offerings. “For the same workload, Cloudflare Workers Unbound can be 75% percent less expensive than AWS Lambda, 24 percent less expensive than Microsoft Azure Functions, and 52 percent less expensive than Google Cloud Functions,”

Slack: The broken monitoring hadn’t been noticed partly because this system ‘just worked’ for a long time, and didn’t require any change.

@johncutlefish: “Scaling” is very different from breaking things up and making them functional “at scale”. Which is why big lumbering orgs can’t just copy the structure of rapidly scaled/scaling new ventures and expect it to work.

@ShaiDardashti: "AMZN has invested $18B over the past 10 years to turn every major cost into a source of revenue." - @chamath on $AMZN (2016);

Tsuyoshi Hirashima: Cells are tightly connected and packed together, so when one starts contracting from ERK activation, it pulls in its neighbors. This then caused surrounding cells to extend, activating their ERK, resulting in contractions that lead to a kind of tug-of-war propagating into colony movement

Necessary-Space: Scaling a web services is trivial if you just have one database instance: Write in a compiled fast language, not a slow interpreted language. Bump up the hardware specs on your servers. Distribute your app servers if necessary and make them utilize some form of in-memory LRU cache to avoid pressuring the database. Move complicated computations away from your central database instance and into your scaled out application servers. A single application server on a beefed up machine should have no problem handling > 10k concurrent connections.

snoob2015: People usually jump into redis when it comes to cache. IMO, if the traffic is fit inside a server, just use a caching library (in-process cache?) in your app ( for example, I use java caffeine). It doesn't add another network hop, no serialization cost, easier to fine tune. I added caffeine into my site and the cpu goes from 50% to 1% in no time, never have another perf problem since then.

throwawaymoney666: I've watched our Java back-end over its 3 year life. It peaks over 4000 requests a second at 5% CPU. No caching, 2 instances for HA. No load balancer, DNS round robin. As simple as the day we went live. Spending a bit of extra effort in a "fast" language vs an "easy" one has saved us from enormous complexity. In contrast, I've watched another team and their Rails back-end during a similar timeframe. Talks about switching to TruffleRuby for performance. Recently added a caching layer. Running 10 instances, working on getting avg latency below 100ms. It seems like someone on their team is working on performance 24/7.

Schneier: Class breaks are security vulnerabilities that break not just one system, but an entire class of systems. They might exploit a vulnerability in a particular operating system that allows an attacker to take remote control of every computer that runs on that system's software. Or a vulnerability in internet-enabled digital video recorders and webcams that allows an attacker to recruit those devices into a massive botnet. Or a single vulnerability in the Twitter network that allows an attacker to take over every account.

Sabine Hossenfelder: The brief summary is that if you hear something about a newly proposed theory of everything, do not ask whether the math is right. Because many of the people who work on this are really smart and they know their math and it’s probably right. The question you, and all science journalists who report on such things, should ask is what reason do we have to think that this particular piece of math has anything to do with reality. “Because it’s pretty” is not a scientific answer. And I have never seen a theory of everything that gave a satisfactory scientific answer to this question.

DANIEL OBERHAUS: Autonomy is critical for Perseverance’s mission. The distance between Earth and Mars is so large that it can take a radio signal traveling at the speed of light up to 22 minutes to make a one-way trip. The long delay makes it impossible to control a rover in real time, and waiting nearly an hour for a command to make a round trip between Mars and the Earth isn’t practical either...Most tests of Perseverance’s navigation algorithms were tested in virtual simulations, where the rover team threw every conceivable scenario at the rover’s software to get an idea of how it would perform in those situations

cactus2093: The companies I've seen succeed were 100% focused on shipping their product to customers. Not 90% focused on customers and 10% focused on code quality, but 100% focused. They'd rather have to spend 30 engineer-days a few years from now fixing an issue if they get to that point than spend 3 hours getting it right upfront. As an engineer that goes against every instinct I have, it really seems like spending a couple hours upfront must be a better use of time. It seems like it should be possible to spend 10% of your time setting yourself up well for the future, that's still just a rounding error of your time. And then if you do survive another few years, you'll have a huge leg up on other series B or C stage competitors if you're not hindered by a lot of tech debt at that point. But from a capitalist perspective, it's probably not so crazy. If you are working with a $200,000 seed round in the beginning, and an engineer costs $80,000 a year, 3 hours of their time costs $115 which is 0.06% of your funds. And more importantly, that $200k is maybe enough for a year of runway, so 3 hours is 0.14% of the time you have to live given a 40-hour a week year (or 0.07% of an 80 hour a week year). Every bit of that starts to add up. Whereas by the time you're a later-stage company and you've raised, say, $40 million dollars and are paying engineers $150k, 30 engineer-days of work is $17,300 but that's only 0.04% of the money you've raised plus your runway is now approaching infinity if you're close to profitable.

amdelamar: Eventually you get to enjoy deprecating old services as much as building new ones, simply because you never have to teach others about them again.

LinkedIn: For those considering a similar migration, a key piece of advice we’d like to share is to defer timelines until you are certain about your throughput. Our initial calculations were based on the ideal performance of our databases, and we soon found that real-world differences in hardware, the extended duration of the migration and more, caused multiple revisions to our timelines. In the end, this entire data migration effort took us nearly six months to complete. It’s important to remember that a data migration is a marathon, not a sprint.

Khan Academy: Our iOS and Android apps share a single codebase [using React Native], with engineers specializing in features of the app, rather than platform. This means we’re way better about improving the quality of a given feature over time, and we can make incremental improvements to features, rather than feeling like we need to get everything in the initial version.

Audrius Kucinskas: In our last AWS bill we found an additional $4k (total bill was 8k) increase what appeared to be NAT Gateway traffic cost. After tracking down recent pull requests, we found one obscure change that essentially added VPC(which routed all traffic through NAT GW) to a global `serverless.yml` scope in order to access MongoDB. This routed all the traffic of every lambda in that stack through NAT. One of those lambdas was an ML lambda which was downloading and uploading Models to and from S3. Needless to say, paying NAT GW traffic price for S3 traffic is not fun.

Matt Lacey: You've only added two lines - why did that take two days! It might seem a reasonable question, but it makes some terrible assumptions: lines of code = effort; lines of code = value; all lines of code are equal. None of those are true.

cpnielsen: We have been running our own self-hosted kubernetes cluster (using basic cloud building blocks - VMs, network, etc.), starting from 1.8 and have upgraded up to 1.10 (via multiple cluster updates). First off, don’t ever self-host kubernetes unless you have at least one full-time person for a small cluster, and preferably a team. There are so many moving parts, weird errors, missing configurations and that one error/warning/log message you don’t know exactly what means (okay, multiple things). From a “user” (developer) perspective, I like kubernetes if you are willing to commit to the way it does things. Services, deployments, ingress and such work nicely together, and spinning up a new service for a new pod is straightforward and easy to work with. Secrets are so-so, and you likely want to do something else (like Hashicorp Vault) unless you only have very simple apps.

Mark Litwintschik: Many believe that for near-instant analytics on billions of records you'd need dedicated Linux clusters, several GPUs or proprietary Cloud offerings. Some of my fastest benchmarks were run on such environments. But in 2020, an off-the-shelf MacBook Pro using OmniSciDB (formerly MapD) can happily do the job.

Wisdom: Dual-channel memory has measurable performance improvement over single-channel memory. With DDR5 adopting 2x 32-bit channels per DIMM, dual-channel will likely act more like a quad-channel when compared to a dual-channel DDR4 with 1x 64-bit channel per DIMM. I fully expect DDR5 to increase FPS by a noticeable amount. Plus, DDR5-4800 starting point is quite a bit faster than the officially supported DDR4-3200.

Tel Aviv University: We discovered that brain connectivity — namely the efficiency of information transfer through the neural network — does not depend on either the size or structure of any specific brain," says Prof. Assaf. "In other words, the brains of all mammals, from tiny mice through humans to large bulls and dolphins, exhibit equal connectivity, and information travels with the same efficiency within them. We also found that the brain preserves this balance via a special compensation mechanism: when connectivity between the hemispheres is high, connectivity within each hemisphere is relatively low, and vice versa."

alex-petrenko/sample-factory (paper, article) : Codebase for high throughput asynchronous reinforcement learning.



GhostDB: a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale. GhostDB is designed to speed up dynamic database or API driven websites by storing data in RAM in order to reduce the number of times an external data source such as a database or API must be read.

The Bit Player: In 1948, Claude Shannon introduced the notion of a "bit", laying the foundation for the Information Age. His ideas power our modern life, influencing computing, genetics, neuroscience and AI. Mixing contemporary interviews, archival film, animation and dialogue from interviews with Shannon, The Bit Player tells the story of an overlooked genius with unwavering curiosity.



Thomas Edison's Stunning Footage of the Klondike Gold Rush. Money was made by the suppliers. Only 30,000 made it the fields, exracting 75 tons of gold. Only a few made money.

