hot links

Stuff The Internet Says On Scalability For December 6th, 2019

High Scalability

06 Dec 2019 — 34 min read

Wake up! It's HighScalability time:

Formation of a single massive galaxy through time in the TNG50 cosmic simulation. It traces the simultaneous evolution of thousands of galaxies over 13.8 billion years of cosmic history. It does so with more than 20 billion particles representing dark matter, stars, cosmic gas, magnetic fields, and supermassive black holes. The calculation required 16,000 cores working together, 24/7, for more than a year.

Do you like this sort of Stuff? Your support on Patreon is appreciated more than you can know. I also wrote Explain the Cloud Like I'm 10 for everyone who needs to understand the cloud (which is everyone). On Amazon it has 63 mostly 5 star reviews (140 on Goodreads). Please recommend it. You'll be a real cloud hero.

Number Stuff:

$2.9B: of $7.9B of Black Friday spending (Up $1.2B from 2018) was spent on smartphones. 62% boost for large retailers versus 27% for smaller retailers.
71%: AWS's percentage of Amazon's operating profit.
$13B: Google spend on new datacenters in 2019. The scale of Google’s infrastructure is remarkable. The company now has 19 data center campuses around the globe, with 11 in the United States, 5 in Europe, two in Asia/Pacific and one in South America.
63%: of javascript developers use React.
430 million: Reddit monthly active users, 30% YoY growth. 199 million posts. 1.7 billion comments. 32 billion upvotes.
4: average lifespan in years of Google products before It kills them
$50,000,000: yearly revenue California makes by selling our captured DMV information.
62.7%: of the total system energy in mobile devices is spent on data movement.
$11 million: Cost to Costco when its website went down for 16 hours on Black Friday.
$23 to $500: per contract price to hire a hacker for targeted account hijacking.
$4,000,000: saved by NextRoll using batch and spot markets.
$8 billion: Worldwide drone sales to agricultural businesses by 2026. Drones do the work of 500 farmers for Palm plantations the size of the UK.
1.5 billion: TikTok downloads. It is the third most-downloaded app outside of gaming this year. Numbers one and two are WhatsApp and Messenger, while four and five are the Facebook app and Instagram.
1.4 billion: 5g smartphones by 2022.
$9.4 billion: spent on Cyber Monday. A 20% increase from 2018.
€12.5 billion: EU's increased space program budget.
$545,000: earned by White Hat hackers creating zero-day exploits targeting products from VMware, Microsoft, Google, Apple, D-Link, and Adobe at the 2019 Tianfu Cup hacking competition.
74.6%: improvement in lambda cold start performance runtimes, like nodejs, over the last 16 months
1%: top app store publsihers drive 80% of of the total 29.6 billion app downloads in the third quarter of 2019. The bottom 99% (784,080 publishers) averaged approximately 7,650 downloads each. That’s less than one-thousandth of a percent of the When apps were analyzed by revenue, the gap was wider. Just 1,526 publishers generated $20.5 billion out of the total $22 billion in revenue in the quarter. downloads Facebook generated in the quarter (682 million). There are 3.4 million apps available across the App Store and Google Play in 2018, up 65% from the 2.2 million apps available in 2014. When apps were analyzed by revenue, the gap was wider. Just 1,526 publishers generated $20.5 billion out of the total $22 billion in revenue in the quarter.

Quotable Stuff:

Memory Guy: So, although DRAM spot prices recently reached a historic low, expect for them to fall further as the ongoing overcapacity plays out.
jedberg: The number one cost of any distributed system is data movement. Mobile phones are essentially leaf nodes in a huge distributed system, where they interact wit "downstream dependencies", ie. the servers that provide all the functionality of their apps. So it makes sense that moving data would use the most power.
Vladislav Hrčka: The Stantinko botnet operators are adding cryptomining capabilities to the computers under their control, according to ESET researchers.
Paul Graham: If I had to put the recipe for genius into one sentence, that might be it: to have a disinterested obsession with something that matters.
CockroachDB: So with all these changes, what did we gain? Using Parallel Commits, CockroachDB is now able to commit cross-range transactions in half the time it previously was able to.
Andy Jassy (AWS Chief Executive): We’re growing at a meaningfully more significant absolute dollar rate than anybody else, even at a much larger absolute size. If you look carefully at the capabilities that different platforms have and the features with the most capabilities, I think the functionality gap is widening and we’re about 24 months ahead of the next largest provider, and the next largest provider is meaningfully ahead of the third-largest provider...Our goal is to be the infrastructure technology platform underneath all of these enterprises in their transformation strategies and to enable them to be able to invent and to build better customer experience and to help them grow
Barbara Liskov: In my version of computational thinking, I imagine an abstract machine with just the data types and operations that I want. If this machine existed, then I could write the program I want. But it doesn’t. Instead I have introduced a bunch of subproblems — the data types and operations — and I need to figure out how to implement them. I do this over and over until I’m working with a real machine or a real programming language. That’s the art of design.
@kelseyhightower: The cloud made the hypervisor disappear. Kubernetes will be next.
Clara Gaggero Westaway~ In every moment of frustration there is a little moment of fuzzy feeling just waiting to be freed. Frustration is opportunity.
@landryst: Glenn Gore, Dir., Solutions Architecture, at AWS #reInvent: Latency matters for many applications; 4G's 120ms latency is too high for many applications, such as gaming. The key advantage of 5G is its low latency of sub 20ms in some installations.
@AngelaTimofte: Wait what? Giving more memory(power) to your #Lambda you can get your function run faster but the price stays almost the same 😱 Great info from @alex_casalboni chalk talk #reInvent #aws #AWSreinvent2019
@Kasparov63: Apple changing its maps inside Russia to make Crimea part of Russia is a huge scandal. Regionalization of facts is unacceptable appeasement.
Charles Fitzgerald: The fact is some of the very largest OSS companies have recently lost sales momentum, relevance, valuation, and/or their independence. And judging from the size and shape of the bite marks on their bodies, it looks like the work of the new apex predator, the cloud
@kevins8: #awsTip 18: serverless appreciates in value over time. manage your own server/db, come back in five years, and find that it many versions behind and in dire need of patches. run lambda/dynamo and find that all patches are taken care of and performance is better
@alexbdebrie: I'm still a huge user of APIGW + Lambda b/c I still think it's better overall. My big point is that 80% of Lambda web app users are way overserved by the APIGW offering and would be happier to use a service that is less feature-rich but much cheaper.
Denis Bakhvalov: Often changing one line in the program source code can yield 2x performance boost. Performance analysis is all about how to find and fix this line! Missing such opportunities can be a big waste.
Yan Cui: Which means if you configure 1 Provisioned Concurrency on a function with 1GB of memory, then you will pay $0.015 per hour for it (rounded up to the next 5 minute block) even if there are no invocations. If you configure 10 Provisioned Concurrency for this function, then you’ll pay $0.15 per hour for them, and so on. Eagle-eyed readers might notice that $0.035 + $0.015 = $0.05 per GB-hour for a fully utilized concurrent execution. Which is $0.01 (16%) cheaper than on-demand concurrency! So a system with high Provisioned Concurrency utilization can also save on Lambda cost too
Murat: Blockchains are supposed to solve the trust problem. But blockchains attack only the easy part of the trust problem, and avoid the hard part. The easy part is to store the transactions in a tamper-resistant database. The hard part is to attest to physical world actions and state.
@jeremy_daly: And one more note on the MASSIVE NUMBER of schemas. With the introduction of Event Destinations, I'm feeling like the go-to pattern will be Lambda->EventBridge for nearly every asynchronous task. This will allow other services (and other Lambdas) to subscribe to Lambda outputs...
@Carnage4Life: Uber, Lyft, Juul & WeWork have lost a combined $100B in valuation this year. Media will try and make this a narrative about tech but the industry is diverse enough that these companies dying has no bearing on heakth of $GOOG, $AMZN, $FB, $MSFT, $CRM, etc
@veenadubal: Google wiped the personal phone of Rebecca, one it’s employees who has been targeted for organizing with her co-workers. What kind of digital dystopia do we live in?? Power is so concentrated that the tech monarchy can intervene in our lives in ways we don’t dare imagine.
SlowRobotAhead: Total opposite. I’m having much more luck with FreeRTOS now that it’s an Amazon product than I was with Zephyr. I’m using it with AWS though, so this isn’t much of a surprise.
drh: GitHub is changing. It is important to understand that GitHub is moving away from being a Git Repository hosting company and towards a company that provides an entire ecosystem for software development. Git is still at the core, but the focus is no longer on Git. Fossil already provides many of the features that GitHub wraps around Git. Prior to the meeting, someone told me that "GitHub is what make Git usable." Fossil has a lot of integrated capabilities that make GitHub unnecessary. Even so, there is always room for improvement and Fossil should be adapting and integrating ideas from other software development systems.
Netflix: The development of an effective data compression strategy completely changed the impact of our statistical tools for streaming experimentation at Netflix. Compressing the data allowed us to scale the number of computations to a point where we can now analyze the results for all metrics in all streaming experiments, across hundreds of population segments using our custom bootstrapping methods. The engineering teams are thrilled because we went from an ad-hoc, on demand, and slow solution outside of the experimentation platform to a paved-path, on-platform solution with lower latency and higher reliability.
Daniel Lemire: If you do theoretical work and arrive at a model that suggests that you have a better, faster algorithm, then you are not nearly done. The map is not the territory. If you are good, then your model should match closely reality. But, no matter how smart you are, and no matter how clever your mathematical work is, you can only claim to be solving a problem faster after you have built it into a computer and recorded the elapsed time.
m12k: So an alternative to object-oriented design was proposed: Data oriented design - there's a good video about this from 2014 by Mike Ackton [1]. But in short, the idea is to go 'back to basics' and focus on this: You're pushing around and transforming bits of data so you can eventually give an output. So the goal of your design is to make it explicit how and when you do any of this, so you can avoid unnecessary copying, you can lay out data to fit efficiently in the cache, parallelize as much of this as possible, and do as little else as you possibly can. The result is the difference between opening Word and Sublime Text. In a web-context, Map-Reduce is a similar way of explicitly expression transformations in a way that enables parallelization.
osrec: Nitpick from the article: it suggests kdb underpins most algorithmic trading systems. This is simply not true. I've worked in a number of banks and hedge funds, and most have their own home-brew storage and analysis engines (I've helped build a few myself).
@paulswail: DynamoDB data modelling can be difficult. AWS recommend using a single table to store all your entity types. But is this always the best approach? Are you optimising for your app's performance now while setting yourself up for future schema migration pain?
jedberg: I got an offer to this out of the blue once. They wanted me to buy some cheap product and give it a good review. Of course I understand why this is wrong and unethical, so I tried to report it to Amazon. There is no way to report this to Amazon. They have no facility for reporting a product that someone tried to pay you to review positively. The only thing you can do is contact customer service who as far as I can tell do nothing with the report.
@kevins8: #awsTip 29: its just a series of pipes. consider using aws pipeline as a central ci/cd solution. it supports all deployments you can think of (servers, ios, android, s3 bucket or lambda just to name a few) across multiple regions/accounts
Facebook: In our tests, we found that COPA consistently provided lower video latency as measured with the application-observed RTT than CUBIC. For sessions with a seemingly good network and already low latencies, BBR offered greater reduction than COPA. P50 App RTT was down from 499 ms for CUBIC to 479 ms for COPA and 462 ms for BBR, a 4 percent reduction with COPA and an 8 percent reduction with BBR.
@kevins8: because lambda is billed by execution length, increasing the memory size can actually decrease the cost by reducing the time that lambda is executing...another consideration - increasing memory size doesn't always result in speedup. if your service is network i/o bound, you might want to look at reducing your memory (and cost) without losing out on performance
@snajper47: Evolution of a programmer:- learning C during studies: "I have no idea what I'm doing" - "I get it, give me more power": Lisp, Mongo, Scala, Ruby, JS - "It is harder on big projects, give me restrictions": SQL, F#, Haskell, Elm- "I need more cores": Erlang/Elixir - TBC
intermix: This same task running on the 2-node ra3.16xlarge took on average 32 minutes and 25 seconds, an 18% improvement! This result is pretty exciting: For roughly the same price as a larger ds2.8xlarge cluster, we can get a significant boost in data product pipeline performance, while getting twice the storage capacity.
@ahachete: Mind blowing demo of #Firecracker restoring a snapshotted VM in a few ms. Including running 4K VMs on a single (beefy) i3.metal instance.
@jthomerson: Use auto-scaling. Except when you shouldn't. This is from a #dynamodb user with a Super Bowl ad. If you have a huge event planned, provision for peak and put the auto scaling above that. "Don't stand on the track waiting for the train to hit you if you know it's coming" Yes, that's 5.8 million WCUs. Because of the bucket provisioning of #dynamodb, they didn't even notice the spike ... #dynamodb just kept working for them straight through it
@jeremy_daly: Your workload most likely fits into a single DynamoDB table. Rick's latest example shows 23 access patterns using only *THREE* GSIs. Is it easy? Not really. Can it be done? Yes, and I have faith in you. #SingleTableForEveryone
Luqiao Liu: People are beginning to look for computing beyond silicon. Wave computing is a promising alternative. By using this narrow domain wall, we can modulate the spin wave and create these two separate states, without any real energy costs. We just rely on spin waves and intrinsic magnetic material.
TrollFactoryEmployee: I calculate the price of the smallest [AWS Outposts] option (4xM5.12xlarge + 2.7TB of EBS) at about $103,000 for 3 years (using upfront for EC2 and the full EBS pricing on gp2) vs. $225,500. For the largest option (11xR5.24xlarge + 11TB of EBS) it's about $730,000 vs. $900,000. Plus you have to pay for your own power and cooling
@jeremy_daly: Use an attribute versioning pattern to store deltas for large items with minimal changes between mutations. Don’t duplicate a massive item every time some small part changes. Instead, store each attribute as an item with the same partitionKey.
@jeremy_daly: Avoid running aggregation queries, because they don’t scale. Use DynamoDB streams to process data and write aggregations back to your DynamoDB table and/or other services that are better at handling those types of access patterns.
@jeremy_daly: Big documents are a bad idea! It’s better to split data into multiple items that (if possible) are less than 1 WCU. This will be a lot cheaper and cost you less to read and write items. You can join the data with a single query on the partition key.
@jeremy_daly: DynamoDB performance gets BETTER with scale. Yup, you read that correctly, the busier it gets, the faster it gets. This is because, eventually, every event router in the fleet caches your information and doesn’t need to look up where your storage nodes are.
@kevins8: #awsTip 14: did you cache that? the fastest packet is the one not sent. use cloudfront to cache at the edge or elasticache to cache expensive data queries inside your network. some services also come with cache integration (eg. api gateway, dynamo dax, etc)
Warren Parad: I’ve seen this happen so many times. CQRS is an implementation detail, like so many other things, async updates, which cloud provider you are using, or if your database is relational or document store. You can’t break apart an implementation into two services, it never works. Microservices are built around logical business concepts (ones that are purely API driven or something less complex like a UI), breaking apart the internal implementation into two services will always cause a problem.
Nicole Nguyen: She's purchased over 700 products, including three vacuum cleaners, six desk chairs, and no fewer than 26 pairs of earbuds. And even though most of the products are cheaply made, she’s given each a 5-star review. The twentysomething who lives on the East Coast isn’t a bad judge of quality — the companies that sell these products on Amazon reimburse her for the purchases.
@FryRsquared: All you do, is just label everyone as straight. And then because 94% of the adult male population identify as heterosexual, you beat this other algorithm by an astonishing 13%!
Michael Keller: The researchers led by Axhausen and Menendez identified four factors that shape a city’s road network and ultimately define its traffic capacity: the road network density (measured in kilometres of lanes per surface area), and the redundancy of the network in providing alternative routes for getting to a particular destination. The frequency of traffic lights also had an impact, as did the density of bus and tram lines that compete with vehicular traffic for both space and rights of way (such as signal priority or bus lanes, a common sight in Zurich).
Tomer Shay: Right-sizing RDS instances is a great way to drive RDS costs down. It can be done by analyzing the actual resource usage on your instances to identify down-sizing opportunities which do not require compromising on performance or availability. Also, you can take actions to actively optimize your SQL queries and your database workload and drive the CPU and memory usage down, which in turn can allow you to safely down-size your RDS instances while keeping the same service level.
Vinton G. Cerf: Looking back on this experience, my sense of assurance that these cars really are equipped to handle unusual or at least complex traffic situations rose significantly. The level of care for safety that Waymo has taken has been documented in its reports to the U.S. Department of Transportation and the National Transportation Safety Board. This personal experience on the ground (err, road) reinforced my belief that self-driving car service is demonstrably feasible, especially in areas where weather conditions are favorable to its operation.
Freemon Dyson: Brains use maps to process information. Information from the retina goes to several areas of the brain where the picture seen by the eye is converted into maps of various kinds. Information from sensory nerves in the skin goes to areas where the information is converted into maps of the body. The brain is full of maps. And a big part of the activity is transferring information from one map to another.
@benedictevans: Google: there are now 2.5bn active Android devices. Developer dashboard says 95% are phones: say 2.4bn. Apple said 900m iPhones in January. Chinese Android (not in Google’s stats) phones are ~650m. Total: 4bn smartphones in use today, out of 5bn total mobile phones (& 5.5bn adults)
Casey Handmer: It turns out that the main advantage of domes – no internal supports – becomes a major liability on Mars. While rigid geodesic domes on Earth are compressive structures, on Mars, a pressurized dome actually supports its own weight and then some. As a result, the structure is under tension and the dome is attempting to tear itself out of the ground. Since lifting force scales with area, while anchoring force scales with circumference, domes on Mars can’t be much wider than about 150 feet, and even then would require extensive foundation engineering.
Caroline Jones: Part of the definition of intelligence is always this representation model. . . . I’m pushing this idea of distribution—homeostatic surfing on worldly engagements that the body is always not only a part of but enabled by and symbiotic on. Also, the idea of adaptation as not necessarily defined by the consciousness that we like to fetishize. Are there other forms of consciousness? Here’s where the gut-brain axis comes in. Are there forms that we describe as visceral gut feelings that are a form of human consciousness that we’re getting through this immune brain?
Don Monroe: As if wormholes were not exotic enough, Carroll, Swingle, and other physicists are exploring the idea that the entire structure of spacetime emerges from entangled quantum information. This alternative approach, sometimes called "It from Qubit," starts with abstract points, with no sense of space between them at all, said Swingle. "Then you start entangling them in some characteristic pattern, and that pattern can take on a geometric structure, in that you follow a link from one particle to another particle, eventually you have some sense of being able to go somewhere, some sense of distance, some sense of space."

Useful Stuff:

Great stories explaining how adding complexity to increase resiliency almost always backfires. La La Land: Galileo's Warning.
- The first story describes how during the Oscars the envelope for Best Actress was given to the presenters instead of the envelope Best Picture Oscar. A very embarrassing moment witnessed by billions.
- Who was at fault? Was the presenter wrong for not noticing they had the wrong card? Was it the accountant’s fault for handing over the wrong envelope? Neither. The problem was in the system.
- The idea is in these situations we should blame the system, not individuals who are victims of poorly designed systems. It’s the system that allowed for the wrong result to occur. Humans make mistakes. Systems must be designed to overcome the mistakes humans inevitably make. In this case the system failed for several reasons.
- The first reason is bad typography. The writing on the card did not make it clear what the card was for. So anyone just glancing at the card in a pressure situation would not be able to detect an error.
- Design matters. Accidents happen because of bad design. Design to increase the probability of success. Another example is the poor design of the control panels at Three Mile Island.
- “Galileo's Warning” is the idea that precautionary measures often lead to disasters. Galileo’s example involves how to safely lay a large marble column horizontally on the ground. If you rest the column on two blocks, one at each end the column will sag and break. So add a third block in the middle you say? That sensible idea doesn’t work. The column in question was found broken because the end blocks decayed and sunk will the middle block did not. Adding the central block pressed into the column, snapping it in half.
- The steps we take to make ourselves safe sometimes lead us into danger. Great examples in No good deed goes unpunished: Case studies of incidents and potential incidents caused by protective systems. For example: two pressure relief systems in a chemical plant interact in such a way that neither of them work; an explosion suppression device that causes an explosion.
- Safety systems backfire because they are complex tightly coupled systems. Tightly coupled means you get a domino effect when one thing fails. A complex system has elements that interact in unexpected ways. There will be surprises. Tightly coupled complex systems are dangerous. When something goes wrong, there's usually no time to figure out why and fix it.
- Every time you add a feature to prevent a problem you are adding complexity. When you create complex tightly coupled systems you should expect catastrophic accidents.
- At the Oscars there were two sets of envelopes given to two different accountants. Two sets just in case one person got in a traffic accident on the way over to the event. The accounts stand just off stage, one in each wing. When a card is given to a presenter its duplicate must be discarded. This time the card was not discarded so the wrong card was given to the next presenter.
- The card system was overly complex. So how did the Oscars decide to fix the problem? By adding a third set of cards!
- Normal Accidents: Charles Perrow argues that the conventional engineering approach to ensuring safety - building in more warnings and safeguards - fails because systems complexity makes failures inevitable. He asserts that typical precautions, by adding to complexity, may help create new categories of accidents. (At Chernobyl, tests of a new safety system helped produce the meltdown and subsequent fire.) By recognizing two dimensions of risk - complex versus linear interactions, and tight versus loose coupling

The Library of Amazonia is a set of resources describing how Amazon builds and operates systems. They call it The Amazon Builders' Library. There are titles like "Avoiding insurmountable queue backlogs" and "Challenges with distributed systems", but if you dig in it's all at a pretty high level. There are not lot of architecture level details. It's more of a general principles approach.

How did Basecamp save more than $700K on their $3 million cloud spend? Spending in the Clouds. Basecamp started moving to AWS in 2016 for all the usual reasons. They went to GCP, had some problems, and moved back to AWS.
- First, someone needs to be on the cost reduction job. Someone needs to go line by line through the bill. It just won't happen by itself and it's a very time consuming process.
- They looked at all their systems and the traffic they were actually receiving versus the assigned capacity. They found they a lot over capacity in the system. Rightsizing capacity and moving to reserved instances (which saves 40 - 50%) saved $250k.
- Ask questions. Why did this go up 300% last month? What caused the change? Can we do something about it? Be able to explain all the costs. For example, the found moving files from GCP to S3 was costing $13K a month. Rewriting the system brought the cost down to $15 a month. Another unnecessary cost was writing metrics into Elasticsearch.
- Share costs with the company so that they’re aware of the money being spent. Hold people accountable.
- Overall they're looking at maybe a 20 to 30% cut in the operation spend this year.

Quite a rogues gallery. Why Microservices Fail?

Interested in cloud network performance? Thousand Eyes has a data filled report for you. A Comparative Study of Cloud Performance (podcast): When we analyzed bi-directional network latency, all three public cloud providers—AWS, Azure and GCP—showed an improvement in inter-AZ latency when compared to the 2018 results. The results revealed that GCP performed the best, with an overall average improvement in latency across global regions of 36.37%, and Azure followed closely with a 29.29% improvement. AWS, however, showed only marginal improvement in latency—less than 1% YoY; Users in Europe are subject to 2.5-3x the network latency while accessing compute engine workloads hosted in GCP’s asia-south1 region in Mumbai, India; Network path data for Alibaba Cloud reveals a clear behavior of forcing traffic across the public Internet prior to absorbing the traffic into its backbone network.

We're nearing the end of the year so it's a good time for reflection. What have you learned in all your years of doing whatever you do?
- 5 Things I’ve Learned in 20 Years of Programming: 1. Duplication of Knowledge is the Worst; 2. Code is a Liability; 3. Senior Developers: Trust but Verify; 4. TDD Is Legit, and It’s a Game-Changer; 5. Evidence is King
- carl_sandland: Some things I've learned after 35 years: 1. learn how to communicate: being a good developer requires as much (more?) social as it does technical skills. I would recommend formal training and practice. 2. question all assumptions. 3. there is no silver bullet (read mythical man month). 4. fight complexity and over-engineering : get to your next MVP release. 5. always have a releasable build (stop talking and start coding). 6. code has little value in of itself; only rarely is a block of code reusable in other contexts. 7. requirements are a type of code, they should be written very precicely. 8. estimation is extrapolation (fortune telling) with unknown variables of an unknown function. 9. release as frequently as possible, to actual users, so you can discover the actual requirements. 10. coding is not a social activity.
- gilbetron: write code in tiny pieces where you don't do the next piece until you're confident the current piece works. I get fidgety when I have a bunch of code I don't know works. Mostly, though, I can tell a grizzled veteran (and I've known some that hardly had any years under their belt, it's all in the soul!) because they rarely claim to know The Answer, and have a weary self-confidence that, even though they don't know how, they will get the job done..."I don't know, but we'll figure it out" is what I like to hear, versus, "This is how it Must Be Done!"
- johnwheeler: 20 years of programming has taught me to avoid working with people who take themselves so seriously and think they’ve figured it all out. 20 years has taught me the importance of compromise, give-and-take, and following as much as leading. It’s also taught me that there are people who’ve only been doing this for half as long that are 10 times better.
- curiousfiddler: From my ~15 years of experience, if there is just one advice I could give to someone starting their career as a software developer: irrespective of whether your work is exciting or not as much, it is a joy in itself to keep working on improving your craft, which is writing software systems. Excellence is always a moving target, but if you stop working on your craft, the joy you will experience as you become senior and older, will keep decreasing.

Events are again all the rage, but just what is an event? What data should an event contain? As little as possible? Changes in state? A full elaboration of all the related elements? Event Notification vs. Event-Carried State Transfer: In stark contrast to the event notification model, the event-carried state transfer model puts the data as part of the event itself. There are two key variants to implementing this. Fine-Grained and Snapshots. Also, EventBridge Schema Registry -- what it is and why it matters for Serverless applications

Halving our AWS Lambda bill with parallel processing in Python: Overall compute time for all our Lambda functions were slashed by -65% and our AWS bill reduced by 53% all thanks to a tiny bit of optimization...Since AWS Lambda functions are billed according to the number of invocations as well as the compute time optimization improvements would directly result in cost savings...Setting up a parallel process in any language can be a little tricky but by using 2 lines of code by means of a context-manager it is a breeze in Python...the core function is called 60 times and all run in parallel to accomplish the data processing in a fraction of the time it took initially

19 Takeaways From React Conf 2019. There's also a list of conference videos at the end. Data-Driven Code Splitting: This one blew my mind a bit. Relay is powerful, no question there. Relay has a new feature that lets you expand your queries to express which component code you need to render specific data types. 🤯 You can think of your code as data. As the server is resolving your GraphQL query, it can let the client know what component code it is going to need to download so it can get it faster!

Our machine learning engine is doing 80,000,000,000 predictions every single day. Over 2019 alone, Batch and Spot market usage has saved us $4,000,000. How NextRoll leverages AWS Batch for daily business operations.
- There are 150B+ auctions each day, of those we participate in at least 80B and in those 80B there are at least 5 separate predictions, to determine the type of auction (1st price v 2nd price for example), determine the price likely to win, determine the likelihood of the placement being viewable, determine the likelihood of the user to click, determine the likelihood of the user to convert given that they click, and then we run these last 2 for each candidate (campaign, creative) that is eligible for the current auction. We obviously don't analyse the stuff we didn't buy but 80B IS the number of top level ML-generated prices from our system.
- What is Batch? AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS.
- Why Batch? Cost Saving; Scale of data; Freedom of stack; Custom data processing formats; Ease of deployment
- What is Batch good for? Periodic or nightly pipelines; Automatic scaling processes; Tasks that are flexible in the stack; Tasks that can benefit from the use of different instance types; Orchestrating instances to run tasks
- How Batch? The data flow for Batch starts with Jenkins. After checking out the source code and building the Dockerfile, Jenkins builds the docker image and pushes it to ECR. ECR keeps all of the images that we need to run on Batch. We have some containers running on ECR which control and organize our jobs on ECR. The first one is the scheduler, which kicks off the jobs based on the time of day. Luigi acts as a dependency checker and ensures the dependency of the jobs was met

While this is specifically about music the same ideas apply to apps and books and other digital content. How I Make a Living with MUSIC. There are a lot of ways these days of making money as musician. Digital downloads. Streaming. Physical album sales. Merch. Youtube ad revenue. Patreon. Sheet music. Live performances. Brand deals and sponsorships. Licensing and publishing. Online educational program. Sample library. Streaming is now number one source of income. Digital used to be. Live stream on twitch. Teaching online music lessons. Youtube even with millions of subscribers does not earn a lot, not enough to cover personal expenses. Youtube keeps 45%. Covers songs mean publishers take 15-25% on top of that. Youtube helps you be discovered and flow into other channels. Buid multiple stream of income. Rare youtube only source of income. THe more source of income the more stable your career. The more passive sources of income the more stable your income will be, it can't just disappear over night. Musician can be more stable than a 9 to 5 job, not irresponsible anymore. You're an entrepreneur. Need to develop a back catalogue of content. Need to be OK with asking for support and getting paid for your work. Let people know they need to support you if they enjoy what you're producing. Need to be hard working, patient, have business plan. Not a get rich scheme. Takes a lot of hard work and time.

Videos from Serverlessconf New York 2019 are now available.

Using (and Ignoring) DynamoDB Best Practices with Serverless | Alex DeBrie: Use on demand pricing until it hurts. The single table model Amazon recommends is inflexible. It can be done, but you aren't the Amazon shopping cart, you don't have the same scaling problems. Single table design (STD) makes it hard to add new features or change the data model if access patterns change. And it's hard for analytics because you need to renormalize. STD is indecipherable. It's more like machine code and STD with graphql doesn't work.

Make your programs more reliable with Fuzzing - Marshall Clow [ACCU 2019]. pstomi: Genetic fuzzers: out-of-process with American Fuzzy Lop, in-process with llvm libfuzzer and others; How libfuzzer could have found the heartbleed exploit in less than 5 minutes; Structured fuzzing : how to trick your program into trusting the data by feeding it with some random valid structured data at the beginning, and then let the fuzzer try to actively explore different code paths when it adds additional random data; Fuzzing on clusters; Permanent fuzzing for open source projects, with oss-fuzz by google

Computer Architecture Lecture 6b: Computation in Memory I. We Do Not Want to Move Data! A memory access consumes ~100-1000X the energy of a complex addition. We Need A Paradigm Shift To Enable computation with minimal data movement; Compute where it makes sense (where data resides); Make computing architectures more data-centric.

Episode #23: Serverless Application Security with Ory Segal (Part 1). Great point about how serverless reduces the blast radius of a hack. Attackers can't get into an infrastructure and move horizontally across a system and infect all the servers in a system.

Good advice for startups. Positive thinking doesn't work. Episode 7: Don't Accentuate the Positive. Positive thinking fools our minds into perceiving that we’ve already attained our goal, slackening our readiness to pursue it. Use a technique called mental contrasting. Plan for every eventuallity. Michael Phelps planned for every negative outcome as do the Navy Seals and Bill Belecheck. Simulating barriers to goals gives you the energy to overcome them. Start with obstacles. Think of every if-then situation and plan how to handle them. When an obstacle comes up imagine implementing you plan. The process is called WOOP: Wish, Outcome, Obstacle, Plan.

Comparative Benchmark of Ampere eMAG, AMD EPYC, and Intel XEON for Cloud-Native Workloads: Lokomotive Kubernetes and Flatcar Container Linux worked well on arm64 with Ampere’s eMAG. Even hybrid clusters with both arm64 and x86 nodes were easy to use either with multi-arch container images...It excels at multi-thread performance, particularly with memory I/O heavy and throughput related workloads (e.g. multi-thread key-value stores and file operations on a tmpfs). These are the clear places where it would pay off to switch the server type. In conclusion the eMAG also feels well-positioned for cloud-native workloads, which tend to fully commit memory and compute resources to pods...AMD’s EPYC has a slight integer/floating point processing advantage with vector arithmetics and is in the same cost range as eMAG, as well as faster IP-in-IP overlay networking, but suffers from lower overall multi-thread throughput...Intel’s XEON, while leading with raw performance in a number of benchmarks, comes in last when cost is factored in.

OWASP Top 10 list consists of the 10 most seen application vulnerabilities: Injection; Broken Authentication; Sensitive data exposure; XML External Entities (XXE); Broken Access control; Security misconfigurations; Cross Site Scripting (XSS); Insecure Deserialization; Using Components with known vulnerabilities; Insufficient logging and monitoring

Interesting new UI. How vibration can turn any object into data-enabled interface. A data enabled hyper surface. Surfaces of objects can become aware of what's going on. Pick up a cup or hanger it knows. Knock on a door and it knows. Touch an AC vent in a car and that can turn it on or off. No need for switches anymore. Objects embed tiny vibration sensors with that use machine learning to identify patterns. Every time you touch something it creates a unique vibration pattern. ML categorizes the pattern. Vibrations are a rich signal that can describe interactions. It's an edge technology. All processing happens on chip so it's local and private.

BGP sucks. Fastly improves delivery reliability with its fast path failover technology: The key insight is that we do not need to wait for peers or transit providers to tell us that a path is broken. We can determine this for ourselves...Since its release, this technique has allowed us to mitigate an average of around 130 performance degradation events impacting at least one of our PoPs every day, each with a median duration of approximately 9.9 minutes. In these cases, our fast reroute technology provides an improvement of 7% on the probability of connections establishing successfully.

Facebook with their Networking @Scale Boston 2019 recap. You may like All the Bits, Everywhere, All of the Time: Challenges in Building and Automating a Reliable Global Network; Anycast Content Delivery at Akamai; Self-Organizing Mesh Access (SOMA); Adaptive Cache Networks with Optimality Guarantees; Building Stadia’s Edge Compute Platform

Before current age of fast centralized systems everything was slow and distributed. The Early History of Usenet: Usenet—Netnews—was conceived almost exactly 40 years ago this month. To understand where it came from and why certain decisions were made the way they were, it's important to understand the technological constraints of the time. In 1979, mainframes still walked the earth. In fact, they were the dominant form of computing. The IBM PC was about two years in the future...At the time, Unix ran on a popular line of minicomputers, the Digital Equipment Corporation (DEC) PDP-11. The PDP-11 had a 16-bit address space...For most people, networking was non-existent. The ARPANET existed (and I had used it by then), but to be on it you had be a defense contractor or a university with a research contract from DARPA...The immediate impetus for Usenet was the desire to upgrade to 7th Edition Unix.

The unexpected problems with automation and autonomy. Mike Elgan makes a interesting point: in our autonomous future, where autonomous devices do more and more of the work—the star of company will be IT. IT operates, provisions, repairs, and secures autonomous systems. Is IT ready?

Reviewing Microsoft Ignite 2019.
- Azure Stack has now become Azure Arc. Unlike AWS Microsoft recognizes there’s a multi-cloud world out there. Arc wants to be a management control plane to manage other services in multiple areas—Azure, on-prem, hybrid clouds, and other clouds. Microsoft wants to be your operations control plane no matter where you are. There are still a lot of questions. With an operations overlay how to you manage other clouds? Do you target a lowest common denominator? Do you let people go deep in to each cloud? How do you abstract away complexity while still offering functionality? We don’t know which approach Arc is taking. But Microsoft wants a vendor customer relationship. Google and Amazon will likely get into one of your businesses, Microsoft wants to distinguish themselves by not doing that. Microsoft has the advantage of enterprise experience. They have their tentacles deep into the enterprise. Arc is changing the Stack hardware model by handling the hardware themselves.
- There are a lot of synergies between GitHub and Microsoft. Expect GitHub Events to link into Azure Functions. You already have all your code in GitHub so GitHub can transition to be your execution environment. Microsoft does not want to scare users by turning GitHub into Azure. The idea would be to reduce friction for developers without exposing Azure. Just click this or choose that and you don’t have to think about it. Stuff just happens.

Amazon has co-opted another open source project. As discussed before this isn't always a bad thing. Amazon has unique expertise in creating and operating managed services that sets them apart. But how they did it was a little different this time. Managed Cassandra on AWS, Our Take: let’s take a look at what Amazon did here. MCS is a form of chimera: the front end of Apache Cassandra (including the CQL API) running on top of a DynamoDB back end (both for the storage engine and the data replication mechanisms). That said, let’s take a look at what Amazon did here. MCS is a form of chimera: the front end of Apache Cassandra (including the CQL API) running on top of a DynamoDB back end (both for the storage engine and the data replication mechanisms).

Don't drink the anti-monolith kool-aid. Just because you have a monolith doesn't mean it can't be modular. Modular Monolith: A Primer: 1. Monolith is a system that has exactly one deployment unit. 2. Monolith architecture does not imply that the system is poor designed, not modular or bad. It does not say anything about quality. 3. Modular Monolith architecture is a explicit name for a Monolith system designed in a modular way. 4. To achieve a high level of modularization each module must be independent, has everything necessary to provide desired functionality (separation by business area), encapsulated and have a well-defined interface/contract.

Amazon is not the only company breaking out of the digital only world—book and automated grocery stores—and venturing into the meat world. Online to Offline: The Technical Backbone of Wayfair’s First Physical Store. Is escaping from digital a future trend? You can bet digital companies won't do retail in the same old way. They are looking to apply all the lessons of digital to physical.

Heavy Networking 485: Understanding Edge Exchanges. A fascinating topic. If you want to understand more about how the internet works then this podcast is for you. 5g makes it possible to change the structure of the internet. Edge networks are the next generation of internet connectivity.
- Internet Exchange (IX): large hyperscalar datacenter where many different networks come together physically to exchange data. The internet is a network of networks. At some point those networks have to come together to exchange data between them. That makes the internet. A network operator brings their big fat fiber pipe into an exchange and decides which other network operators to interconnect with. Then packets can flow freely between the network.
- The main reason for an exchange is to say transit data from AT&T’s network to Verizon’s network. There are very few places in the US where this actually happens. One is in downtown Chicago. So if you are on AT&T and a friend is on the Verizon network to talk to each other the traffic has to travel first to an exchange, which may be in a different state. That’s the core function of an exchange: to allow networks to transfer data between themselves.
- There are smaller exchanges dotted throughout the country, but you are limited by who is at an exchange. If the network you want to connect to isn’t in the smaller exchange then your traffic still hast route through one of the bigger exchanges.
- Companies like large banks, manufacturers, Google, can connect directly to exchanges as well. It’s just a matter of scale.
- Now we want to talk about edge exchanges which brings up edge computing. By edge computing we’re talking about infrastructure edge computing which is the deployment of micro datacenters facilities which are positioned at the network operator side of the last mile networks (4g, 5g, cabal, ADSL, etc). It’s the ability to put a 150 - 250kw micro datacenter in that location and have the associated network connectivity to connect to the last mile networks (where the customers are) to the middle mile networks that connect the last mile networks to the rest of the network.
- Today with 3g/4g we have baseband units at the bottom of cell towers connected to radio heads at the top of cell towers which are responsible for baseband processing and transmitting across the network across the front haul connectivity back to a carrier location where it can be backhauled out to the rest of the carrier network. At the moment there’s no exchange service where you can get your traffic out to normal IP packets at the bottom of the cell tower and put them on to a different network. At the moment you’re still taking a CPRI or ECPRI which encapsulates all the mobile traffic for the specific network back to a carrier hotel or an IX where it then gets split out and given public IP addresses so it can be exchanged between networks. Even AT&T and Verizon are on the same tower they can’t exchange packets at the tower. It has to go back to centralized point and sent back through the RAN (radio access network) to talk to a friend.
- With a 5g network maybe there’s a 60kw/100kw micro decenter that’s fiber connected to 10-15km from a tower so you are able to get sub 100 microsecond latencies on a straight dark fiber run. You can then do local break out within that facility, break out all the traffic from the telco encapsulations, you can perform this edge exchange functionality and exchange data between those two networks at the edge location without having to take the longer path described before through the network.
- It saves backhaul costs to provide breakout at the edge location. There’s a huge amount of data (terabits) sent unnecessarily across the country with the current system and that costs money. It also introduces latency and jitter which can’t handle the longer network paths. Imagine supporting autonomous control vehicles that require sub 10 millisecond round trip times between the user and the application they can’t afford to backhaul traffic across the state, let alone across the country to an IX.
- You want to be off the mobile network as fast as possible because the mobile networks are oversubscribed and pretty disastrous. This is a network topology problem. You want the break out as close to the tower as possible and do the exchange at the edge. You want traffic to never have to leave the city if possible.
- Ideally the exchange would be a fiber aggregation hub in that area.
- 5g doesn’t pretend voice is the main traffic anymore. With 5g it gets to IP as soon as possible.
- In Europe internet exchanges are non-profit and in the US they are a for profit businesses. The IX prices in the US are much higher.
- It doesn’t seem like cloud operators will operate inside one of these exchanges, but they could direct connect so there’s faster routing to a regional datacenter. I’m wondering if that will be fast enough?
- This could be huge. @stu: "Verizon CEO on #reInvent stage to announce AWS Wavelength - embeds compute and storage inside telco providers’ 5G networks. Enables mobile app dev to deliver apps with single digit millisecond latencies. Pay only for what you use. Verizon is first of global partners." More info: Verizon and AWS team up to deliver 5G edge cloud computing

Soft Stuff:

jeremydaly/dynamodb-toolbox: is a simple set of tools for working with Amazon DynamoDB and the DocumentClient. It lets you define your data models (with typings and aliases) and map them to your DynamoDB table.
Zephyr Project: The Zephyr™ Project is a scalable real-time operating system (RTOS) supporting multiple hardware architectures, optimized for resource constrained devices, and built with safety and security in mind. The Zephyr Project’s goal is to establish a neutral project where silicon vendors, OEMs, ODMs, ISVs, and OSVs can contribute technology to reduce the cost and accelerate time to market for developing the billions of devices that will make up the majority of the Internet of Things
AdRoll/batchiepatchie: a service built on top of AWS Batch that collects information on all the jobs that are running and makes them easily searchable through a beautiful user interface.
cloudflare/flan (article): a lightweight network vulnerability scanner. With Flan Scan you can easily find open ports on your network, identify services and their version, and get a list of relevant CVEs affecting your network.
NASA Software: 704 available programs and counting.
vmware-tanzu/antrea: a Kubernetes networking solution intended to be Kubernetes native. It operates at Layer3/4 to provide networking and security services for a Kubernetes cluster, leveraging Open vSwitch as the networking data plane.
uber/cadence (article): a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
slackhq/nebula: a scalable overlay networking tool with a focus on performance, simplicity and security. It lets you seamlessly connect computers anywhere in the world. Nebula is portable, and runs on Linux, OSX, and Windows. (Also: keep this quiet, but we have an early prototype running on iOS). It can be used to connect a small number of computers, but is also able to connect tens of thousands of computers.
SudeepDasari/RoboNet (article): A Dataset for Large-Scale Multi-Robot Learning.

Pub Stuff:

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution: The new storage layer, BlueStore, achieves a very high-performance compared to earlier versions. By avoiding data journaling, BlueStore is able to achieve higher throughput than FileStore/XFS. When using a filesystem the write-back of dirty meta/data interferes with WAL writes, and causes high tail latency. In contrast, by controlling writes, and using write-through policy, BlueStore ensures that no background writes to interfere with foreground writes. This way BlueStore avoids tail latency for writes.
Newton vs the machine: solving the chaotic three-body problem using deep neural networks: We show that an ensemble of solutions obtained using an arbitrarily precise numerical integrator can be used to train a deep artificial neural network (ANN) that, over a bounded time interval, provides accurate solutions at fixed computational cost and up to 100 million times faster than a state-of-the-art solver.
Portfolio rebalancing experiments using the Quantum Alternating Operator Ansatz: This paper investigates the experimental performance of a discrete portfolio optimization problem relevant to the financial services industry on the gate-model of quantum computing. We implement and evaluate a portfolio rebalancing use case on an idealized simulator of a gate-model quantum computer. The characteristics of this exemplar application include trading in discrete lots, non-linear trading costs, and the investment constraint.