hot links

Stuff The Internet Says On Scalability For August 3rd, 2018

High Scalability

03 Aug 2018 — 17 min read

Hey, it's HighScalability time:

Everything starts with Doug Engelbart — Jane Metcalfe.
It was the very first time (1968) the world had ever seen a mouse, seen outline processing, seen hypertext, seen mixed text and graphics, seen real-time video conferencing. — Doug Engelbart (Valley of Genius).
ARPA funded the demo at a cost of $1 million. Most importantly? It was the first use of a todo list as an example. A tradition unlike any other.

Do you like this sort of Stuff? Please lend me your support on Patreon. It would mean a great deal to me. And if you know anyone looking for a simple book that uses lots of pictures and lots of examples to explain the cloud, then please recommend my new book: Explain the Cloud Like I'm 10. They'll love you even more.

$1 trillion: Apple; 45: mean founder age for fastest growing new ventures; 1 trillion: files created by Trinity in 2 minutes; $3.93 billion: Baidu's AI driven quarterly revenues; 13.7: microsecond error in GPS timestamps for nearly 14 hours; 240 million: Apple CarPlay cars by 2023; 99%: confidence setting to use when face matching politicians to criminals; 31%: Apple services revenue increase to ~$10B; 4 billion: containers Google launches each week; 0,1,2: likely per person mutations on a coding gene; 1,000x: solid-state memory density; $10 billion: Pentagon cloud contract up for bid; 661Tbps: through a single optical fiber; $49bn: last 6 months of M&A; $1.1 trillion: projected 2040 space economy revenue;

Quotable Quote:
- @BrianRoemmele: This shift away from affiliate compensation for IOS app sales will have a rather large impact on niche apps. Many are high ticket and require rather involved promotion and support by third parties that were partly compensated by the affiliate compensation.
- Paco Nathan: Frankly, I’d feel a lot more comfortable sending my kids off to school in a self-driving bus if the machine learning models hadn’t been trained solely by Google’s proprietary data. Instead, let’s get every possible edge case understood by mingling Google’s training data with that from the other manufacturers.
- David Rosenthal: The margins on AWS, averaging 24.75% over the last twelve quarters, are what enables Amazon to run the US retail business averaging under 3% margin and the international business averaging -3.7% margin over the same period.
- @asymco: Apple Q4 Sales Guidance $60B-$62B vs $52.58B year ago, Gross Margin 38-38.5% vs. 37.9% year ago.
- Valley of Genius: The best way to think about Silicon Valley is as one large company, and what we think of as companies are actually just divisions. Sometimes divisions get shut down, but everyone who is capable gets put elsewhere in the company: Maybe at a new start-up, maybe at an existing division that’s successful like Google, but everyone always just circulates. So you don’t worry so much about failure. No one takes it personally, you just move on to something else. So that’s the best way to think about the Valley. It’s really engineered to absorb failure really naturally, make sure everyone is taken care of, and go on to something productive next. And there’s no stigma around it.
- David Gerard: Companies are shocked to realise that blockchain — an expensive and useless idea that has soaked up millions of dollars for zero return — may not be a good technology. “Many companies will halt their blockchain tests this year. The pullback could hurt IBM and Microsoft, analyst says … The expectation was we’d quickly find use cases.”
- Tullis: The U. S. Department of Homeland Security has designated 16 sectors of infrastructure as 'critical', and 14 of them depend on GPS.
- Valley of Genius: The people who really create things that change this industry are both the thinker and the doer in one person.
- Joanna Hoffman [General Magic]: If you have a bunch of self-motivated and smart people and you put them together they’ll produce something incredible. But you can’t minimize the importance of management. It’s a dirty word. It’s prosaic. It’s not vision. It’s not dream. It’s not technological excellence. But unfortunately, it makes all the difference.
- @timoreilly: OH: "We can't entirely eliminate our technical debt. My goal is to refinance it at a lower interest rate."
- Alok Pathak: While both (Multi-AZ and Read replica) maintain a copy of database but they are different in nature. Use Multi-AZ deployments for High Availability and Read Replica for read scalability. You can further set up a cross-region read replica for disaster recovery.
- Stacey Higginbotham: a startup in France called GreenWaves Technologies has built a dedicated chip for the Internet of Things. The company chose the RISC-V architecture because it wanted to avoid raising the crazy amounts of money typically needed for chip startups. GreenWaves CEO Loic Lietar said the company has raised €3.1 million (US $3.6 million) and has already managed to produce a sample of its silicon.
- mark_l_watson: I have been doing general AI and machine learning since the 1980s. I now manage a team that specializes in deep learning. My bet is that most of data engineering/data science/designing neural network architectures will be commoditized in 5 to 10 years. Maybe sooner.
- Laura Davison: A recent U.S. federal court ruling means companies such as Facebook Inc. and Alphabet Inc.’s Google won’t be able to deduct the full cost of the stock payments they make to employees when calculating their corporate tax bills, which they’ve done for years.
- Singhal: Parallelization is really cool and powerful, but it is extremely hard to implement and get right. The reason it is so hard is the same as why hardware design and verification is so much harder than software design and verification. If you think of hardware design, it is essentially one massively parallel program where every block is a parallel process. What that does is make design hard, debug hard, synchronization hard, and they all add a layer of complexity that is hard to manage with software.
- Xiaowei R. Wang: Chinese tech isn’t an imitation of its American counterpart. It’s a completely different universe. And at the heart of it always hums the question, just how cultural is the construction of technology? As many scholars such as Joe Karaganis, Jinying Li, and Lucy Montgomery have asked, how can piracy help expand media and technological access in China? Why is creative reuse and the lack of intellectual property protections in Shenzhen seen as knock-off culture by many in the US, when the actual conditions are more like open-source? How does the vibrant economy in southern China challenge Western notions of authorship and copyright? Is technology less universal than we think?
- Liz Fong-Jones: There definitely is a distinction between building technical infrastructure at Google whether it’d be GCP or not GCP and building products like Google Web Search or ads, that’s definitely true. There’s a difference between being someone who develops technical infrastructure and someone who uses technical infrastructure.
- Michael Bohm: During the research and development process [of the Alibaba Cloud], engineers were often woken up at midnight to handle online failures. Some of my colleagues even set the voices of their kids as their ringtones as they were often away from home; many stayed up for work in the middle of the night for more than two hundred days.
- Tim Bray: This is a new thing. I’m a greybeard and have seen a lot of technology waves roll through. By and large, what’s driven the big changes are technical advantages: PCs let you recompute huge spreadsheets at a keystroke, in seconds. Java came with a pretty big, pretty good library, so your code crashed less. The Web let you deliver a rich GUI without having to write client-side software. But Serverless isn’t entirely alone. The other big IT wave I’ve seen that was in large part economics driven was the public cloud. You could, given sufficient time and resources, build whatever you needed to on-prem; but on the Cloud you could do it without making big capital bets or fighting legacy IT administrators. Serverless, cloud, it all goes together.
- @joeparlock: There's a game on Steam called Abstractism that is actually a bitcoin mining, TF2 item-scamming virus. So Steam's content curation is working great.
- @goserverless: How to deliver on the Serverless promise of "if 1, then N"? 1. Partition for horizontal scalability 2. Embrace eventual consistency 3. Idempotency throughput 4. Stateless(ish) compute 5. Understand your bottlenecks @RobGruhl #Serverlessconf
- @goserverless: We hear stories like @LesliePajuelo's (from Walmart) all the time. A developer thinks #Serverless can solve a big problem, easily. They build it out with a 1 or 2 person team. It does so well they move to adopt it org wide. #ServerlessConf
- @mweagle:"There are legitimate reasons for needing to run kubernetes" Eg: * Data locality * Legal * Custom hardware/GPU/TPU @BretMcG #serverlessconf
- ryandrake: Definitely looks like several "teach-able moments" here: They learned the hard way about: 1. Developing a fix without understanding root cause (try-something development). 2. Sufficient testing, including load testing, prior to initial deployment. 3. Better change control after initial deployment. 4. Sufficient testing for changes after initial deployment. 5. Rollback ability. 6. Crisis management (What was the plan if they didn't miraculously find the bad line of code? When would they pull the plug on the site? Was there a contingency plan?) 7. Perfect being the enemy of good enough. Looks like they were bailed out of the cost but what if that didn't happen?
- NVRLand: EVERYONE at Spotify are working towards the same goal: making more freemium users go premium. That's the only thing that matters to them and it drives all decisions. "Will Feature A turn more freemium users into premium users?". I've been to the Spotify HQ a couple of times and talked to developers and one told me that they did A/B test lossless streaming for some users but couldn't see that they listened to more music than users without and they didn't buy premium to a higher degree either. So they just scrapped that.
- Fanghao (Robin) Chen: Discord is thriving through learning, growing and leveraging expertise every day. As a result, we are incredibly proud of our amazing engineer to user ratio at Discord — 40 engineers to 130M+ users.
- Charity Majors: You can't rely on your customers to provide the vision, but for implementation their feedback is absolutely critical.
- Glenn Fleishman: Gutenberg was so far ahead of this time that this punch/matrix/mold method persisted largely through the early 1800s
- Nick Chater: In my perspective, it comes from this central idea that the brain is a sequential processor, thinking one thought at a time. With each thought, you’re taking up massive fragments of information and trying to pull them together. It could be perceptual information, it could be linguistic information, it could be fragments of memory. And one of the things that presumably religion’s good at doing is giving us the sense that it all somehow fits. It has the sense that all these different aspects of life click into place. Rather like when you see a hard-to-process image and suddenly think, “Oh, I see, it’s a cow’s face” or “It’s a dog.”
- Bill Kleyman: This is a major reason why edge is now supporting so many different types of use cases, and not just faster Netflix so you can binge on “Stranger Things.” In my experience, edge is now being used for: Branch and micro data centers; Hybrid cloud connectivity; IoT processing (Azure IoT Edge, for example); Firewall and network security; Internet-enabled devices and sensors collecting and analyzing real-time data; Connecting entire networks of devices; Asset tracking; Streamlining research; Managing inventory for pharmaceutical, manufacturing, and corporate operations; Reducing latency for specific services and latency-sensitive applications
- Ed Sperling: There are differences in Russia versus the United States. When people tried to invent supersonic flights, the first parties were going up in planes and then diving down to reach the speed limit. Both parties found the wings were flying off the planes, and unfortunately people lost their lives over this. But the Russians went back to the drawing board to look at the theory, and they found that with the Mach challenge, the problem stemmed from overlapping waves, so they developed wings that were not in the way of those waves. The Americans tried to make the wings stronger and stronger, but that didn’t work, so eventually they had to take a different approach. When engineering doesn’t work anymore, some theory may be necessary to overcome a problem.

Statelessness in space? Amp Hour Interview with Brent and Bryce Salmi about SpaceX, developing systems that work in space, and FaradayRF.
- Stateless isn't just for programming, it's for satellites too. What if you want a cubesat to last over 5 years instead of the usual 6 months? A microcontroller can suffer from radiation bit flips. A couple solar storms in five years and you're dead. Data and code can get corrupted. What do you do? Create an analogue computer. An analog computer is constantly performing computations and refreshing itself. So if a particle hits a board and causes voltage or current spikes it will degrade slowly over time instead fail suddenly.
- With rockets milliseconds matter, you can’t go into safe mode to fix a problem. With satellites there's time. Options? Lock step. Have two processors in lock step, if they disagree go to a cold third standby, and if that dies you’re dead. Voting system in triplicate. You don’t even notice a failure, you just take the two that agreee.
- Modern processors are very reliable. They are designed to never fail for a mass market. Don’t necessarily need space hardened parts.
- Your car can be autonomous, why not a rocket? There’s a difference between designing something to work and figuring out all the ways something can fail and designing so the failure can’t happen or putting in protections when a failure occurs. Making it work is just the start. The rest of the time is in figuring out how it can fail. Only the paranoid survive. Derating over spec gives you head room. Even with triple redundancy when there’s a design error all three will fail.
- Why can’t you just use an iPhone to launch a rocket? High current. High temps. High vibration. It's a crazy environment. You need to thread needle of performance and reliability. Go towards performance and you invite failure. Go towards reliability and it will never work.
- A failure of one thing can’t cascade. When one sensor fails the others must be protected. The others can't even change accuracy. One sensor shorted can’t effect accuracy. But you want to keep it simple to keep the board small.
- One board of space rated components can cost a couple of hundred thousand dollars. With commercial parts you can perform destructive tests to see what will fail and protect against those failures. It's cheap enough you can test. Everyone should be simulating their circuits. Hit an crystal oscillator hard enough and it will change frequency. You can see jitter on a clock.
- Ask why is that requirement there? Is this a valid fault condition?

Only in the movies. The Rebel Alliance would defeat the United Federation of Planets.

Great interview with Stewart Brand. De-Extinction, The Whole Earth, & Way More.
- He talks about his early years in the military; the hippie/acid years; campaigning for a picture to be taken of the earth from space; the Whole Earth Catalogue; the Well; bringing back the Woolly Mammoth; and fixing climate change.
- Fixing climate change is the first time humanity has ever fixed a humanity scale problem before. Thinking as a planet is what climate change is forcing us to do...thinking as a civilization rather than just a society or just a nation, but as a civilization. People in this century get to live in the century where humanity discovers itself as the keeper of life on a whole planet. What an amazing realization, what an amazing job, what an amazing thing to work on how that actually works. But all the tools are in place to make it Earth national park and the city you want to live in and the society that continues to become ever more amazing from year to year, decade to decade—in a non-destructive way.

Make a coding mistake on your own hardware? There's a natural bulkhead preventing it from spreading. On the cloud? Not so much. Well, that's not completely true, Firestore has hard daily caps, they just weren't used. How we spent 30k USD in Firebase in less than 72 hours. How? Since the campaign was released, and for the next 48 hours, we had use lot of resources of Firestore, our billing came up to $35,000 USD!!! We did more than 46 BILLION requests to Firestore. Yes, billion with a B. The cause? Doing work in constructors. Never do work in constructors! Here's why: Every time we call the service vakis.ts , on the constructor method was the line this.loadPayments() which called the service payments.ts and with that service we were printing the information of a Vaki. This means, that with every visitor to our site, we needed to call every document of payments in order to see the number of supports of a Vaki, or the total collected. On every page of our app!

Videos are now available from Facebook's Systems @Scale 2018. Titles include: Geo-Replication in Amazon DynamoDB Global Tables, Kubernetes Application Migrations: How Shopify Moves Stateful Applications Between Clusters and Regions, Scaling Data Distribution at Facebook Using LAD.

When is a platform no longer a platform? Apple says it is removing apps & in-app purchases from its iTunes Affiliate Program. A platform must generate more value for the ecosystem than it does for the platform owner. Participants drive more revenue to Apple than Apple reradiates and amplifies out to the ecosystem, so the App store isn't a platform, it's just a brilliant way for Apple to make a lot of money.

We're used to servers being treated like cattle in the cloud. Through automation Google is treating their network like cattle too—mostly. Repairing network hardware at scale with SRE principles. Now hardware engineers in a datacenter can replace a failed control plane using a UI. The automated system performs automated safety checks and an entire device audit at the end of the operation. What can't they automate? For now, a human is needed for the "hardware gets replaced" step. Furthermore, they can't automate steps where they interface with outside suppliers. Vendors like Juniper or Cisco must be convinced a part is bad before they'll replace it. If you want to know why hyperscalers want to build all their own equipment, automation is a big reason. When you build you're own you can build in automation from the start.

Videos from Google Cloud Next '18 are available.

Isn't it odd how precise sounding numbers are often anything but? A 10 nanometer chip isn't really 10 nanometers. That's a marketing number. It does not represent physical dimensions in a chip. A 2.9 gigahertz processor is another marketing number, as we all learned the MacBook Pro throttling incident. A feeling frisky Core i9 can Turbo Boost up to 4.8GHz. Are you actually using those processors? Is it a touch warm outside? Then the average clock speed could be 2.2GHz or even much lower.

Key Takeaway Points and Lessons Learned from QCon New York 2018. @whereistanya: That was an engaging and informative talk by Susheel Aroskar about how Netflix does push notifications. Key takeaways: - recycle connections often - randomize connection lifetime - use small servers - autoscale on open connection count - use websocket-aware or TCP LB

This Bomb-Simulating US Supercomputer Broke a World Record: Settlemyer’s solution was, instead, to create more files with less information: one file for every particle, tracing each one through the entirety of the simulation. If Settlemyer put those files into a searchable index, the scientist could simply ask the computer, “Which of those particles’ lives ends with the biggest bang?” The scientist can then just pull and parse those personal dossiers. “We’re able to retrieve the data between 1,000 and 5,000 times faster,” adds Settlemyer. Fast enough to make the scientist’s Fermi acceleration research doable. Trinity created a trillion files in two minutes—a world record. It’s not just an academic achievement. That speed could allow scientists to follow the trajectory of a particle (or 10,000 particles) in a trillion-particle warhead simulation. The warheads whose integrity, remember, Los Alamos is tasked with maintaining.

Good recap. ServerlessConf 2018 San Francisco: key takeaways for the future of serverless. Best practices: Functions logic should be stateless; Functions should be idempotent; One task per function ("do one thing"); Functions should finish as quickly as possible; Avoid recursions; Concurrency limitations and rate limits.

Grubhub moved from on-premise monoliths to microservices in AWS. Cloud infrastructure at Grubhub. They use Docker, Netflix Eureka, Cassandra, Elasticsearch, Datadog, Splunk, and various AWS services.

Want to compete with Facebook, Apple, Amazon, Netflix and Google? David Rosenthal says it's primarily a business problem, not a technology problem. Why? Look at the economics. Amazon's Margins Again: Contrary to the myth that Amazon has revenue but not profits, Del Rey and Molla show that it has made a profit for thirteen straight quarters. And, not content with margins rising from 20.5% to 26.9% over that period...A business growing nearly 50%/yr with 25%+ margins is in an amazingly strong position...The advocates of decentralized storage naively imagine that they can compete with the S3 part of the AWS juggernaut. Their product doesn't perform as well, isn't integrated with a cloud computing environment, is more expensive to implement and operate, and lacks economies of scale...AWS is the source of the investments that drive Amazon's takeover of retailing. So Amazon's "slow AI" would view a threat to disrupt AWS, or even just the S3 part, as an existential threat, and respond accordingly. It has over $30B in cash to fuel the response, and the infrastructure to out-compete a storage network's nodes, driving the price down to make it uneconomic for anyone else to run a node.

How we scaled nginx and saved the world 54 years every day. Cloudflare found SSD performance varied a lot, sometimes taking as long as one second for a read. Performance consistency is as important as peak performance. Google found this out a long time ago. Spikes cause head-of-line blocking on queues, so you need to load balance and not block in the event loop. Cloudflare used SO_REUSEPORT to solve the uneven distribution problem, which improved peak p99 by 33%. Moving read() to thread pool was not a silver bullet. Moving non-blocking open() to the thread pools helped. All changes together improved the overall peak p99 TTFB by a factor of 6.

Looks like Facebook and Apple aren't the only ones reaching peek total addressable market. China's Baidu credits AI for robust ad sales: Robin Li, Baidu's chief executive, said in a Tuesday call with analysts that tech companies in China can no longer rely primarily on gaining new users to fuel growth. The internet population in China is not growing as quickly as before. That really means the technology will play a much more important role, both in terms of user experience and in terms of monetization.

A good pre-serverless overview. Scaling webapps for newbs & non-techies.

Saving With MyRocks in The Cloud: MyRocks outperformed InnoDB in every single combination. InnoDB requires three times more in storage throughput. If we take a look at the storage cost, it corresponds to three times more expensive storage. Given that MyRocks requires less storage, it is possible to save even more on storage capacity. On the most economical storage (3400GB gp2, which will provide 10000 IOPS) MyRocks showed 4.7 times better throughput. For the 30000 IOPS storage, MyRocks was still better by 2.45 times. However it is worth noting that MyRocks showed a greater variance in throughput during the runs.

Where did the Microsoft Tech Stack disappear? Microsoft's new stack is Azure, which does an end-around the Windows tax. The old stack didn't escape the enterprise for all the same old reasons. feoh: I think part of the problem with .Net in many people’s perceptions is the fact that it’s joined at the hip with the Windows platform. Alex: The problem is pricing of Visual Studio. All the nice developer things are reserved for Enterprise version which costs around 3k USD per year so basically the new features are behind a paywall. James Cole: This is an opinion, but one coming from years of experience; IIS is a complete disaster in comparison to Apache and NGINX and should have been discontinued years ago. wenc: SQL Server and Windows licensing are actually major inhibitors, even at large enterprises. We have to manage our VL/CALs conservatively to control costs. Wanna spin up n VMs for production / distributed computing? For Windows, you have to license each VM, whereas with Linux, the incremental cost of another node is nearly 0.

How can you model human memory in software? No conclusion, but the exploration is the thing. The Mind Wanders.

awslabs/aws-cdk: an infrastructure modeling framework that allows you to define your cloud resources using an imperative programming interface. The CDK is currently in developer preview. We look forward to community feedback and collaboration.

Lithography for robust and editable atomic-scale silicon devices and memories: We created two rewriteable atomic memories (1.1 petabits per in2), storing the alphabet letter-by-letter in 8 bits and a piece of music in 192 bits. With HL no longer faced with this trade-off, practical silicon-based atomic-scale devices are poised to make rapid advances towards their full potential.

Speed up solving complex problems: be lazy and only work crucial tasks: Research teams at Aalto University in Finland and KU Leuven in Belgium have developed an approach to "lazy grounding" that could solve hard-set and complex issues in freight logistics, routing, and power grids by significantly reducing computation times. Conventional methods to "ground" such computations free up memory, but may cause the system to get stuck in searching for a solution and suddenly require an unreasonable amount of time. The new method pinpoints the small subset of decisions that contribute to a wrong turn somewhere in the computation, and ignores the rest. Aalto's Antonius Weinzierl says, "Our approach essentially draws a local part of the map on demand and allows you to pinpoint where exactly the initial wrong turn was and how to get straight back on track."

Stuff The Internet Says On Scalability For August 3rd, 2018

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale