hot links

Stuff The Internet Says On Scalability For November 30th, 2018

Wake up! It's HighScalability time:

We all know the oliphant in the room this week (reinvent)

Do you like this sort of Stuff? Please support me on Patreon. I'd really appreciate it. Know anyone looking for a simple book explaining the cloud? Then please recommend my well reviewed (30 reviews on Amazon and 72 on Goodreads!) book: Explain the Cloud Like I'm 10. They'll love it and you'll be their hero forever.

8: successful Mars landings; $250,000: proposed price for Facebook Graph API; 33: countries where mobile internet is faster than WiFi; 1000s: Facebook cache poisoning; 8.2 million: US Nintendo Switch sales; 40+%: Rust users feel productive; 15 terabytes: monthly measurements of third-party web transparency tracking data; $133.20: total music sales by Imogen Heap on blockchain; 8.3 million: concurrent Fortnite players; 6.2 Billion: fuel costs saved by smart car drivers; 80: salad bags assembled per minute by smart machines, 2x the output of a worker; 1/10th: power used by ebike compared to Nissan Leaf; 100,000: new micro industries; 40MW: solar plant floats on water; 20%: car crashes reduced using Waze-fed AI platform; 14%: decline in Sillycon Valley median wage over 20 years; 36.7%: smartphone e-commerce sales; many: Google Earth datasets; 6 zeta tons: earth's mass;

Quotable Quotes:
- @william_r_kerr: "The number of new international students enrolling at American institutions fell by 6.6% during the 2017-18 academic year, on top of a 3.3% decline the year before."
- @codinghorror: “Apple’s silicon is now so well ahead that we’re not really expecting Android vendors to catch up any time soon.”
- @emileifrem: Adobe used 125 MongoDB servers to run their activity feed. It was replaced by 48 Cassandra servers. Now it runs on 3 (THREE!) servers of Neo4j. With more data (yet smaller disk footprint), higher load and more functionality. The scalability of a native graph database.
- @swardley: First up at #AWSreInvent is McKinsey revisiting their 2009 report on Cloud Computing with their talk "We were so wrong, you'd have been a muppet to follow our advice" ... damn, daydreaming again. X : What did McKinsey do? Me : Clearing the Air on Cloud, 2009. Basically an argument that cloud was for startups but proper enterprises would be better off investing in their data centres and adopting a cloud model would be a money-losing mistake
- @AstroKatie: Hold your hand up 12 inches from your face: you’re seeing your hand as it was a nanosecond ago. Everything you look at is, to one degree or another, in the past. The farther away in space, the more ancient in time.
- Ms. Hou: I used to think the machines are geniuses. Now I know we’re the reason for their genius.
- @lightbend: How to scale? Here you go: With over 125 million players, and supporting over 8 million *concurrent gaming sessions*, we are really happy to learn that #Fortnite is running #Akka under the hood!
- @mattklein123: The Firecracker announcement from @awscloud is super, super cool. I love seeing real innovation in the OS/VMM space, and a willingness to toss away legacy in order to vastly simplify the problem space. Kudos on making this OSS as well.
- @hichaelmart: So my hot take is that Firecracker is gVisor done right: at the VM level, and more importantly, NOT using a garbage collected language 🔥
- @profgalloway: More Americans have Prime than go to church or own a landline. #thefourbook
- @mjpt777: SBE 1.10.0 is out. Java codecs see ~25% performance bump with this release and more if using G1 GC (those GC write barriers can hurt!).
- @mattbeane: AI platform finds accident predictors in Waze data, gives drivers real-time feedback, 90% of drivers slow down, accidents down nearly 20%. Seems like benefit might extend to non-Waze users too, but they don’t say...
- @anliguori: No BIOS in Firecracker. We are expressly not trying to support legacy OSes. The device model ends up being extremely simple. No live migration either because serverless workloads are short lived. We really want to explore this space deeply and not get tied down by legacy.
- David Gerard: The price peak [for bitcoin] was December 2017 — but the hash rate was six times that by August 2018, at one-third of the price. Mining 1 BTC cost way less than 1 BTC during the bubble. So, to compete, miners built out big. The hash rate changes approximately every two weeks. But capital expenditure — building and deploying single-purpose mining hardware — has a rather longer lead time.
- @michael_at_work: AWS has done so much for the world, and there’s so much money in OSS, it only makes sense that OSS give a little something back to AWS to help support them going forward, especially now with AWS in a financial crisis. (Oh wait, got my acronyms mixed up. This, but the opposite.)
- @kellabyte: Your DB loses writes & pages every night. Your nerd pride is too tight to admit it just keeps compacting gigabytes. You just like to be a hero fixing things in the limelight. Your boss asks if everything is alright and you give them the green light and return to fortnite #opslife
- James Hamilton: Faster than a Pi with more memory than a Raspberry Pi but, yes, it’ll [Graviton] run much of the same software.
- @bcantril: Am waiting for the year that reInvent goes full Red Wedding, locking the doors and announcing that every attendee's product or service is now a forthcoming AWS offering. Or maybe that was this year?
- @rakyll: Great that AWS went with @envoyproxy. Fragmentation in sidecars and load balancers is a huge issue and multiples work tremendously. Great that cloud providers and community seems to be agreeing on something, and we can just focus our energy on Envoy. To give an example, how many load balancers do we maintain to provide similar telemetry collection at load balancer level? Too many. The stack is highly fragmented, any simple feature requires months of planning and execution.
- Mikhail Davidov: This sounds like a pretty big ‘if’ but there lies the crux of the T2: it does too much. The T2 is a fully featured and bulky platform. It shares many common components and drivers from the iOS and macOS platforms that frankly, just should not be there.
- Chris Benner: The returns to capital are significantly outpacing the returns to labor.
- @tjholowaychuk: $10/TB for queries, might as well use BigQuery at $5/TB unless it has a much faster lower-bound for latency, will have to see about that
- @abooghost: Hard to believe a decade plus spent hiring psychologists to design monetary feedback loops in free to play games with the explicit purpose of getting children addicted could've gotten a bunch of children addicted
- @drbarnard: “Apple can and does dramatically shape the App Store economy… I’d love to see Apple wield that power to shape the App Store in ways that will sustain and encourage meaningful development instead of continuing to allow the deck to be stacked against it.”
- @ajaynairthinks: And love how casually Jeff threw in “Lambda processes trillions of executions for hundreds of thousands of active customers every month” - and growing :) Wanna hear more about the isolation, density and other cool things it enables #AWS Lambda to do? Checkout SRV409 !
- @mckelveyf: Yeesh...airlines in UK used an algorithm to NOT assign families near each other when randomly picking seats, nudging families to pay for pre-assigned seats. Algorithms might hide the decision, but it's someone special to cook something like up
- Karl Rupp: Clearly, transistor counts still follow the exponential growth line. AMD's Epyc processors with 19 billion transistors contribute the highest (publicly disclosed) transistor counts in processors to-date. For comparison: NVIDIA's GP100 Pascal GPU consists of 15 billion transistors, so these numbers are consistent. With the upcoming introduction of 10nm process nodes it is reasonable to assume that we will stay on the exponential growth curve for transistor counts for the next few years. These additional transistors will provide more cores, while the gains we see for SpecINT to measure single-threaded performance are primarily due to compilers employing auto-vectorization and auto-parallelization. Thus, if you want to benefit from future processors over what you have now, make sure to have parallel workloads. 🙂
- David Rosenthal: Until May 27, 2020 there is a constant flow of 75BTC/hour coming in to the market. For the price to average $4K over the next 18 months, speculators have to inject on average $300K/hour in real money into the market. That is almost $3.9B. Right now its hard to see why they would do that. If they don't, the price crash will continue, making it even harder to see why they would pump money in. This is starting to look like a death spiral.@tanepiper: I maintain the most popular library for Bitly for node. I don't use it myself anymore and no one from Bitly has ever offered money to continue to maintain it, or offered to take it over. I fix bugs when I have time to or accept PRs. Beyond that I have more important things.
- @rbranson: Did some basic benchmarking of a1.4xlarge vs c5.4xlarge using Phoronix C-Ray benchmark. Render took 108s on c5, 112s on a1. c5 is $0.68/hr, a1 is $0.408/hr. Looks like it really is quite the bargain. #reinvent
- devonkim: I view the democratization of infrastructure similar to democracy- the best part of it is that anyone can do it, and the worst part of it is that anyone can do it. On the flipside of specialists getting involved, I also see an awful lot of bad / inappropriate networks and security layouts in cloud environments created by traditional infrastructure engineers because they carried too many principles from managing physical networks.
- @Obdurodon: Good analogy 1/2 from HN today: You don't need an architect to design a garden shed, but you do for a house or an office building. SW devs who only build garden sheds should STFU about building houses.
- @ajassy: New A1 instances are 1st to be powered by custom #AWS Graviton processors, based on @Arm architecture. Excited to pass cost savings back to customers, reducing them by up to 45% for scale-out workloads (like microservices & web servers): https://cnb.cx/2ShHEKx . #reInvent
- Tehnix: I've seen a lot of people complain about [Amazon Timestream] pricing, so I thought I'd share a little why we are excited about this: We have approximately 280 devices out, monitoring production lines, sending aggregated data every 5 seconds, via MQTT to AWS IoT. The average messages published that we see is around ~2 million a day (equipment is often turned off, when not producing). The packet size is very small, and highly compressable, each below 1KB, but let's just make it 1KB. We then currently funnel this data into Lambda, which processes it, and puts it into DynamoDB and handles rollups. The costs of that whole thing is approximately $20 a day (IoT, DynamoDB, Lambda and X-Ray), with Lambda+DynamoDB making up $17 of that cost. Finally, our users look at this data, live, on dashboards, usually looking at the last 8 hours of data for a specific device. Let's throw around that there will be 10,000 queries each day, looking at the data of the day (2GB/day / 280devices = 0.007142857 GB/device/day)...From these (very) quick calculations, this means we could lower our cost from ~$20/day to ~$4.5/day. And that's not even taking into account that it removes our need to create/maintain our own custom solution.
- @jessfraz: It seems like a lot of people are confused with the difference between container runtimes and firecracker. Firecracker is a virtual machine manager. Qemu is also a virtual machine manager. Katacontainers uses Qemu. I’d love to see them switch to Firecracker. :)
- John McClean: The real tension isn’t between OO and FP, it’s between mutual imperative code that by passes where possible compiler checks to maximise runtime dynamism (as if all we needed was to migrate to Ruby) and constrained functional code that enables the compiler to find problems in our code before we do.
- @HishamElfar: 2019 Prediction: Amazon will launch a nationwide grocery delivery service in April 2019 as they have a monopoly on food storage and are the only game left in town.
  
  2021: The UK formally changes its name to Amazon UK. All citizens are judged on their Prime Membership.
- @aniccia: This study found ebikes expand bike ridership in age and gender (see Oslo graphs). Women were ~2x as likely as men to ebike v bike, and those 35+ age were ~50% more likely to ebike v bike as a 25-34 yr old. Ave speeds tail off >27 kph for both, perhaps due to safety & road net.
- @xaprb: Something that isn’t a “Law,” but has held true through my entire career: when you instrument and measure something, you always learn surprising things. The code is doing WHAT? That query usually runs in microseconds but sometimes hours? We have a server nobody knew about? etc. This isn’t just database-monitoring-related. It’s true of any system, like… application code; off the shelf software; company finances; my personal health and vital signs—whatever.
- AlexDeBrie: One issue with DynamoDB with serverless is that you had to determine your capacity ahead of time. No more. Like AWS Lambda, you can now pay per-request. This is great for coupling the cost to the value you're provided your users.
- @bodil: (And if you don't want it bad enough to maintain it, maybe it's time to move on to something with a development model that suits you better. There are lots of them out there. Rust's model is particularly nice. Python too, if you still prefer a dictator in charge.)
- @chamath: This is an amazing company we’ve backed since inception. DroneSeed has created a fleet of FAA approved aerial drones that conduct reforestation. The technology they’ve built to do this is amazing...and they’re replanting our forests at the same time.
- @stu: The pace of innovation is getting faster and faster AWS is a nearly $27 billion revenue run rate business growing 46% y/y There is no compression algorithm for experience containers and serverless computing are completely rethinking the basic unit of compute @ajassy #reInvent
- Andy Jassy: I would’ve made sure that we hired more salespeople and more professional services faster. We launched AWS in 2006 with two salespeople. At this point, we have a very large field [organization], with a lot of account managers and a lot of solutions architects and a lot of professional services folks, but we have just insatiable demand.
- tnolet: Not dissing the general message. Startups can be brutal, but still. Dude is 22 and claims having seen more hardship than others. Try dealing with raising kids, health issues, sick family members, dying friends, mid life (all of which will probably happen as you near the age 40) and a startup will look like a pleasant distraction from the actual hard stuff in life. I'd say enjoy being 22, don’t take yourself to seriously and reflection will come with age.
- Joe Emison: Here is a much better serverless architecture, which I will call a Serviceful Serverless application. In this architecture, all aspects of the application that do not need to be unique or differentiated from standard functionality (e.g., user management and authentication) are handled by a managed service (e.g., AWS Cognito, Auth0, Google Firebase Auth). One immediate reaction that some practitioners have to the above table is disbelief and skepticism--disbelief that any such reductions in code count are possible, and skepticism that, if it is possible, that the architecture is flexible and extensible enough for continued development as new feature requests come up. There is only so much that I can do in this article to convince you that these architectures do deliver extensible, extremely low-line-of-code-count applications; the best way for you to understand is to try them.
- Ankur Sethi: Are Frameworks Evil? I’m not advocating that frameworks are evil, or that you should write all your applications in VanillaJS. That would waste far too much productive developer time, and the result will probably end up performing badly at runtime. Frameworks exist for a reason. But it’s important to consider your audience. If you’re building for resource constrained devices — which you certainly are if your product targets a country like India — you could consider using a lighter framework such as Riot or Preact. Your users will thank you.
- njs: The popular concurrency primitives – go statements, thread spawning functions, callbacks, futures, promises, ... they're all variants on goto, in theory and in practice. And not even the modern domesticated goto, but the old-testament fire-and-brimstone goto, that could leap across function boundaries. These primitives are dangerous even if we don't use them directly, because they undermine our ability to reason about control flow and compose complex systems out of abstract modular parts, and they interfere with useful language features like automatic resource cleanup and error propagation. Therefore, like goto, they have no place in a modern high-level language. Nurseries provide a safe and convenient alternative that preserves the full power of your language, enables powerful new features (as demonstrated by Trio's cancellation scopes and control-C handling), and can produce dramatic improvements in readability, productivity, and correctness.

What is Amazon's Amazon Quantum Ledger Database QLDB good for? Colm MacCárthaigh, Senior Principle Engineer at AWS, says in a tweet: "As a QLDB user for a few years now ... it is awesome, and game-changing. In EC2, we are big big fans of it ... and thrilled that we're opening it up to everyone!" How enigmatic. What's he talking about? Won't say. Directly. But—Dan Brown-like—he points to this talk as solving the mystery: AWS re:Invent 2018: Close Loops & Opening Minds: How to Take Control of Systems, Big & Small ARC337. It's an awesome talk. Well worth watching. Lot's of deep stuff not covered elsewhere.
- The talk is about creating simple and stable control systems. So presumably QLDB—a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log ‎owned by a central trusted authority—is infrastructure for building control systems as large as the one used to control S3.
- What's a control plane? CloudFront—a distributed HTTP cache—is given as an example. The cache is the data plane. The control plane launches new sites, adding servers, directing user requests to the user, measuring latency, and so on. So control planes: manage the life cycle for resources; provision software, provision service configuration; provision user configuarion.
- Building high quality systems is about habit. Teams paying attention to deail. Building test cases. Worrying about what could go wrong.
- High quality designs are invented by diverse creative minds working a fearless environment; systematic reviews and mechanisms to share lessons; use well-work patterns where possible focus invention where it is truly needed; testing, testing, testing.
- Every design requires making tradeoffs. Order of priority: security, durability, availability, speed.
- Control planes are a lot of effort. You need to approach building them with the right level of investment and effort.
- What makes a stable control system? Some kind of measurement process to tell what state it's in. An algorithm that issues corrections to move to the desired state. An actuator to peform the actions. It's done in a closed loop—constantly measuring; constantly deciding; constantly actuating. Systems like this are tremendously sensitive to lag. These systems should be fast. The only kind of stable control system contains a PID (proportional–integral–derivative) controller, which is a mathematical component. If you understand PID you can quickly spot what's wrong with a control system. Hhe recommends the book Designing Distributed Control Systems.
- Pattern 1: Checksum all the things. A S3 outage was caused by a single bit corruption from one box with a bad network card. Because it was an open loop system the single bit error propagated everywhere. MD5 checksums are used everywhere now. Also, don't push configuration state around in YAML. A truncated YAML file is still a valid, but wrong file.
- Pattern 2: Cryptographic authentication everywhere. Make sure nothing malicious is inserted. Use HMAC encrypted signatures. Everything talking to everything else should have a strong sense of identity and authenticity. Baking a system based on credentials, where you can revoke credentials and rotate credentials, is great for operational health. It means you can create a system humans shouldn't have access to. You know machines are being managed by the system and not from someone's desk. You can also make sure a preproduction control plane can't talk to a production data plane.
- Pattern 3: Limit blast radius. Assume everything can fail and try to limit the damage it can do. Divide control planes horizontally. Regions don't know about each other. Compartmentalize in a vertical direction. User input from APIs are guarded by bulkheads, One example is a poison taster. Put input through as much of the processing the data plane will do before accepting a request.
- Pattern 4: Asynchronous coupling. Separate systems by using async calls. Put work on a queue and drive it asynchronous through a state machine. Makes the system much more deterministic. Workflows and queues can retry and won't cause as much amplification as big trees of sync calls. Cross-region replication is all async. One region won't impact another.
- Pattern 5: Closed feedback loops. If you're not measuring your system you have no hope of achieving stability.
- Pattern 6: push or pull? It's the wrong question. The better way to think about it is to consider the relative size of the fleets. You don't want big fleets connecting to small fleets. The big fleet will hammer the small fleet and it will never recover. Choose the direction that will prevent thundering herds.
- Pattern 7: Avoid cold starts and cold caches. Another pattern to prevent thundering herds. A control plane is small relative to S3, for example. Caches are bi-modal. They are fast when they have entries and slow when they are empty. Try not to have caches. Avoid them. If you need them self warm them before accepting requests. You can also serve stale cache entries (DNS resolvers should do this, for example).
- Pattern 8: Throttles. You need to be able to throttle incoming requests, at least until the system recovers. Throttling is also a form of smart prioritization. Load balancers are deliberately throttled during recovery.
- Pattern 9: Deltas. It's more efficient to compute deltas and distribute patches rather than push all the original data around. The best way to have deltas is a versioned data scheme. Add a version column to your key value pair. Make them append only and immutable that just record every version of a value. Keep a versioned data structure in your data plane. Makes it easy to implement snapshots and rollbacks. I'm assuming this is where QLDB comes in.
- Pattern 10: Modality and constant-work. You want to minimize the number of modes of operation. You want to minimize the number of code branches. Every time you execute combinations of code that have never occurred before you are at risk. The more paths, the more state, you have an untestable system.
  - Relational databases are terrible for control planes. That's because relational databases have a lot of modes. They have emergent query optimizers so you can't predict what will happen. They can radically change in an instant even though you didn't do anything. That mode can be catastrophic. Nosql databases are much easier to reason about. Performance characteristics, what the code is doing, how joins work, how merges are occurring, are all obvious to the programmer and make it easier to build stable systems.
  - Do things the dumb way every time. Always do full table scans. Always do full joins.
  - For Route 53 network health checks they don't use deltas, they do something really dumb. A user calls an API that edits a configuration file on S3. There's no queues or workflows pushing data to the data plane. Instead, the data plane is in a loop every 10 seconds getting that file from S3, regardless of if it has changed or not. It's very reliable and robust. Very resilient. It won't build up backlog. It won't build up queues. It will just work. The system doesn't care how much changes. If every file changes that's OK because they are doing the same thing every time. This is how Route 53 health checks work. Distributed fleets all over the world check your service in real-time. That aggregates into a bit map with two bits per health check status, indicating if it's healthy or stale. Those bits are pushed no matter what to the DNS. If the DNS sees a one in the bitmap then it serves that IP. Every health check could fail in a zone and the system wouldn't be bogged down. This is a constant-time operation. It's a surprisingly cheap way of doing things. A hundred nodes fetching a config every 10 seconds costs $1200 a year. Which is nothing compared to developer time. Few systems work like this.
- When you're using API Gateway and AWS Lambda you're using a system that has all these ideas baked in.
- How do you institutionalize these learnings? Weekly tech talks. Design review process. Operational readiness review assessment. Codify processes and make them the default in the stack.

Who was the first hacker? Medea. Ancient Dreams of Technology, Gods & Robots. Her method was figuring out how systems worked so she could hack them. Sound familiar? Over two thousand years ago greeks were imagining all sorts of non-biological automata—things given life through craft. Hephaestus—the god of blacksmithing, forging, invention, and technology—was a combination Elon Musk and Steve Wozniak for lazy Olympians. He made lots of cool stuff, like automatic gates for Mount Olympus and a fleet of automatic butlers to deliver ambrosia at parties. The very first robot to walk the earth was Talos, a giant bronze automaton invented by who else? Hephaestus. Talos patrolled the island of Crete, protecting it from sea invaders. It identified which ships were friend or foe, showering boulders on the less righteous. Talos could also kill enemies by heating its body to red hot and hugging them to its chest. Talos was powered by ichor that ran through a single artery. The entire system was sealed by a bronze bolt on its ankle. Madea, sensing Talos would like to live forever, promise Talos she could make im immortal if only she could remove the bolt. A very human-like Talos agreed. Using a wrench, Jason, as in Jason and the Argonauts, removed the bolt. Ichor poured out like molten metal. Talos tumbled over as a tear fell from his cheek. Medea also reasoned out how to thwart an unstoppable skeleton army that grew from planted dragons teeth. The skeletons had one drive: go forward and attack. Medea figured out how to trigger their programming so they would destroy themselves. She advised Jason to throw rocks at the army. The blows on their shields triggered their attack programming and they hacked each other to death. Magic and technology, doesn't matter, the issues are all the same.

How Cheap Labor Drives China’s A.I. Ambitions. Here's a sentence that could only make sense this year: "Hou Xiameng runs a data factory out of her in-laws’ former cement factory in the Hebei city of Nangongshi." A data factory?: Inside, Hou Xiameng runs a company that helps artificial intelligence make sense of the world. Two dozen young people go through photos and videos, labeling just about everything they see. That’s a car. That’s a traffic light. That’s bread, that’s milk, that’s chocolate. That’s what it looks like when a person walks...the ability to tag that data may be China’s true A.I. strength, the only one that the United States may not be able to match. In China, this new industry offers a glimpse of a future that the government has long promised: an economy built on technology rather than manufacturing...We’re the assembly lines 10 years ago...now employs 300 workers but plans to expand to 1,000...

Amazon's new Graviton processor is not meant to be a speed demon. The A1 instance type will do best on highly scalable, loosely coupled workloads. They targeted price/performance. James Hamilton explains in AWS Designed Processor: Graviton: "The AWS Graviton Processor powering the Amazon EC2 A1 Instances targets scale-out workloads such as web servers, caching fleets, and development workloads. These new instances feature up to 45% lower costs and will join the 170 different instance types supported by AWS, ranging from the Intel-based z1d instances which deliver a sustained all core frequency of 4.0 GHz, a 12 TB memory instance, the F1 instance family with up to 8 Field Programmable Gate Arrays, P3 instances with NVIDIA Tesla V100 GPUs, and the new M5a and R5a instances with AMD EPYC Processors. No other cloud offering even comes close." Why not RISC-V? What Arm brings to the table is massive volume with over 90 billion cores delivered. It is a commercially licensed rather than open sourced design but because they are amortizing their design costs over a very broad usage base, the cost per core is remarkably low. Still not free but surprisingly good value. Because Arms are used in such massive volume, there is a well developed software ecosystem, the development tools are quite good, and the Arm licensing model allows Arm cores to be part of special purpose ASICs. Arm cores can be inexpensive enough to be used in very low cost IoT devices. They perform well enough they can be used in specialized, often very expensive embedded devices. They are excellent power/performers and are used in just about every mobile device. And, they deliver an excellent price/performing and power/performing solution for server-side processing. It’s an amazingly broadly used architecture that continues to evolve quickly

Azure will soon have on-prem competition. AWS Outposts bring native AWS services, infrastructure, and operating models to virtually any data center, co-location space, or on-premises facility. You can use the same APIs, the same tools, the same hardware, and the same functionality across on-premises and the cloud to deliver a truly consistent hybrid experience. Outposts can be used to support workloads that need to remain on-premises due to low latency or local data processing needs. Workloads like this: Peter Sbarski: Google’s Bret McGowen, at Serverlessconf San Francisco ’18, gave an example of a real-life customer out on an oil rig in the middle of an ocean with poor Internet connectivity. The customer needed to perform computation with terabytes of data but uploading it to a cloud platform over a connection equivalent to a 3G modem wasn’t feasible. “They cannot use cloud and it’s totally unfair to say — sorry buddy, hosted functions-as-a-service or bust — their developers deserve to have the same serverless experience as the rest of us” was McGowen’s explanation why, in this case, running kNative locally on the oil rig made sense.

Your One-Stop Shop For Everything React Boston 2018: GraphQL was a big player; ReasonML was the subject of an enthusiastic presentation; React is everywhere.

5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code: every time you go to work on a new piece of infrastructure, go through checklist (it's a long list); using infrastructure as code is an investment: there’s an up-front cost to get going, but if you invest wisely, you’ll earn big dividends over the long-term; build your code out of small, standalone, reusable, composable modules; If your infrastructure code does not have automated tests, it’s broken; follow a specific process for building and managing infrastructure. Good discussion on HN.

CockroachDB 2.1 is now 50x more scalable than Amazon Aurora. awoods187: I'm the author. We've introduced transactional write pipelining (covered in a forthcoming blog post), load-aware rebalancing, and completed general performance tuning which all contribute to our improved performance numbers. manigandham: CRDB is a great product with some of the easiest operations (although key management is a nightmare that they do not have a good plan for). It's fast enough for point-lookups and makes it easy to distribute and replicate your data across zones and regions. All nodes are part of a single cluster so read and write latencies will be high for global deployment, with the enterprise version having a workaround for local regional reads using pinned covering indexes. That works, but further lowers write performance. It also has trouble with large transactions and the middle ground between OLTP and OLAP with heavy joins. Good choice if you need easy scalability and SQL interface over performance and complex queries.

At one point in time you make a build decision rather than a buy decision. Time marches on and things change. Hardware improves (cheap SSDs improve random IO, better NICs push more bandwidth). Competing systems improve (Kafka supports replication). Do you make the effort to reevaluate your decision? Twitter did. Twitter’s Kafka adoption story. It was decision based on several months evaluating Kafka under similar workloads that as run on EventBus. Kafka had significantly lower latency, regardless of the amount of throughput. Kafka requires fewer network hops, uses zero-copy, fsyncs in the background. Kafka requires fewer machines resulting in 68% resource savings, and for fanout cases with multiple consumers we saw a 75% resource savings. In practice fanout has not been extreme enough to merit separating the serving layer, especially given the bandwidth available on modern hardware.

How many years ago was it that building your own datacenter was just how things were done? The hype cycle has gone to that place where it's now OK to make fun of a behaviour that was once SOP. Like smoking. AWS Chief Executive Andy Jassy now says building one’s own data centers is “kind of wacky."

Emerging Memories Today: Understanding Bit Selectors: Every bit cell in a memory chip requires a selector. This device routes the bit cell’s contents onto a bus that eventually makes its way to the chip’s pins, allowing it to be read or written. The bit cell’s technology determines the type of selector that is appropriate: SRAMs use two transistors, DRAMs use one transistor, and flash memories combine a transistor with the bit cell so that the transistor both stores the bit and selects it. Emerging memory technologies use simpler selectors than are required for today’s leading memory technologies. They can get by with either two-terminal selectors or three-terminal selectors. The circuits for both types of cells, with three-terminal and two-terminal selectors, are shown in the graphic to the left. You can see that there’s not much difference between the two. In both cases the selector controls the current through the cell either by turning it off with a transistor, or by turning it off when the current reverses with a diode (or something similar)...All of this has been presented to explain that the selector has a significant influence on the amount of area a memory array consumes, and that the array’s cost is proportional to its area. For this reason memories that can use a two-terminal selector are much more likely to compete against established memories than are memories that must use a three-terminal selector.

An excellent and thorough description. Stack Overflow: How We Do Monitoring - 2018 Edition.

It's free! Pattern Recognition and Machine Learning:This leading textbook provides a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed.

Cloudflare explains how they can drop 8 million packets per second using only 10% CPU. The key is using eBPF: The softirq CPU usage (blue) shows the CPU usage under which XDP / L4drop runs, but also includes other network related processing. It increased by slightly over a factor of 2, while the number of incoming packets per second increased by over a factor of 40!

@RogerGrosse: Important paper from Google on large batch optimization. They do impressively careful experiments measuring # iterations needed to achieve target validation error at various batch sizes. The main "surprise" is the lack of surprises. [thread]. Measuring the Effects of Data Parallelism on Neural Network Training: In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured in the number of steps necessary to reach a goal out-of-sample error. Eventually, increasing the batch size will no longer reduce the number of training steps required, but the exact relationship between the batch size and how many training steps are necessary is of critical importance to practitioners, researchers, and hardware designers alike.

No time to read AI research? We summarized top 2018 papers for you. They covered 10 papers. Titles include:Universal Language Model Fine-tuning for Text Classification, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, World Models.

Stuff The Internet Says On Scalability For November 30th, 2018

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale