hot links

Stuff The Internet Says On Scalability For February 9th, 2018

High Scalability

09 Feb 2018 — 24 min read

Hey, it's HighScalability time:

To those living in the past: the future is coming. (launch, great sound).

If you like this sort of Stuff then please support me on Patreon. And I'd appreciate if you would recommend my new book—Explain the Cloud Like I'm 10—to anyone who needs to understand the cloud (who doesn't?). I think they'll learn a lot, even if they're already familiar with the basics.

7.2 terabytes: data used during Super Bowl; $220 million: projected podcast revenue this year; $127 billion: total addressable value of drone-powered solutions in all applicable industries; 100 billion billion billion: living microbial cells underlying all the worlds oceans, 200x biomass of humans; 110 billion: total market for memory; 1,000: drones North Korea may have, possibly with chemical or biological weapons, ready to attack South Korea; $123 billion: US apparel market; 100 million: iOS devices sold in Q4; $3: earnings in 24 hours from conscripting 5,000 Android devices into a mining botnet; 46%: cloud market growth in Q4; 104%: YoY Alibaba cloud growth;

Quotable Quotes:
- John Perry Barlow: The Internet is the most liberating tool for humanity ever invented, and also the best for surveillance. It's not one or the other. It's both.
- Fernando J. Corbato: Our use of the word daemon was inspired by the Maxwell's daemon of physics and thermodynamics. (My background is Physics.) Maxwell's daemon was an imaginary agent which helped sort molecules of different speeds and worked tirelessly in the background. We fancifully began to use the word daemon to describe background processes which worked tirelessly to perform system chores.
- Monica Alleven: T-Mobile also released some additional tidbits: Tom Brady’s fumble and the Eagles’ field goal were the most shared moments of the game on social, with a 33% increase in posts. Social posts doubled during Timberlake’s halftime show versus the rest of the game, and nationwide, group and picture messaging went up by nearly 50%, with texting increasing nearly 10%.
- @BretWeinstein: The most important patterns: 1. Prisoner's Dilemma 2. Race to the Bottom 3. Free Rider Problem / Tragedy of the Commons / Collective Action 4. Zero Sum vs. Non-Zero Sum 5. Externalities / Principal Agent 6. Diminishing Returns 7. Evolutionarily Stable Strategy / Nash Equilibrium
- @Jason: US content companies should remove their content from @Facebook until facebook gives them 70% of revenue from the ads around their content — or a yearly license fee equal to 10-20% of their content budget. Facebook is the enemy of content companies — period.
- @Carnage4Life: Slack officially states the only browser they support is Chrome. Striking to see history repeat itself and the dream of the Open Web die so ignominiously
- @swardley: #noops, #nocode ... it's funny how some parts of the DevOps community are trying to create strawmen in the #serverless world. It's like EC2 in 2009 and the fight back from old practices against #DevOps. This time, the boot is on the other foot.
- Troy Hunt: Every single minimum password length is an even number! How scientific do you think the process of determining the perfect minimum length is when all the big players just happened to land on 4, 6 or 8?
- @seanjtaylor: TL;DR - There are about 100K-200K unique workers on MTurk. - On average, there are 2K-5K workers active on MTurk at any given time - 50% of the worker population changes within 12-18 months. - MTurk has a yearly transaction volume of a few hundreds of millions of dollars.
- schappim: The TL;DR; of this article [How to Design a New Chip on a Budget] is: "...a simple ASIC (say one that is a few square millimeters in size, fabricated using the 250-nm technology node) might cost a few thousand bucks for a couple dozen samples."
- @j_s_n_d: Always store in UTC @girlziplocked: What’s the best piece of dating advice you’ve ever encountered?
- shadow31: Why? The entire point of doing a rewrite like that is to compare the implementations. No, you can't conclude that Rust is 100x more memory efficient than Java or something silly like that, but it's obvious that compiled, native code without a massive runtime VM layer is going to be more efficient. In this case, it's much more efficient.
- Shar Darafsheh: Don’t start scaling until you have gathered all your requirements and understand the full scope of your SaaS scale processes. Give yourself enough time to scale; no software ever works well when development is rushed. Make sure there is proper architecture in place prior to development.
- fackin_samsquamch: OK. I keep hearing this claim [There is no "skills gap." Employers are just cheapskates], and yet I see dozens of resumes a day and interview several candidates a week with unbelievably shit technical abilities and a barely passing ability to communicate with other humans.
- @etiene_d: - How do french people send files? - Pierre-to-Pierre
- @omphe: I'm working through the anger of clients who call meetings for "new projects" as a veiled attempt to get free consultancy. From now on the response is "Just pay for Heroku". It's usually true and I never need speak to them again.
- @rtfeldman: After 2 years and 200,000 lines of production @elmlang code, we got our first production runtime exception. (We wrote code that called Debug.crash and shipped it. That function does what it says on the tin. 😅) In that period, our legacy JS code has crashed a mere 60,000 times.
- chx: A spinning disk dissipates about 6W on average, for 100 000 disks that's 600kw. With an average of 0.12 USD paid for a kwh times 24 hours, their [Backblaze] energy bill is less than 2000 USD a day. Their HDD cost based on the "81.76 — The number of hard drives that were installed each day in 2017." and we know they mostly now do 8TB at roughly 0.02 USD per GB according to link which is 13000 USD a day. That means even if we had SSDs at zero watts, they would need to be (13+2/13) ~ 15% more expensive than HDDs in order to break even but if you look at link you will see it's still about 8 times as expensive. Obviously there's some savings due to floor space but the savings are going to be really low there. And this is just the economics side.
- @cloud_opinion: The sigh you hear is from the VCs who recently wrote checks for container startups
- Matt Klinman: Facebook is essentially running a payola scam where you have to pay them if you want your own fans to see your content. If you run a large publishing company and you make a big piece of content that you feel proud of, you put it up on Facebook. From there, their algorithm takes over, with no transparency.
- Eric Green: My guess is that people who are dealing with much fewer rows, but who are querying those rows with much greater frequency, will have a more successful experience with Aurora Postgres. But then they'll run into the IOPS costs. Basically, to get the same IOPS that would cost me $1200/month on EBS, I'd end up paying around $4,000/month on Aurora.
- mythrwy: Organizing all your jQuery into meaningful classes and keeping the whole thing organized and orchestrated is very much the essence of programming. It's a rare skill (as a trip through many large jQuery code bases reveals). Filling in some templates causing magic to happen behind the scenes on the other hand, not so much.
- @CodeWisdom: "It turns out that style matters in programming for the same reason that it matters in writing. It makes for better reading." - Douglas Crockford
- @copyconstruct: From a Slack channel I’m a part of: Here is how to be 95-98% of a Database DBA 1. add an index 2. don't use it as a message queue 👏
- tinco: I worked on a competitor to skylight, our agent was in c++ for the same reason they built it in rust, but internally we also did processing. One of those processing tools was an industry standard tool written in Python and it frequently gathered over 10gbs of ram after which it would become slow. We ported it to Go, and it would never go over 100mb again. Same fishy improvement.
- FPGAhacker: FPGAs are definitely not a dead end. By virtue of being reconfigurable, they will never be obsolete as long as ASICs are a thing. Now, some whole new technology will come along eventually, supplanting present day ASICs and FPGAs... but until then...Program as a term means something different with chip design than it does with software. An analogy is that to program an FPGA is to paint a canvas. The source code in chip design is instructions for how the canvas should be painted. Another analogy would be to program an FPGA is to cook a meal. The source code is the recipe for the meal. But one doesn't run a recipe on a meal. These analogies break down because a painting and a meal is passive... it doesn't do anything by itself, or react to the outside world.
- @nicoleperlroth: Mickos: Those who participate in bug bounty programs should "not be legally exposed." "We need hackers," Mickos says. "Ethical hacking may be the only force that can combat criminal hacking." #uberhearing
- Backblaze: The HGST/Hitachi 4 TB models delivered sub 1.0% failure rates for each of the three years. Amazing.
- @cloud_opinion: 10x Rockstar engineer: "Simultaneous update of two production instances is risky" He said it right after watching the Falcon launch.
- @benhammersley: Chinese police in Zhengzhou's high-speed rail station, using a Google-Glass type headset with facial recognition to look for fugitives. 7 arrests so far, apparently.
- Peter Welch: So no, I'm not required [as a programmer] to be able to lift objects weighing up to fifty pounds. I traded that for the opportunity to trim Satan's pubic hair while he dines out of my open skull so a few bits of the internet will continue to work for a few more days.
- @rob_carlson: AI bites into manufacturing. "We [Foxconn] will reduce our total workforce to less than 50,000 people by the end of this year, from some 60,000 staff at the end of 2017; [up to 75% of production will be fully automated by the end of 2018]."
- Simone Robutti: Code is not the object of programming. It’s merely the result.
- @copyconstruct: "Often times, when debugging distributed systems, it is hard to tell the difference between cause and effect without some experimentation. For example, we noticed slight CPU throttling by Mesos around the timeout events, so we tried allocating more CPU. This did not help, indicating it was an effect, not the cause. With our long-running JVM-based services, we often suspect garbage collection (GC), but we didn’t see any correlation between GC events and timeout events in our logs. However, when inspecting the logs, we did notice that the logs themselves were being printed during the events! From this observation, we were able to trace the issue back to a release of a new Twitter-specific JVM. With the smoking gun in hand, we worked with our VM team to identify synchronous GC logging in the JVM as the culprit. The VM team implemented asynchronous logging and the issue disappeared, clearing the SuperRoot for launch."
- Claudia Lutz: This apparent parallel between human and bee social interactions hid a surprise. When the researchers simulated how fast a piece of information (for bees, this could be anything from a chemical signal to a disease-causing pathogen) might spread through the network, they found that this occurred rapidly, unlike the slow spreading found in bursty human networks. This feature was robust to changes in colony demography, even re-emerging in the interaction networks of hives from whom many individuals had been suddenly removed.
- Greg Ferro: whats striking here is the change in storage technology now that legacy storage vendors are no longer involved. When most storage came through EMC, it made sense to prevent innovation so they could sell old products at higher products. Make something once, sell it many times is a good business model that inherently prevents innovation. Now that public cloud/hyperscalers are driving technology and spending money to buy the newer storage, this means that a wider range of different products is possible. So we will see SATA SSD, NVMe SSD exist. Plus other 3D express, 3D NAND etc where before it was difficult to bring them to market via the storage vendors.
- perfectstorm: I agree with the author here but not for the reasons he stated. my biggest gripe with Slack is that it results in less and less one-to-one conversation. We used Slack at my last job but we were encouraged to stop by the person's desk for any matter that requires immediate attention. I sat near our CTO and sometimes I saw many engineers standing around his desk trying to debug a production issue. This resulted in knowing other people's name/face (which I realized after I joined my current company). At my new company we use Slack for pretty much everything. If there's a production bug you're encouraged to @here on the dedicated channel and someone would take a look at it. There's no one-to-one interaction to debug it. We had our holiday party last month where I introduced myself to some of my coworkers and once we started talking we realized that we have chatted on Slack but never saw each other even though we were all working in the same office. I never associated the engineer's face to the name.

Oh so spooky. Worm Uploaded to a Computer and Trained to Balance a Pole: the nematode C. elegans is about one millimetre in length and is a very simple organism. But for science, it is extremely interesting. C. elegans is the only living being whose neural system has been analysed completely. It can be drawn as a circuit diagram or reproduced by computer software, so that the neural activity of the worm is simulated by a computer program. Such an artificial C. elegans has now been trained at TU Wien (Vienna) to perform a remarkable trick: The computer worm has learned to balance a pole at the tip of its tail.

The best way to improve your cell service? Host a Super Bowl. Super Bowl shines light on selfies, speed claims and 5G: AT&T enhanced or built 122 new permanent cell sites near heavily trafficked places like hotels, airports and convention centers in Minneapolis; Sprint upgraded hundreds of cell sites in the market to include all three of its spectrum bands. Like AT&T, it also benefited from distributed antenna system upgrades; T-Mobile reported that its 35x capacity increase in Minneapolis paid off, with T-Mobile customers clocking the fastest upload and download speeds at U.S. Bank Stadium according to Speedtest data from Ookla. T-Mobile said its speeds were 2.2x faster than Verizon's; Verizon streamed 180-degree stereoscopic video from the stadium in Minneapolis to virtual reality headsets in New York City to show live action on the field; they also showed high-resolution replays on secondary screens using 5G technology.

I'm always confused by this too. What We Talk About When We Talk About Performance: If you optimize a task so that it takes 90% less time than before then that is a ten-times speedup, and it should be described as such. It doesn’t matter whether the task is flying a plane from London to Seattle, loading a game, or finding a file. A 90% reduction in time is a ten-times speedup, period, and the five examples above are all incorrect. You should never describe something as being a “90% improvement” or “90% better” – these phrases are meaningless. Instead, embrace the big and accurate 10-times as fast number. It’s 10 times better!

Making a site static is one of the recurring memes of bloggers. There are almost more static site generators than there are javascript frameworks—and that's saying a lot. Here's how Smashing Magazine—no strangers to blog biz—made their site static, with a twist. How To Make A Dynamic Website Become Static Through A Content CDN. The switched from WordPress to Netlify to save money. The cost is WordPress has quite a nice ecosystem play going. Need to support AMP, Dropbox, S3, etc and they have a plugin for it. So they combine the best of both worlds by generating a static version of the website on the fly, page by page. Good detailed explanation of with code examples. Not for the faint of heart.

Videos from FOSDEM 2018 are now available. FOSDEM is a free event for software developers to meet, share ideas and collaborate.

They really should have started with a rowboat MVP and then iterated with continuous incremental releases until a warship popped out. New German Warship Fails Sea Trials Due to Tech Woes: The main problem is integration, since 90 percent of the components on the frigate are brand new.

If you suspend disbelief and accept a programmer would write so concentrated a batch of bad code, this is a great code example of Memory Safety in Rust: A Case Study with C: In sum, the guarantees provided by Rust helped us fix every memory-related error in our buggy C implementation (with the exception of the capacity issue, which at least would have had a better error message). And remember–these are guarantees, meaning no matter how large your code base, Rust enforces them everywhere, all the time5. Because if we can pack so many memory errors into 50 lines of C, imagine the nightmare of a large codebase. All this, of course, comes at the price of fighting with Rust’s borrow checker, both the initial learning curve as well as working around its limitations (see: non-lexical lifetimes), but for a codebase of sufficient scale, the pain is quite likely worth the payoff.

Facebook's Android @Scale 2018 recap, feauring stories told by engineering leaders from Audible, Facebook, Google, Instagram, Oscar Health, Pinterest, Spotify, Tumblr, and Twitter. You might like Migrating Apps To React Native from Android (and iOS). An award for the most creative use of the word jank goes to Eliminating Long-Tail Jank With StrictMode.

One of the most surprising aspects IMO of building a product: the fact that, regularly, we access new corpuses of knowledge that we did not have before, which help us improve the product significantly. How we grew from 0 to 4 million women on our fashion app, with a vertical machine learning approach: We decided we were going to build that tool to understand taste. We focused on developing the correct dataset, and built two assets: our mobile app and our data platform...We launched an extremely early alpha of Chicisimo with one key functionality. We launched under another name and in another country. You couldn’t even upload photos… but it allowed us to iterate with real data and get a lot of qualitative input...We spent a long time trying to understand what our true levers of retention were, and what algorithms we needed in order to match content and people...(a) identify retention levers using behavioral cohorts (we use Mixpanel for this)...(b) re-think the onboarding process, once we knew the levers of retention...(c) define how we learn...When we’ve obtained these game-changing learnings, it’s always been by focusing on two aspects: how people relate to the problem, and how people relate to the product...We soon saw that there was a lot of data coming in. After thinking “hey, how cool we are, look at all this data we have”, we realized it was actually a nightmare because, being chaotic, the data wasn’t actionable...The end result is our current system. A system that learns the meaning of an outfit, how to respond to a need, or the taste of an individual.

Big-O notation by itself can’t explain the differences we’re seeing. The code to search the trees is identical – the only difference is the order the nodes in the tree are laid out in memory. Maximize Cache Performance with this One Weird Trick: An Introduction to Cache-Oblivious Data Structures: I find cache-oblivious data structures very satisfying because they can yield huge performance gains in practice...The green line is a tree that’s been laid out with a so-called “recursive blocking” approach which is cache-oblivious. It is almost twice as fast for a tree with 16 million elements!...This approach is the green line in the graph. In designing cache-oblivious data structures and algorithms, a divide and conquer strategy frequently bears fruit. This stems from the fact that divide and conquer algorithms naturally break the work in subproblems of increasingly smaller sizes – one of those sizes will be close to BB and constrain the number of blocks you need to read. We’ll take a similar approach in laying our our cache-oblivious static BST.

As only he can, Cliff Click tells you what's Under the hood of the JVM.

The six wins of Rust. How Rust is Tilde’s Competitive Advantage. Why not Ruby? Too soft. Why not C++? Too hard. Rust? Just right. First win: After rewriting of the agent in Rust, the agent consistently used 8 MB: 92% smaller! Once this version was shipped to production, there were no more customer reports of being over the Heroku memory limit. In terms of raw performance, Rust was a clear win, but that wasn’t the only benefit the Tilde team saw. Second win: Rust allowed the Tilde engineers to craft differentiating features that collect more data in Skylight without having to worry about unintentional resource bloat caused by garbage collection and a language runtime. Third win: effectively zero crashes. Fourth win: More Maintainable. Over the last 3 and half years, there have been 63 releases of the Skylight agent; each of which were quickly and easily adopted by their customers. Fifth win: Rust’s prevention of data races at compile time allowed Tilde to discover potential issues long before a user could find it in production. Sixth win: Teachable to a Broad Team.

Language does make a difference. Python’s Weak Performance Matters: Python has the lowest TimeToWriteCode, but very high TimeToRunCode. TimeToWriteCode is fixed as it is a human factor (after the initial learning curve, I am not getting that much smarter). However, as datasets grow and single-core performance does not get better TimeToRunCode keeps increasing, so that it is more and more worth it to spend more time writing code to decrease TimeToRunCode. C++ would give me the lowest TimeToRunCode, but at too high a cost in TimeToWriteCode (not so much the language, as the lack of decent libraries and package management). Haskell is (for me) a good tradeoff. lmm: Today, as the article says, things are different: Core counts are rising so practical Python performance is falling further and further behind, datasets have gotten large enough for Python performance to be an issue, Javascript has proven that it's possible to get much higher performance out of a scripting language, languages like Haskell have gone mainstream and offer a comparable-to-Python (better, in fact, given what a mess Python's packaging situation is) tool/ecosystem experience and comparable levels of productivity with much higher performance.

Seems a lot like software development. Mark Rosenfelder: David Mamet has memorably explained the basic formula for drama: Someone has a problem. They take action to solve it, and it’s going well. At the last minute it fails. The bad guys advance— they’re about to win! They’re stopped just in time. Then the pattern repeats.

In another episode of As the World Churns, the theme explored is how every new destabilizing weapon causes the development of an equally new destabilizing counter-strategy—Security drones are ready to intercept rogue drones during the 2018 Winter Olympics in South Korea: Drone-catching drones will be deployed and are able to cast nets over rogue drones that are considered to be a threat to the event. In addition to the drone-catching drones, security forces have been training to shoot drones down.

My MySQL Linux Tuning Checklist: IOschedular (noop or deadline); Linux Kernel > 3.18 (multi queuing); IRQbalance > 1.0.8; File System: noatime, nobarrier ext4: data=ordered, xfs: 64k logfiles in different partition (if possible); Swapiness; Jemalloc (if needed); Transparent hugepages; Ulimit (open files); Security: IPtables, PAM security

Is that paper you want hidden behind a paywall? Maybe unpaywall can find it for you.

Reactive Microservices Architecture on AWS: From time to time, it is necessary to update core data in Redis. A very efficient implementation for this requirement is using AWS Lambda and Amazon Kinesis. New core data is sent over the AWS Kinesis stream using JSON as data format and consumed by a Lambda function. This function iterates over the Kinesis events pulled from the Kinesis stream by AWS Lambda. Each Kinesis event is unwrapped and transformed from ByteBuffer to String and converted into a Java object. The Java object is passed to the business logic and stored in Redis. In addition, the new core data is also sent to the main application using Redis pub/sub in order to reduce network overhead and converting from a pull- to a push-based model.

A new serverless computing podcast you might be interested in is Think FaaS. The first episode is ‘Talkin' Lock-In’. Short at < 5 minutes; targeted at a beginner to intermediate audience.

So, you want to use DNA for storage? DSHR has some advice for your new startup. DNA's Niche in the Storage Market: Sales team, your challenge is to own 90% of the market by finding ten customers a year who have 300EB of cold data they're willing to spend $90M to keep safe. Engineers, your challenge is to increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than $24M/yr. Finance team, your challenge is to persuade the company to spend $24M a year for the next 10 years for a product that can then earn about $216M a year for 10 years.

Yep, I use Lambda to handle Facebook chatbot operations and it works great. Stop Using Servers to Handle Webhooks: At this point I know it’s pretty obvious that I am going to advocate for all the wonderful benefits of using FaaS for processing webhooks, though I do acknowledge there are some pretty annoying tradeoffs...the drawbacks around using FaaS are usually around maintainability, testing, and cold starts. There are some tools that help with maintaining versions of your functions.

Why Developers Love Node.js & what's their main issue with it? Uses: Node.js is Used Mainly for Developing API's, Backends/Servers & WebApps. Pros: Fast development, great performance & the easiness of Node.js makes it a favorite. Cons: Most Node.js Developers Face Performance & Security Problems in Production.

Making 30x performance improvements on Yelp’s MySQLStreamer. Here's what they found: logging can be expensive; focus on the code that runs often; Batch IO operations; AVRO serialization is slow; Run your application on PyPy. Here's how they found them: Identify your key performance metrics - typically, they are “Throughput” and “Latency.”; Enable your application to emit these key metrics to some kind of time-series charting software like SignalFX; Equip your application with some kind of code profiler like VmProf; Make sure you have a production-like canary environment in a “saturated” state where you can profile your application’s performance; Understand where your application should be spending the most amount of timep; Generate and inspect the VmProf flamegraph to check if it matches with your understanding. If yes, then quit. If not, document the problem and attempt to fix it; Validate by looking at SignalFX graphs; Goto (5).

Your Database Should Work Like a CDN. marknadal: Also a distributed database engineer, and while I disagree with CockroachDB's CAP Theorem tradeoff decisions, they are definitely well reasoned and principled. RethinkDB was Master-Slave (strongly consistent) with an amazing developer community and actually survived Aphyr's tests better than most systems. CockroachDB is also in the Master-Slave (strongly consistent) camp, but more enterprise focused, and therefore will probably not fail, however their Aphyr report worked... but was unfortunately very slow. But hey, being strongly consistent is hard and correctness is a necessary tradeoff from performance. Other systems, like Cassandra, us (https://github.com/amark/gun), Couch/Pouch, etc. are all in the Master-Master camp, and thus AP not CP. Our argument is that while P holds, realtime sync with Strong Eventual Consistency is good enough yet has all the performance benefits, for everything except for banking. Sure, use Rethink/Cockroach for banking (heck, better yet, Postgres!), but for fun-and-games you can do banking on top of AP systems if you use a CRDT or blockchain (although that kills performance, CRDTs don't) on top. So yeah, I agree with you about CAP Theorem and stuff, disagree with Cockroach's particular choice - but they do have some pretty great detailed explainers/documentation on their view, and therefore should be treated seriously and not written off.

Periscope Technology Stack: Programming languages/frameworks: C++ (GO), HTML5/CSS3, Java, JavaScript (Node.js, React, RxJS, Restify, EmberJS, AngularJS, BackboneJS), Python, Ruby (Ruby on Rails). Data storage/management: Atlas-DB, Cassandra, MySQL, Oracle, PostgreSQL. Cloud platforms: Amazon EC2/S3. Analytics: Google Analytics, Hadoop, Hive, MixPanel, Mode, Parquet, Pig, Presto, Spark. CDN: Amazon CloudFront, Cloudflare, Fastly, Open Connect. Streaming protocols: Adobe HTTP Dynamic Streaming, Apple HTTP Live Streaming, M2TS, MPEG-DASH, Microsoft Smooth Streaming, RTMP. Media formats: H.264. Media containers: FLV, MP4. Media processing platform: Brightcove, Contus Vplay, DaCast, Flash Media Server, JW Live, Livestream, Muvi, Ustream, Vimeo PRO, Wowza Media Systems. Geolocation: Google Maps, MapKit/Core Location (iOS). Messaging: Firebase, PubNub, Twilio.

An epic peak of 3.4 million concurrent players leads to an epic failure and even more epic post mortem. The lesson: scale changes everything. Postmortem of Service Outage at 3.4M CCU - Epic Games. The problems: The extreme load caused 6 different incidents between Saturday and Sunday, with a mix of partial and total service disruptions to Fortnite. The fixes: Identify and resolve the root cause of our DB performance issues...Optimize, reduce, and eliminate all unnecessary calls to the backend from the client or servers...Optimize how we store the matchmaking session data in our DB...Improve our internal operation excellence focus in our production and development process...Improve our alerting and monitoring of known cloud provider limits, and subnet IP utilization...Reducing blast radius during incidents...Rearchitecting our core messaging stack...Scaling our internal infrastructure...Performance at scale...MCP Re-architecture.

It took one long day to convert from Node.js to Lambda. Going serverless: Converting Yield.IO to AWS and Lambda. The Good: Free SSL certificates; Performance. Serving from CloudFront is fast; No shell accounts to maintain and upgrade; Cost. The pay as you go model is awesome for low volume side projects like Yield.IO. I expect the costs will be only a few dollars per month. The Bad: The sea of AWS console browser tabs; When using the S3 upload UI you have to constantly configure permissions; Takes a long time to deploy CloudFront distributions; Edit/Upload/Test cycle with Lambda.

Cars are complex embedded systems, if not hyper-converged private style datacenters. Compressing data in vehicles: As the number of cameras in automobiles is on the rise with the move to autonomous vehicles, internal vehicle networks are being pushed to their limits from the flood of data...In new vehicle models, there can be up to 12 cameras in new vehicle models, mostly in the headlights, taillights, and side mirrors while an on-board computer built into the car uses the data for the lane assistant, parking assist system or to recognize other road users or possible obstacles...in order to avoid latency, the team only use special mechanisms of the H.264-coding method, whereby determining the differences in individual images no longer takes place between images, but within an image. This makes it a low-latency method. With this method, the delay is now less than one image per second, almost real time, therefore the H.264 method can now be used for cameras in vehicles.

Excellent tutorial from Riot Games on PROFILING: MEASUREMENT AND ANALYSIS. And here's a nice tutorial with code examples on using Go serverless to process an image. Also, AWS Lambda, GoLang and Grafana to perform sentiment analysis for your company / business.

I frequently use AWS Calculator to come up with rough estimates after selecting the EBS volume size, type and IOPS configurations. The Importance of Capacity Planning for EBS Volumes in AWS: have you ever thought whether you are provisioning either under or over the required amount of capacity for Storage? This is one of the main places where over provisioning happens. Its common to provision higher amount of capacity or throughput just to be on the safe side but finally ending up in problems when application workload changes...there are other factors such as selecting the taking snapshots, pre-warming & etc. which also affects AWS EBS performance that also needs to be considered for capacity planning...Another important action you can take is to automate the associated volume modifications using CloudFormation, AWS CLI or using AWS SDKs so that you can provision the required configuration optimally and when it is needed...Most of the time I see many EC2 configurations go ahead with the default settings in selecting general purpose SSD which is alright unless you know what you are doing. However, there are costs benefits in selecting HDD storage over SSD for several sequential access scenarios or non-trivial workloads which needs to be considered while planning the throughput...This is one of the areas where many forget about considering especially for high-performance applications. One of the important aspects of managing latency is to manage the queue length which depends on the number of pending I/Os for an EBS device...Knowing the limitations and capabilities for automating EBS capacity and provisioning possibilities can save lots of money due to over and under provisioning of EBS volumes...It is also important to understand that the AWS costs changes from region to region and also based on the Usage. So selecting right region while having the necessary configuration for automated tasks can save you a considerable amount of money if done properly. o selecting right region while having the necessary configuration for automated tasks can save you a considerable amount of money if done properly. This includes scheduling snapshots on required time frames, increasing storage size with predictive analysis while affecting the changes at the right time frames.

A fun trip down memory lane. Writing Space Invaders with Go. All in 300 lines of code.

Write-Behind Logging: This paper explores the changes that are required in a DBMS to leverage the unique properties of NVM in systems that still include volatile DRAM. We make the case for a new logging and recovery protocol, called write-behind logging, that enables a DBMS to recover nearly instantaneously from system failures. The key idea is that the DBMS logs what parts of the database have changed rather than how it was changed. Using this method, the DBMS flushes the changes to the database before recording them in the log. Our evaluation shows that this protocol improves a DBMS’s transactional throughput by 1.3×, reduces the recovery time by more than two orders of magnitude, and shrinks the storage footprint of the DBMSon NVM by 1.5×.

Computer Architecture - ETH Zürich - Fall 2017. Onur Mutlu's lecture videos from the Computer Architecture course taught at ETH Zürich in Fall 2017. 28 videos in the series.

OmniLedger: a secure, scale-out decentralized ledger via sharding: OmniLedger makes a nice complement to Chainspace that we looked at yesterday. The two systems were developed independently at the same time. OmniLedger combines Visa levels of scalability (caution: the authors compare against the average Visa tps, the peak tps in the Visa network is considerably higher) with a secure decentralised ledger. It’s also a demonstration of how quickly the field is progressing, and something of a wake-up call if you’ve been working in the field of distributed systems and transaction processing but so far ignoring developments in decentralised ledgers. Standard building blocks are emerging and being combined in novel ways, and there’s a lot to learn!

ODINI : Escaping Sensitive Data from Faraday-Caged, Air-Gapped Computers via Magnetic Fields: We introduce a malware code-named ODINI that can control the low frequency magnetic fields emitted from the infected computer by regulating the load of the CPU cores. Arbitrary data can be modulated and transmitted on top of the magnetic emission and received by a magnetic receiver (bug) placed nearby. We provide technical background and examine the characteristics of the magnetic fields. We implement a malware prototype and discuss the design considerations along with the implementation details. We also show that the malicious code does not require special privileges (e.g., root) and can successfully operate from within isolated virtual machines (VMs) as well.

Stuff The Internet Says On Scalability For February 9th, 2018

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale