Stuff The Internet Says On Scalability For December 1st, 2017

Hey, it's HighScalability time:

Isn't this all of software? @thomasfuchs: Here we see a group of JavaScript engineers implementing a method that adds two numbers

If you like this sort of Stuff then please support me on Patreon. And there's my new book, Explain the Cloud Like I'm 10, for complete cloud newbies.


  • 82%: chance a file on GitHub is a duplicate; 11: new AWS regions; 42%: AWS yearly growth; 1,100: new AWS services in 2017; 300%: year of year growth in Lambda; 00000000: code to launch a Minuteman missile; 100 megawatts in 100 days: biggest battery in the world; 40: months in prison for VW engineer; 3,000 cores: Raspberry Pi cluster; 11: lost cities found by building a database from 4,000-year-old clay tablets; 1.25 million: Riot Games builds per year; 41.78: miles walked at reinvent; 

  • Quotable Quotes:
    • @gigastacey: This FCC is going to destroy net neutrality, strangle competition in media, let wireline providers off the hook for replacing copper with fiber or an equivalent to copper AND kill broadband access for the poor. This is an unprecedented attack on consumers.
    • @randyshoup: “My service is stateless, by which I mean I have state, but I store it somewhere else.” @samnewman #reInvent
    • @StuFlemingNZ: "Hi, I've found a fault with the English language and I need an entomologist" "An etymologist you mean?" "Νo. It's a bug, not a feature"
    • @copyconstruct: The future where "all the code you ever write is business logic" is one that will be facilitated by the huge cloud providers, leaving most infra startups either acquired or in the dust.
    • Mark Callaghan: At high-concurrency mysqld with jemalloc or tcmalloc can get ~4X more QPS on sysbench read-only compared to mysqld with glibc malloc courtesy of memory allocation stalls.
    • @aisipos: AWS Lambda functions can now use top memory size of 3GB. #reinvent2017
    • @cloud_opinion: It feels like AWS is putting more stress on containers than on serverless - Is it because they want to balance long game with short term revenue to fund the retail business? #reInvent
    • @__apf__: "how was your day" "today I parallelized a thing and slowed it down 100x" "you mean sped it up 100x?" "nope"
    • @mipsytipsy: It’s this simple: if you don’t sample, you don’t scale.
    • Daniel Dennett: The key insight, which I’ve known for years, is that we have to get away from the idea of there being the pure ultimate fixed proposition that captures the information in any informational state.
    • @kelseyhightower: I need to put my hands on EKS before I can speak on it, but my initial reaction: this is a good thing for the community and adds weight to the Kubernetes anywhere promise.
    • @Koffie_kopjes: Ok, so far for #Cloud9 It could be a great IDE, but requiring third party cookies.... thought @Werner told developers are the new security team, but if they require third party cookies in 2017, they aren't very aware... ;) #security #reinvent2017
    • @GossiTheDog: I honestly think IT is backsliding in InfoSec across the world at the moment. I’ve said it before, but a decade ago we had two factor VPNs etc - now there’s a massive tilt towards open RDP, AWS keys everywhere etc etc.
    • melissa mcewen: If I won the lottery would I still code? I would, but it would not be like work. It would be projects I enjoyed. And it would be fewer hours.
    • olalonde: I feel like that should be the other way around. When all the "blockchain startups" and ICOs blow up, Bitcoin will be left standing. The true innovation behind the "blockchain" was its decentralised consensus mechanism. That mechanism is only secure as long as no single entity controls over 50% of the hash rate. Some of the largest Bitcoin miners have so much hash rate today that they could attack any (SHA-256 based) blockchain but the Bitcoin one.
    • @ben11kehoe: "The future" in this keynote is apparently 2020, which will still be containers for most customers. #serverless is on a bit longer timeline for the masses #reinvent
    • @irwin: I’ve seen things you people wouldn’t believe. Gopher, Netscape with frames, the first Browser Wars. Searching for pages with AltaVista, pop-up windows self-replicating, trying to uninstall RealPlayer. All those moments will be lost in time, like tears in rain. Time to die. 
    • Andy Jassy: We're just at the beginning of mainstream enterprise mass migration to the cloud...The torrid pace of adoption and innovation in the serverless (Lambda) space has totally blown us away...in fact, he says that if Amazon.com were starting today, it would go serverless
    • Andy Jassy: In our [AWS] business, you have to be able to have access to capital. It's part of why I think it's hard at the scale that we're operating at. It’s hard for others to start from scratch and pursue it because not only do you need hundreds of services to have a competitive offering, but you need large amounts of capital.
    • RightScale: 70 percent of the 104 price points we include in our comparison have gone down since our last comparison in April 2017. Although these comprise a fraction of the total price points, they represent some of the most commonly used instances
    • RightScale: Overall, Azure is the cost leader, with the lowest price across scenarios about 71% of the time with the highest price just 8% of the time. AWS fell in the middle, while surprisingly, Google Cloud had the highest price half the time, 
    • Takashi Nishikawa: The power grid is quite robust against the propagation of failures — perhaps surprisingly robust, when we consider all the complexities involved
    • @vgcerf: Today is the 40th anniversary of the first three-network test of the Internet Protocols: joining ARPANET, Packet Radio and Packet Satellite networks linking the US and Europe!
    • Andy Jassy: What's different is with every successive year, as we launch a thousand plus features and services, we just have the capabilities to make it easier for the rest of the market to use us. So I think the total addressable market for the areas that we touch, which is infrastructure software, hardware and data center services, is trillions of dollars
    • @GossiTheDog: Again: stop paying the ransoms. We’re creating a billion dollar criminal industry instead of, well, setting up backups. We are monetising low skill crime.
    • @cogconfluence: Asked a bunch of mechanical turkers what one question they would ask to determine if they were talking to a human or AI. fave reply: When is the last time your teeth felt like they had little sweaters on them?
    • @EricJorgenson:  I still find this concept absolutely staggering: "On a daily basis, 15 percent of searches -- 500 million -- have never been seen before by Google's search engine, and that has continued for 15+ years"
    • @mijndert: Things I’m most excited about from the @awscloud #reInvent announcements: Fargate, EKS, Launch Templates, Aurora Multi-Master, Aurora Serverless, MediaLive, Inter-Region VPC Peering, and GuardDuty.
    • @0x424c41434b: You are probably tired of hearing me talk about rust but one reason I like it is that, I feel like a better programmer because it takes out that fear of something going wrong. Concentrating on the logic only made me do things much quicker than I did in the past. More confidence
    • Steve Konves: For those of us developers who have a unwavering love for our craft, there will always exist a bias to make decisions based on our passion for coding rather than profitability or cost savings.
    • @dcaoyuan: With new tuned #akka http client, our crawler can fetch and process 300k+ web page per day, 100 millis per year, on a 16 cores 64G memory machine.
    • @benschwarz: When Amazon changed their pricing to per minute billing I implemented an aggressive autoscaling policy. This policy (with tweaks and improvements along the way) has reduced EC2 costs by >30% and improved service dramatically.
    • @CodyBrown: Seriously. I don’t think people quite understand how many lawyers are getting into Space Law right now. Satellite internet is so feasible and the economics are changing. Massssssive terrestrial infrastructure is about to get competition
    • @Silver_Watchdog: The Evolution of Bitcoin 1. It's the future of global payments. The revolution! 2. So what about Mt.Gox and robo traders. Stocks are rigged too. 3. Yes it forks and creates more supply. Your point? 4. Everyone knows it's all traded for speculation and not used for payments. Geez
    • DHH: Etsy corrupted itself when it sold its destiny in endless rounds of venture capital funding. This wasn’t inevitable, it was a choice. One made by founders and executives who found it easier to ask investors for money than to develop the habits and skills to ask customers.
    • @bitfield: “In cloud-native, network issues—mapping IP addresses, latency, retries—are now falling into the lap of developers.”
    • @nathankpeck: Okay so I ran a small load test of 1000 reqs/sec, against a 4 machine cluster of containerized Node.js processes backed by DynamoDB Accelerator. DAX kept response times under 5ms for most services!
    • @tsrandall: Musk claims that "jacknifing is impossible." There are four independent Model 3 motors on two axles, and they'll adjust. Even if multiple motors failed, the truck will keep driving.
    • Pavan Patibandla: If you are using SSDs and running PostgreSQL with default configuration, I encourage you to try tuning random_page_cost & seq_page_cost. You might be surprised by some huge performance improvements.
    • Jim Baggott: What we recognize as mass is a behavior of these quantum fields; it is not a property that belongs or is necessarily intrinsic to them.
    • @mweagle: Hot take: AWS aiming to be the 21st century version of Sears + General Electric/Westinghouse
    • Andy Jassy: All the rumors and crazy claims from competitors creates a lot of wasted energy and time, which we [AWS] don’t pay attention to. Instead, we spend that energy on trying to listen to what customers care about, and really inventing and iterating on their behalf
    • @Techmeme: YouTube says it has terminated 270+ accounts, taken down 150K+ videos, removed ads from ~2M videos over past week as it purges content that endangers children (Noah Kulwin / VICE News) 
    • @ben11kehoe: I think it’s a real thing. We at @iRobot have not had to build a competency in much of cloud infrastructure and operations. Never had EC2 in production.
    • @AWSreInvent: Need multi-region and multi-master in your Managed NoSQL database? Announcing DynamoDB Global Tables #reInvent
    • @Jason: Instagram will pass 1B users in Q1; they sold for $1B — $1 per user AR 1b.  If they stayed private — which they could easily done — they would have a $150-200B IPO (at $200 per user, like fb is valued). Founders: never sell your company — sell secondary shares, go public.
    • @marcwin: Classic failure to predict exponential growth. We really a have no idea how fast technological change is coming. It should be the first thing on every Government's agenda.
    • @ben11kehoe: FYI, Amazon MQ is not a serverless messaging fabric. It's more like the Elasticsearch service: you're just not managing the instances. Still useful for lots of people.
    • @QuinnyPig: DynamoDB backup and restore announced, throwing away a giant pile of code every shop’s been running for years. #reinvent
    • @AWSreInvent: The explosion of virtual IPs at Amazon. 6,000 to 600,000 in 10 years. #reInvent
    • @cloud_opinion: There are two kinds of ops people: 1) People who think k8s is the solution to every problem 2) Ops people
    • @ekaulberg: Peter @awscloud: exponential growth in #GPGPU / #FPGA deployments in #EC2 this year. #reinvent17 #AWSreInvent
    • @brendangregg: Bare metal systems these days are huge. 72 CPUs or more. You want all that, or do you want to carve it up? If the latter, then you're putting on a hypervisor anyway. You think you can do better than Nitro?
    • bupku5: this is such a common pattern and it is lame that Go makes developers hunt down blogs to figure it out. 99% of the time you want a goroutines to gracefully exit on some signal or just be killed after a timeout...why are we constantly reimplementing these with hacked variants of WaitGroups, chans and mutexed flags?
    • TheAceOfHearts: If I were creating a new app I'd probably contain each backend service to a single cloud environment. Different services can be in different cloud environments, but no cross-cloud services. I'm uncertain the extra complexity would be worth it, at least for many of the use-cases I'm considering. Having fewer environments to setup and maintain is a big win in my book.
    • @brendangregg: AWS announces "Nitro hypervisor" used by c5s, KVM based, plus more: "we want customers to have instances that are indistinguishable to bare metal" #reInvent
    • @AndySugs: RT:  (TamaraMcCleary)#BIGDATA: By 2020 there will be 44 Zetabytes of Data & an astonishing 5.2 TB of information per person! #reInvent
    • @seldo: For a bargain $5,000/hour consulting fee I will tell you how to build your blockchain-based startup to launch 10x sooner by not using the blockchain.
    • @chamath: Over the course of re:Invent this year, AMZN reaffirmed complete dominance in cloud computing and rendered about $2B of venture funding in dumb businesses by sheepy VCs obsolete...containerization, ML-aaS, etc etc. these guys are cranking.
    • manigandham: Running multiple Kubernetes clusters in different cloud vendors with configuration details abstracted away in a more self-contained service that can be deployed by itself as the bootstrap process makes things much easier. Consul is a good system for this, deploy it once (manually if necessary) then everything else refers to it to figure out the environment details automatically.
    • Google: At Google, we’ve found that larger clusters are more efficient — especially when running multiple workloads. So if you were hesitating to create larger clusters worry no more and scale freely!
    • Joshua Burgin: We launched a new pricing model and simplified access to Amazon EC2 Spot Instances. With these changes, you can launch Spot instances the same way launch On-Demand instances, and customers can count on low, predictable prices.
    • kgilpin: +1 The real money in the cloud business is in the Fortune 1000, not the small fry. 5 years ago it was only startups and Netflix using the cloud. Now the battle is on between AWS, Azure and Google to be the “new data center”.
    • FBISurveillance: I'm serving 1.2MM requests per second from 3 GCP regions, managing instances and GKE clusters with terraform, and I cannot see how could I possibly set that up in a resilient fashion with DigitalOcean. I think DO is perfect for certain scale apps. You usually care about UI things mostly when you spin up couple servers; but when you operate hundreds of machines you need automation.
    • ethomson: GitHub only shares objects among forks. Source: I used to work on GitHub's Git Infrastructure team, but this is publicly available information.
    • @copyconstruct: OH: “their infra is 50 shades of broken and they’re now setting up their own frankenetes cluster hoping it’ll fix everything” 😮😕
    • dmitrygr: Classic silicon valley-like thinking. Try to "disrupt" a system without understanding what it is for and why it was designed the way it was.
    • Leveson: The safety culture is the general attitude and approach to safety reflected by those working in an industry. The accident reports all described various aspects of complacency and a discounting or misunderstanding of the risks associated with software. Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk. This phenomenon is not new, and is extremely difficult to country when it enters the engineering culture of an organization.
    • @shivenz: my favourite #reinvent2017 announcements were: 1. DynamoDB Global tables, backups 2. Neptune 3. Fargate
    • rjzzleep: Dude that elitest crap you're spouting is exactly what killed Solaris. Repeat after me: I read the manual pages and it's still crap.
    • Brendan Gregg: I've been investigating the overhead of Nitro and have so far found it to be miniscule, often less than 1%. It is hard to measure. Nitro's performance is near-metal. r I'm also excited for Nitro as it exposes all PMC counters. I previously posted The PMCs of EC2: Measuring IPC, covering the architectural set of seven PMCs that were recently exposed to certain instance types in EC2. That's only 7 PMCs. On the c5 Nitro instances, you have hundreds of PMCs, and can truly analyze low-level CPU performance in detail. This should help find wins of 5%, 10%, and more.
    • @Nick_Craver: "Nothing is worth making 1ms faster" is a bit ignorant. We get 8B+ hits a month. If I shave 1ms off each, that's over 26,000 dev hours/year.
    • @mattklein123: I agree that in the case odf Mesos vs. K8s, the choice of C++ vs. Go is a major factor in why K8s won. As I keep saying (ah the joy of shouting into the Twitter void): there are many factors in choosing which tool is right for a job. I do not use C++ for everything I do! 1/
    • Rodney Brooks: The idea that space can have properties does not come easily, but by the time you finish this book you will be comfortable with the concept of fields.
    • Orion Edwards: What's my preference? Superficially Kotlin and Swift are very similar. Assuming you could somehow magically eliminate all the platform specific differences between iOS and Android - say perhaps you were writing server-side software on linux or something - I think I'd chose Swift. There's a variety of reasons, but the thing that I guess I place the most weight on is the modern features/paradigms (see above). From that point of view, Kotlin and Swift are actually very different.
    • rocketplex: My rule of thumb is that if I have to spend more that 10m on the query, I do it in SQL. That's served me well for a long time. Leave me and my ORMs alone.
    • Jeff Hawkins: This provocatively suggests that all processing in the neocortex is associated with locations, even if those locations do not correspond to physical locations in the world. It suggests that we manipulate abstract concepts using the same neural mechanisms we use to manipulate physical objects. Of course, manipulating concepts is a core feature of general intelligence.
    • John Tooby: Forming coalitions around scientific or factual questions is disastrous, because it pits our urge for scientific truth-seeking against the nearly insuperable human appetite to be a good coalition member. Once scientific propositions are moralized, the scientific process is wounded, often fatally.  No one is behaving either ethically or scientifically who does not make the best case possible for rival theories with which one disagrees. 
    • _msw_: The NIC that is used by EC2 Bare Metal instances is an Elastic Network Adapter (ENA) PCI device that surfaces a logical VPC Elastic Network Interface. ENA is implemented in an ASIC that we design and build. When ENA is used in virtualized instances, Intel VT-d and SR-IOV are used to bypass the hypervisor. When ENA is used in a bare metal instance, the OS simply has direct access to the PCI device. In either case the device is a controlled surface, and VPC software defined networking deals with verifying and encapsulating network traffic.
    • jatsign: I've been programming my own Ethereum smart contract (virtual currency) for awhile now. Here's some gotchas off the top of my head: - You have about 500 lines of code to work with. This of course varies, but smart contracts have to be really small to fit in the max gas limit (6.7 million wei). - You can break up your code into multiple contracts, but the tradeoff is an increased attack area. - Dumb code is more secure than smart code. - The tooling is very immature. You'll probably use truffle, which just released version 4. It makes some things easier, some harder. It's version of web3 (1.0) may differ from what you were expecting (0.2). - The Ethereum testnet (Ropsten) has a different gas limit than the main net (4.7 million vs 6.7 million).
    • Kranar: So I took the next logical step and used a BigDecimal. Now my rounding issues were solved but the performance of my applications suffered immensely across the board. Instead of storing a 64 bit int in a database I'm now storing a BigDecimal in Postgresql, and that slowed my queries immensely. Instead of just serializing raw 64-bits of data across a network in network byte order, I now have to convert my BigDecimal to a string, send it across the wire, and then parse back the string. Every operation I perform now requires potentially allocating memory on the heap whereas before everything was minimal and blazingly fast. I feel like there is a general attitude that performance doesn't matter, premature optimization is evil, programmer time is more expensive than hardware, so on so forth... but honestly nothing feels more demoralizing to me as a programmer then having an application run really really fast one day, and then the next day it's really really slow.

  • As usual, Amazon's programmers have been as busy as Santa's little elves. Assuming you aren't already tired of AWS Reinvent 2017, there are a lot of packages to unwrap under the tree:

  • The art of doing without doing, assuming you are well funded. Scaling Unsplash with a small team: On an average day, our API handles 10M+ requests from unsplash.com...our team is relatively small: 2 designers, 3 frontends, 3 backends, and 1 data engineer...Build boring, obvious solutions...On the backend, there are very few problems that can’t be made “good enough” using standard workhorse tools and a few tried-and-trued patterns, like caching, batching, asynchronous operations, and pre-request aggregation...Focus on solving user problems, not technology problems...we focus our time on connecting pre-built technologies in a way that solves our user’s problems and expands Unsplash’s community...Deployment pipelines, server configuration, system dependencies, data processing, data analysis, image processing, and personalization (to name a few) are examples of areas we chose not to focus on investing our engineering resources in, choosing instead 3rd-party services to handle each of them...Throw money at technical problems...Throwing money at a technological problem frees our team up to focus on the non-repeatable, hard problems...We use Heroku wherever we can to simplify deployment, configuration, testing, maintenance, and scaling of our primary applications...We lean heavily on Redis, ElasticSearch, and Postgres...We aggressively use worker queues...Our data processing uses Snowplow...We use an array of cloud monitoring services like Datadog, New Relic, Sentry, and Logentries...We outsource all of our image hosting and infrastructure to Imgix...We push all of our user activities to Stream...We don’t train our own image recognition algorithms, and instead use TinEye for reverse image search and Google Vision for image understanding and classification...We push all of our behavioural events to Vero, an email marketing and notification platform...we’ve transitioned the app progressively from a single Rails application to a Rails API, a Node + React powered web app...We’re developing a new internal GraphQL API to speed up independent iterations of experiments.

  • Memory prices are increasing 6% to 10%. Demand is increasing while production capacity isn't. The Week In Review: Manufacturing. 

  • Amazon has released an excellent 50 page paper on Serverless Architectures with AWS Lambda. If you started early with Lambda it has grown quite a bit. There's probably advice here you may have missed. Serverless: applications are ones that don't require you to provision or manage any servers. Lambda: a high-scale, provision-free serverless compute offering based on functions. At its core, you use Lambda to execute code. Central tenant: your code cannot make assumptions about state. Events: you can associate your Lambda function with event sources occurring within AWS services that will invoke your function as needed. You don’t have to write, scale, or maintain any of the software that integrates the event source with your Lambda function. Security best practices: One IAM Role per Function; You should not have any long-lived AWS credentials included within your Lambda function code or configuration; Secrets should always only exist in memory and never be logged or written to disk; API Gateway can perform much of the heavy lifting by providing things like native AWS SigV4 authentication,47 generated client SDKs,48 and custom authorizers;  inside a VPC, you should apply network security best practices through use of least privilege security groups, Lambda function-specific subnets, network ACLs, and route tables that allow traffic coming only from your Lambda functions to reach intended destinations. Reliability Best Practices:  The availability posture of your Lambda function depends on the number of Availability Zones it can be executed in; What can be complex, like most multi-region application designs, is coordinating a failover decision across all tiers of your application stack; take advantage of dead letter queues and implement how to process events placed on that queue after recovery occurs. Performance Best Practices: By analyzing the Max Memory Used: field, you can determine if your function needs more memory or if you over-provisioned your function's memory size; choose the language you’re already most comfortable with; Always use the default network environment unless connectivity to a
    resource within a VPC via private IP is required; Choose an interpreted language over a compiled language; Trim your function code package to only its runtime necessities; After initial execution, store and reference any externalized configuration or dependencies that your code retrieves locally; Limit the reinitialization of variables/objects on every invocation; Keep alive and reuse connections; use AWS X-Ray.54 X-Ray lets you trace the full lifecycle of an application request. Operation Best Practices: log; Create a custom metric and integrate directly with the API required from your Lambda function as it’s executing. And much more. Also, Serverless Applications Lens

  • How did we ever survive? Things that did not exist on Thanksgiving 10 years ago: Uber, Airbnb, Instagram, Snapchat, Bitcoin, iPad, Kickstarter, Pinterest, App Store, Angry Birds, Slac,k Siri, Lyft, Google Chrome, WhatsApp, Venmo, Candy Crush, Alexa, Tinder, Stripe, Square, Apple Watch, FB Messenger

  • DDOS and the DNS: I wonder if perhaps we are just not looking at this problem in a way that leads to different ways to address the issue. The role of authoritative name servers in the DNS, from the root downward, is not in fact to answer all queries generated by end users. The strength of the DNS lies in its caching ability, so that recursive resolvers handle the bulk of the query load...It took some hundreds of years, but Europe eventually reacted to the introduction of gunpowder and artillery by recognising that they simply could not build castles large enough to defend against any conceivable attack. So they stopped. I hope it does not take us the same amount of time to understand that building ever more massively fortified and over-provisioned servers is simply a tactic for today, not a strategy for tomorrow.

  • Linus tells Google security engineers what he really thinks about them: Some security people have scoffed at me when I say that security problems are primarily "just bugs". Those security people are f*cking morons. Because honestly, the kind of security person who doesn't accept that security problems are primarily just bugs, I don't want to work with. If you don't see your job as "debugging first", I'm simply not interested. So I think the hardening project needs to really take a good look at
    itself in the mirror. Because the primary focus should be "debugging". The primary focus
    should be "let's make sure the kernel released in a year is better than the one released today". And the primary focus right now seems to be "let's kill things for bugs". That's wrong. dmazzoni: I think this just comes from a different philosophy behind security at Google. At Google, security bugs are not just bugs. They're the most important type of bugs imaginable, because a single security bug might be the only thing stopping a hacker from accessing user data. You want Google engineers obsessing over security bugs. It's for your own protection. A lot of code at Google is written in such a way that if a bug with security implications occurs, it immediately crashes the program. The goal is that if there's even the slightest chance that someone found a vulnerability, their chances of exploiting it are minimized.

  • Wonderful description of how to run a global system from your own datacenters. RUNNING ONLINE SERVICES AT RIOT: PART IV: Riot has a massive global deployment footprint. We deploy our services to dozens of datacenters around the world and each of those datacenters can host multiple regions. We want to "build once; ship everywhere," and that means micro-services have to be highly portable. Making our services portable started with the decision to containerize them...We still have to deliver those packaged containers to our datacenters around the globe. We achieve this goal by hosting our own globally-replicated docker registry, leveraging the power and capability of JFrog’s Artifactory...Because these Docker images are built on reusable layers, they can replicate around the world in minutes. They tend to be very small as only the bits that have changed move...We’re currently running over 10,000 containers in production at Riot. Any one micro-service may consist of several containers...to maintain portability our applications must be deployable and equipped to operate in any environment at runtime, no muss, no fuss.This is where configuration as a service comes in...After a new application starts, it seeks out the discovery service to find out where the configuration service lives...We handle these cases with a simple heartbeat pattern. Services that fail to call back in the allotted time are assumed M.I.A. and dropped from discovery...Thus enters the final piece of our operable puzzle: secrets management. For this requirement, we chose to create a service wrapper around HashiCorps’ excellent Vault service. 


  • Datanauts 110: The Future Of Storage: data gravity and data movement are huge problems because wide area bandwidth is still expensive. Cloud providers make it expensive to move data outside the cloud. If you have a huge amount of data don't get locked into one cloud provider. Deploy your storage in bandwidth rich colo centers. No on-premises building anymore has the bandwidth to deal with huge storage volumes. Be in a datacenter where there are bunch of telcos competing for bandwidth. Plus you can connect directly to cloud providers. Bandwidth is still costly compared to storage. Most big players construct their streams so data streams to more than one location. Don't try to move huge chunks from one place to another. As you gather data, make sure it ends up in more than one place. Don't hide the performance implications of where the storage is when in fact it makes all the difference to performance. By putting storage outside the container cluster your making it harder to get to. Next step in container storage is to move it onto the same servers that run the containers. Take the storage out of the server, but leave in the same rack. Don't put any intelligence in the storage. All the intelligence can be in containers and the storage software itself becomes stateless. S3 is now the only interesting API for object storage. With lambda, when an S3 event triggers a lambda function, the computation happens in S3. That's a lot more efficient, moving the computation to the data. There is no universal storage system. Take it application down to see what you're really dealing with. 

  • EC2 network performance demystified: m3 and m4. Expect a network throughput from 0.3 Gbps to 1.0 Gbps with m3 instances. The variance is much higher between measurements of the same instance type within the m4 family.

  • It's complicated. Net Neutrality Is A Sham: Is All About Capitalism and Politics: The internet has and always will be about interconnecting networks, a point that is completely lost in this discussion...Most have no idea how the internet works and many in the media don’t even know what a transit provider is. It goes back to the wrong idea that ISPs are the only one’s that control the quality of the content that we consume on our devices. Most are simply uninformed on this topic and haven’t taken the time to really learn how traffic gets delivered from say Netflix to Comcast...And this is exactly what I have a problem with when people demand a bill to protect them, without knowing that the bill, as written, doesn’t apply to a company like Cogent. Cogent directly impacted the quality of the Netflix video consumers were receiving, in a negative way, and suffered no consequences, since the FCC doesn’t classify Cogent as a last-mile provider...There has never been any rule or understanding that certain networks must carry traffic for free. A lot of networks engage in settlement free peering, but that is purely at their option, as a business decision...The bottom line is that there is a cost to get the content we want, at a level of quality we want. Operating and maintaining a network is expensive and capacity limitations do exist. Many ISPs have gone under and even Google has pulled back with their fiber deployments. Net neutrality doesn’t change the fact that the underlying infrastructure of the internet is expensive. And that is what this debate should really come down to; how that cost is shared amongst all the parties involved, businesses and consumers.

  • Make your database faster by archiving old data. Archiving for a Leaner Database. Nice examples of how to handle the partitioning. 

  • Looks like our IO model at the application layer and in the OS may need to change. Datanauts 111: NVMe And Its Network Impact: non-volatile memory express SSDs are fast. Most storage vendors are on board with NVMe. It's a mature technology. One of the big advantages of NVMe are queues.  A queue pair can be pinned to a CPU core. You can have many many different queues to support multiple read and write operations in parallel. Traditional IO is serial. Threads could be retrieving data from the storage system in parallel. You can get much more throughput. Startups are rewriting operating systems to take advantage of these queues. The result is many times faster than the posix approach. PCIe is an interconnect protocol the CPU uses to talk to different devices. It's a bus, not a fabric. You can buy a PCIe switch to connect to multiple NVMe devices. A fabric by its nature can handle lost packets, handle retransmission, implement high availability, failover, active and passive pathing. Fabrics understand the network, not just the devices, it makes a difference in how you handle scale. Much better than a bus. The high performance of NVMe drives will swamp ethernet networks. A typical fibre channel network using SCSI could have 200,000 IOPs per initiator. With NVMe, per drive, the network has to handle a 1.5 million IOPs. You can have systems with 48 drives. You have a bandwidth problem. Storage has always been the slowest part of the system, now it's not.

  • A detailed description of the Erlang Garbage Collector. The truck comes every Thursday. Get those cans out by 1PM or you're out of luck.

  • Great experience report by Spotify on building event processing in the Google Cloud. Autoscaling Pub/Sub Consumers: 300 different types of events are being collected from Spotify clients...delivering hundreds of billions of events every day...As the backbone of our system we use [Google] Cloud Pub/Sub...each event stream is published to its own dedicated topic that is consumed and exported to Cloud Storage by a dedicated ETL process...As long as nothing unexpected happens, autoscaling works well. When problems occur, if not handled properly, things can get really ugly...Considering that Docker issues were affecting significant amounts of machines in our fleet, we needed to find a solution...Google’s autoscaler is configured using a single threshold parameter (aka target usage). This makes autoscaler tricky to configure...Issues with downstream services, upon which Consumer depends on to export data, are another common reason when autoscaler goes bonkers...We learned, the hard way, that exponential backoffs are a must in order to handle such scenarios...We’re looking closely at how can we use Kubernetes to run the system. If that were the case, the biggest resource optimisation would come from the fact that Kubernetes does bin packing.  

  • Uber prepares months in advance to handle the rush of drunken revelers. Reliability at Scale: Engineering an Uneventful New Year’s Eve: In the lead up to large-scale events, engineers need to measure more things than normal and we have to scale to that demand. Our goal is to allow every engineer to measure, tune and mitigate to make sure we always provide the best user experience possible, even during times of peak traffic...The Site Reliability Engineering team runs large-scale event drills to simulate the platform running at our predicted trip volume for the event. This simulated trip volume exercises every dependent service and flow of the user experience from requesting a ride to billing after the end of a trip. If a service begins to degrade during the drill, we pause the exercise to address the issue...The frequency of these drills increase as we get closer to the event, until we’re eventually running drills multiple times a day...The capacity planning process starts months in advance with Uber’s data scientists analyzing information such as the number of riders we have supported on the platform during previous high traffic events with sophisticated machine learning techniques...In addition to system-wide drills, individual teams load test their own services independently by using Uber’s self-service load test platform...Testing the most visible degradations is done by a system that runs outside our network and exercises product flows. Each team monitors their services using extensive whitebox monitoring which looks at the state of our internal systems.

  • Interesting detective story. How a single PostgreSQL config change improved slow query performance by 50x: The same query finished 50x faster when using a Nested Loop instead of a Hash Join. So why did PostgreSQL choose a worse plan for app A?...The main culprit for this discrepancy was the sequential scan cost estimation. PostgreSQL estimated that a sequential scan would be better than 4000+ index scans, but in reality index scans were 50x faster...That led me to the ‘random_page_cost’ and ‘seq_page_cost’ configuration options. The default PostgreSQL values of 4 and 1 for ‘random_page_cost’, ‘seq_page_cost’ respectively are tuned for HDD, where random access to disk is more expensive than sequential access. However these costs were inaccurate for our deployment using gp2 EBS volume, which are solid state drives. For our deployment random and sequential access is almost the same.

  • Greg Ferro's worst nightmare, someone who suggests enterprises move to the public cloud! Show 365: You Can’t Do That In Enterprise Networks. It's for all the usual reasons. Why wait months to a get a switch when you can get something working in 15 minutes on the public cloud? It's about delivering functionality. Greg asks why haven't we seen private clouds? I'd say because it's hard. It takes a special team organization to make a public cloud, a special software process, etc. Cisco, etc have misaligned incentives.

  • Is cloud cost optimisation a full time job? How else do you avoid The hidden costs of cloud?: Misuse of services, Unusual traffic spikes, Provisioned capacity being too low, Inter-zone networking costs are very high, You have to search for the discounts. 


  • Digital Ocean is all about simplicity and basic offerings and rounded prices. This is so refreshing because cloud providers are such a Paradox of Choice. Goodbye Google Cloud, Hello Digital Ocean!: Launching cloud instances should be fun. Like invoicing customers. GCP used to be fun too...But things have changed. GCP become cumbersome and slow to manage...DO is easy to fall in love with. I call it the hacker’s cloud since it doesn’t get in your way and packs serious power...20 instance types...8 regions and 12 availability zones...Generous bandwidth...Free DNS...Network-attached block storage — 100GB for $10/month...Simple object storage (ala S3) — 250GB for $5/month...Automatic backups — 20% of your instance cost...Free firewall, monitoring and alerts. 

  • A 2-year price study put Walmart and Amazon head-to-head — and the results should terrify Amazon. Which is cheaper is the wrong question. Which is more convenient? That's the right question.

  • The Celeste project achieved a 100x increase over results previously reported in the literatureJulia Language Delivers Petascale HPC Performance: Even in HPC terms, the Celeste project is big, as it created the first comprehensive catalog of visible objects in our universe by processing 178 terabytes of SDSS (Sloan Digital Sky Survey) data[1]. Remarkably, the combination of Cori supercomputer and Julia application was able to load and analyze the SDSS data set in only 15 minutes. Thus the Celeste team demonstrated that the Julia language can support both petascale compute and terascale big data analysis on a leadership HPC system plus scale to handle the seven petabytes of data expected to be produced by the Large Synoptic Survey Telescope (LSST) every year.

  • jacobdufault/cquery:  a highly-scalable, low-latency language server for C/C++. It is tested and designed for large code bases like Chromium. cquery provides accurate and fast semantic analysis without interrupting workflow.

  • aws/amazon-freertos (article): Cloud-native IoT operating system for microcontrollers. IoT microcontroller operating system that makes small, low-powered edge devices easy to program, deploy, secure, connect & maintain. 

  • NLKNguyen/awesome-language-engineering: A curated list of useful resources for computer language engineering and theory


  • Popularity predictions of Facebook videos for higher quality streaming: Today’s paper looks at the problem of predicting the popularity of videos on Facebook. Why does that matter? As we saw yesterday, videos can be encoded at multiple different bitrates. Having a broader choice of bitrates means a better overall experience for clients across a range of bandwidths, at the expense of more resources consumed on the server side in encoding. In addition, Facebook’s QuickFire engine can produce versions of a video with the same quality but approximately 20% smaller than the standard encoding. It uses up to 20x the computation to do so though! Since video popularity on Facebook follows a power law, accurate identification of e.g. the top 1% of videos would cover 83% of the total video watch time, and allow us to expend server side effort where it will have the most impact.