« Snuggling Up to Papers We Love - What's Your Favorite Paper? | Main | The Simple Leads to the Spectacular »

Stuff The Internet Says On Scalability For March 11th, 2016

The circle of life. Traffic flow through microservices at Netflix (Rob Young)


If you like this sort of Stuff then please consider offering your support on Patreon.
  • 400Gbps: DDoS attack; 50,000: frames per second Mythbusters films in HD; 3,900: pages Paul Klee’s Personal Notebooks; 1 terabit: satellites deliver in-flight Internet access at hundreds of megabits per second; 18%: overall mobile market revenue increase; 21 TB: amount of date the BBC writes daily to S3; $300 million: Snapchat revenue; 

  • Quotable Quotes:
    • Dark Territory:  Yes, he told them, the NORAD computer was supposed to be closed, but some officers wanted to work from home on the weekend, so they’d leave a port open.
    • @davefarley77: If heartbeat was a clock cycle, retrieving data from fastest SSD is equivalent to crossing whole of London on foot  @__Abigor__ #qconlondon
    • @fiddur: "Legacy is everything you wrote before lunch." - @russmiles #qconlondon
    • @BarryNL: Persistent memory could be the biggest change to computer architecture in 50 years. #qconlondon
    • @mpaluchowski: "You can tell which services are too big. That's the ones developers don't want to work with." #qconlondon @SteveGodwin
    • @danielbryantuk: "I'm not going to say how big microservices should be, but at the BBC we have converged on about 600 lines of Java" @SteveGodwin #qconlondon
    • Steve Kerr~ What we have to get back to is simple, simpl, simple. That's good enough. The leads to the spectacular. You can't try the spectacular without doing the simple first. Our guys are trying to make the spectacular plays when we just have to make the easy ones. If we don't get that cleaned up we're in big trouble.
    • Dark Territory: a disturbing thought smacked a few analysts inside NSA: Anything we’re doing to them, they can do to us.
    • @andyhedges: ~100k TPS with JDK SSL, then ~500k TPS with netty equivalent on same box. Netty fully uses the server's CPU resources too. #qconlondon
    • Paul Marks: Humanoid robots can’t outsource their brains to the cloud due to network latency
    • @manumarchal: O.5TB generated during each flight by jet engines sensors, used for optimising fuel consumption and accelerating repair #Iot #qconlondon
    • fhe: It's both exciting and eerie [AlphaGo]. It's like another intelligent species opening up a new way of looking at the world (at least for this very specific domain). and much to our surprise, it's a new way that's more powerful than ours.
    • @jaykreps: "Part of using Google's Cloud is convincing yourself that Google will invest 5+ years in really entering the market"
    • DEAN TAKAHASHI: With just 3 games, Supercell made $924M in profits on $2.3B in revenue in 2015.
    • @anne_e_currie: Even an anti-wrinkle cream liked my tweet about containers at #qconlondon. It's good to see #container appreciation has spread so wide.
    • @KingPrawnBalls: Failure is inevitable. What matters is that u learn from it. Never fail the same way twice! #qconlondon Josh Evans, Director Ops Eng Netflix
    • Quiizlet: Everyone involved unanimously picked GCP. It came down to this: we believe the core technology is better.
    • @KevlinHenney: "I have to change the word 'compassion' to 'derisking the people problem' when dealing with upper management."
      — @kkirk #QConLondon
    • @adr: I am an "electricity native". Since I was born into a world of electricity, I am qualified to run a power plant and build a transformer
    • Goodhart's law: When a measure becomes a target, it ceases to be a good measure.
    • @alblue: “Once your unit of consistency is too large for a single server it’s no longer a unit of consistency” #QConLondon 
    • @nora_js: The users like it, the engineers hate it. That's how you know it's a good solution." - Micah Lemonik at #qconlondon
    • @mpaluchowski: "Microservices provide a more even distribution of complexity." #qconlondon @rachelreese
    • @robyoung26: Soft routing gave Netflix the tool they needed to evacuate a large-scale regional outage. #qconlondon
    • @cmeik: A data store problem posed to me yesterday has me reviewing some papers. There's a lot of room for exploring datastores turned "inside out".
    • @fabianpiau: @mitchellh Chef, Puppet, Nagios are so 2006, now we have Nomad, Kubernetes, Sysdig in 2016 #itIsMovingSoFast #qconlondon
    • kyledrake: We're entering a new phase of the web, where almost every home internet is going to have 1Gbps connections, upwards of 10Gbps in some areas (US Internet has already started providing 10Gbps to home customers in Minneapolis). The idea that datacenter egress bandwidth can continue to be this expensive is ridiculous
    • @xjoeduffyx: Every async system faces a fundamental design decision: Either pump messages - and risk "races" - or don't - and risk hangs/responsiveness.
    • obulpathi: Google’s network is so fast, however, that this kind of multi-cloud might just be possible. To illustrate the difference in speeds, we ran a bandwidth benchmark in which we copied a single, 500 Mb file between two regions. It took 242 seconds on AWS at an average speed of 15 Mbit/s, and 15 seconds on GCE with an average speed of 300Mbit/s. GCE came out 20x faster.

  • Strange to think the impact movies have had on national security policy. Dark Territory: The Secret History of Cyber War. Ronald Reagan after watching the movie WarGames asked if someone could hack the military. The answer: Yes, the problem is much worse than you think. Did anything happen? Nope. People didn't understand computers back then so they didn't think there was a threat (or opportunity in war). A stance that wouldn't change for over a decade. Admiral John "Mike" McConnell watched Sneakers and came up with a NSA mission statement from a soliloquy in the movie: The world isn’t run by weapons anymore, or energy, or money. It’s run by ones and zeroes, little bits of data. It’s all just electrons. . . . There’s a war out there, old friend, a world war. And it’s not about who’s got the most bullets. It’s about who controls the information: what we see and hear, how we work, what we think. It’s all about the information. 

  • Think about this: Amazon launched S3 on March 14, 2006 and with it they started the cloud revolution. That's just ten years ago! James Hamilton in A Decade of Innovation takes a little trip down memory lane. He lists year by year the major AWS product releases and it's impressive. Contributing to this speed may be how decisions are made: Another interesting aspect of AWS is how product or engineering debates are handled. These arguments come up frequently and are as actively debated at AWS as at any company. These decisions might even be argued with more fervor and conviction at AWS but its data that closes the debates and decisions are made remarkably quickly. At AWS instead of having a “strategy” and convincing customers that is what they really need, we deliver features we find useful ourselves and we invest quickly in services that customers adopt broadly. Good services become great services fast.

  • QCon threw a party in London this week and a lot of tech people showed up. You can find slides for the talks here. The videos of course must first, like Theseus, escape from the Labyrinth. Alex Blewitt has some excellent trip summaries for Day 1, Day 2, and Day 3. Akafred has also written up conference notes, well not notes really, more like works of art drawn on an iPad Pro.

  • Alex Blewitt: Google Docs started off as a web-based excel spreadsheet server...provided a remote viewing platform for spreadsheet content, by interpreting an Excel spreadsheet on the server and then rendering a remote HTML view...The collaboration was an accident as two people edited a document at the same time and they both saw the results. Fortunately the model they built had no non-commutative operations which meant that operations could be replayed...Scaling is through sharding and consistency trade-offs; for a popular document in read/write mode, users may be switched to a read-only version that may be delayed, thereby trading availability for turns out that once the unit of consistency is too large for a single server, it’s no longer consistent, which means finer-grained control of document sharding often implies sharding at a greater level, like per chapter on a book.

  • Given Google's historical use of infrastructure as a competitive advantage, this must mean the fight is only on one front now: software. Google joins Open Compute Project to drive standards in IT infrastructure.

  • A really detailed and fun example of using coroutines (though yield is evil) to implement simple Turret behavior for a mock game. Beyond the State Machine: the turret does the following: it looks for a target (the player) within a given radius. Once it finds a target, it does two things at once. It shoots projectiles, and it tracks the target. If it loses lock on the target (player moving too far away), it returns to its original orientation and starts over.

  • Evaluating Container Platforms at Scale. This benchmark was sponsored by Docker, but it seems to give a really good overview of what you can expect when operating Swarm and Kubernetes at scale. It wasn't easy to operate a 1000 node cluster: "I was able to build out both clusters to 1000 nodes. It took something like 90 iterations, more than 100 hours of research and trials." Swarm won for ease of use: "Plugging these values in to the adoption effort index equation results in an adoption effort index of 25 for Kubernetes and 6 for Swarm." But Kubernetes, the product of years of operational experience at Google, has more features: "The effort investment might be one an adopter is willing to make in exchange for the expanded feature-set currently offered by Kubernetes." Simplicity is always a good way to make an entrance into a market with a mature leader. You slurp up the large group of early adopters who are just getting their feet wet. They are happy to grow up with you. Kubernetes  is way past the cute puppy stage of adoption. Kelsey Hightower: Kubernetes offers a unified set of APIs and strong guarantees about cluster state. Hence the complexity; The complexity Kubernetes abstracts at the lower level enables you to build elegant tools on top. This should be the gaol of any platform; Distributed systems are hard. You cannot eliminate the complexity; only move it around. Kubernetes deals with it so you don't have to.

  • Here's a recap of Facebook's Performance @Scale 2016 event. Facebook, Google, LinkedIn, Microsoft, and Netflix all had speakers presenting. Topics: Evolution of performance, Using BPF superpowers, Web speed at Facebook, Automatic regression triaging at Facebook, Sifting for gold: Increasing ad revenue by improving performance, and several more.

  • Alex Blewitt: Pony is a fascinating language that builds upon LLVM and uses a sound type system to achieve millions of messages and actors in a single process. Each actor has its own queue of messages but also its own heap; so when garbage collection runs, it only works with an individual actor’s heap, allowing the remainder of the actors to keep processing. 

  • 12 apps with a billion users: It took Microsoft 25+ years to get to a billion. Facebook did it in 8.7 years. Microsoft’s Office took 21.7 years. Facebook’s WhatsApp took 6.8 years. Google’s Search took 12 years, YouTube 8.1 years, but Android took 5.8 years and Chrome 6.7 billion. Google has 7 applications in the billion user club 

  • Using an API is riskier than unprotected sex. Instagram kills newly launched ‘Being’ app, which saw 50K downloads its first week: “Things have changed. The larger corporations that independent developers helped cultivate over the past 5-10 years are now big enough to stand on their own. The app culture is becoming more and more centralized, and it’s winner-take-all with no holds barred"

  • Adding a scheduling layer is a common pattern for increasing performance and utilization (clusters, tasks, traffic, spectrum, networks, meeting rooms). MIT develops a new technique to load webpages faster: Their technique focuses on mapping the connections (aka ‘dependencies’) between different objects on a page in order to dynamically figure out the most efficient route for a browser to load the various interdependent elements.

  • Josh Evans, team Netflix, gave what looks like an excellent talk at QCon London. Here are the slides #NetflixEverywhere Global Architecture.

  • This implicit information flows in even the simplest of activities is sobering. So be kind when Siri can't answer anything beyond the simplest of questions. Gray Kimbrough: This graphic of all the information passed during a baseball game by Megan Jaegerman (via @EdwardTufte) is amazing. 

  • Aggregate then personalize. Medium is looking for a Senior Engineer (Personalization).

  • Everspan Optical Cold Storage. James Hamilton looks at an optical storage system that stores 181 petabytes on 604,928 optical disks. This thing is huge. At max configuration the 19 bay setup runs 123 feet long. I bet the testing group doesn't get to see this configuration for very long.

  • Yes, AlphaGo is beating the poor human, but remember it stands on the shoulders of giants (us). Part of it's training was based on a huge corpus of human game moves. A thoughtful discussion on HackerNews. Also, The AI Revolution: The Road to Superintelligence.

  • Technologies for Testing Distributed Systems, Part I. Wonderful deep dive on regression testing. 

  • StorageMojo takes a look at memcomputing: As the prefix mem suggests, memdevices have memory. So the data the processors are working on can be integrated into the device, rather then shuttled off to a cache or RAM.

  • Spotify is moving to Google Cloud Platform. They have an excellent series of articles explaining what this means for their architecture. Spotify’s Event Delivery – The Road to the Cloud (Part I, Part II, Part III): The worst end-to-end latency observed with the new system is four times lower than the end-to-end latency of old system. But boosting performance isn’t the only thing we want to get from the new system. Our bet is that by using cloud-managed products we will have a much lower operational overhead. Also, What's the Best Cloud? Probably GCP with a good discussion on Hacker News.

  • Is Hell freezing over? Microsoft SQL Server is Coming to Linux. Seems like Microsoft really does want to be a cloud company.

  • Your brain, a ruthless conserver of energy, it likes to approximate, answers that satisfy rather than optimize are the most efficient, but not necessarily the best. A neuroscientist explains why your first idea is hardly ever the best one: Ask it to solve a problem, and it'll come back with the solution that's most readily available, even if it's not necessarily the best one...Whenever he presents a lab member with a problem, he asks them to come back with 10 answers instead of just one.

  • And don't forget the cheesy commercials. How Intuit got 1 million people to use TurboTax on their phones: It took changing his entire team to make mobile their focus. Not only that, the team focused on one mobile platform at a time — no more running separate teams for Android or iOS. The company built the app once and made it available on any platform, via a proprietary wrapper that could work on any native app experience or on the mobile web.

  • To be honest, I don't know quite what to make of this. The Secrets of Surveillance Capitalism.

  • Where can you host slack integrations? Google Cloud Plarform, Azure, AWS Lambda, Heroku, Bee Boop, api.ia. Something to consider is what I experienced hosting a slackbot on Heroku's free tier is the service spins down after one hour of inactivity. You get what you pay for.

  • Most bacteria reproduce through a process called binary fission. A single  E. coli cell weighs in at 665 femtograms (0.000 000 000 000 001 grams). There are ten to the power of fifteen femtograms in one gram, or a quadrillion, or 1,000 trillion. Once single E. coli cell can produce 3,000 tons of bacteria in 36 hours (6 elephants). Source.

  • How Spelunky Creates Amazing Unexpected Situations: I think at one point I was asked how I could possibly handle all of the cases that come up in Spelunky, like when this item hits that,” programmer of Spelunky’s 2013 remake, Andy Hull, tells me. The answer is that every object in the world has default behaviours and properties. For instance, Olmec is actually a glorified push-block. When the spiked balls in Hell are sent flying, they become boulders. The way hawk men, croc men and shopkeepers bounce around when enraged – they share the exact same movement abilities.

  • This is so mature, using software to route around our blind spots. Computer Chooses Quantum Experiments: Quantum weirdness is hard for humans to grasp, so researchers wrote a program to suggest experimental setups.

  • Isn't replication part of the definition of life? SRI's Micro Robots Can Now Manufacture Their Own Tools: Now that the micro robots can make their own tools, the same fleet of robots can be reconfigured to make different kinds of things. SRI has been focused on constructing strong, lightweight carbon fiber trusses with its MicroFactory, but they’ve recently started developing ways to make skins. 

  • Using Apache Spark to predict attack vectors among billions of users and trillions of events: In our approach, we do not only look at a single user's behavior. We put all the users together and study correlations between the users and how users link to each other, how similar are the users' actions. Nowadays, bad attackers do not have a single bad account. They usually have tens of accounts, hundreds, even millions of accounts. Using these accounts, they can do spam, they can do "likes," they do transactions. These accounts usually have high correlations among them because they're controlled by robots or controlled by trained people. For us, we look at the user-user correlation.

  • Introducing Conversant Disruptor: an ultra-low-latency mechanism for communication between threads. It supports high performance network applications as well as processing offload operations.

  • luebken/container-patterns: Developing container based applications is still a fairly new topic. This document tries to gather some best practices and suggests some new ideas from the community. These should be container runtime agnostic but still practical relevant with concrete examples.

  • Efficient State-based CRDTs by Delta-Mutation: We introduce Delta State Conflict-Free Replicated Datatypes (δ-CRDT), which make use of δ-mutators, defined in such a way to return a deltastate, typically, with a much smaller size than the full state. Delta-states are joined to the local state as well as to the remote states (after being shipped). This can achieve the best of both worlds: small messages with an incremental nature, as in operation-based CRDTs, disseminated over unreliable communication channels, as in traditional state-based CRDTs.

  • CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy: By using a technique known as homomorphic encryption, it’s possible to perform operations on encrypted data, producing an encrypted result, and then decrypt the result to give back the desired answer. By combining homomorphic encryption with a specially designed neural network that can operate within the constraints of the operations supported, the authors of CryptoNet are able to build an end-to-end system whereby a client can encrypt their data, send it to a cloud service that makes a prediction based on that data – all the while having no idea what the data means, or what the output prediction means – and return an encrypted prediction to the client which can then decrypt it to recover the prediction.

  • BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data: BlinkDB allows users to trade ošff query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratied samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy orresponse time requirements. 

  • Design Rules, Vol. 1: The Power of Modularity: They argue that the industry has experienced previously unimaginable levels of innovation and growth because it embraced the concept of modularity, building complex products from smaller subsystems that can be designed independently yet function together as a whole. Modularity freed designers to experiment with different approaches, as long as they obeyed the established design rules. 

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>