Stuff The Internet Says On Scalability For January 10th, 2014

Hey, it's HighScalability time:


Run pneumatic tubes alongside optical fiber cables and we unite the digital and material worlds.

  • 1 billion: searches on DuckDuckGo in 2013
  • Quotable Quotes: 
    • pg: We [Hacker News] currently get just over 200k uniques and just under 2m page views on weekdays (less on weekends). 
      • rtm: New server: one Xeon E5-2690 chip, 2.9 GHz, 8 cores total, 32 GB RAM.
    • Kyle Vanhemert: The graph shows the site’s [Reddit] beginnings in the primordial muck of porn and programming.
    • Drake Baer: But it's not about knowing the most people possible. Instead of being about size, a successful network is about shape.
    • @computionist: Basically when you cache early to scale you've admitted you have no idea what you're doing.
    • Norvig's Law: Any technology that surpasses 50% penetration will never double again.  
    • mbell: Keep in mind that a single modern physical server that is decently configured (12-16 cores, 128GB of ram) is the processing equivalent of about 16 EC2 m1.large instances.
    • @dakami: Learned databases because grep|wc -l got slow. Now I find out that's pretty much map/reduce.
    • Martin Thompson: I think "Futures" are a poor substitute for being pure event driven and using state machines. Futures make failure cases very ugly very quickly and they are really just methadone for trying to wean programmers off synchronous designs :-) 
    • @wattersjames: Can your PaaS automate Google Compute Engine? If it can, you will find that it can create VMs in only 35 seconds, at most any scale.
    • Peter M. Hoffmann: Considering the inherent drive of matter to form ever more complex structures, life seems inevitable.

  • The marker for a new generation, like kids who will never know a card catalogue, Kodak film, pay phones, phone books, VHS tapes, typewriters, or floppy disks:  Co-Founder Of Snapchat Admits He's Never Owned A Physical Server

  • Want the secret of becoming a hugely popular site? Make it fast and it will become popular. It's science. Are Popular Websites Faster?: No matter what distribution of websites you look at – the more popular websites trend faster. Even the slowest popular website is much faster than those that are less popular. On the web, the top 500 sites are nearly 1s faster (by the median), and on mobile it is closer to 1.5s faster. 

  • In 1956 we may not have had BigData, but BigStorage was definitely in. Amazing picture of IBM's 5 mega-byte drive weighing in at more than 2,000 pounds.

  • Increasing slow query performance with the parallel query execution: Splitting a complex report into multiple queries and running it in parallel (asynchronously) can increase performance (3x to 10x in the above example) and will better utilize modern hardware. It is also possible to split the queries between multiple MySQL servers (i.e. MySQL slave servers) to further increase scalability (will require more coding)

  • Some good ideas. 5 Techniques to get more from your AWS Deployments: Launch Amazon EC2 Instances within VPC,  Launch Amazon EC2 Within an Auto Scaling Group, Use a Bastion Host for Administering Cloud Deployments, Control Access through Identity and Access Management (IAM), Tap into the Power of AWS Tags.

  • Why would an article on Scaling Mecurial at Facebook generate so much vitriol, er discussion? Ah, it involves that which must not be criticized, Git. Once you get past that it's a great discussion of the issues and fixes supporting a large team with a large and dynamic code base. Also, Google's vs Facebook's Trunk Based Development for a thoughtful discussion. 

  • Geeking with Greg with another interesting set of Quick Links. And I bet Greg is always honest and treats people like adults.

  • Chartbeat shows how to reduce TCP retry timeouts, reduce CPU usage across our front end machines by about 20%, and improve our average response time from 175ms to 30ms. Excellent explanation of the investigative process. Shows the tools used and the thinking needed. Nginx was ignoring the backlog setting which caused lots of connection drops. Then they go into sysctl tuning and show their configuration. Oh, and use ELB instead of DNS for failover.

  • Those pesky leaky abstractions. Academic Research – Multi-Threaded Sequential IO can be termed as “Perceived Random IO: file sequential multi-threaded access can appear Random on the back end Storage subsystem if users are accessing same disks. In that case those users that are accessing the data sequentially will definitely see performance degradation. 

  • DigitalOcean is getting a lot more pub and traction. As you might expect, it lacks services, but gives good hardware value for the money. DigitalOcean - A Review and Comparison: lower pricing, better disk speed, better processing power value, no ELB, no DynamoDb, no ElastiCache. Also, Digital Ocean vs. Linode

  • Airbnb with a nice write up on design decisions for Hammerspace, a library that stores strings off the ruby heap. The problem was in their translation handling system which must store and retrieve 80,000 strings for over 30 locales. Dynamic languages not so good at this sort of efficient memory handling. They used on-disk hash table with a concurrency control layer in code. Spirited discussion on Hacker News.

  • Clear explanation and code examples on How To Work with the ZeroMQ Messaging Library from DigitalOcean.

  • The Mathematics of Gamification from FourSquare. Not a lot of hand holding here, but it shows how they verify crowdsourced data using user accuracy scores. One interesting bit: We can gauge each Superuser’s voting accuracy based on their performance on honeypots (proposed updates with known answers which are deliberately inserted into the updates queue). 

  • AWS EC2 + MySQL with Complex, Highly-Concurrent Queries. Paul Otto thought new AWS developments might make it possible to run their complex queries: 4,000 Provisioned IOPs; cc2.8xlarge - this monstrosity provides 60GB of RAM, 32 virtual CPUs, and 10 gigabit networking; MySQL threadpool which reduces context switching. But the performance wasn't quite good enough:  I was able to successfully provide more than 600 qps of complex aggregate queries, all while CPU load remained within healthy thresholds.

  • 4k video? Not so fast. Dan Rayburn reveals The Dirty Little Secret About 4K Streaming: Content Owners Can’t Afford The Bandwidth Costs: To put some real numbers behind it, for a content owner delivering video today at 3Mbps, one hour of video is going to consume about 1.4GB. If they are paying $0.02 (two cents) per GB, which is a low price, it’s currently costing them about $0.03 (three cents) to deliver one hour of video. If they then want to deliver that same content at 4K quality, it’s going to cost them between $0.11 (eleven cents) and $0.18 (eighteen cents), depending on the bitrate used, for one hour of video. Anyone see a problem with this? 

  • Have part of a system that can evolve faster than your core. Some of My Best Friends Are Germs: So why haven’t we evolved our own systems to perform these most critical functions of life? Why have we outsourced all this work to a bunch of microbes? One theory is that, because microbes evolve so much faster than we do (in some cases a new generation every 20 minutes), they can respond to changes in the environment — to threats as well as opportunities — with much greater speed and agility than “we” can. Exquisitely reactive and adaptive, bacteria can swap genes and pieces of DNA among themselves. This versatility is especially handy when a new toxin or food source appears in the environment. The microbiota can swiftly come up with precisely the right gene needed to fight it — or eat it. In one recent study, researchers found that a common gut microbe in Japanese people has acquired a gene from a marine bacterium that allows the Japanese to digest seaweed, something the rest of us can’t do as well.

  • The Universe in a Glass of Wine: Richard Feynman on How Everything Connects, Animated: We will probably never know in what sense he said that, for poets do not write to be understood. But it is true that if we look in glass of wine closely enough we see the entire universe. There are the things of physics: the twisting liquid which evaporates depending on the wind and weather, the reflections in the glass, and our imagination adds the atoms. The glass is a distillation of the earth’s rocks, and in its composition we see the secrets of the universe’s age, and the evolution of the stars. What strange array of chemicals are in the wine? How did they come to be? There are the ferments, the enzymes, the substrates, and the products. There in wine is found the great generalization: all life is fermentation. Nobody can discover the chemistry of wine without discovering the cause of much disease. How vivid is the claret, pressing its existence into the consciousness that watches it! If our small minds, for some convenience, divide this glass of wine, this universe, into parts — physics, biology, geology, astronomy, psychology, and so on — remember that nature does not know it! So let us put it all back together, not forgetting ultimately what it is for. Let us give one more final pleasure: drink it and forget it all!

  • Horton: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs: Horton is a graph query processing system that executes declarative reachability queries on a partitioned attributed multi-graph. It employs a query language, query optimizer, and a distributed execution engine.

  • A Tutorial Survey of Architectures, Algorithms, and Applications for Deep Learning: The goal of this tutorial survey is to introduce the emerging area of  deep learning or hierarchical learning to the APSIPA  community.  Deep learning refers to a class of machine learning techniques, developed  largely since  2006, where many stages of nonlinear information processing in hierarchical architectures are exploited for pattern classification and for feature learning.