Stuff The Internet Says On Scalability For July 17th, 2015

Hey, it's HighScalability time:


In case you were wondering, the world is weird. Large Hadron Collider discovers new pentaquark particle.

  • 3x: Uber bigger than taxi market; 250x: traffic in HotSchedules' DDoS attack; 92%: Apple’s share of the smartphone profit pie; 7: Airbnb rejections
  • Quotable Quotes:
    • Netflix: A slow or unhealthy server is worse than a down server 
    • @inconshreveable: ngrok production servers, max GC pause: Go1.4 (top) vs Go1.5. Holy 85% reduction! /cc Go team
    • Nic Fleming: The fungal internet exemplifies one of the great lessons of ecology: seemingly separate organisms are often connected, and may depend on each other.
    • @IBMResearch: With 20+ billion transistors on new chip, that's a 50% scaling improvement over today’s tech #ibmresearch #7nm 
    • Micah Lemonik (Google Docs)~ Scaling – declare a master server for all writes to each document. If that server goes down, fail over to a different primary. Multiple servers can’t handle non-commutative messages at the same time. Could shard by chapter. … Once your unit of consistency is too large for one server, it’s no longer a unit of consistency.
    • Scaling Stack Overflow~ Performance is considered a major feature. Their target is to have everything below 30 ms. When it becomes slower, they have a sort of "stop the line", and everything's stops until performance is back to normal again.
    • Garrett Smith~ Scalability is a desirable characteristic of any system. But what does the word “scalable” actually mean? In this talk, Garrett will argue that when we use the word “scalable” we should instead use the word “awesome”. Awesome has has the same meaning… but is a lot more fun to say!
    • samkone: The reason modern Paas aren't enough, is the fact at some scale you're not just running web services. But also rather complex data pipelines involving distributed data systems like kafka, spark, cassandra, etc .. and a simple paas has issues handling those workloads. That's where Mesos shines. As for the user value, I consider having an uptime system, providing reliable and intelligent service based on data processing, by efficiently using your resources is has an important business value.

  • Apple and Google Race to See Who Can Kill the App First. Honest question, how are people supposed to make money in this new world? Apps are quickly becoming just an identity that ties together 10 or so components that appear integrated as part of the OS, but don't look like your app at all. Reminds me of laminar flow. We are seeing a rebirth of CORBA, COM and OLE 2, this time the container is an app bound by deep linking and some ad hoc ways to push messages around. Show developers the money.

  • The dark side of Google 10x: One former exec told Business Insider that the gospel of 10x, which is promoted by top execs including CEO Larry Page, has two sides. “It’s enormously energizing on one side, but on the other it can be totally paralyzing,”

  • Wait, are we going all RAM or all flash? So confusing. MIT Develops Cheaper Supercomputer Clusters By Nixing Costly RAM In Favor Of Flash: researchers presented evidence at the International Symposium on Computer Architecture that if servers executing a distributed computation go to disk for data even just 5 percent of the time, performance takes a hit to where it's comparable with flash memory anyway. 40 servers with 10 terabytes of RAM wouldn't chew through a 10.5TB computation any better than 20 servers with 20TB of flash memory. What's involved here is moving a little computational power off of the servers and onto the chips that control the flash drives.

  • Is disruption merely a Silicon Valley fantasy? Corporate America Hasn’t Been Disrupted: the advantage enjoyed by incumbents, always substantial, has been growing in recent years...more Americans worked for big companies...Large companies are becoming more dominant in part by buying up their rivals...Consolidation could explain at least part of the rising failure rate among startups...The startup rate has declined in every major industry, every state and nearly every city, and the failure rate’s rise has been nearly as universal. 

  • What's a unikernel and why should you care? Amir Chaudhry reveals all in his Unikernels talk given at PolyConf 15. And here's the supporting blog post. Why are we still applications on top of operating systems? Most applications are single purpose so why all the complexity? Why are we building software for the cloud the same way we build it for desktops? We can do better with Unikerels where every application is a single purpose VM with a single address space.

  • Great tutorial on using BigQuery. Analyzing 50 billion Wikipedia pageviews in 5 seconds

  • Are users still using your software? Here's how to calculate Rolling Retention and Hard Retention.

  • Interesting way of looking at distributed systems. CALISDO: Threat Modeling for Distributed Designs: Consistency, Availability. Latency, Integrity, Scalability, Durability, Operational Costs. 

  • If you want to know what happened at WWDC then mackuba has your hookup. New stuff from WWDC 2015. If it's not there it didn't happen.

  • A damn good question. What’s The Future of Work? The WTF Economy: My [Tim O’Reilly] goal is to shed light on the transformation in the nature of work now being driven by algorithms, big data, robotics, and the on-demand economy.

  • The industrialisation of Machine Learning: Humphrey Sheil summarizes a talk on Reinforcement Learning from Google's DeepMind team. Gorila: Google Reinforcement Learning Architecture. Here are the paper's and videos from the International Conference on Learning Representations conference. Key: Google is building the same infrastructure around ML as they have around other problems (Gorila is to Reinforcement Learning as MapReduce is to task parallelisation as BigTable is to data storage).

  • Caching is how you speed up your Rails app by 66% and reduce response times from ~250ms  to 50-100ms. First profile performance using rack-mini-profiler in production mode on a copy of your data. Set a goal, maximum acceptable average response time. Benchmark with Apache Bench, or ab. Then there's a good description of your caching techniques: Key-based cache expiration; Russian Doll Caching followed by an evaluation of different caching backends.

  • DConf 2015 videos are available online

  • The truth about MapReduce performance on SSDs: We found that for our tests and hardware, SSDs delivered up to 70 percent higher performance, for 2.5x higher $-per-performance...The primary benefit of SSD is high performance, rather than high capacity. Storage vendors and customers should also consider $-per-performance, and develop architectures to work-around capacity constraints.

  • LOL! It’s The Future: So I just need to split my simple CRUD app into 12 microservices, each with their own APIs which call each others’ APIs but handle failure resiliently, put them into Docker containers, launch a fleet of 8 machines which are Docker hosts running CoreOS, “orchestrate” them using a small Kubernetes cluster running etcd, figure out the “open questions” of networking and storage, and then I continuously deliver multiple redundant copies of each microservice to my fleet. Is that it?

  • Ever wondered how Shazam is able recognize even the most obscure songs? Here is the result of 200 hours effort over 3 years to spell it out in wonderful glorious detail. Amazing work. How does Shazam work. Hint: it does not involve Solomon, Hercules, Atlas, Zeus, Achilles, or Mercury.

  • Is machine learning always the answer? In Tracking down the Villains: Outlier Detection at Netflix Netflix describes how they go about finfing a sick but not deathly sick server using a unsupervised learning technique. Some of the commenters bring up different approaches like Integral analysis that may work without the extra complexity.

  • AltConf 2015 videos are now available.

  • Here's how Facebook built their Moments feature. Moments makes it easier to organize event photos you didn't take with the group that was there. They used existing work from Facebook's AI Research Lab to turn on facial recognition to figure who was at the event. You have one of those AI labs, don't you? Flux was used for a flexible way of keeping the client and server portions of the system in sync during periods of rapid change (flux, get it?). C++11 was used to build a common project core that could be used for both the iOS and Android apps. "features such as std::shared_ptr reference counting, lambda functions, and auto variable declarations, we were able to quickly implement highly performant, memory-safe code." Impressive work.

  • A fabulously detailed comparison of Amazon CloudSearch vs ElasticSearch vs Apache Solr. If you are searching for a search solution this is the place to start. The winner? That would be way too easy.

  • Excellent summary of Key Takeaway Points and Lessons Learned from QCon New York 2015. I look forward to the future videos. Lots of good stuff.

  • If you are interested in NGINX you can download a book on it for free: O’Reilly’s NGINX: A Practical Guide to High Performance. Covers: configuration, CGI, reverse proxy, and load balancing.

  • How does Mapbox serve beautiful imagery tiles? With a selectable compression level and composited imaged that combine different layers into one image. Rendering is on the server side. This reduces the amount of data transferred, reduces latency, and reduces power usage. 

  • Have a MySQL scalability problem? There's MaxScale for that: A new tool to solve your MySQL scalability problems...MaxScale has proven to be a very useful and flexible tool that allows to elaborate solutions to problems that were very hard to tackle before. 

  • Building Analytics at 500px. Lots of amazing detail, including a primer on data warehousing. Too much to summarize. The result: We now have a system where people know how the company is doing, and can self serve their own analysis and data pulls without asking the analyst. We’re far from where we started and well on our way to transforming 500px into a top data-driven start-up.

  • The scary part of the DDoS attack on HotSchedules is never knowing who was responsible for the attack or why they did it. Random violence taps into thd monster under the bed part of our brains. The way they dealt with the 10- to 15 gigabytes per second (Gbps) attack was to side with a strong man, historically a common strategy: Ultimately, our sleepless security engineers re-engineered the whole service on a subnet protected by Akamai’s cloud security solution, which can withstand over 321 Gbps of traffic.

  • A very good discussion. Ask HN: Do you use Vagrant or Docker for active development? No conclusion of course, but lots of good data. There are many many workable workflows out there.

  • Here's how Gravity Tech implements Centralized logging (and more) with Apache Kafka: Gravity’s Kafka clusters are comprised by two or three Kafka brokers with twofold replication and onefold receipt acknowledgment: even this minimally safe setup proved to be reliable.

  • CryptDB: A database system that can process SQL queries over encrypted data.

  • Decentralizing Authorities into Scalable Strongest-Link Cothorities: We propose collective authorities or cothorities, an architecture enabling thousands of participants to witness, validate, and co-sign an authority's public actions, with moderate delays and costs. Hosts comprising a cothority form an efficient communication tree, in which each host validates log entries proposed by the root, and contributes to collective log-entry signatures.