Stuff The Internet Says On Scalability For June 7, 2013

Hey, it's HighScalability time:


(Ever feel like everyone has already climbed your Everest?)

  • Trillion Particles, 120,000 cores, and 350 TBs: Lessons Learned From a Hero I/O Run on Hopper
  • Quotable Quotes:
    • @PenLlawen: @spolsky In my time as a scalability engineer, I’ve seen plenty of cases where optimisation was left too late. Even harder to fix.
    • @davidlubar: Whoever said you can't fold a piece of paper in half more than 7 times probably forget to unfold it each time. I'm up to 6,000.
    • deno:  A quick comparison of App Engine vs. Compute Engine prices shows that App Engine is at best 10x more expensive per unit of RAM.
    • Fred Wilson: strategy is figuring out what part of the market the company wants to play in, how it goes to market, and how it differentiates itself in the market it is about what you are going to do and importantly what you are not going to do
    • Elon Musk: SpaceX was able to achieve orders of magnitude savings in rockets. Instead of looking at what other rockets cost, they looked at the material cost (which is only 1-2% of total cost) which was small; clearly people were doing silly things on how the rockets were put together; for SpaceX, the savings came from efficiently putting the rocket together!

  • Switched away from App Engine, couldn't be happier. Shows how GCE may start to cannibalize GAE even though they seem to be targeted at different market segments. When costs go up with usage on GAE there's a natural force pushing programmers to drop down an abstraction level and lower costs and get more power. In this case GCE cost about a third of what GAE cost with much more performance headroom and much better latency. What you give up is safety, GAE replicates and you have to do it by hand on GCE. It's always about tradeoffs.

  • Scaling Memcache at Facebook. Murat Demirbas with his usual excellent analysis of how Facebook handles read dominated workloads (which are two orders of magnitude more frequent than writes), over 1 billion reads/second, with very high fan-out. Some lessons: 1) Separating cache and persistent storage systems via memcached allows for independently scaling them. 2) Managing stateful components is operationally more complex than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption. 3) Finally Facebook treats the probability of reading transient stale data as a parameter to be tuned, and is willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.

  • Michael Bernstein says Real-Time Garbage Collection Is Real as long as you are willing to make the necessary tradeoffs. But what if I want it all? Guaranteed limits on memory and scheduling perfection? Oh well.

  • Groupon with their take on Building a Distributed Messaging System. They use HornetQ servers coordinated by load balancers and a client API. It handles over 100 topics, 200 subscribers, 5 million messages are published to two clusters with 9 nodes daily, and spiky traffic up to 2 million per hour.

  • Scaling Play applications with ZeroMQ. Good look at how to use a queue to connect nodes once you have to move to more than one server. 

  • Curt Monash with an informative toe-to-toe conversation on SQL-Hadoop architectures compared: Polybase/SQL-H/Hawq let you dynamically get at data in Hadoop/HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.

  • 10 Lessons from a year of development in Ruby on Rails: Watch Railscasts, put static assets anywhere else but your own server, use the asset pipeline, turbolinks, eager loading associations, be careful with often used helper methods, don't rely on ajax, make your front-end friends heroes not foes, memcached and dalli, and performance profiling.

  • Cassandra performance decreases drastically with increase in data size. GC strikes again. Bryan Talbot had a lot of useful suggestions: remove data (rows especially), add nodes, add RAM, reduce bloom filter space, reduce row and key cache sizes, increase index sample ration, reduce compaction and concurrency and throughput, update to Cassandra 1.2. 

  • Atomic I/O operations: for many situations, atomic I/O operations look like a good way to let the hardware help the software get better performance from solid-state drives. Once this functionality finds its way into the mainline, taking advantage of fast drives might just feel a bit less like coaxing another trip out of that old jalopy.

  • Multicast replication in Percona XtraDB Cluster (PXC) and Galera. 1Gbps may seem like a lot of network but if you are replicating a lot of data it quickly becomes a bottleneck. The solution detailed is to switch to a real multicast approach instead of sending each message multiple times. There's some setup, but the traffic reductions can be great. 

  • Steven Sinofsky with a great summary from the All Things D Conference (D11). Like this from Elon Musk: This is a great example of how disruption gets talked about in early stages – all the focus on whether electric cars can displace gas cars using the criteria gas cars developed over all this time. 

  • Scaling the Database: Why You Shouldn’t Cluster on GUIDs: it’s pretty common for the DBA or developer to leave the primary key set as the default clustering key — that is a problem. When this happens with a GUID column, it can create huge performance issues in the database.

  • Bryan Wagstaff with a mystical magical Journey Through the CPU Pipeline: What goes on inside the CPU? How long does it take for one instruction to run? What does it mean when a new CPU has a 12-stage pipeline, or 18-stage pipeline, or even a "deep" 31-stage pipeline? 

  • It's good to see ants and humans finding a basis of cooperation. Majid Sarvi: used repellent to make ants flee from various structures. In a computer simulation of human behaviour in the layouts exited most swiftly by ants – those with exits in corners rather than in the middle of walls, for example – evacuation times were reduced by up to 160 per cent

  • Scaling Storage Is Hard To Do: Loosely-coupled object storage is the future: No more monoliths or clusters. The new wave of startups recognize this.

  • Russ Cox with an awesome paper on How Google Code Search Worked. Excellent detail that makes you think hey, I could maybe possibly do that...now.

  • How do you make money with APIs? John Musser from Programmable Web has 20 options for you. Key is: An API strategy is not an API business model. 

  • Time is money: How Clash of Clans earns $500,000 a day with in-app purchases: In other words, gems mean instant satisfaction.  What’s more, their cost is neatly obfuscated: purchased gems come in big, oddly numbered stashes (500, 1200, 2500, 6500, 14000), and once you have the pile sitting in your account, it’s easier to whittle it away 834 gems at time, whereas you’d probably think twice if asked to punch in your credit card details and confirm that you really.

  • Eric Drexler: Nature developed the first digital information systems and used them to direct atomically precise fabrication. Textbook biology teaches us that DNA encodes digital data in sequences of four kinds of monomers—nucleotides—each bearing two bits of data and read in blocks of six-bit words that encode the twenty standard amino acids found in proteins. Ribosomes read this information, produce the specified proteins, and thereby build the molecular components and devices that are so central to the workings of everything else in the cell, including the ribosomes themselves.

  • Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams: Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds.

  • Omega: flexible, scalable schedulers for large compute clusters: We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control. We compare this approach to existing cluster scheduler designs, evaluate how much interference between schedulers occurs and how much it matters in practice, present some techniques to alleviate it, and finally discuss a use case highlighting the advantages of our approach – all driven by
    real-life Google production workloads.