hot links

Stuff The Internet Says On Scalability For October 24th, 2014

High Scalability

24 Oct 2014 — 7 min read

Hey, it's HighScalability time:

This is an ultrasound powered brain implant! (65nm GP CMOS technology, high speed, low power (100 µW))

70: percentage of the worlds transactions processed using COBOL.
Quotable Quotes:
- John Siracusa: Apple has shown that it wants to succeed more than it fears being seen as a follower.
- @Dries: "99% of Warren Buffett's wealth was built after his 50th birthday."
- @Pinboard: It is insane to run a bookmarking site on AWS at any kind of scale. Unless you are competing with me, in which case it’s a great idea—do it!
- @dvellante: I sound like a broken record but AWS has the scale to make infrastructure outsourcing marginal costs track SW curve
- @BrentO: LOL RT @SQLPerfTips: "guess which problem you are more likely to have - needing joins, or scaling beyond facebook?"
- @astorrs: Legacy systems? Yes they're still relevant. ~20x the number of transactions as Google searches @IBM #DOES14
- @SoberBuildEng: "It was all the Agile guys' fault at the beginning.Y'know, if the toilet overflowed, it was 'What, are those Agile guys in there?!'" #DOES14
- @cshl1: #DOES14 @netflix "$1.8M revenue / employee" << folks, this is an amazing number
- Isaac Asimov: Probably more inhibiting than anything else is a feeling of responsibility. The great ideas of the ages have come from people who weren’t paid to have great ideas, but were paid to be teachers or patent clerks or petty officials, or were not paid at all. The great ideas came as side issues.

With Fabric can Twitter mend the broken threads of developer trust? A good start would be removing 3rd party client user limit caps. Not sure a kit of many colors will do it.

Not only do I wish I had said this, I wish I had even almost thought it. tjradcliffe: I distinguish between two types of puzzles: human-made (which I call puzzles) and everything else (which I call problems.) In those terms, I hate puzzles and love problems. Puzzles are contrived by humans and are generally as much psychology problems as anything else. They basically require you to think like the human who created them, and they have bizarre and arbitrary constraints that are totally unlike the real world, where, as Feyrabend told us, "Anything goes."

David Rosenthal with a great look at Facebook's Warm Storage: 9 [BLOB] types have dropped by 2 orders of magnitude within 8 months...the vast majority of the BLOBs generate I/O rates at least 2 orders of magnitude less than recently generated BLOBs...Within a data center it uses erasure coding...Between data centers it uses XOR coding...When fully deployed, this will save 87PB of storage...heterogeneity as a way of avoiding correlated failures.

Gene Tene on is it a CPU bound future: I don't think CPU speed is a problem. The CPUs and main RAM channels are still (by far) the highest performing parts of our systems. For example, yes, you can move ~10-20Gbps over various links today (wired or wifi, "disk" (ssd) or network), but a single Xeon chip today can sustain well over 10x that bandwidth in random access to DRAM. A single chip has more than enough CPU bandwidth to stream through that data, too. E.g. a single current Haswell core can move more than that 10-20Gbps in/out of it's cache levels. and even relatively low end chips (e.g. laptops) will have 4 or more of these cores on a single chip these days. < BTW, a great thread if you are interested in latency issues.

Nobody expects the double fault! Twitter: Our apologies for today’s outage.

Do you need to build a real-time sliding-window dashboard over streaming data? And who doesn't? Here's a helpful whitepaper with a good and well explained reference architecture: Amazon Kinesis and Apache Storm.

Message Queue vs. Web Services? [closed]. Fortunately some good answers were added before Stackoverflow closed a really interesting and useful question. Though I think there's a level confusion here. A service would use a queue, a queue can't replace a service.

Snowflake-shaped networks are easiest to mend: They found the best [networks] were made from triangle or square-shaped loops, with one side of each loop missing. The loops link together and then back to a central hub, giving them a branching structure similar to a snowflake. If a link breaks, you just add in the missing side of a loop

Muut exclaims Redis as the primary data store? WTF?: In the end we were able to build a system that can fulfill API requests under load at around 2ms...The API servers that we are able to push this load with cost a mere $90/month so we’re able to support and scale horizontally to support pretty massive loads at a very low cost.

Antirez mediates on Redis Sentinel properties and fail scenarios. Conflict is difficult, but sometimes it generates more light than heat. This post is such a case. It thoughtfully walks through failure cases and how they are handled or not handled. This is not easy stuff.

EventMachine internals and the Reactor pattern. Nicely detailed walk through a reactor loop, in Ruby. A reactor loop is just an old style loop where events are received and then distributed to handlers. Looks very clean.

JesterXL with a massive post on Message Systems in Programming: Callbacks, Events, Pub Sub, Promises, and Streams. If you want to give someone a good overview of these topics in the web space then this is a great start.

Everyone wants metrics, but often their gathering has a high cost. Fernando Papa in Using Netlink to Optimize Socket Statistics, shows how to use Netlink, which is a way to do IPC between the kernel and a process, or between processes, to go from 100% CPU consumption (user and system) to less than 2% on highly loaded systems.

Murat dissects Facebook's software architecture. He covers: TAO: Facebook's distributed data store for the social graph and F4: Facebook's warm BLOB storage system. Facebook is great at exploring different architectures under the most rigourous of production environments. F4 manages storage across warm and hot environments, a problem intimately associated with great scale.

Beware of the Dogpile Effect. HipChat had an interesting failure case: Starting Oct 1st, we started to see higher than normal load on our web tier, which caused some load issues at that tier, which in turn triggered all web clients to suddenly reconnect and cause a spike across our whole system.

How do you create repeatable builds resulting in as-static-as-possible server environments? Here are Eight Docker Development Patterns from Vidar Hokstad: The Shared Base Container(s), The Shared Volume Dev Container, The Dev Tools Container, The Test In A Different Environment containers, The Build Container, The Installation Container, The Default-Service-In-A-Box Containers, The Infrastructure / Glue Containers. With good description and example code.
Scaling MySQL in Amazon Web Services. Excellent slide deck by Laine Campbell. Yah, a multi-RDS setup with provisioned IOPS is expensive, but people are even more expensive. Moving storage to ephemeral SSD saves $$$. Lots more.

If you want to learn what's new in the world of C++ it would be hard to do better than Herb Sutter. My CppCon talks.

Domain Modeling Around Deletes or “Using Cassandra as a queue even when you know better”. Ryan Svihla shows how it can be done with options with pros and cons: Partition tables based on domain modeling, Partition tables based on a time event. Great background on why deletes are a problem. Oh, love the diagrams!

Chartbeat shows how to Robot Traffic Filtering in Real Time, handling traffic that regularly peaks past 250K requests/second. A very nice practical application of Big O notation in a discussion of what string matching algorithms to use.

Instagram is doing something interesting by Migrating from AWS to AWS. Inside AWS there's a way to carve out your own enterprise style setup and connect to it directly over a fast network, that's with VPC. It's complex to do. Here Instagram shows how they "migrated thousands of running AWS EC2 instances into Amazon’s Virtual Private Cloud (VPC) in the span of 3 weeks with no downtime." It was quite an accomplishment and will help anyone considering the same move.

Autocompletion doesn't have to be hard. Sloan Athrens shows you how in Quick and Dirty Autocomplete with Elasticsearch Completion Suggest. The solution duplicates data, but it does only take a few configuration options.

Murat with a Paper Summary: A self-configurable geo-replicated cloud storage system: An experiment with clients distributed in datacenters around the world shows that reconfiguration every two hours increases the fraction of reads guaranteeing strong consistency from 33% to 54%. This confirms that automatic reconfiguration can yield substantial benefits which are realizable in practice.

Neural Turing Machines: We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

Instant Loading for Main Memory Databases: With Instant Loading, we contribute a novel CSV loading approach that allows scalable bulk loading at wire speed. This is achieved by optimizing all phases of loading for modern super-scalar multi-core CPUs. Large main memory capacities and Instant Loading thereby facilitate a very eficient data staging processing model consisting of instantaneous load-work-unload cycles across dataarchives on a single node. Once data is loaded, updates and queries are eficiently processed with the flexibility, security, and high performance of relational main memory databases.

Data Integrity and Problems of Scope: The crux of the problem is that, if used to provide data integrity, these properties must be applied at the appropriate scope. Simply making individual operations commutative or immutable doesn’t guarantee that their composition with other operations in your application also will be, nor will this somehow automatically lead to correct behavior. Analyzing operations (or even groups of operations) in isolation—divorced from the context in which they are being used—is often unsafe. If you’re building an application and want to use these properties to guarantee application correctness, you likely need to analyze your whole program behavior.

Stuff The Internet Says On Scalability For October 24th, 2014

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale