hot links

Stuff The Internet Says On Scalability For March 13th, 2015

High Scalability

13 Mar 2015 — 7 min read

Hey, it's HighScalability time:

1957: 13 men delivering a computer. 2017: a person may wear 13 computing devices (via Vala Afshar)

5.3 million/second: LinkedIn metrics collected; 1.7 billion: Tinder ratings per day
Quotable Quotes:
- @jankoum: WhatsApp crossed 1B Android downloads. btw our android team is four people + Brian. very small team, very big impact.
- @daverog: Unlike milk, software gets more expensive, per unit, in larger quantities (diseconomies of scale) @KevlinHenney #qconlondon
- @DevOpsGuys: Really? Wow! RT @DevOpsGuys: Vast majority of #Google software is in a single repository. 60million builds per year. #qconlondon
- Ikea: We are world champions in making mistakes, but we’re really good at correcting them.
- Chris Lalonde: I know a dozen startups that failed from their own success, these problems are only going to get bigger.
- Chip Overclock: Configuration Files Are Just Another Form of Message Passing (or Maybe Vice Versa)
- @rbranson: OH: "call me back when the number of errors per second is a top 100 web site."
- @wattersjames: RT @datacenter: Worldwide #server sales reportedly reach $50.9 billion in 2014
- @johnrobb: Costs of surveillance. Bots are cutting the costs by a couple of orders of magnitude more.

Is this a sordid story of intrigue? How GitHub Conquered Google, Microsoft, and Everyone Else. Not really. The plot arises from the confluence of the creation of Git, the evolution of Open Source, small network effects, and a tide rising so slowly we may have missed it. Brian Doll (a GitHub VP) is letting us know the water is about nose height: Of GitHub’s ranking in the top 100, Doll says, “What that tells me is that software is becoming as important as the written word.”

This is audacious. Peter Lawrey Describes Petabyte JVMs. For when you really really want to avoid a network hit and access RAM locally. The approach is beautifully twisted: It’s running on a machine with 6TB and 6 NUMA regions. Since, as previously noted, we want to try and restrict the heap or at least the JVM to a single NUMA region, you end up with 5 JVMs with a heap of up to 64GB each, and memory mapped caches for both the indexes and the raw data, plus a 6th NUMA region reserved just for the operating system and monitoring tasks.

There's a parallel between networks and microprocessors in that network latency is not getting any faster and microprocessor speeds are not getting any faster, yet we are getting more and more bandwidth and more and more MIPS per box. Programming for this reality will be a new skill. Observed while listening to How Did We End Up Here?

Advances in medicine will happen because of big data. We need larger sample sizes to figure things out. Sharing Longevity Data. We are seeing 23andMe Enters Drug Development.

Which brings us to Will Apple’s ResearchKit Change Science? Unfortunately the article doesn't actually address the question. It gets lost in concerns over questions of the possibility of perfect anonymity and fears about participants being knowledgeable of the risks involved. Some people worry too much. Think of the power of a study with hundreds of thousands and potentially millions of participants. We could actually figure some stuff out.

Which brings us to Sewage Bacteria Linked to Obesity: “Had they shown up in my office with this idea I would have said, ‘You’re nuts. There’s no way you can pull that off,’” Randy Seeley, an obesity researcher at the University of Michigan in Ann Arbor who was not involved in the work, told Inside Science. “The fact that they are capable of doing it just shows you the power of the big-data approach.”

Something you quickly learn in distributed and multithreaded programming. There Is No Now – Problems with Simultaneity in Distributed Systems. Nice summary by marknadal: 1. You cannot beat the speed of light; 2. Machines break. Even the most reliable ones; 3. Networks are unreliable. Even local area networks; 4. It is an exciting time for distributed systems: CRDTs, Hybrid Logical Clocks, Zookeeper, etc.

Judging from the tweet stream QCon London was raging. Will Hamill wrote some really excellent glosses on several talks from the show. I have no idea how he writes such good summaries on the fly like that. Find them at QCon London 2015 Day One, QCon London 2015 Day Two. Some of the talks: Cluster management at Google; Securing PaaS with Docker and Weave; Infrastructure and Go; Docker vs PaaS; Docker clustering: batteries included, but removable; Responding rapidly when you have 100GB+ data sets in Java; Protocols: the glue for applications. And here's a summary QCon London 2015 - Day 3 by Pere Villega. Also excellent. It's almost as good as being there.

This makes you think. If State Is Hell, SOA Is Satan: State might be hell, but it's a hell we have to live. I don't advocate an all-out microservice architecture for a company just getting its start. The complications far outweigh any benefits to be gained, but it becomes a necessity at a certain point. They key is having an exit strategy.

Dang, almost 800 comments in this thread: Goodbye MongoDB, Hello PostgreSQL. A lot of rehashing of old arguments, but some good content can be had if you are brave enough to wade in.

John Wilkes...John ended the keynote by summarizing with a call for incremental improvement, saying that the likelihood for success and building momentum is much higher than a big-bang project: “roofshot is better than moonshot”. John left us with three points to finish: Resilience is more important than performance; It’s okay to use other people’s stuff, don’t do it all yourself; Do more monitoring.

How FourSquare measures perceived latency. Cool graphs. It's full stack. Measuring latency at every stage, including even serialization times in Core Data.

In the I did not know that department Robert Graham lays down some arcane knowledge in Some notes on DRAM (#rowhammer): These tiny capacitors are prone to corruption from other sources. One common source of corruption is cosmic rays. Another source is small amounts of radioactive elements in the materials used to construct memory chips. So, chips must be built with radioactive-free materials. The banana you eat has more radioactive material inside it than your DRAM chips.

The Applied Science Youtube channel is awesome. There's a comment on How Digital Light Processing (DLP) works that brings home how much we live in the future. Mythricia: What's even more amazing is that we use devices with this kind of complexity every day, and don't bat an eye. Or the fact that you can buy a chip made using this super complex and mega-expensive-to-develop manufacturing technology for literally .1$ a pop.

It's full of streams of immutable data. Why use Event Sourcing. Event Sourcing is " the rebuilding of objects based on events" with a "Event Store holding the events to rebuild an object behind the domain as opposed to something storing the current state." Seems like a log based system. Good discussion on reddit. check3streets: "The system depends on its events and the structure of those events as the Source of Truth." Here's a group on the topic.

How do you build a high availability distributed key-value store? Nice description of a simple system that uses LedisDB, which is protocol compatible with Redis, but without the main memory limitation because Rocksdb or leveldb are used for backend storage. It uses master/slave to guarantee data security; redis-failover to monitor the system and do failover; xcodis to support cluster.

Microservices even have their own home: microservices.io. I'm glad they don't have to sleep in the park anymore.

Good Lessons to be learned from a project nightmare. $100+ million U.S. Department of Labor contract down the drain. Make sure to: Own your own data, Own the licenses, Select a cloud provider with care, Be suspicious of low bids, Monitor program progress closely. You might be wondering, How Did We End Up Here?

So you think you have a good team. Could your team do this? Awe-Inspiring Amish Barn Raising – 3 Minutes and 30 seconds of Amazing Teamwork.

Ronald Bradford with an EBS money saving tip: A trivial cost saving tip for checking if you are spending money in your AWS environment on unused resources. These are available and unused EBS volumes which you should consider deleting.

Don't call it cold storage. It's nearline storage. And Google made it fast (~3 second response times) and Google made it cheap (1c per GB). Introducing Google Cloud Storage Nearline: (near)online data at an offline price. Just in time for 8k video and all those pesky log files. If the access costs are low enough this service could make a difference in architectures.

Tinder is Scaling with MongoDB and Some Help from ObjectRocket: Tinder is an example of a company that is outsourcing some of its most complex database management to focus on the product and what it knows best: matchmaking via geolocation.

Pinterest on Serving configuration data at scale with high availability. Pinterest used Redis for their spam domain blacklist. They moved to a system that uses: ZooKeeper as the notifier and S3 for the storage. Since S3 provides very high availability and throughput, it seemed to be a good fit for our use case in absorbing the sudden load spikes.

Ben Stopford comments on 10 or so interesting papers in the Best of VLDB 2014.

Niket Patel tells the story of how three services (PDF, Alert, Gradebook) moved From a Ruby monolith to microservices in Go, lessons learned. Moving to Go microservice halved memory usage and improved performance by 10x - 50x. They also liked the easier deployment model with Go over Ruby.

Are open source libraries a kind of horizontal gene transfer?

I've always thought of Live Migration as a bit of snake oil, but this sounds useful: Google Compute Engine uses Live Migration technology to service infrastructure without application downtime. You can roll out security fixes; deal with bricked memory, disk drives, and machines; upgrade software, hardware, and configuration. Examples:Heartbleed bug, flapping network cards, Cascading battery/power supply issues, A buggy update was pushed to production, Unexpected host memory consumption.

Here's a great recap of the Mobile World Congress 2015 by Chetan Sharma. Some of the topics: Facebook and internet.org, Google MVNO and broadband connectivity, Apple - the invisible participant in the show, 5G, Net Neutrality, Samsung S6 launch, Industrial IoT, Lack of consumer IoT, and even more. Lots and lots of commentary.

Understanding the Causes of Consistency Anomalies in Apache Cassandra: We show that the staleness spikes exhibited by Cassandra are strongly correlated with garbage collection, particularly the “stop-the-world” phase which pauses all application threads in a Java virtual machine. We show experimentally that the staleness spikes can be virtually eliminated by delaying read operations artificially at servers immediately after a garbage collection pause.

Stuff The Internet Says On Scalability For March 13th, 2015

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale