hot links

Stuff The Internet Says On Scalability For August 16, 2013

High Scalability

16 Aug 2013 — 6 min read

Hey, it's HighScalability time:

1 trillion: edges in Facebook's search graph
Quotable Quotes:
- Miguel de Icaza: Callbacks as our Generations' Go To Statement
- @kaleidic: "Haskell is a great language as a way of thinking, but I prefer programming in a language where I can cheat."--Meijer
- T.S. Eliot, who totally got the Internet: Distracted from distraction by distraction

The argument eternal: Why Some Startups Say the Cloud Is a Waste of Money. The argument follows a getting back-to-nature pattern. Amazon is the glittering city of compromised values and colos are the familiar places of refuge and virtue. But be honest, don't we all really know the score by now?

Fred Wilson: The Similarities Between Building and Scaling a Product and a Company: The system you and your team built will break if you don't keep tweaking it as demand grows. Greg Pass, who was VP Engineering at Twitter during the period where Twitter really scaled, talks about instrumenting your service so you can see when its reaching a breaking point, and then fixing the bottleneck before the system breaks. He taught me that you can't build something that will never break. You have to constantly be rebuilding parts of the system and you need to have the data and processes to know which parts to focus on at what time. The team is the same way.

On the power of foucs. Why not a little more Fred Wilson while we're at it: My favorite part so far is how Jobs turned around Apple and did it pretty quickly. He did two primary things as far as I can tell. First, he got his people into the top jobs and got rid of the executives who had been calling the shots before he showed up. And second, he brought focus to the product line, and thus everything else.

Pretty funny: Beating the CAP Theorem Checklist

Wireless devices go battery-free with new communication technique. If recycling is good then recycling wireless signals must be great. Especially if it makes distributed sensors a reality. As it stands the grand vision of the Internet of Things is a non-starter. These things need power and they need to communicate. Both of which are expensive and large. You may have thought like I have that the world is bathed in signals, why not use that for power? These guys have one step further, messaging without relying on batteries or wires for power. Two devices communicate with each other by reflecting the existing TV and cellular transmissions to exchange information. Ambient Backscatter to the rescue.

Gilt on Designing Distributed Systems With ZooKeeper. For distributed service development they use Apache ZooKeeper, RabbitMQ, and Kafka. They like ZooKeeper for helping with agreement between nodes and serializing work, which take the form of two tools: Leader Election and Partitioning. Good discussion and enough details to get started.

Bitmessage. Secure messaging that doesn't leak meta data information. It's a p2p network for sending messages. Absolutely secure. Uses bitcoin block chain technology. Everyone gets everything. You receive all messages from every user. Your local client extracts that what is meant for you. Nobody can monitor who is sending to whom.

It's funny how the immune system for a website can mirror the human immune system. In Facebook's Summary of the August 13th App Outage they talk about how they detected a pattern of attack that caused a lot of good applications being disabled. Like an auto-immune disease.

You and I know web development is way too hard. Matt Debergalis explains why with a great tour of the web development landscape in Meteor - Web Development Like You Never Seen. We are building tomorrow's apps with yesterday's tools. HTTP is the wrong protocol. We need new primitives that realize we are creating distributed applications with a client and server slices unified in a cloud. JavaScript is the new C and what is needed is a single API the works across the client and server. Looks interesting, but the examples on their site didn't work and none of them looked very complex.

Two good articles on Facebook's graph search: The Making of Facebook’s Graph Search, Scaling Apache Giraph to a trillion edges. Facebook ended up choosing Giraph to help solve problems that couldn't be solved with Hive/Hadoop. They liked it because it interfaces with HDFS, Hive, uses MapReduce, and supports a wide variety of graph applications. Excellent story of how Facebook made it work for them. "On 200 commodity machines we are able to run an iteration of page rank on an actual 1 trillion edge social graph formed by various user interactions in under four minutes with the appropriate garbage collection and performance tuning."

Stack ranking is evil: What killed 20% time [at Google]? Stack ranking. Google's perf management is basically an elaborate game where using 20% time is a losing move. In my time there, this has become markedly more the case. I have done many engineering/coding 20% projects and other non-engineering projects, with probably 20-40% producing "real" results (which over 7 years I think has been more than worth it for the company). But these projects are generally not rewarded. Part of the problem is that you actually need 40% time now at Google -- 20% to do stuff, then 20% to tell everyone what you did (sell it). Promotion optimizes for depth and not breadth. Breadth -- connecting disparate ideas -- is almost invariably what's needed for groundbreaking innovation.

You've got less than a second. Google on Making smartphone sites load fast: Server must render the response (< 200 ms), Number of redirects should be minimized, Number of roundtrips to first render should be minimized, Avoid external blocking JavaScript and CSS in above-the-fold content, Reserve time for browser layout and rendering (200 ms), Optimize JavaScript execution and rendering time.

Good explanation by Olivier Pomel of how to use StatsD: Beyond the technical problem it solves – getting data from point A to point B efficiently, StatsD’s biggest contributions are organizational in nature. It allows for a culture where Developers don’t have to ask anyone’s permission to instrument their application.

AWS EBS latency and IOPS: The surprising truth: Even if your EC2 instances were using dedicated volumes (known as “provisioned IOPS volumes” in AWS parlance), the physical disks behind EBS may still be shared with other AWS customers. Their workloads may consume a great share of disk bandwidth when you need it most.

Videos from the Backbone Conf are now available.

Power-law distributions are everywhere. Keith Rabois of Khosla Ventures: The tech sector as a whole has created more than a trillion dollars in value over the past decade. Yet that value creation is incredibly concentrated. Nearly two-thirds of the increase reaped by investors and employees comes from Apple and Google alone, with the likes of Amazon, Facebook, LinkedIn, eBay, Yandex and Baidu rounding out the list.

Awesome Google Group's thread on Coordinated Omission: the measurement error which is introduced by naively recording synchronous requests, sorting them and reporting the result as the percentile distribution of the request latency.

SQL Case Study: Removing bias from customer analysis. A great example of how to do complicated yet practical stuff in SQL.

Scott Hanselman reveals Penny Pinching in the Cloud: When do Azure Websites make sense? Agree with one of the comments that say you'll learn more reading this article than after reading the official docs.

Amazon EC2 Spot Instances + Auto Scaling are an ideal combo for machine learning loads: One of the lessons I learned early is that scaling a machine learning system is a different undertaking than scaling a database or optimizing the experiences of concurrent users. Thus most of the scalability advice on the web doesn’t apply. This is because the scarce resources in machine learning systems aren’t the I/O devices, but the compute devices: CPU and GPU.

Mio: A High-Performance Multicore IO Manager for GHC: We show that with Mio, McNettle (an SDN controller written in Haskell) can scale effectively to 40+ cores, reach a throughput of over 20 million new requests per second on a single machine, and hence become the fastest of all existing SDN controllers.

On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus: The main findings from this study are that the end-to-end control of the EIB which is influenced by the software running on the Cell has inherent scaling problems and serves as the main limiter to overall network performance. Thus, end-to-end effects must not be overlooked when designing efficient networks on chip.

How To Build a User-Level CPU Profiler: When a program is running, writing to a control file in the /proc file system enables profiling. At that point, the operating system allocates an array of counts with one count for every 8-byte section of the program code, and attaches it to the program. Then, each time a timer interrupt happens, the handler uses the program counter—the address of the instruction in the code that the program is currently executing—divided by 8 as an index into that array and increments that entry. Later, another program (on Plan 9, called tprof) can be run to read the profile from the kernel and determine the function and specific line in the original source code corresponding to each counter and summarize the results.

Stuff The Internet Says On Scalability For August 16, 2013

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale