hot links

Stuff The Internet Says On Scalability For March 23, 2012

High Scalability

23 Mar 2012 — 4 min read

Plop, Plop, Fizz, Fizz, Oh, What a HighScalability it is:

$1.5 billion: The cost of cutting London-Toyko latency by 60ms; 9 days: It took AOL 9 years to hit 1 million users. Facebook 9 months. Draw Something 9 days; ~362 sq ft solar array: powers 1 sq ft of data center.
Is Amazon is trying to margenalize OpenStack by partnering with Eucalyptus?
As the DevOps turns. You won't see this on TMZ. Adrian Cockcroft: There is no central control, the teams do it for themselves in the cloud; John Allspaw: s/NoOps/OpsDoneMaturelyButStillOps/g; Edward Capriolo: Trust developers not. Good thing is we all agree DevOps is necessary, the differences are in the how and whom.
In a word (or two), Wordnik has gone cloud. Gone is their big iron, in are envious EC2 instances. Driving the move was HA in multi-datacenters, elasticity for traffic bursts, and incremental cluster upgrades. There has been almost no reduction in performance.
Inside Zynga's Big Move To Private Cloud. Zynga does the same work with one-third the number of servers they used on EC2. Another win in their servers are optimized for different roles: database; Web server, game logic. Their servers can also be located closer to Facebook to reduce latency. But they'll still be in Amazon too: We own the base, rent the spike. We want a hybrid operation. We love knowing that shock absorber is there.
People Make Poor Monitors for Computers. Ashwin Parameswaran makes a fascinating observation about how we can expect more Black Swans as we increase automation because humans are incable of monitoring these complex beasts. Numerous examples are shown. He concludes with: Although it seems logical that the same process of increased productivity that has occurred during the modern ‘Control Revolution’ will continue during the creation of the “vast,automatic and invisible” ‘second economy’, the incompability of human cognition with near-fully automated systems suggests that it may only do so by taking on an increased risk of rare but catastrophic failure.
Fun "discussion" on Why was Tanenbaum wrong in the Tanenbaum-Torvalds debates? Also on reddit. This comment sounds abour right: When predicting the future, favor entropy and luck over innovation and ideals.
A cartoon on the true meaning of the cloud.
Getting Real About Distributed System Reliability. Jay Kreps calling BS on the supposed higher reliability of systems like Cassandra and Hadoop over traditional alternatives: the problem is the assumption that failures are independent. The actual reliability of your system depends largely on how bug free it is, how good you are at monitoring it, and how well you have protected against the myriad issues and problems it has. Part of the difficulty is that distributed system software is actually quite complicated in comparison to single-server code. I think we should insist on a little more rigor and empiricism in this area.
Cassandra Indexing: The good, the bad and the ugly. Brian O'Neill with a practical explanation of partitions and secondary indexes. He's open sourced a generic server-side trigger mechanism that indexes data as its written to Cassandra.
Scaling in the Linux Networking Stack: You've worked in your inner threads, now it's time to work on your network. This document describes a set of complementary techniques in the Linux networking stack to increase parallelism and improve performance for multi-processor systems. It talks about Receive Side Scaling, Receive Packet Steering, Receive Flow Steering, XPS: Transmit Packet Steering.
Why upgrading your Linux Kernel will make your customers much happier. After you've tuned your network, Sam Saffron advises it's time to prevent TCP slow start by setting your window size to 10, potentially halving your page download times. Great explanation of the whole flow.
10GigE delivers mobile backhaul scalability. If you've ever wondered why your cell network is so slow, you might find the answer in this article on cell tower bandwidth requirements: Consider an example network where 16 towers are passed by a fiber route with a monthly fiber lease cost of $3,000 per pair. A 10GigE solution is more cost-effective than multiple 1GigE rings on separate fiber pairs once the average bandwidth per tower exceeds 125 Mb to 150 Mb.
Implications of Non Volatile Memory on Software Architectures: Flash based storage is virtually addressed, embrace it!; NAND-flash is a cost effective way to build large memory systems, but software work is required to reap the benefits.
Storm - Distributed and fault-tolerant realtime computation. Guaranteed data processing, Horizontal scaling, Fault-tolerance, No intermediate message brokers!, Higher level abstraction than message passing, “Just works”
SSH Tunneling Explained. A great explanation of what's under the hood by Buddhika Chamith.
Has anyone done any comparative study on syslog-ng, rsyslog, Scribe, and Flume in regards to throughput? Looks at some of the tradeoffs. "Flume handles terabytes of data ingestion per day although we haven't explicitly focused on improving performance." "I've used scribe for almost-realtime applications (latency on the order of seconds) with on the order of 10MB/s+ throughput."
Query Rewriting in Search Engines. Hugh E. Williams with a spectacular explanation of what happens under the hood whith query rewrites on searches: query rewriting has the potential to deliver as much improvement to search as core ranking.
Cloud server IO performance comparisons. CloudHarmony ran a benchmark and found Storm Cloud servers rated 361 IOPs, Amazon 212 IOPs, GoGrid 211 IOPs, Joyent 181 IOPs, Rackspace 92 IOPs. Discussion on HackerNews.
Network, Interrupted. Derick Winkworth asks: If you can buy a 64-core server with 768GB of RAM and 8TB of storage from HP for a mere $57k, do you really need appliances from Cisco and Juniper anymore? The traditional answer is solutions built around ASICS have higher density, lower power, and more rigorous environmental hardening. Is that still true?
Leverage Splunk for MapReduce & Big Data Analysis. Living up to their name, Socialize shares how they provide an intelligent dashboard using Splunk. They like how easy it is to analyze a billion events using Splunk's query language.
If you are into DevOps and you aren't listening to DevOps Cafe, then automate it to make sure you do.
What Larry Page really needs to do to return Google to its startup roots. It was impossible to miss this article by Steve Lacy, but just in case you did, it's an epic post on not just Google really, but what happens when there's a hyper focus on feduciary responsibility.
Quick links: Greg Linden with what has caught his attention lately.

Stuff The Internet Says On Scalability For March 23, 2012

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale