hot links

Stuff The Internet Says On Scalability For September 23, 2011

High Scalability

23 Sep 2011 — 6 min read

I'd walk a mile for HighScalability:

1/12th the World Population on Facebook in One Day; 1.8 ZettaBytes of data in 2011; 1 Billion Foursquare Checkins; 2 million on Spotify; 1 Million on GitHub; $1,279-per-hour, 30,000-core cluster built on EC2; Patent trolls cost .5 trillion dollars; 235 terabytes of data collected by the U.S. Library of Congress in April.
Potent quotables:
- @jstogdill : Corporations over protect low value info assets (which screws up collaboration) and under protects high value assets. #strataconf
- @sbtourist : I think BigMemory-like approaches based on large put-and-forget memory cans, are rarely a solution to performance/scalability problems.
1 Million TCP Connections. Remember when 10K was a real limit and you had to build out boxes just to handle the load? Amazing. We don't know how much processing can be attached to these connections, how much memory the apps use, or what the response latency is to requests over these connections, but still, that's cool. Good discussion on Hacker News. Also, A Million-user Comet Application with Mochiweb, Part 3, Optimising TCP/IP connectivity: An exploratory study in network-intensive Erlang systems, Linux Kernel Tuning for C500k.
Yahoo! News Activity Backend Architecture. Yahoo! shows how they are integrating with Facebook to power deeper personalization. Components include: Mixer, Jetty, Task Execution Engine, Facebook, Memcache, and Sharpa. Mixer is a fully asynchronous service that requires significantly fewer threads than thread-per-connection or thread-per-request architectures; uses an aggressive dynamic caching scheme to reduce the load on the data store and balances response time performance with data freshness.
Lots more good stuff on Storm, Twitter's complex event system. On GitHub. Mailing list. Hacker News. Documentation in a wiki. One-click deploy for Storm on EC2. Starter projects. Slides from launch presentation. On Twitter Engineering. Storm does not have 100's of jars of dependencies, nor does it require much memory to run. In fact, it's quite straightforward to spin up a Storm cluster and Storm has a local mode where you can develop/test topologies on your local mode completely in-process.
StrataConf slide decks. Stephen Wolfram with a fascinating talk on how they Compute the World. Hard AI is side stepped with pure computation. Humans format the data and equations blast through to the answer to compute things that nobody ever asked before.
Now that the public key system is toast, can a Firefox addon from Convergence fix it? Instead of requiring browser users to trust an anointed group of certificate authorities, it gives users the ability to pick a group they trust (e.g., the EFF, Google, their company, their university, their group of friends, etc.) and trust no one else. Based on the Perspectives project. Steve Gibson with a great explanation on his Security Now show on Twit. It's a beautiful architecture. Here's a video. Steve seemed to think this project had no chance. Why?
How GitHub Uses GitHub to Build GitHub. Build features fast. Ship them. That's what we try to do at GitHub. Our process is the anti-process: what's the minimum overhead we can put up with to keep our code quality high, all while building features *as quickly as possible*?
A Graph-Based Movie Recommender Engine. Marko Rodriguez creates a sweet example of how to build a graph-based movie recommender engine using the publicly available MovieLens dataset, the graph database Neo4j, and the graph traversal language Gremlin. It shows how to compute answers to questions like How many of Toy Story’s highly co-rated movies are unique? It's a little strange, but makes sense after awhile.
Sherpa is Yahoo's massively scalable and elastic storage for structured data. Some fun facts: 500+ Sherpa tables; 50,000 Sherpa tablets(shards); one app services 75,000 requests per second; thousands of Sherpa servers in more than a dozen data centers around the world; automated load-balancing capabilities enable it to run on server farms composed of heterogeneous servers; SSDs and HDDs are supported. And lots more.
If you are a startup pondering which service stack to use, PipelineDeals has shared what they believe in enough to pay for: DataDog, Hipchat, Pingdom, AWS, DNSMadeEasy, FogBugz, Workflowy, Google Apps for Business (Email, Documents, Calendar, Chat), Campaign Monitor, Google+, Opscode Platform, MailGun, NewRelic, Rightscale, Squarespace, Viddler.
Real intelligence not doing it for you, Google Tech Talks has a some new videos from the The Fourth Conference on Artificial General Intelligence. It's hard to be smart.
Neo4j's Emil Eifrem predicts NoSQL will cross the chasm from a niche/web product to become broadly adopted by the enterprise. This adoption is driven by: Support for transactions; Support for durability; Support for Java. And they have $10+ million more in funding to prove it. Good luck!
Jared Carrol thinks Testing Doesn't Scale. But failure does? Develop small testable components. Make tests run in parallel on a cluster. Run integration tests on check-in. Run system tests on builds. Run acce[tance tests on deployment. It all works.
Some Best Practices for Apps For Domains with GAE. Brandon Wirtz with much needed help navigating the choppy waters of Apps for Domains. Also, a discussion of why New [GAE] Pricing ROCKS! for Small apps. Also also, to see how the new pricing scheme is getting programmers to think about their architectures, read Entity Sizing / Grouping wrt New Pricing and Question about datastore cost optimization by reusing old entities for new data. We always knew this, but size matters.
Publications by Microsofties. Publications by Googlers.
How to get Speed out of Amazon's EBS volumes: Software RAID it! Dathan Vance Pattishall finds with a 8 EBS 125 GB volumes create a raid10 array with a 256KB chunk size he can get 22-25 MB of second of random I/O from 20 threads.
Velocity 2011, recent trends on improving Web performance. Jae-wan Jang with a great writeup of technologies he saw at Velocity: Web Page Test; YSlow; Pagespeed; Chrome Developer Tool; HTTP archive; Weinre. In my perspective, the issues of web speed and performance have just begun to evolve.
Spot Instances, Big Clusters, & the Cloud at Work. James Hamilton on why selling on the spot market so unused infrastructure capacity is used makes so much economic sense in the cloud. He doesn't go into it here, but previously James mentioned as long is the price is higher than the marginal cost of power then selling spot cycles is profitable, which is an interesting way of looking at things.
Curt Monash bravely disentangles all the different Salesforce properties: salesforce.com, force.com, database.com, data.com, heroku.com.
Implementing a queue in Cassandra: when queue sizes reaches 50K there's high CPU usage and constant GC. Also, Cassandra Write Performance – A quick look inside by Michael Kopp. Cool deep dive into Cassandra. NoSQL or BigData Solutions are very very different from your usual RDBMS, but they are still bound by the usual constraints: CPU, I/O and most importantly how it is used! Although Cassandra is lighting fast and mostly I/O bound it’s still Java and you have the usual problems – e.g. GC needs to be watched.
Free book on Mining Massive Datasets by Anand Rajaraman (@anand_raj) and Jeff Ullman. Covers Data Mining; Large-Scale File Systems and Map-Reduce; Finding Similar Items; Mining Data Streams; Link Analysis; Frequent Itemsets; Clustering; Advertising on the Web; Recommendation Systems.
Pallet - Automates controlling and provisioning cloud server instances.
Achieving Fast Joins in Distributed Data-Stores through the application of Snowflake Schemas and the Connected-Replication Pattern. Ben Stopford describes a novel mechanism for storing data across a distributed architecture so that joins can be performed efficiently without key-shipping. Snowflake-Schema are used to define what we replicate and what we partition.
SPDY: What I Like About You. Patrick McManus reflects on his experience implement SPDY for Firefox, concluding: SPDY is good for the Internet beyond faster page load times. Compared to HTTP, it is more scalable, plays nicer with other Internet traffic and brings web security forward. Reasons: Infinite Parallelism with Shared Congestion Control; SPDY is over SSL every time; Header compression.
BigData hits the Big Screen: The Lessons of Moneyball for Big Data Analysis. Rich Miller ties together the seemingly different worlds of sports and BigData in a thoughtful and well written article.
LinkedIn delivers: 3 Big Data Tech Talks You Can’t Miss and more of their tech talks on YouTube.
Geeking with Greg has some quick links worth quicking.

Stuff The Internet Says On Scalability For September 23, 2011

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale