hot links

Stuff The Internet Says On Scalability For June 24, 2011

High Scalability

24 Jun 2011 — 5 min read

Submitted for your scaling pleasure:

Achievements:
- Watson uses 10,000's of watts, the computer between the ears uses 20. With only 200 million pages and 2TB of data, Watson is BigInsights, not BigData.
- That Google is pretty big: 1 billion unique monthly visitors
- tweetimages: We peaked at 22m avatars yesterday. Bandwidth peaked at 9GB of @twitter avatars in a single hour.
- Foursquare Surpasses 10 Million Users
- Reddit Hits 1.2B Monthly Pageviews, More Than Doubles Its Engineering Staff
- Twitter: 185 million tweets are posted daily; 1.6 billion search queries daily; indexing latency is less than 10 seconds.
Quotable quotes:
- skr: OH: "people wait their whole lives for a situation where they can use bloom filters"
- joeweinman: @Werner at #structureconf : as of Nov 10, 2010, all Amazon.com traffic was served from AWS. <-- The child surpasses the parent.
- bbatsov: A compiled language does not scalability make -- Yoda
- swredman: If i read the marketing buzzwords 'scalability' or leverage your data' one more time, gonna lose my sh*t.
- ipeirotis: Most tasks are too arbitrary to even decompose in atomic steps, while handling quality, cost, scalability and interactions.
- ArmonDadgar: Some people. You had to use NoSQL for 2 TPS and a DB that can fit on my iPhone? This is some serious #BigData.
- aphethean: For a guy who started Dev life programming applications on hierarchical databases #nosql looks remarkably familiar
- "Software is an entropic system whose arrow of time flows in the direction of failure, aided and abetted by human bullsh*t"
- bstg: Wonder if the best way to improve insights from #bigdata isn't better analytics, but a fundamental change in the way we capture it? #in
Apple has made their WWDC 2011 Videos available. Apple is normally as closed as a counter-insurgency cell, but their WWDC videos are always top notch.
James Hamilton hits on another big change in the database landscape, moving away from crazy enterprise pricing schemes to more sustainable and rational models.
Great discussion on Reddit of Eben Moglen's fascinating The alternate net we need, and how we can build it ourselves. Our net has been turned against us. How do we get it back? Without anonymity the human race will not be human anymore. We need smart routers that work for us.
The NoSQL Fad says Alex Popescu won't be countered by a relational database with relaxed semantics, as that's just recreating NoSQL in the first place.
Spark, in-memory cluster computing that aims to make data analytics fast — both fast to run and fast to write.
In The State of Management Scalability at Stack Exchange Kyle Brandt talks about scaling their ability to manageer their Linux and Windows environment through automation. The idea is if you have to do more than once on multiple servers then automate it. The cool part is they have a chart of what part of their current process doesn't meet this goal and they have a plan of how to get there.
PortLand: Scaling Data Center Networks to 100,000 Ports and Beyond, great discussion of how we need to be able scale the network layer as easily as we currently scale the CPU and storage layer. We are being held back by IO.
Disruptor - a Concurrent Programming Framework, is a general-purpose mechanism that solves a complex problem in concurrent programming in a way that maximizes performance.
Content Delivery Summit Videos Now Available For Viewing. Secrets behind the real heart of the web, CDNs.
Velocity 2011 Speaker Slides & Video are available.
Tired of the NoSQL love fest? Here you go: Scaling with MongoDB (or how Urban Airship abandoned it for PostgreSQL). Found MongoDB was fast until data and indexes no longer fit in memory and that Auto-sharding and Replica Sets too scary to trust. Decided to move their data to a manually partitioned PostgreSQL. To learn how Foursquare uses Mongo take a look at Practical Data Storage: MongoDB at foursquare.
Nice summary (in Japanese) of How do I improve the scalability of database? Hard to get over the auto translation capability in Chrome. Ain't the web grand?
When Watson needs to be fed it dines at chez Hadoop. A Hadoop backend is used to crunch through the documents to prep for the interactive Jeopardy matches. There is no other system flexible enough to allow for the flexible knowledge extraction that we need.
More videos for you. NDC 2011 Video Torrent, a torrent of all the NDC 2011 videos (Norwegian Developers Conference) is now available. If that's not enough here are videos from Jfokus 2011. Emil Eifrem talks NoSQL and there are talks on GWT, Scala, Java EE 6, and TDD.
Performance is a Feature says Jeff Atwood. To be fast: Follow Yahoo's Guidelines, Optimize for Anonymous Users, Make Performance a Point of Public Pride.
Intel takes wraps off 50-core supercomputing coprocessor plans reports Jon Stokes. It's the age-old general-purpose (slower, easier to use) vs. specialized (faster, harder to use) tradeoff, and Intel is betting that since Tesla has so far been the only real option there are plenty of potential users out there who are in the market for something less specialized.
Lift - a web framework built on Scala to create concise, secure, scalable, highly interactive web applications that provide a full set of layered abstractions on top of HTTP and HTML.
Greg Weber talks High Performance Ruby Part 3: non-blocking IO and web application scalability. What I am hoping the Ruby community can achieve is the same ease of programming of Rails, but with easier deployment and much better scalability. But lets step back and look at the situation we are in.
Windows Azure Storage Abstractions and their Scalability Targets. A single queue is targeted to be able to process up to 500 messages per second. The target throughput of a single blob is up to 60 MBytes/sec. The throughput target for a single partition is up to 500 entities per second.
Jeremiah Peschka with a good overview of Resolving Conflicts in the Database. Some options: Manual intervention, Logging conflicts, Master write server, Last write wins, Write partitioning.
A unspoken law of the Internet is that all of Google's infrastructure must be recreated outside Google in open source form. GoldenOrb is doing their part by creating an open source version of Pregel, used for massive-scale graph processing. If you are unsure what Pregel is or how to use it, Michael Nielsen has a very good article on Pregel that's worth a look.
Riak Pipe details shared. Pipe allows you to specify work in the form of a chain of function pairs. It will be used to supercharge Riak's MapReduce feature.
Learn more about Parallel Alogorithms from Guy Blelloch's 15-499: Parallel Algorithms course.
A diagram of Amazon's Multi-AZ setup.

Stuff The Internet Says On Scalability For June 24, 2011

High Scalability

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale