Stuff The Internet Says On Scalability For July 29, 2011

Submitted for your end of July scaling pleasure:

  • YouTube: 3 billion videos viewed a day; 48 hours of footage uploaded every minute. 64 core Tilera chip.
  • Google wants to be your CDN. They figure the only way to make the web faster...is to host it. Page Speed Service - Web Performance, Delivered. An eventually for pay service that caches your website and distributes it around the world. No cost information. Your speed may vary. See the longish list of limitations.
  • Nobody said anything interesting on scalability this week! A disaster of non-quotable proportions. If I missed something, now is your chance.
  • Moving an Elephant: Large Scale Hadoop Data Migration at Facebook. Paul Yang describes the greatest westward expansion since the land bridge across the Bering Strait. It's a story of moving a 30PB Hadoop cluster from an over populated datacenter to the wide open spaces of a new continent. Unlike the early settlers, Facebook did not move the boxes over, that would disrupt service, they instead mirrored the data to their new datacenter.
  • Video: PBS Professional Looks Out 20 Years on Scalability. What's really cool is managing work for a million constantly failing machines. System are now sized in megawatts, not petaflops. Large system is 5 megawatts.  A power envelope is given and the goal is to stay within that power envelope, not keep all the cores going. Run the right work in the right way.
  • StorageMojo breaks down storage options for comparing disk storage hardware on a per-slot cost metric (PSC). PSC should track with the value of the stored data. Expect to see segments range from Bulk (the Backblaze segment) to Heavy Transactional (traditional big iron) with yet-to-be-named segments between.
  • Even handed:  Comparing Mongo DB and Couch DB. CouchDB is MVCC  based, and MongoDB is more of a traditional update-in-place store; Couch users use replication as a way to scale, Mongo used replication as a way to gain reliability/failover; Couch uses a clever index building scheme to generate indexes which support particular queries; Mongo uses traditional dynamic queries; Both MongoDB and CouchDB support concurrent modifications of single documents; CouchDB has a "crash-only" design, MongoDB offers durability in the MongoDB storage engine; Couch uses REST as its interface to the database, MongoDB relies on language-specific database drivers; Mongo is very oriented toward performance; nothing was given for CouchDB on this point.
  • YOSHINORI MATSUNOBU with an epic tutorial. Linux and H/W optimizations for MySQL. Incredibly thorough and detailed. And it ain't just for MySQL.
  • Solid State Silliness and Efficiency, Performance, and Locality by Jeff Darcy. Basically what it all comes down to is that you might not need all those IOPS for all of your data. If you need a lot of machines for their CPU/memory/network performance anyway, and thus don’t need half a million IOPS per machine, then spending more money to get them is just a wasteful ego trip. By putting just a little thought into using flash and disk to complement one another, just about anyone should be able to meet their IOPS goals for lower cost and use the money saved on real operational improvements.
  • Notes from geekSessions – Network and Infrastructure Scalability. Markus Klems shares some highlights from geekSessions 2.2: If you want to build a hybrid cloud solution, better make sure that it integrates with EC2; OpenFlow;  focus on impact duration, not incident duration, i.e. being able to fail over traffic from one DC to another within minutes, use DNS-based Global Server Load Balancing, degrade service gracefully; Not everybody needs a cloud; next generation of monitoring tools.
  • A Big Data Inflection Point in Life Science Computing: Until quite recently, life sciences research would not typically have been described as 'data intensive', certainly not in comparison with other scientific disciplines, such as high energy physics or weather modeling. In the last few years, however, new data-intensive modalities such as spectrometry, next-gen sequencing, and digital microscopy have entered the mainstream, thus unleashing an unprecedented tsunami of unstructured data.
  • Scalability for Dummies - Part 1: Clones and Scalability for Dummies - Part 2: Database. Sebastian Kreutzberger with his answer on what it takes to make a web service massively scalable. This first post addresses horizontal scaling using load balancing and sessions. The second covers your options when scaling a database. Stick with MySQL or go NoSQL.
  • We may not know who your daddy is, but now we know who is your UnQL: Richard Hipp (SQLite) and Damien Katz (CouchDB), who are proposing UnQL (Unstructured Query Language) - It's an open query language for JSON, semi-structured and document databases. It's very SQL like, but being document oriented, not everything must be a relation, they are documents. More here. Not sure if it's related to this UnQL.
  • Architecture of Tankster – Introduction to Game Play Part 1 and Scale Part 2 by Nathan Totten. A good look at using Azure to build and scale a game. A single Windows Azure storage account supports 100 TBs, 5,000 entities/messages/blobs per second, 3 gigabits per second; Azure queue can handle up to 500 messages per second; a single blob can handle up to 60 megabytes per second. To scale larger than this use multiple Windows Azure storage accounts, split out these queues into n number of queues and distribute the messages across them, manually federate this data across several SQL Azure databases.
  • Mat Keep reminds us MySQL Cluster is a viable database option with a nice pair of articles: Scaling Web Databases: Auto-Sharding with MySQL Cluster and Scaling Web Databases, Part 2: Adding Nodes, Evolving Schema with Zero Downtime. The first article is on how to scale writes through better auto-sharding. Lots of detail and a good explanation. The second article looks at scaling operational agility with live on-line operations. More on the MySQL Cluster Architecture.
  • IO IO IO. Dan Fruehauf with How To Improve Server Performance by IO Tuning – Part 1. Advocates a top down process: Characterize your IO; Tune your application; Tune the operating system; Choose the right filesystem; Disk configuration; Test; Monitor.
  • A Non-Foolish Consistency. Kyle Brandt takes through the evolution of improving page response times. Your fist badge is to minimize page load times. For a second badge used CDN to reduce load times. Master status is minimizing the variance of page response times, not just minimizing the average.
  • NoSQL standouts: New databases for new applications - Cassandra, CouchDB, MongoDB, Redis, Riak, Neo4J, and FlockDB reinvent the data store. Peter Wayner writes a good general survey of the different options.
  • NoSQL Databases: What, Why And When. Excellent presentation by Lorenzo Alberton. Here's a post introducing the ideas.
  • App Engine Fan with well done multi-part tutorial on Python for non programmers. The goal was to help his friend get up and running as fast as possible, might help your friends too.
  • Curt Monash always does an excellent job on the database front. Here's a look at McObject and eXtremeDB, an in-memory database system. And in MongoDB users and use cases we learn the largest MongoDB databases are 20-30 TB and 100 nodes.
  • Characterizing the Scalability of Erlang VM on Many-core Processors. Results show that the current version of Erlang VM achieves good scalability on the processor with most benchmarks used. The maximum speedup is from about 40 to 50 on 60 cores. Synchronization overhead caused by contention is a major bottleneck of the system.
  • The ScaleDB blog has a lot of good content, worth a look.
  • Google App Engine now has community update summarizing key GAE developments.
  • A MongoDB user really likes Ramsan. Using ramsan the system is fully optimized to its max weather you do 90/10 or 10/90 or 50/50 - you still get the full performance. And you can use fiber switch and connect multiple nodes
  • MongoDB vs Cassandra David Mytton. Takes a look at Structure, Indexes, Deployment, Consistency/Replication, Provenance, Support, Ongoing Development, Documentation, Community, Drivers. The conclusion is a diplomatic non-conclusion.
  • Learn the JVM for Dummies with Charles Oliver Nutter. A very cool and detailed look at the assembly language of the JVM, JIT, how to emit and read byte codes, and more. returnvoid.
  • Time travel may be dead, but we'll always have parallel universes. Maybe time travel works in one of those?