Stuff The Internet Says On Scalability For July 15, 2011

Submitted for your scaling pleasure:

  • That's a lot of data...CERN: ATLAS produces up to 320M bytes per second, followed by CMS with 220M Bps. Amazon Cloud Now Stores 339 Billion Objects. CERN also has an open source hardware effort.
  • Domas Mituzas on why Facebook may just outlast their MySQL heritage: I feel somewhat sad that I have to put this truism out here: disks are way more cost efficient, and if used properly can be used to facilitate way more long-term products, not just real time data. Think Wikipedia without history, think comments that disappear on old posts, together with old posts, think all 404s you hit on various articles you remember from the past and want to read. Building the web that lasts is completely different task from what academia people imagine building the web is. What happens in real world if one gets 2x efficiency gain? Twice more data can be stored, twice more data intensive products can be launched.
  • Quotes that are quoted because they are quotable:
    • @Werner - If you have never developed anything of that scale you cannot be taken serious if you call for the reengineering of facebook's data store
    • @Werner - Acaling data systems in real life has humbled me. I would not dare criticize an architecture that the holds social graphs of 750M and works
    • Dwight Merriman -  I'm not smart enough to do distributed joins that scale horizontally, widely, and are super fast. You have to choose something else. We have no choice but to not be relational.
    • @MichaelStal - My new idea: A confessional box for software engineers would be great but needs high scalability. 
    • @gandamanurung - Scalability solutions aren't magic. They involve partitioning, indexing and replication 
    • @HiveTheory - Today I fixed a scalability problem in my code by making it simpler. As @9len said, deleted code is tested code...
    • @DundeeGroup - Scalability does not mean skip steps. It means make them smaller.
    • @ChuckJolley  - 'Renegade farmer' focus of 'American Meat.' Gotta ask about scalability, though. Can Salatin's system feed the world?
  • APIv2: Woulda, coulda, shoulda. Foursquare with a detailed and honest look at their API design experiences. JSON: good; REST lite: good; Generic structures and indirection: good; CamelCase: meh; Images: meh; Versioning: jury is still out; Category representation: oops; Object consistency and level of detail: meh; Envelope: good; Grouping versus ranking: meh.
  • Speed matters: how Ethernet went from 3Mbps to 100Gbps... and beyond.  Iljitsch van Beijnum with a nice trip down ethernet cable lane. SFW pictorial on Tom's Hardware.
  • Karl Seguin has made his The Little MongoDB Book for free. Looks like a good place to start.
  • Philopator discusses your sharding options: while sharding is very powerful and scalable technique for dealing with large data volumes, it involves a lot of complexity and side-effects. A careful examination of alternatives is in order. Be it NewSQL, NoSQL, or plain OldSQL (archival, DB-level partitioning), it’s worthwhile to carefully analyze tradeoffs of each solution.
  • If you are thinking of using Scala this a interesting interview with Martin Odersky in Dr. Dobbs. Follow up discussion on Reddit. Is scala to not opinionated enough? Is it too complex? Is it too tied to the Java environment? Only The Developer knows for sure...
  • I thought there were 101 things architects should know, but apparently there are 97: 97 Things Every Software Architect Should Know - The Book. A good list, I'll say this again, but Fight Repetition is my favorite. 
  • The Three Monitoring Perspectives: Part 1, Ground Level. Kyle Brandt explains the 3 levels of StackExchange's monitoring infrastructure:  Micro: “Ground Level”, Meso: “Day to Day”Macro: “Seasonal”. “ Ground Level” or micro monitoring is high resolution monitoring. By this I mean that you take a lot of samples in short periods of time — generally every second or multiple times a second. These tools are often run from the machines themselves. They also return lots of information.
  • Search is always a problem. Cloudant has released an interesting new search product combining Apache Lucene and CouchDB. It's fault-tolerant, federated, full-text search and scalable. 
  • Peter Schuller on the value of using flashcache: Regardless of whether the cache is memory, ssd, a combination of both, or anything else, most workloads tend to be subject to diminishing returns. Doubling cache from 5 gb to 10 gb might get you from 10% to 50% cache hit ratio, but doubling again to 20 gb might get you to 60% and doubling to 40 gig to 65% (to use some completely arbitrary random numbers for demonstration purposes). The reason a cache can be more effective than the ratio of its sizes. the total data set, is that there is a hotspot/working set that is smaller than the total data set. If you have completely random access this won't be the case, and an cache of size n% of total size willgive you a n% cache hit ratio.
  • Thoonk! - A persistent (and fast!) system for push feeds, queues, and jobs, leveraging Redis.
  • A Cassandra block: Revealing discussion of reads on Cassandra. Strongly consider denormalizing so that you can read ranges from a single row instead of creating secondary indexes. Cassandra is designed to use DAS on commodity hardware, not use a SAN. Ed Anuff with an awesome article on Indexing in Cassandra: The basic property of Cassandra that a lot of new users just blow past is the fact that it's a column-oriented database.  They quickly see that they can create a column family (CF) as an analogue to a traditional database table and so create, for example, a CF called "Users" and the columns in each row in that CF are effectively used the same way that they'd use any database column, storing fields like "name", "location", etc.
  • We have another entrant in to the database market: MemSQL - a scalable in-memory database that is up to 30-times faster than relational databases on disk. Hacker News discussion, which answered my immediate question, it's 10 times faster than MySQL running out of a RAM disk. VoltDB has covered a lot of reasons why a program specialized for RAM can outperform a similar product running out of RAM that has not been speficically tuned for RAM.
  • Migrating to Google App Engine's new High Replication Store is not that easy. Good discussion on Google Groups.
  • Comparison of application virtual machines. A lot more choices than you might expect.
  • Greg Ferro on why Soft Switching Fails at Scale. The idea about software switching is that you can develop some software on in the hypervisor platform that performs all the frame forwarding. It fails: higher latency, higher load, higher complexity, and poorer security.
  • StackOverflow: Ruby on Rails scalability/performance? A reasoned and useful discussion of the topic.
  • NimbusDB with a review of GigaOM Structure 2011. Netflix intends to move to masterless database instances; delivering elastic cloud solutions cannot be done by simply repurposing traditional inelastic database technologies; SSD next big infrastructure revolution; Existing RDBMS just flat don’t scale. Big ramifications; if you define consistency in a distributed database as a requirement for a known, global state at all points in the system at all times then you have defined a synchronous distributed system. Obviously a synchronous distributed system is not going to scale; the big challenge in cloud databases is elasticity and multi-tenancy.