Stuff The Internet Says On Scalability For December 23, 2011

A merry HighScalability to all and to all a good night:

  • Santa: 3.7 million appointments; iPad2 == 1986 Cray 2 6 processor super computer; Watson: 200 million pages of natural language content
  • Funny: a cautionary tale about storage and backup. Where is my data? I’m kinda big deal after all! I should have listened to my postdoc, he can build cheaper storage than you can.
  • Nothing stirs up more energy than when someone says they are abandoning and old beloved framework for a newer sexier model. Feelings of betrayal and abandonment leak over everything, which is always a good draw for reality social networking. Here Paul Querna tells why The Switch: Python to Node.js. And here we see the response on Hacker News. Python got the job done for cloudkick, they were acquired, but they wanted something more going forward, a trophy wife if you will, after the first wife put them through law school. Good discussion all around. You may find something that helps in your own platform decision. Or you just may find it entertaining.
  • Wired with a nicely written origin story on Dropox and how they will stomp iCloud into bits. Strategy: simplicity and ubiquity.
  • So you dumped your RDBMS and you wondering how to aggregate with HBase? This thread has some options:  rollups and ditch the raw data, hbase-lattice project, archive the tables for a given year and start fresh the next, use a separate OLAP system, keep live counts using atomic increments.
  • Oyster shows How to Build a 40TB File Server. Building your own 40TB box costs 1/10th the cost of the $60K/year it would cost on S3. That's a lot of Oysters. On Reddit. Cons: it's not geographically distributed, it uses unproven tech, lacks wide bandwidth to customers, lacks redundancy, you have to build and manage. Pros: S3 has gone down, the S3 cost is an anual cost.
  • CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space). Riyad Kalla concludes: I have been reading up on log structured file systems, efficient data formats, database storage engines and copy-on-write semantics for a while now trying to get a feel for next-generation file storage optimized for flash-based technology... reading about the pros and cons of different approaches and seeing it all come together so smoothly in a single design like Couch's really deserves a hat-tip to the Couch team.
  • Write Scalability. Robert Haas chronicles the adventure of trying to get linear write speedups for PostgreSQL. A difficult task: I think I was lucky, when working on read scalability,  to find that there was basically only one bottleneck.  In the area of write scalability, there are three: ProcArrayLock, WALInsertLock, and CLogControlLock.   All of these affect each other.  Anything that reduces the pressure on one lock (and thereby speeds up the system) increases pressure on the other two (and thereby slows down the system).
  • Show and Tell: MongoDB at foursquare. Foursquare uses MongoDB for almost all of their storage needs. Cooper Bethea, Foursquare Site Reliability Engineer, gives the talk and does a really good job, fairly balancing the upside and the downsides of using MongoDB. They are very happy with MongoDB, but the list of workarounds seems quite extensive.
  • Murat Demirbas writes his usual thorough technology review. This time it's on: Pregel: a system for large-scale graph processing. Pros: the key to the scalability of Pregel is batch messaging, ability to modify vertices/edges on-the-fly, Cons: it offloads a lot of responsibility to the programmer, this model leads to some race-conditions.
  • How Moviepilot Walks The Graph. Nice tips and tricks on using Neo4j for graph processing. Just because you can you can be using legacy relational models doesn’t mean you should. Your system will be faster, cleaner and will be much easier to understand for everybody.
  • Raima’s High-Availability Embedded Database. Another option in the embedded database space. It uses a network data model, supports transactions, a 5 9s availability.
  • Far ranging discussion on the Cloud Computing news group on the Requirements for Applications to be "Cloud-Ready".
  • LinkedIn has released IndexTank, a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. It also has a REST API and a multitenant framework to host and manage an unlimited number of indexes.
  • Why is RAID So Important for Databases? Suprotim Agarwal with a good summary of the age old question of which RAID level to use for what. Short and to the point.  Oh, and remember, RAID is not backup.
  • European Research on Future Internet Design – White Paper. Greg Ferro links to a paper the discusses the source of ossification and innovation on the Internet. Ossification: TCP, IP, HTTP. Innovation: Everything above HTTP and everything below IP.
  • Partychat — migrating from Google App Engine to EC2. Detailed discussion of the reasons behind their move to EC2. It's more than just a price/stability tradeoff. The problem is, as an App Engine user, one is totally at the mercy of any future price changes on App Engine because it is nearly impossible to seamlessly migrate away.
  • DB2 10–The Secrets of Scalability. If you are into DB2 this presentation has a pretty detailed list of scalability enhancements for DB2 10.
  • Watson's DeepQA is developed using Apache UIMA, a framework implementation of the Unstructured Information Management Architecture. UIMA was designed to support interoperability and scale-out of text and multimodal analysis applications. All of the components in DeepQA are implemented as UIMA annotators. These are components that analyze text and produce annotations or assertions about the text. Over time Watson has evolved so that the system now has hundred of components.