Stuff The Internet Says On Scalability For November 11, 2011

You got performance in my scalability! You got scalability in my performance! Two great tastes that taste great together:

  • Quotable quotes:
    • @jasoncbooth : Tired of the term #nosql. I would like to coin NRDS (pronounced "nerds"), standing for Non Relational Data Store. 
    • @zenfeed : One lesson I learn about scalability, is that it has a LOT to do with simplicity and consistency.
    • Ray Walters : Quad-core chips in mobile phones is nothing but a marketing snow job
  • Flickr:  Real-time Updates on the Cheap for Fun and Profit. How Flickr added real-time push feed on the cheap. Events happen all over Flickr, uploads and updates (around 100/s depending on the time of day), all of them inserting tasks. Implemented with Cache, Tasks, & Queues: PubSubHubbub; Async task system Gearman; use async EVERYWHERE; use Redis Lists for queues; cron to consume events off the queue; 
  • Cloud Event Processing - Big Data, Low Latency Use Cases at LinkedIn by Colin Clark. It talks about some big data, low latency use cases and highlights a distributed, streaming map/reduce architecture and also the SAX algorithm. SAX is for - real time (or near real time) reactions via inverted index utilization. Imagine for a moment that you're receiving streaming prices at a very high rate. Due to the dimensionality of that stream, it's very difficult to search a database for the last time a pattern happened. However, using a SAX word can provide an immediate index point, in addition to finding patterns that are close to the current situation. Why? Because SAX words and the distance between them is lower bounded. This means it can be used in traditional distance measuring algorithms. To summarize, you can use SAX to encode real-time information and provide a constant reference point to historical occurrences.
  • Riyad Kalla with a great summary of From “Overnight” to “Real-time”: A Two-Year NoSQL Case Study. Cons: performance, static query model, disk usage, database choice impacts every other design choice, early system teething pains. Pros: scaled to 14 nodes with no problems, stable API, stability allowed them to concentrate on their app, cluster performs well as an aggregate, good support from Cloudant. 
  • How Heroku Works - Maker's DayMaker’s Day ensures that engineers get a full day of uninterrupted time to focus on making things. Maker’s day is meant for making shit. Meetings don’t happen on Maker’s Day.
  • The Darknet Project: netroots activists dream of global mesh networkA group of Internet activists gathered last week in an Internet Relay Chat (IRC) channel to begin planning an ambitious project—they hope to overcome electronic surveillance and censorship by creating a whole new Internet. Wasn't the original Internet a Darknet? We need more redundancy, more options, more paths more routes. That's the only way to protect us from the centralizers. On Hacker News.
  • Java garbage collection pauses got you down bucky? Azul, high priests of all things Java, may have a solution for you with Zing, a pauseless garbage collection that works on an unmodified Linux distro on 512GB of RAM. Applications can completely separate heap size from response times for predictable, consistent Java GC behavior.
  • Pushing the Limits of Amazon S3 Upload Performance. Mark Rasmussen with excellent research and discussion of S3's upload characteristics. He found: Parallelization - the easiest way to scale is to just parallelize your workload. Locality & bandwidth - Being close to the S3 bucket servers is of utmost importance. Operating system -  15% better performance on a 2008 R2 server over a 2003 server. Saturating the S3 service - not going to happen, simple as that.
  • Let’s be consistent about consistency – a post for the relational mind by Billy Bosworth. When NoSQL people talk about consistency, they are most likely not talking about the  ”C” in ACID.  What they are really talking about is “data consistency”, which largely has to do with concurrent reads. It is referring to whether the data itself going to be consistent for everyone reading from the database.  With data consistency, each user sees a consistent view of the data, including visible changes made by the user’s own transactions and transactions of other users.
  • DarkStar. A cloud event processing system used at LinkedIn. 
  • Carriers desperately seeking higher mobile data prices by Stacey Higginbotham. Instead of squeezing us for profits with poorer service, how about they encourage data ubiquity by investing and radically increasing bandwidth? Just a thought.
  • Commentary: The GPU computing fallacy.  Lars Juhl Jensen busts some myths, concluding: The message then is clear. If you are a bioinformatician who likes to live on the bleeding edge while wasting money and electricity, get a GPU compute server. If on the other hand you want something generally useful and well tested and quite a lot faster than a GPU compute server … get yourself some computers.
  • If you've wanted to play with the effect SSDs will have on your architecture, CloudSigma has added SSDs to their cloud
  • eBay tells a little more about how they handle personalized search. Put your math hat on, because it gets dense in their.
  • The Collections Cache. BenStopford talks about the design for a common problem:  a cache that appends values to a collection using a Trigger. Also, Coherence Implementation Patterns – Slides from Coherence SIG.
  • Isn't it amazing that counting stuff is still a problem? MapReduce vs count vs find
  • In Clustrix benchmarks under tpcc-mysql workload Percona finds Clustrix provides very good scalability by adding more nodes and in supporting a multi-threaded workload. Interestingly, when using few threads SSD cards perform better.