10 NoSQL Systems Reviewed

Jonathan Ellis reviews in the NoSQL Ecosystem the origin of the NoSQL movement and 10 different NoSQL products and how their 1) support for multiple datacenters,  2) the ability to add new machines to a live cluster transparently to the your applications, 3) Data Model, 4) Query API, 5) Persistence Design. The 10 systems reviewed are: Cassandra, CouchDB, HBase, MongoDB, Neo4J, Redis, Riak, Scalaris, Tokyo Cabinet, Voldemort.

A very thorough and thoughtful article on the entire NoSQL space. It's clear from the article that NoSQL is not monolithic, there is a very wide variety of approaches to not being a relational database.

Click to read more ...


Product: Resque - GitHub's Distrubuted Job Queue

Queuing work for processing in the background is a time tested scalability strategy. Queuing also happens to be one of those much needed tools where it easy enough to forge for your own that we see a lot of different versions made. Resque is GitHub's take on a job queue and they've used it to process million and millions of jobs so far.

What is Resque?

Redis-backed library for creating background jobs, placing those jobs on multiple queues, and processing them later. Background jobs can be any Ruby class or module that responds to perform. Your existing classes can easily be converted to background jobs or you can create new classes specifically to do work. Or, you can do both.

GitHub tried and considered many other systems: SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, beanstalkd, AMQP,  and Kestrel, but found them all wanting in one way are another. The latency for SQS was too high. Others didn't make full use of Ruby. Others still had a lot of overhead. Some didn't have enough features. And still others weren't reliable enough.

Click to read more ...


A Yes for a NoSQL Taxonomy

NorthScale's Steven Yen in his highly entertaining NoSQL is a Horseless Carriage presentation has come up with a NoSQL taxonomy that thankfully focuses a little more on what NoSQL is, than what it isn't:

  • key‐value‐cache
    • memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache, velocity, terracoqa
  •  key‐value‐store
    • keyspace, flare, schema‐free, RAMCloud
  • eventually‐consistent key‐value‐store
    • dynamo, voldemort, Dynomite, SubRecord, Mo8onDb, Dovetaildb
  • ordered‐key‐value‐store
    • tokyo tyrant, lightcloud, NMDB, luxio, memcachedb, actord
  • data‐structures server
    •  redis
  • tuple‐store
    • gigaspaces, coord, apache river
  • object database
    • ZopeDB, db4o, Shoal
  • document store
    •  CouchDB, Mongo, Jackrabbit, XML Databases, ThruDB, CloudKit, Perservere, Riak Basho, Scalaris
  • wide columnar store
    • BigTable, Hbase, Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

"Who will win?" Steven asks. He answers:  the most approachable API with enough power will win. Steven touts the contender with the most devastating knock out punch will be document stores because "everyone groks documents." Though the thought is there will be just a few winners and products will converge in functionality.

Steven is banking on the "worse is better" model of dominance, which is hard to argue with as it has been so successful an adoption pattern in our field. The convergence idea is something I also agree with. What we have now are a lot features masquerading as products. Over time they will merge together to become more full featured offerings.

The key question though is what is enough power to win? Just getting a value back for a key won't be enough. Who are you putting your money on?

Click to read more ...


Damn, Which Database do I Use Now?

With so many database options available these days, like for the rest of life, it's natural to wonder how it all fits together. Amazon complicated, or rather expanded the available options by introducing RDS, their relational database service. RDS is MySQL safely cocooned as a manageable cloud element, resting boldly within an energy providing elastic CPU pool, supported by a virtually infinite supply of very capable virtualized storage .

MySQL in AWS is now easy to start, stop, monitor, backup, snapshot, expand, and effortlessly move up and down the instance hierarchy. What it's not, contrary to what you might expect, is a scale-out solution, it's a scale-up solution. You get more by buying a bigger instance, not by horizontally adding more instances. There's a limit. Admittedly a larger limit now with Amazon's new high memory instances.

That's OK, well maybe not for people who helped grow Amazon's ecosystem by offering a similar product, but so many projects use MySQL that this is a big win for a lot of people. It makes life easier even if the promise of infinite relational database storage is yet to be realized.

If one of the reasons you were considering using a Platform as a Service is to knock the database item off your worry list, RDS is one more reason to consider playing your own general contractor and orchestrating all the elements together yourself. As more services become packaged into cloud capable components this is likely how many systems will be bolted together in the future.

But we are left wondering, how RDS fits together with SimpleDB and all the other database options?

Click to read more ...


Squeeze more performance from Parallelism

In many posts, such as: The Future of the Parallelism and its Challenges I mentioned that synchronization the access to the shared resource is the major challenge to write parallel code.

The synchronization and coordination take long time from the overall execution time, which reduce the benefits of the parallelism; the synchronization and coordination also reduce the scalability.

There are many forms of synchronization and coordination, such as:

  • Create Task object in frameworks such as: Microsoft TPL, Intel TDD, and Parallel Runtime Library. Create and enqueue task objects require synchronization that it’s takes long time especially if we create it into recursive work such as: Quick Sort algorithm.
  • Synchronization the access to shared data.

But there are a few techniques to avoid these issues, such as: Shared-Nothing, Actor Model, and Hyper Object (A.K.A. Combinable Object). Simply if we reduce the shared data by re-architect our code this will gives us a huge benefits in performance and scalability.

Continue >



Hot Scalabilty Links for October 30 2009


Paper: No Relation: The Mixed Blessings of Non-Relational Databases

This excellent survey of the field was written by Ian Thomas Varley as part of his Master of Science in Engineering program.

The aim of this paper is to explore the conceptual design space of non-relational databases as compared to traditional relational databases. It is clear that the design needs of the two paradigms are different, but how fundamental are the differences, and what strategies can we use to transition our conceptual designs from one to the other?
There are a few things to like about this paper. A running a example is used to show the different ways to model data depending on which type of solution you are targeting, especially covering how many-to-many relationships are modeled, data integrity, and how to support optional attributes. There's also a brief survey of some of the major systems.
The most interesting section of the report is where it tackles the problem of design for non-relational systems. The approach has two different phases: design questions and design strategies.
The questions you should ask yourself about your problem are:

Click to read more ...


Digg - Looking to the Future with Cassandra

Digg has been researching ways to scale our database infrastructure for some time now. We’ve adopted a traditional vertically partitioned master-slave configuration with MySQL, and also investigated sharding MySQL with IDDB. Ultimately, these solutions left us wanting. In the case of the traditional architecture, the lack of redundancy on the write masters is painful, and both approaches have significant management overhead to keep running.

Since it was already necessary to abandon data normalization and consistency to make these approaches work, we felt comfortable looking at more exotic, non-relational data stores. After considering HBase, Hypertable, Cassandra, Tokyo Cabinet/Tyrant, Voldemort, and Dynomite, we settled on Cassandra.

Each system has its own strengths and weaknesses, but Cassandra has a good blend of everything. It offers column-oriented data storage, so you have a bit more structure than plain key/value stores. It operates in a distributed, highly available, peer-to-peer cluster. While it’s currently lacking some core features, it gets us closer to where we want to be than the other solutions.



GemFire: Solving the hardest problems in data management

GemStone's website recently recieved a major facelift over at I felt that the users of this site might find our detailed description of how we solve the hardest problems in data management interesting. This can be viewed at: (PDF available for download).

Click to read more ...


And the winner is: MySQL or Memcached or Tokyo Tyrant?

Matt, from the ever excellent MySQL Performance Blog, decided to run a test using a simple scenario drawn from his client experience in the gaming space. The scenario: read a row based on a primary key, update the row, write it to disk, and use the row to lookup another row. Matt ran three different tests explained in a series of three different articles: MySQL and MySQL + Memcached, Memcached Only, and Tokyo Tyrant.

The lovingly compiled details along with many cool graphs are in the articles, but in general the lessons learned are:

Click to read more ...