Finding the Right Data Solution for Your Application in the Data Storage Haystack

The InfoQ article Finding the Right Data Solution for Your Application in the Data Storage Haystack makes a series of concrete recommendations for a user who wants to find the right storage solution for his application.

Few years back, there was a time SQL RDBMS were solution for almost all storage needs, but we all know how scaling came along and shattered the perfect dream. Then NoSQL happened, and now we are end up with a Haystack of solutions. For example, Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. are some classes of such solutions.

We discuss about how to find the right storage solution, and we make choices often when we design. But, when comes to describe how to select the right one, we often end up giving very high-level guideline. The article argues that the way to make more concrete recommendations is to drill down into bit more detail and consider them case by case.

To that end the article takes four parameters about an application/usecase (Scale, Consistency, Type of Data, and Queries needed), then take some 40+ cases that arises from different value combination of those parameters and make one or more concrete recommendations on right storage solution for that case.

What follows are the four parameters and potential values they can take and the recommendations for structured, semi-structured, and unstructured data:

Parameter Name

Potential values taken by the parameter

Types of data stored 

structured, unstructured, semi-structured 

Scalability requirements 

small 1-4nodes, medium 10s of nodes, and very large 100s of nodes

Nature of data retrieval

Types of Queries: key lookup, WHERE, JOIN, Offline 

Consistency requirements

ACID, single atomic Operation, loose consistency

Finally, the article presents results using a table that summarizes all recommendations followed by a discussion on how each recommendation was made.

Following three tables summarizes the recommendations.

Recommendations  for Structured Data

KV = Key-Value Systems, CF = Column Families, Doc = Document Based Systems, DB = Database

Recommendations  for Semi-Structured Data

 

Low Scalability (1-3 nodes) 

Scalable (10 nodes)

Highly Scalable

XML (Queried through XPath)

XML DB or convert to a structured model

XML DB or convert to a structured model

??

Graphs

Graph DBs

Graph DBs if graph can be partitioned

??

Data Structures

Data Structure Servers, Object Databases

??

??

Queues

Distributed Queues

Distributed Queues

Distributed Queues

Recommendations for Unstructured Data

 

Small Scale (1-3 nodes)

Scalable (10 nodes)

Highly Scalable

Key based retrieval

Distributed File systems/ Key-Value systems

Distributed File systems, Key-Value systems

Distributed File systems (Lustre), Key-Value systems

Meta data search

Metadata Catalogs

Metadata Catalogs

??

Content Search

File System + Indexing

File System + Indexing

HDFS/ Google FS + Indexing