architecture

Finding the Right Data Solution for Your Application in the Data Storage Haystack

Srinath Perera

01 Nov 2011 — 2 min read

The InfoQ article Finding the Right Data Solution for Your Application in the Data Storage Haystack makes a series of concrete recommendations for a user who wants to find the right storage solution for his application.

Few years back, there was a time SQL RDBMS were solution for almost all storage needs, but we all know how scaling came along and shattered the perfect dream. Then NoSQL happened, and now we are end up with a Haystack of solutions. For example, Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. are some classes of such solutions.

We discuss about how to find the right storage solution, and we make choices often when we design. But, when comes to describe how to select the right one, we often end up giving very high-level guideline. The article argues that the way to make more concrete recommendations is to drill down into bit more detail and consider them case by case.

To that end the article takes four parameters about an application/usecase (Scale, Consistency, Type of Data, and Queries needed), then take some 40+ cases that arises from different value combination of those parameters and make one or more concrete recommendations on right storage solution for that case.

What follows are the four parameters and potential values they can take and the recommendations for structured, semi-structured, and unstructured data:

Parameter Name	Potential values taken by the parameter
Types of data stored	structured, unstructured, semi-structured
Scalability requirements	small 1-4nodes, medium 10s of nodes, and very large 100s of nodes
Nature of data retrieval	Types of Queries: key lookup, WHERE, JOIN, Offline
Consistency requirements	ACID, single atomic Operation, loose consistency

Finally, the article presents results using a table that summarizes all recommendations followed by a discussion on how each recommendation was made.

Following three tables summarizes the recommendations.

Recommendations for Structured Data

KV = Key-Value Systems, CF = Column Families, Doc = Document Based Systems, DB = Database

Recommendations for Semi-Structured Data

	Low Scalability (1-3 nodes)	Scalable (10 nodes)	Highly Scalable
XML (Queried through XPath)	XML DB or convert to a structured model	XML DB or convert to a structured model	??
Graphs	Graph DBs	Graph DBs if graph can be partitioned	??
Data Structures	Data Structure Servers, Object Databases	??	??
Queues	Distributed Queues	Distributed Queues	Distributed Queues

Recommendations for Unstructured Data

	Small Scale (1-3 nodes)	Scalable (10 nodes)	Highly Scalable
Key based retrieval	Distributed File systems/ Key-Value systems	Distributed File systems, Key-Value systems	Distributed File systems (Lustre), Key-Value systems
Meta data search	Metadata Catalogs	Metadata Catalogs	??
Content Search	File System + Indexing	File System + Indexing	HDFS/ Google FS + Indexing