Finding the Right Data Solution for Your Application in the Data Storage Haystack
The InfoQ article Finding the Right Data Solution for Your Application in the Data Storage Haystack makes a series of concrete recommendations for a user who wants to find the right storage solution for his application.
Few years back, there was a time SQL RDBMS were solution for almost all storage needs, but we all know how scaling came along and shattered the perfect dream. Then NoSQL happened, and now we are end up with a Haystack of solutions. For example, Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. are some classes of such solutions.
We discuss about how to find the right storage solution, and we make choices often when we design. But, when comes to describe how to select the right one, we often end up giving very high-level guideline. The article argues that the way to make more concrete recommendations is to drill down into bit more detail and consider them case by case.
To that end the article takes four parameters about an application/usecase (Scale, Consistency, Type of Data, and Queries needed), then take some 40+ cases that arises from different value combination of those parameters and make one or more concrete recommendations on right storage solution for that case.
What follows are the four parameters and potential values they can take and the recommendations for structured, semi-structured, and unstructured data:
Parameter Name |
Potential values taken by the parameter |
Types of data stored |
structured, unstructured, semi-structured |
Scalability requirements |
small 1-4nodes, medium 10s of nodes, and very large 100s of nodes |
Nature of data retrieval |
Types of Queries: key lookup, WHERE, JOIN, Offline |
Consistency requirements |
ACID, single atomic Operation, loose consistency |
Finally, the article presents results using a table that summarizes all recommendations followed by a discussion on how each recommendation was made.
Following three tables summarizes the recommendations.
Recommendations for Structured Data
KV = Key-Value Systems, CF = Column Families, Doc = Document Based Systems, DB = Database
Recommendations for Semi-Structured Data
Low Scalability (1-3 nodes) |
Scalable (10 nodes) |
Highly Scalable |
|
XML (Queried through XPath) |
XML DB or convert to a structured model |
XML DB or convert to a structured model |
?? |
Graphs |
Graph DBs |
Graph DBs if graph can be partitioned |
?? |
Data Structures |
Data Structure Servers, Object Databases |
?? |
?? |
Queues |
Distributed Queues |
Distributed Queues |
Distributed Queues |
Recommendations for Unstructured Data
Small Scale (1-3 nodes) |
Scalable (10 nodes) |
Highly Scalable |
|
Key based retrieval |
Distributed File systems/ Key-Value systems |
Distributed File systems, Key-Value systems |
Distributed File systems (Lustre), Key-Value systems |
Meta data search |
Metadata Catalogs |
Metadata Catalogs |
?? |
Content Search |
File System + Indexing |
File System + Indexing |
HDFS/ Google FS + Indexing |