databases

Against all the odds

HFadeel

17 Jul 2009 — 3 min read

This article not about Mariah Carey, or its song. It's about Storing System, Database.

First let's describe what means by odds: In my social network, I found 93% of the mainstream developers sanctify the database, or at least consider it in any data persistence challenge as the ultimate, superhero, and undefeatable solution.

I think this problem come from the education, personally, and some companies also I think it's involved in this.

To start to fix this bad thinking, we all should agree in the following points:

Every challenge have its own solutions, so whatever you want to save/persistent, there are always many solutions. For example the Web search engines, such as: Google, Kngine, Yahoo, Bing don't use database at all instead we use Indexes (Index file) for better performance.

The Database in general whatever the vendor it's slow compared with other solutions such as: Key-Value storing system, Index file, DHT.

The Database currently employ Relation Data model, or Object relational data model, so don't convince yourself to save non-relation data into relation data model store system such as: Database.

The Database system architecture didn't changed very much in last 30 years, and it's content a lot of limits, and fails in its performance, scalability character. If you don't believe me check out this papers:

I hope if you agreed with me in the previous points. So the question do we really need Database in every application?

There are many scenario shouldn't use Database resisters, such as: Web search engine, Caching, File sharing system, DNS system, etc. In the other hand there many of scenarios should use Database, such as: Customer database, Address book, ERP, etc.

Tiny URL services for example, shouldn't use Database at all because it's require very simple needs, just map a small/tiny URL to the real/big URL. If you start agreed with me, you likely want ask: But what we can use beside or instead of Databases?

There are a lot of tools that fallowing CAP, BASE model, instead of ACID model. But first let's describe ACID:

Atomicity: A transaction is all or nothing

Consistency: Only valid data is written to the database

Isolation: Pretend all transactions are happening serially and the data is correct

Durability: what you write is what you get

The problem with ACID is that it gives you too much; it trips you up when you are trying to scale a system across multiple nodes.
Down time is unacceptable. So your system needs to be reliable. Reliability requires multiple nodes to handle machine failures.
To make scalable systems that can handle lots and lots of reads and writes you need many more nodes.
Once you try to scale ACID across many machines you hit problems with network failures and delays. The algorithms don't work in a distributed environment at any acceptable speed.

In other hand CAP model is about:

Consistency: Your data is correct all the time. What you write is what you read.
Availability: You can read and write and write your data all the time.
Partition Tolerance: If one or more nodes fails the system still works and becomes consistent when the system comes on-line.

CAP is easy to scale, distribute. CAP is scalable by nature.
Everyone who builds big applications builds them on CAP. Who use CAP: Google, Yahoo, Facebook, Kngine, Amazon, eBay, etc.

For example in any in-memory or in-disk caching system you will never need all the Database features. You just need CAP like system. Today there are a lot of: column oriented, and key-value oriented systems. But first let's describe Column oriented:

A column-oriented is a database management system (DBMS) which stores its content by column rather than by row. This has advantages for databases such as data warehouses and library catalogues, where aggregates are computed over large numbers of similar data items. This approach is contrasted with row-oriented databases and with correlation databases, which use a value-based storage structure. For more information check Wikipedia page.

Distributed key-value stores:

Distributed column stores (Bigtable-like systems):

Something a little different: