advertise
« 10 Golden Principles For Building Successful Mobile/Web Applications | Main | C is for Compute - Google Compute Engine (GCE) »
Wednesday
Jul042012

Top Features of a Scalable Database

This is a guest post by Douglas Wilson, EMEA Field Application Engineer at Raima, based on insights from biulding their Raima Database Manager.

Scalability and Hardware

Scalability is the ability to maintain performance as demands on the system increase, by adding further resources. Normally those resources will be in the form of hardware. Since processor speeds are no longer increasing much, scaling up the hardware normally means adding extra processors or cores, and more memory.

Scalability and Software

However, scalability requires software that can utilize the extra hardware effectively. The software must be designed to allow parallel processing. In the context of a database engine this means that the server component must be multi-threaded, to allow the operating system to schedule parallel tasks on all the cores that are available. Not only that, but the database engine must provide an efficient way to break its workload into as many parallel tasks as there are cores. So, for example, if the database server always uses only four threads then it will make very little difference whether this server runs on a four-core machine or an eight-core machine.

Distributed Design

Splitting up the workload of a database engine to take full advantage of the available hardware is non-trivial, and not all data management systems do this well. There are limits to the scalability of a monolithic server. Not just the database engine, but also data and other system resources need to be split up to avoid interdependencies. The whole system needs to have a distributed design.

For example, most databases store indexes in B-Tree structures. B-Trees allow indexed values to be located quickly, and they also allow relatively efficient insertion and deletion, but for this they need to be “balanced”, i.e. the tree structure must have the same depth across all its branches. The need to keep the tree balanced means that a single insertion or deletion may trigger changes that ripple all the way to the root of the tree. This makes it difficult to share the management of a B-Tree between multiple threads, and therefore between multiple cores. The threads may frequently compete for access to the root of the tree, which becomes a bottleneck.

Minimize Shared Resources

Scalability is all about minimizing the number of such shared resources, so that different threads can run on different cores without ever having to wait for each other to release shared resources. Without this independence, adding extra cores will not greatly improve performance.

An example of this concept can be shown through experience in our own database management system, RDM.  RDM has intelligent support for distributed databases and allows the application to distribute data across the available hardware and minimize contention between different threads and processes. Since this objective cannot usually be achieved without knowledge of the data structure and use cases, the database engine must allow the application writer to specify which data will be handled by which server. RDM’s server process (called the Transactional File Server, or TFS) is a relatively lightweight process and multiple instances of the TFS can run on multi-core systems; each assigned to different databases, and completely independent of each other. The client is able to connect to all of these and retrieve data from them as if it was just reading one database. RDM therefore provides building blocks that allow the creation of a truly scalable application.

Multi Version Concurrency Control - Simultaneous Access

In situations where simultaneous read and write access to the same data is required, support for Multi Version Concurrency Control (MVCC) allows this simultaneous access without blocking the threads or processes involved. MVCC allows a reader to view a snapshot of the data before the writer’s changes, allowing the read and write operations to continue in parallel.

Replication - Efficient Distribution

Replication may also be used as a way to improve scalability, by providing multiple read-only replicas of a master database. This may enable processes running on remote processors to read a local copy of the database. It may also help by reducing the number of processes trying to access the master database.

Top Scalable features of a DBMS

To summarize, the following features of a database management system allow a highly scalable, distributed data architecture:

  • Lightweight server process that does not use much CPU time – by running multiple instances you can use as many processors or cores as you have.
  • Client side that can connect to multiple servers and present the data from those servers to the application as if it was all in one database.
  • Replication – the client side can retrieve data from a master database or a replicated slave database – this is transparent to the application.

Reader Comments (1)

Thanks for this article. Replication can only be taken so far. If you replicate a large DB (> 1 TB) it is hard to recover from any replication problem\disaster or get the read only subscribers back in synch. Smaller datasets can just be "snapshot" to the subscribers but that becomes impossible with larger tables. The only way to get everything working again after replication breaks is to put up splash pages, backup your OLTP database, copy it to all the subscribers, restore it, set up relication THEN you can take the splash pages down. If you don't do all these your subscribers will be out of synch before you set them up again. This is a big pain point for us. Thanks.

July 6, 2012 | Unregistered CommenterKirby Repko

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>