advertise
« Stuff The Internet Says On Scalability For November 30, 2012 | Main | Sponsored Post: Akiban, Booking, Teradata Aster, Hadapt, Zoosk, Aerospike, Server Stack, Wiredrive, NY Times, CouchConf, FiftyThree, Percona, ScaleOut, New Relic, NetDNA, GigaSpaces, AiCache, Logic Monitor, AppDynamics »
Thursday
Nov292012

Performance data for LevelDB, Berkley DB and BangDB for Random Operations

This is a guest post by Sachin Sinha, Founder of Iqlect and developer of BangDB.

The goal for the paper is to provide the performances data for following embedded databases under various scenarios for random operations such as write and read. The data is presented in graphical manner to make the data self explanatory to some extent.

  • LevelDB:

    LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Leveldb is based on LSM (Log-Structured Merge-Tree) and uses SSTable and MemTable for the database implementation. It's written in C++ and availabe under BSD license. LevelDB treats key and value as arbitrary byte arrays and stores keys in ordered fashion. It uses snappy compression for the data compression. Write and Read are concurrent for the db, but write performs best with single thread whereas Read scales with number of cores

  • BerkleyDB:

    BerkleyDB (BDB) is a library that provides high performance embedded database for key/value data. Its the most widely used database library with millions of deployed copies. BDB can be configured to run from concurrent data store to transactional data store to fully ACID compliant db. It's written in C and availabe under Sleepycat Public License. BDB treats key and value as arbitrary byte arrays and stores keys in both ordered fashion using BTREE and un-ordered way using HASH. Write and Read are concurrent for the db, and scales well with number of cores especially the Read operation

  • BangDB:

    BangDB is a high performance embedded database for key value data. It's a new entrant into the embedded db space. It's written in C++ and available under BSD license. BangDB treats key and value as arbitrary byte arrays and stores keys in both ordered fashion using BTREE and un-ordered way using HASH. Write, Read are concurrent and scales well with the number of cores

The comparison has been done on the similar grounds (as much as possible) for all the dbs to measure the data as crisply and accurately as possible.

The results of the test show BangDB faster in both reads and writes:

Since BangDB may not be familiar to most people, here's a quick summary of its features: 

  1. Highly concurrent operations on B link -Tree
    1. Manipulation of the tree by any thread uses only a small constant number of page locks any time
    2. Search through the tree does not prevent reading any node. Search procedure in-fact does no locking most of the time
    3. Based on Lehman and Yao paper but extended further for performance improvement
  2. Concurrent Buffer Pools
    1. Separate pools for different types of data. This gives flexibility and better performance when it comes to managing data in the buffer pool in different scenarios
    2. Semi adaptive data flush to ensure performance degrades gracefully in case of data overflowing out of buffer
    3. Access to individual buffer header in the pool is extensively optimized which allows multiple threads to get right headers in highly efficient manner resulting in better performance
    4. 2 LRU lists algorithm for better temporal locality
  3. Other
    1. Write of data/log is always sequential
    2. Vectored read and write as far as possible
    3. Aries algorithm for WAL has been extended for performance. For ex; only index pages have metadata related to logging, data pages are totally free of any such metadata
    4. Key pages for index are further cached to shortcut various steps which results in less locking on highly accessed pages giving way towards better performance
    5. Slab allocator for most of the memory requirements 

Since leveldb writes data in sorted manner always(based on user defined comparator) hence I have used BTREE (and not HASH) for BDB and BangDB across all tests (though you should note that the BTREE way of organizing data is different than the way LSM-Tree does it). Since the best write performance by LevelDB is achieved using single thread and best read performance by using 4 threads(on 4 core machine), hence I have used the best possible numbers for all these dbs unless mentioned for a particular test.

For BangDB and BDB we have used 4 threads for both read and write. Of course these are the active number of threads performing the write and read but there could be more number of threads doing the background operations for each db. For ex; LevelDB uses background thread for compaction, similarly BangDB uses background threads for log flush, managing the buffer pool, read/write from disk, checkpointing etc. I have used the same machine and the similar db configuration (as much as possible) for all the dbs for the performance analysis.

 

Following machine ($750 commodity hardware) used for the test;

  • Model: 4 CPU cores, Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz, 64bit
  • CPU cache : 6MB
  • OS : Linux, 3.2.0-32-generic, Ubuntu, x86_64
  • RAM : 8GB
  • Disk : 500GB, 7200 RPM, 16MB cache
  • File System: ext4

Following are the configuration that are kept constant throughout the analysis (unless restated before the test)

  • Assertion: OFF
  • Compression for LevelDB: ON
  • Write and Read: Random
  • Write Ahead Log for BangDB and BDB: ON
  • Single Process and multiple threads
  • Transaction for BDB and BangDB: ON, with only in-memory update, Writes are not synchronous
  • Checkpointing, overflow to disk, log flush, write ahead logging: ON
  • Access method: Tree/Btree
  • Key Size: 24 bytes, random
  • Value Size: 100 - 400 bytes, random
  • Page Size: 8KB

There are overall six test cases as described below. Before each test case, the different configuration parameters set for the test case are mentioned. For most of test cases large cache sizes(around 1 - 2GB) are used simply because mostly I have used relatively large number of operations (in the range of couple of millions and 100K operations here consumes around 50MB including key and value) and for many tests I needed to ensure that data doesn't overflow out of the buffer. But for few test cases I have repeated the tests using small memory as well (for test A and B) to show the numbers when cache size is smaller (in the range of 64MB). Other test cases can be repeated for small (or even smaller) cache sizes and similar trends (compared to the bigger cache size tests) could be seen.

The reason for showing the graph for individual test rather than putting a number against the test case is to assess the variability of the performance with different number of operations and other variable parameters. The typical number of operations are in millions starting with thousands.

Random Write and Read using single thread

No Overflow of data (out of the memory), 100K - 2M operations, following are the DB configurations for the test;

  • Write Buffer size for LevelDB: 512MB
  • Cache size for all the dbs: 1GB
  • Log Buffer size for BDB and BangDB: 256MB


Same test case result for smaller cache size is shown below;

  • Write Buffer size for LevelDB: 64MB
  • Cache size for all the dbs: 64MB

Random Write and Read using Multiple Threads

No Overflow of data (out of the memory), 100K - 2M operations, following are the DB configurations for the test;

  • Write Buffer size for LevelDB: 512MB
  • Cache size for all the dbs: 1GB
  • Log Buffer size for BDB and BangDB: 256MB
  • LevelDB - 1 Thread for Write and 4 Threads for Read
  • 4 Threads for Read and Write for BDB and BangDB


Same test case result for smaller cache size is shown below;

  • Write Buffer size for LevelDB: 64MB
  • Cache size for all the dbs: 64MB

Random Write and Read of 3GB of data using 1.5GB Buffer Cache

50% Overflow of data (50% to/from disk), 1M - 10M operations, following are the DB configurations for the test;

  • Write Buffer size for LevelDB: 1GB
  • Cache size for all the dbs: 1.5GB
  • Log Buffer size for BDB and BangDB: 256MB
  • LevelDB - 1 Thread for Write and 4 Threads for Read
  • 4 Threads for Read and Write for BDB and BangDB


Random Write and Read vs the number of concurrent threads

No Overflow of data (out of the memory), 1M operations, following are the DB configurations for the test;

  • Write Buffer size for LevelDB: 512MB
  • Cache size for all the dbs: 1GB
  • Log Buffer size for BDB and BangDB: 256MB


Random Write and Read for Large Values

No Overflow of data (out of the memory), 10K - 100K operations, following are the DB configurations for the test;

  • Value size: 10,000 bytes
  • Write Buffer size for LevelDB: 1.5GB
  • Cache size for all the dbs: 2GB
  • Log Buffer size for BDB and BangDB: 512MB
  • Page Size: 32KB


Random Write and Read simultaneously in overlapped manner

No Overflow of data (out of the memory), 1M operations. The test basically first inserts around (100-x)% of data for x% write and (100-x)% read and then tries to write remaining x% of data with (100-x)% of simultaneous read by 4 concurrent threads and time elapsed is computed only for the later x% write with (100-x)% read. If data is already present then it updates the existing data. following are DB configuration for the test;

  • Write Buffer size for LevelDB: 1GB
  • Cache size for all the dbs: 1GB
  • Log Buffer size for BDB and BangDB: 256MB


 

Summary

I have just covered very few use cases for the performance analysis, for example the sequential read and writes, batch operations are not covered. Also the changes in various configurable parameters may change the graphs completely. BDB and BangDB support Hash based access of the data as well but not covered here because LevelDB stores data in sorted order only.

If we see the amount of data written by individual dbs, we will find that BangDB is writing more data than the other two dbs. BangDB doesn't use compression at the moment hence this can be improved in future if compression is supported.

BDB write performance goes down drastically when writing more than the allocated buffer. For multiple thread tests the dbs perform best (as expected) when number of threads is close to the number of cores in the machine, with the exception of LevelDB write, which works best when single thread performs the operation.

All dbs are available for download freely hence one can/may cover more scenarios comparing these three or other/more such dbs. The test files for the performance analysis is available for download.

Related Articles

Reader Comments (20)

Unless BangDB also has compression and had it enabled for these benchmars), the numbers are invalid. I suggest disabling Snappy in LevelDB and running them again.

November 29, 2012 | Unregistered CommenterBenjamin Black

Why are they invalid? I have used LevelDB extensively and I have yet to discover a situation where disabling compression gives better performance. The compression gives smaller data and better memory utilization and smaller I/O operations ( although not such an issue with these tiny DB sizes)

November 29, 2012 | Unregistered CommenterMorten

It is not an apples to apples comparison. If LevelDB with compression enabled gives better performance, that should be shown along with compression disabled. Benchmarking remains hard.

November 29, 2012 | Unregistered CommenterBenjamin Black

It would be more interesting to see how many bytes/sec you are able to go at [raw disk i/o speed] when you write/read data? This will give some idea of which method uses disk I/O more efficiently (and converts random i/o to sequential i/o better).

November 29, 2012 | Unregistered CommenterDhruv

If you check the benchmark of leveldb at http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html , in the section D when compression is disabled the numbers for Random Write is low. Effectively LevelDB with compression 'on' gives better performance. Here too with compression 'on' LevelDB's better numbers are taken. There are many scenarios to consider but somewhere you have to stop

November 29, 2012 | Unregistered CommenterMani

it does not make sense that benchmark is performed over only 2M data. You should at least make test corpus much larger than memory. Say 100M data, without this guarantee, any benchmark is ridiculous

November 29, 2012 | Unregistered CommenterYY

Benjamin - Switching the compression ON actually improves the LevelDB perf a bit. Apart from that I haven't seen any noticeable changes in pattern. My intention was to show better numbers for each DB under the broad configurations which were non-negotiable and here if I have to use LevelDB then I would like to switch the compression ON just to get better perf even if others don't support compression. As you said Bench-marking remains hard but as Mani has mentioned above somewhere we have to stop

Dhruv - I agree with you, a test case could be written to gauge that solely. But in some sense when we overflow out of the buffer in one of the test cases above, we are essentially writing pages back to the disk. We know the amount of overflow hence roughly the number of bytes as well. Same in case of Read as we need to read so much data from the disk. But a separate test case could be useful.

YY - There is a test case where we go upto 10M and overflow 100% . Typically people do benchmark with 100byte value size or less and here we are doing with 400-500 bytes. The number of ops were restricted for the sake of time and also the limited amount of memory on the machine. Someone can do the test for 100M if required. But with 4-5 GB of memory, randomly writing 50GB (100M * 500 bytes) takes time, for ex;BDB just doesn't respond in such cases

November 29, 2012 | Unregistered Commentersachin

1M operation test? You have tested in memory cache and Linux page cache. You working data set size nicely fits RAM in ALL tests. No single word on how on disk data compaction and garbage collection is implemented. Latency matters as well. I am not interested in bulk throughput for in memory data sets but mostly how query latencies are distributed. LevelDB is known to have serious latency issues , what about Bangalore? You want impress world with your new product? Please provide realistic benchmark results.

1) Data sets size Up to 20-30 of available RAM. A couple hundreds GB for your box
2) Run tests for at least 1 hours.
3) Measure SUSTAINED throughput - not max values.
4) Calculate query latency distribution. For example: Max, 99.9%, 99%, mean, avg.

November 30, 2012 | Unregistered CommenterVladimir Rodionov

What Vladimir said.

November 30, 2012 | Unregistered CommenterBenjamin Black

@Sachin: Posting raw I/O numbers would partially help level the playing field even if some databases have compression enabled. IMHO, it isn't a more useful number to have when comparing. It would also be great if you can add tokudb into the mix since it is write optimized.

November 30, 2012 | Unregistered CommenterDhruv

@Vladimir: Thanks for your suggestions. We will post values on what you have mentioned. However not as extensive but we already have some data at http://www.iqlect.com/bangdb_embedded_stress_data.php where we have done 200M ops and wrote ~75GB using 6GB buffer. It shows the variability in TPS with time (sec). For the blog, It was just the matter of aligning with how other dbs have posted their perf data and we selected similar grounds. However the grounds were same for all the dbs compared. In future we will post data for different scenarios which will include what you have suggested.

@Dhruv: Point taken, as I said I could cover this much in the blog, more will come later. We will include tokudb into the mix, also recevied a request to include nessDB as well, which we will. Thanks for your points.

November 30, 2012 | Unregistered Commentersachin

@Vladimir: Thanks for your suggestions. We will post values on what you have mentioned. However not as extensive but we already have some data at http://www.iqlect.com/bangdb_embedded_stress_data.php where we have done 200M ops and wrote ~75GB using 6GB buffer. It shows the variability in TPS with time (sec). For the blog, It was just the matter of aligning with how other dbs have posted their perf data and we selected similar grounds. However the grounds were same for all the dbs compared. In future we will post data for different scenarios which will include what you have suggested.

@Dhruv: Point taken, as I said I could cover this much in the blog, more will come later. We will include tokudb into the mix, also recevied a request to include nessDB as well, which we will. Thanks for your points.

December 1, 2012 | Unregistered Commentersachin

Consistency of performance along with increment in size of DB is definitely very important. However, it doesn't matter whether we are achieving it through adding RAM or moving on solid state drive. Keeping in mind the cost of hardware which has nosedived in last decade it make sense to consider riding new technology which can take leverage of such opportunities. We all know C language is very efficient and computationally economical. But, do we really bother about that in today’s scenario when computation power is so cheap. We have welcomed high level languages to speed up the process. I hope BangDB is filling that niche in small database space.

December 1, 2012 | Unregistered CommenterJyoti

Fully agreed with previous posters, the benchmark is fairly incomplete. I am looking forward to an amended version.

December 2, 2012 | Unregistered CommenterFooBar

Errata: the right name is Berkeley DB, not Berkley DB as it appears in the title and in the body of the article.

December 3, 2012 | Unregistered CommenterOscar

Hey Sachin, I cant demonstrate how clever I, too, am on the subject as I dont have a database to my credit! Just wanted to say that great article and much appreciated - Cheers! Satish :-)

January 18, 2013 | Unregistered CommenterSatish

When will the source code be available, and work in FreeBSD?
Many thanks!

February 5, 2013 | Unregistered CommenterKenji Chan

Please check out the one billion perf test for BangDB at http://www.iqlect.com/bangdb_embedded_one_billion.php
This is also in line with what @Vladimir requested

@Kenji We will announce the source code availability by the end of the year. The DB has not been tested on FreeBSD, however, we have plan to do and release accordingly sooner

Also the BangDB is now available natively on windows. Also support for C# and Java has been released at http://www.iqlect.com/download.php

April 27, 2013 | Unregistered Commentersachin

BangDB shows great potential. I have tested it thoroughly (native windows C++ only), and have some feedback. Primarily, I do not think (as of writing 1/8/2013) that it is ready for prime-time. Notably:
1) Any sort of inconsistency in the database files causes the DLL library functions to hang.
2) Stopping the run of a program using the database without manually closing tables/connections/database WILL cause database inconsistency, so (1) above is a constant problem.
3) No source for the library itself has been provided, so although it is promised to be released as BSD-license there is no way to fix any bugs and work in a collaborative way at-present.
4) The library DLL is dependent on functionality provided in cpp/h files provided with the test harness. It is (IMO) undesirable to (a) separate dependent functionality into a DLL into support cpp files, (b) have the need to place additional cpp files into your vxproj. All interdependent functionality should be self-contained into one DLL and associated header files.

I look forward to seeing this project develop as it shows great promise. It is very simple to use (although some STL container support would be welcomed), and has excellent performance.

January 8, 2014 | Unregistered CommenterPaul

How does it compare to MonetDB or Airospike?

September 19, 2014 | Unregistered Commenterjuan

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>