In-Memory Computing at Aerospike Scale: When to Choose and How to Effectively Use JEMalloc
This is a guest post by Psi Mankoski (email), Principal Engineer, Aerospike.
When a customer’s business really starts gaining traction and their web traffic ramps up in production, they know to expect increased server resource load. But what do you do when memory usage still keeps on growing beyond all expectations? Have you found a memory leak in the server? Or else is memory perhaps being lost due to fragmentation? While you may be able to throw hardware at the problem for a while, DRAM is expensive, and real machines do have finite address space. At Aerospike, we have encountered these scenarios along with our customers as they continue to press through the frontiers of high scalability.
In the summer of 2013 we faced exactly this problem: big-memory (192 GB RAM) server nodes were running out of memory and crashing again within days of being restarted. We wrote an innovative memory accounting instrumentation package, ASMalloc [13], which revealed there was no discernable memory leak. We were being bitten by fragmentation.
This article focuses specifically on the techniques we developed for combating memory fragmentation, first by understanding the problem, then by choosing the best dynamic memory allocator for the problem, and finally by strategically integrating the allocator into our database server codebase to take best advantage of the disparate life-cycles of transient and persistent data objects in a heavily multi-threaded environment. For the benefit of the community, we are sharing our findings in this article, and the relevant source code is available in the Aerospike server open source GitHub repo. [12]
Executive Summary
Memory fragmentation can severely limit scalability and stability by wasting precious RAM and causing server node failures.
Aerospike evaluated memory allocators for its in-memory database use-case and chose the open source JEMalloc dynamic memory allocator.
Effective allocator integration must consider memory object life-cycle and purpose.
Aerospike optimized memory utilization by using JEMalloc extensions to create and manage per-thread (private) and per-namespace (shared) memory arenas.
Using these techniques, Aerospike saw substantial reduction in fragmentation, and the production systems have been running non-stop for over 1.5 years.
Introduction
Aerospike prides itself on creating systems that "Just Work." As a foundational component of Internet-scale, in-memory computing systems, the Aerospike database must deliver consistent performance under extreme load conditions of 1 million transactions per second per node and just keep working. Moreover, Aerospike must continue to perform 24X7 without surprising the system operator nor requiring much manual intervention. Year-long individual server node up times are not uncommon.
The Graphite image above shows the system free memory (vertical axis = % free memory) on the individual nodes of an actual 14-node Aerospike database cluster in production over a 4 month period (horizontal axis = time), both before and after implementing the techniques discussed in this paper. Each server node had an 8-core Intel Xeon E5-2690 CPU @ 2.90GHz running the CentOS 6 GNU/Linux operating system and was provisioned with 192 GB of RAM. The steep downward curves on the left hand side, the “before” case, show the inexorable trend of memory exhaustion, which was shown to be due to main memory fragmentation. After selecting a better dynamic memory allocator and changing the server to use the allocator effectively in a multi-threaded context, the system memory use per node quickly “flatlines” on the right side (with some residual “bouncing” as memory as additional memory is transiently allocated and given back to the system.)
The Challenge of In-Memory Computing
In-memory computing, using both Flash and DRAM, has the potential for delivering the highest possible performance, in contrast with systems relying solely on secondary storage, such rotational drives. The high cost per gigabyte of DRAM relative to secondary storage means in-memory computing systems must use every byte effectively and efficiently to realize their full potential in terms of consistently-high TPS and consistently-low latency, while minimizing TCO.
The cornerstone of in-memory computing is therefore advanced memory management. Traditionally there are two approaches to memory management: automatic and manual. Understanding the characteristics and limitations of these alternatives is crucial for creating world-class in-memory computing systems.
Aerospike's in-memory computing solution effectively leverages system resources by keeping the index packed into RAM and with a highly-concurrent architecture that matches transaction and storage threads to maximize Flash IOPS.
Automatic Memory Management
Automatic storage management, generally embodied in a Garbage Collector (GC), is used by most modern dynamic languages, such as Java. In these languages, the run-time system attempts to provide the illusion of unlimited storage by automatically cleaning up unreferenced data structures and retiring them to the free store, ideally compacting as it goes. A major downside of automatic methods, however, tends to be the occasional necessity for the GC to kick in at unexpected times. The result is unpredictable response times. Operations staff and developers often end up expending effort manually tuning the GC system, defeating the purpose of an automatic system. While much research and system development has gone into optimizing GC-based systems, the resulting variability is still not acceptable for a database required to have consistently predictable behavior at high scale.
Manual Memory Management
Under the manual approach to memory management, the individual program - rather than the language or run-time system - handles all memory allocation. Manual memory management is typified by the "malloc()"/"free()" interface of the C standard library. Accordingly, the programmer must explicitly allocate and release each data object. Failure to release unneeded memory creates memory leak that is eventually fatal to a long-lived server process. On the path toward its final destruction, the process may first start swapping to secondary storage as physical RAM is exhausted, resulting in increasingly sluggish behavior. Finally, in the Linux environment, the dreaded OOM Killer may terminate the process with extreme prejudice. At this point, the program must be restarted afresh, but even if this is handled automatically, the performance of the system will most likely already have been degraded, and data may have even been lost.
Fragmentation: Internal and External
Even if there are no actual memory leaks, the above dire scenario is still quite possible due to the effects of memory fragmentation. There are two common types of fragmentation: internal and external. Internal fragmentation is due to the inherent granularity of the memory allocator. If the program requests an odd number of bytes, the allocator will most likely actually provide a chunk of memory sized to round up to the one of the next higher multiples of the machine's word size, e.g., 8 or 16 bytes on a 64-bit architecture. (See Figure 1.) The actual cost of internal fragmentation is usually too not high, since it generally brings with it the advantages of providing nicely-tiled segments of memory that work well with buddy system allocators [1, p. 442] (more on that later) and also of automatically cache-aligning the objects in memory for efficient loading into the CPU.
External fragmentation arises when the pattern of object allocation and deallocation leads to an irregular "Swiss cheese"-like structure in memory where the allocator is unable to re-use the holes created by freed objects when they lie between objects that are still alive. If a new allocation request is made that is larger than any existing, contiguous, allocated but currently unused “holes,” then the allocator must resort to requesting additional memory from the system, rather than re-using its free space. (See Figure 2.) Various strategies such as using memory “handles” (a pointer to a pointer) and compacting garbage collectors have been used to address external fragmentation, each of these approaches either complexify the application or may suffer the allocation latency of automatic memory management.
Finally, in multi-threaded systems, there is another type of fragmentation possible where the memory allocator cannot reclaim unused memory due to the pattern of memory being allocated by one set of threads and freed up by another. Addressing this source of fragmentation is one of the main points of this paper.
Selecting a Memory Allocator
The standard approach for application programming is to delegate the main memory management responsibility to one of the various available dynamic memory allocator libraries. The allocator library, in turn, uses the operating system calls to request reserving and releasing ranges of pages of virtual memory from the kernel. In addition, server programs like Aerospike implement various special-purpose sub-allocators, such as slab allocators, to handle different object types within the server application. The selection of a memory allocator library, however, is especially critical, since it lives between the application and the operating system and therefore shoulders a major portion of the burden in in-memory computing systems.
On the GNU/Linux platform, the GNU C Standard Library ("GLibC") [7] is the default choice for the main memory allocator for C programs. This allocator conforms to the standard C "malloc()" / "free()" interface and works well for a wide variety of use cases. The actual memory allocator used by GLibC today (1Q2015) is called "PTMalloc2" [9, 10], which is the second "Per-Thread" modification to the well-respected Doug Lea allocator [8] (known as "dlmalloc".)
While PTMalloc2 is easy to use with the GNU Compiler Collection (GCC) toolchain, which includes the GCC C compiler used by Aerospike for the X86_64 platform across the major Linux distributions (CentOS, Ubuntu, Debian), we encountered various difficulties with it when building highly-scalable systems. The first area of difficulty was that when things began to go wrong, there were few tools that could be used to analyze memory usage under GLibC and get to the root of the trouble.
There are some memory debugging features in GLibC, but they are either too indiscriminate in terms of how they let you introspect the memory use of your program (e.g., the "mtrace" feature), or else they do not properly account for all of the memory used by the allocator (the "mallocinfo" feature), or else they are not thread safe, leading to program crashes when used, and are therefore completely unusable at high scale, which is frequently when problems with the memory allocator caused by that scale only begin to manifest.
Choosing JEMalloc
In 2003, Jason Evans (now of Facebook and HHVM fame) created a new allocator originally for the FreeBSD operating system [2]. His allocator, "JEMalloc" [3, 4, 5, 6], is a completely new implementation of a modern C memory allocator library meeting the standard C library interface and also providing additional non-standard features for debuggability and controllability. JEMalloc, like most other modern multithread-friendly dynamic memory allocators, supports multiple, independent arenas. (Note that the term "heap" is sometimes used for what JEMalloc calls an arena. "Heap" has an unfortunate collision with the name of an important family of efficient data structures related to priority queues [1, p. 435], which term apparently pre-dates the memory allocator usage, so we will use the term "arena" throughout this paper.) Unlike many other allocators, JEMalloc also provides an extended API [4] permitting, among other things, creating and switching a thread into an arena. (Using these extended APIs will be discussed in the next section.)
Each JEMalloc arena (using the default build settings) comprises 3 distinct allocation size class categories: “small” (8 bytes up to 3,584 bytes), “large” (4K bytes up to just under 4M bytes), and “huge” (> 4M bytes.) (See Figure 3.) An arena segregates objects of the same size [3] by allocating them in contiguous runs of pages. Internal fragmentation is minimized by using the nearest size an allocation request will fit into. For example, the “small” size has 8-byte, 16-byte, 24-byte, …, chunks, while “large” has 4 KB, 8 KB, 12 KB, …, chunks. The “small” and “large” classes differ in how the chunk metadata is handled: shared in “small,” independent in “large.” Red-black trees are used to track pages used for each of these allocation granularities. The “huge” size class uses “mmap()” to allocate / deallocate bigger chunks of memory.
To begin experimenting with JEMalloc, the dynamic shared library, "libjemalloc.so", may be pre-loaded via the linker using the "LD_PRELOAD" environment variable. This method is the easiest for doing an initial integration with JEMalloc. In our production system, however, we have chosen to statically link the JEMalloc library into our Aerospike server executable. Static linking removes some flexibility but provides the guarantee that the version of the library that has been qualified via our SQA process is the same as what is running in production. (We have, however, left the capability of dynamic linking with JEMalloc as a separate option in our build system so we can retain those additional features for debugging, instrumenting, etc.)
Debuggability -- How to See What Your Server is Doing with Your Memory
A specific debuggability feature of JEMalloc we are using in the Aerospike server is the ability to request an ASCII log of all the memory currently managed by JEMalloc. This includes the global allocator options, as well as a roll-up of the statistics across all of the memory regions known as arenas. In the Aerospike server, the Info. command "jem-stats:" causes the server to log the JEMalloc statistics to the file "/tmp/aerospike-console.<PID>". We have also created tools that parse key elements of this log output and provide the ability to ask and get answers for questions such as which arenas are the largest, which have grown the most over time, and what is the efficiency (the opposite of fragmentation) in the selected arenas. (Future work includes graphical display of these detailed system statistics in a more human-comprehensible form.) Two excellent properties of JEMalloc's detailed statistics output are that it is first of all comprehensive (and as far as we can tell, correct), and second that there is no noticeable impact on system performance from dumping this log information with modest frequency.
Data Locality and Multi-Thread Concerns
The next feature of JEMalloc critically important for controlling memory usage is setting the arena to be used by the current thread for dynamic memory allocation. In today's multi-core, multi-CPU computers, multi-threaded programming is generally used to achieve high scalability. The Aerospike database is no exception, and it is carefully architected to use threading to the best advantage. While single-threaded design has the potential to be highly performant, the reality is that there are good reasons for sharing resources across threads to provide the best use of limited resources, e.g., without replicating all data for all threads. And in addition, by its very nature, an immediate-consistency guarantee requires data objects to be synchronized appropriately across threads handling concurrent client requests. Therefore, while Aerospike does its best to stay single-threaded on a transaction as long as it can, there are certain major data structures that must be accessed by multiple threads. In some cases, such as system start-up, these threads handle loading in of persisted data, then exit cleanly to allow the transaction threads to "play through."
In modern memory allocators, to achieve high performance in multi-threaded programs, there are usually a fixed number of arenas created at start-up time (how many is heuristically determined by the number of CPU cores) so threads should rarely need to wait to allocate memory. (GLibC and JEMalloc, as well as Google's TCMalloc [11], all essentially operate this way.) And a thread is generally "sticky" to the last arena it used (it was available last time, so it might be available next time as well.) In addition, there is generally a provision for handling allocations up to a certain size on a smaller arena that is private to the thread itself. For certain transaction-based application patterns, such as web servers, this approach works quite well, and is much more efficient than having the threads share arenas.
In the Aerospike server, however, the patterns of access are more complex. The thread cache (remember the "PT" in GLibC and "TC" in Google's allocator?) feature is great for allocating relatively small, short-lived, temporary data objects on a per-thread basis. The switch to per-thread allocation provided a big speed improvement for most multi-threaded programs by eliminating the need for locks. This works well with a purely-transactional model where the data is allocated and relatively quickly freed on the same thread.
The main database object store, however, follows entirely different allocation patterns. These objects are usually persistent, potentially much larger, and may be accessed over time by many different threads (and sometimes are even accessed concurrently.) Therefore, the best pattern we have found is to have objects with the same general characteristics (i.e., those within the same Aerospike namespace, which roughly corresponds to a database (containing an ensemble of tables) in a standard RDBMS) to be stored in the same arena. The thread-level arena-setting feature of JEMalloc [4] is used at key points in the Aerospike transaction life-cycle to switch the thread into (and back out of) the desired per-namespace arena. This preserves an important type of locality such that when an object is modified, if it is still the same size, it can remain in the same place in the arena. Otherwise, if the object gains more data and therefore expands, it will be written back to the same arena, while freeing up the old contents for re-use in place. And finally, when the object is deleted, the memory space it formerly occupied will be available for reuse by the next similar object requiring that amount of space in the same namespace. (See Figure 4.)
Note that the above is relevant to Aerospike "data in memory" namespaces, which are one major use case. In another common use case, only the object metadata is stored in memory and the object's actual data is stored on secondary storage. In this case, the arena switching not as critical for preventing fragmentation, since the fixed-metadata is stored compactly using a custom suballocator. Storing widely variable-sized object data is the root cause of fragmentation, either in main memory or secondary storage. Aerospike’s flash memory defragmentation subsystem handles the second case.
For more details on integrating with JEMalloc, please refer to the minimal JEMalloc interface provided by the files "cf/src/jem.c" and "cf/include/jem.h" in the Aerospike Database Server open source Git repository [12] for the thin APIs actually used on top of JEMalloc's API to control object placement in shared arenas, etc.
Conclusions
In conclusion, Aerospike has developed extreme-performance, in-memory computing technology suitable for a wide range of use cases. By carefully considering the patterns of object memory use and matching the memory management strategies to work well with disparate object life-cycles, we have achieved consistently high performance and stability.
Choosing integration with the JEMalloc memory allocator library was a major factor for building performance into our system. Beyond simply relying on the allocator to be internally efficient, we have used a few key extensions of JEMalloc over the standard C library memory allocation interface in order to direct the library to store classes of data objects according to their characteristics. Specifically, by grouping data objects related by the major category of namespace in the same arena, the long-term pattern of object creation, access, modification, and deletion is optimized and fragmentation is minimized. Finally, the built-in instrumentation capability of JEMalloc is used to monitor the number, distribution, and packing efficiency of data objects in the precious DRAM resource in our in-memory computing nodes. All of these features make the Aerospike database the most reliable, efficient, and trustworthy foundation for Internet-scale systems, and it "Just Works."
References
[1] Knuth, Donald E., "The Art of Computer Programming, Volume 1 (3rd ed.): Fundamental Algorithms", 1997.
[2] Evans, Jason, JEMalloc paper: "A Scalable Concurrent malloc(3) Implementation for FreeBSD", 2006.
http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf
[3] Evans, Jason, JEMalloc Facebook paper: "Scalable memory allocation using jemalloc", 2011.
[4] JEMalloc "man" page:
http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.html
[5] JEMalloc Wiki:
https://github.com/jemalloc/jemalloc/wiki
[6] JEmalloc Git Repository:
https://github.com/jemalloc/jemalloc
[Points to: http://www.canonware.com/jemalloc/ ]
[7] Free Software Foundation, Inc., "The GNU C Library (glibc)", 2015.
http://www.gnu.org/software/libc/libc.html
[8] Lea, Doug, "A Memory Allocator", 1996.
http://gee.cs.oswego.edu/dl/html/malloc.html
[9] Gloger, Wolfram, "Wolfram Gloger's malloc homepage", 2006.
[10] Douglas, Niall, "ptmalloc2 homepage", 2013.
http://www.nedprod.com/programs/Win32/ptmalloc2/
[11] Ghemawat, Sanjay; Menage, Paul, "TCMalloc : Thread-Caching Malloc", 2005.
http://goog-perftools.sourceforge.net/doc/tcmalloc.html
[12] The Aerospike Database Server, Open Source Git Repository, 2015.
https://github.com/aerospike/aerospike-server
[13] ASmalloc: Memory Allocation Tracking, Open Source Git Repository, 2015.
https://github.com/aerospike/asmalloc