advertise
« How big is a Petabyte, Exabyte, Zettabyte, or a Yottabyte? | Main | Stuff The Internet Says On Scalability For September 7, 2012 »
Monday
Sep102012

Russ’ 10 Ingredient Recipe for Making 1 Million TPS on $5K Hardware

My name is Russell Sullivan, I am the author of AlchemyDB: a highly flexible NoSQL/SQL/DocumentStore/GraphDB-datastore built on top of redis. I have spent the last several years trying to find a way to sanely house multiple datastore-genres under one roof while (almost paradoxically) pushing performance to its limits.

I recently joined the NoSQL company Aerospike (formerly Citrusleaf) with the goal of incrementally grafting AlchemyDB’s flexible data-modeling capabilities onto Aerospike’s high-velocity horizontally-scalable key-value data-fabric. We recently completed a peak-performance TPS optimization project: starting at 200K TPS, pushing to the recent community edition launch at 500K TPS, and finally arriving at our 2012 goal: 1M TPS on $5K hardware.

Getting to one million over-the-wire client-server database-requests per-second on a single machine costing $5K is a balance between trimming overhead on many axes and using a shared nothing architecture to isolate the paths taken by unique requests.

Even if you aren't building a database server the techniques described in this post might be interesting as they are not database server specific. They could be applied to a ftp server, a static web server, and even to a dynamic web server.

Here is my personal recipe for getting to this TPS per dollar.

The Hardware

Hardware is important, but pretty cheap at 200 TPS per dollar spent:

  1. Dual Socket Intel motherboard
  2. 2*Intel X5690 Hexacore @3.47GHz
  3. 32GB DRAM 1333
  4. 2 NIC ports of an Intel quad-port NIC (each NIC has 8 queues)

Select the Right Ingredients

The architecture/software/OS ingredients used in order to get optimal peak-performance rely on the combination and tweaking of ALL of the ingredients to hit the sweet spot and achieve a VERY stable 1M database-read-requests per-second over-the-wire.

It is difficult to quantify the importance of each ingredient, but in general they are in order of descending importance.

Select the Right Architecture

First, it is imperative to start out with the right architecture, both vertical and horizontal scalability (which are essential for peak-performance on modern hardware) flow directly from architectural decisions:

1. 100% shared nothing architecture. This is what allows you to parallelize/isolate. Without this, you are eventually screwed when it comes to scaling.

2. 100% in-memory workload. Don’t even think about hitting disk for 0.0001% of these requests. SSDs are better than HDDs, but nothing beats DRAM for the dollar for this type of workload.

3. Data lookups should be dead-simple, i.e.:

  1. Get packet from event loop (event-driven)
  2. Parse action
  3. Lookup data in memory (this is fast enough to happen in-thread)
  4. Form response packet
  5. Send packet back via non-blocking call

4. Data-Isolation. The previous lookup is lockless and requires no hand-off from thread-to-thread: this is where a shared-nothing architecture helps you out. You can determine which core on which machine a piece of data will be written-to/served-from and the client can map a tcp-port to this core and all lookups go straight to the data. The operating system will provide the multi-threading & concurrency for your system.

Select the Right OS, Programming Language, and Libraries

Next, make sure your operating system, programming language, and libraries are the ones proven to perform: 

5. Modern Linux kernel. Anything less than CentOS 6.3 (kernel 2.6.32) has serious problems w/ software interrupts. This is also the space where we can expect a 2X improvement in the near future; the Linux kernel is currently being upgraded to improve multi-core efficiency.

6. The C language. Java may be fast, but not as fast as C, and more importantly: Java is less in your control and control is the only path to peak performance. The unknowns of garbage collection frustrate any and all attempts to attain peak performance.

7. Epoll. Event-driven/non-blocking I/O, single threaded event loop for high-speed code paths.

Tweak and Taste Until Everything is Just Right

Finally, use the features of the system you have designed. Tweak the Hardware & OS to isolate performance critical paths:

8. Thread-Core-Pinning. Event loop threads reading and writing tcp packets should each be pinned to their own core and no other threads should be allowed on these cores. These threads are so critical to performance; any context switching on their designated cores will degrade peak-performance significantly.

9. IRQ affinity from the NIC. To avoid ALL soft interrupts (generated by tcp packets) bottlenecking on a single core. There are different methodologies depending on the number of cores you have:

  1. For QuadCore CPUs: round-robin spread IRQ affinity (of the NIC’s Queue’s) to the Network-facing-event-loop-threads (e.g. 8 Queue’s, map 2 Queue’s to each core)
  2. On Hexacore (and greater) CPUs: reserve 1+ cores to do nothing but IRQ-processing (i.e. send IRQ’s to these cores and don’t let any other thread run on these cores) and use ALL other cores for Network-facing-event-loop-threads (similarly running w/o competition on their own designated core). The core receiving the IRQ will then signal the recipient core and the packet has a near 100% chance of being in L3 cache, so the transport of the packet from core to core is near optimal.

10. CPU-Socket-Isolation via PhysicalNIC/PhysicalCPU pairing. Multiple CPU sockets holding multiple CPUs should be used like multiple machines. Avoid inter-CPU communication; it is dog-slow when compared to communication between cores on the same CPU die. Pairing a physical NIC port to a PhysicalCPU is a simple means to attain this goal and can be achieved in 2 steps:

  1. Use IRQ affinity from this physical NIC port to the cores on its designated PhysicalCPU
  2. Configure IP routing on each physical NIC port (interface) so packets are sent from its designated CPU back to the same interface (instead of to the default interface)

This technique isolates CPU/NIC pairs; when the client respects this, a Dual-CPU-socket machine works like 2 single-CPU-socket machines (at a much lower TCO).

That is it. The 10 ingredients are fairly straightforward, but putting them all together, and making your system really hum, turns out to be a pretty difficult balancing act in practice. The basic philosophy is to isolate on all axis.

The Proof is Always in the Pudding

Any 10 step recipe is best illustrated via an example: the client knows (via multiple hashings) that dataX is presently on core8 of ipY, which has a predefined mapping of going to ipY:portZ.

The connection from the client to ipY:portZ has previously been created, the request goes from the client to ipY:(NIC2):portZ.

  • NIC2 sends all of its IRQs to CPU2, where the packet gets to core8 w/ minimal hardware/OS overhead.  
  • The packet creates an event, which triggers a dedicated thread that runs w/o competition on core8.
  • The packet is parsed; the operation is to look up dataX, which will be in its local NUMA memory pool.
  • DataX is retrieved from local memory, which is a fast enough operation to not benefit from context switching.
  • The thread then replies with a non-blocking packet that goes back thru only cores on the local CPU2, which sends ALL of its IRQs to NIC2.

Everything is isolated and nothing collides (e.g. w/ NIC1/CPU1). Software interrupts are handled locally on a CPU. IRQ affinity insures software interrupts don’t bottleneck on a single core and that they come from and go from/to their designated NIC. Core-to-core communication happens ONLY withIN the CPU die. There are no unnecessary context switches on performance-critical code paths. TCP packets are processed as events by a single thread running dedicated on its own core. Data is looked up in the local memory pool. This isolated path is the closest software path to what actually physically happens in a computer and the key to attaining peak performance.

At Aerospike, I knew I had it right when I watched the output of the “top” command, (viewing all cores) and there was near zero idle % cpu and also a very uniform balance across cores. Each core had exactly the same signature, something like: us%39 sy%35 id%0 wa%0 si%22.

Which is to say software-interrupts from tcp packets were using 22% of the core, context switches passing tcp-packets back and forth from the operating system were taking up 35%, and our software was taking up 39% to do the database transaction.

When the perfect balance across cores was achieved optimal performance was achieved, from an architectural standpoint. We can still streamline our software but at least the flow of packets to & fro Aerospike is near optimal.

Data is Served

Those are my 10 ingredients that got Aerospike’s server to one million over-the-wire database requests on a $5K commodity machine. Mixed correctly, they not only give you incredible raw speed, they give you stability/predictability/over-provisioning-for-spikes at lower speeds. Enjoy

Related Articles

Reader Comments (14)

Maybe I'm so ingenuous, but I have to ask:

For RAM DBs like that, how to avoid getting data lost when the server crashes or goes down for some reason? I didn't see anything about replicas.

September 10, 2012 | Unregistered CommenterSony Santos

>> For RAM DBs like that, how to avoid getting data lost when the server crashes or goes down for some reason? I didn't see anything about replicas.

This is a very relevant question. Any RAM-DB needs to synchronously replicate data to insure durability in case of a server crash. This is commonly referred to as K-safety. Aerospike supports synchronous replication w/ a default setting of replication factor 2. Different read/write loads are reported here

Since tcp packet marshaling is the major bottleneck in this type of performance benchmark, writes with synchronous replication do not perform as well as reads; As a (very) general rule, write operations with synchronous replication have half the throughput (double the packets) and twice the latency (double the hops) of read operations.

September 10, 2012 | Unregistered CommenterRussell Sullivan

Several in memory databases (VoltDB, MemSQL, Redis) support replication and disk based persistence. Aerospike supports replication and disk based persistence according to their docs. They also support larger than memory working sets where indexes fit in memory.

The huge speed gain for an in memory database is mostly from eliminating random reads that cause all kinds of issues when you want to do hundreds of thousands of queries/sec per node. There is nothing particularly slow about persistence, especially when you are logging asynchronously to query execution.

Flash mitigates the random read issue to a certain extent but doesn't solve the concurrency issues (exacerbated by transaction support) created by disk based data structures that typically assume that they will store data sets that are larger than memory.

September 10, 2012 | Unregistered CommenterAriel Weisberg

Aerospike has two different configuration modes. The first is the 100% in-memory mode used to boast the 1M TPS on $5K hardware. The second is a SSD optimized mode where indexes are kept in memory and data is kept on SSD. SSDs are very strong at random reads, but writing to them and maintaining peak performance (especially as the months go by) requires specially optimized software.
Aerospike has invested a huge amount of effort on optimizations focusing on writing to SSDs and maintaining consistent read/write speeds over time.
SSD benchmark speeds can be found in the second box, here. This topic has tons of technical details and definitely warrants its own blog post, so I will get on that :)

September 10, 2012 | Unregistered CommenterRussell Sullivan

Thank you. Very interesting reading. CitrusLeaf does not support range queries?

September 11, 2012 | Unregistered CommenterVladimir Rodionov

Everything I see online says that IRQ affinity to a group of CPUs *hurts* performance, but this article implies the opposite. Having all CPUs perform IRQ processing seems to be the recommendation.

September 11, 2012 | Unregistered CommenterJeremy Wilson

>> Thank you. Very interesting reading. CitrusLeaf does not support range queries?

Aerospike (formerly Citrusleaf) has been a distributed highly available key-value store running in production since 2009.

We have recently built in distributed secondary indexes and are looking for beta customers this fall (2012). The secondary indexes support range queries, compound indexes, and simple aggregation operators.

These new features are all part of Aerospike's acquisition of AlchemyDB.

September 11, 2012 | Unregistered CommenterRussell Sullivan

>> Everything I see online says that IRQ affinity to a group of CPUs *hurts* performance, but this article implies the opposite. Having all CPUs perform IRQ processing seems to be the recommendation.

Yes and no.

With quadcores, we found having all cores do IRQ processing yielded optimal performance. With hexacores (plus), we found that leaving 1 (or more) core(s) free (meaning no processes running on them {via taskset}) to do IRQ processing yielded optimal performance.

Both results surprised me in different ways. The quadcore simply couldn't spare an entire core to ONLY do IRQ processing w/o suffering serious performance degradation. The hexacore(plus) findings point to a very clever way to leverage more cores in peak performance benchmarks of this sort.

Needless to say, we tried every possible configuration we could think of. Thankfully, our software's thread-responsibilities & their process-ids are query-able, so the different configurations can be made entirely w/in the OS and even dynamically :)

September 11, 2012 | Unregistered CommenterRussell Sullivan

How much did thread and IRQ binding specifically buy you? Automating that sort of thing looks annoying, and error prone when you take into account things like hyper-threading (or what AMD does). It seems like getting it wrong could be worse than letting the OS manage threading. Binding groups of cooperating threads to a NUMA node seems like an easier way to get the NUMA and L3 benefits.

Once you mix in compression, replication, disk IO, and start looking at response times at lower concurrency over peak throughput it becomes a tougher sell to do exactly on thread per core because that means thread exclusive data is inaccessible while a task is running even if the task doesn't involve the thread exclusive data.

SSD benchmark speeds can be found in the second box, here. This topic has tons of technical details and definitely warrants its own blog post, so I will get on that :)

I am looking forward to it. I am definitely interested in seeing what can be done when you focus on SSDs and assume indexes fit in memory.

September 11, 2012 | Unregistered CommenterAriel Weisberg

>> How much did thread and IRQ binding specifically buy you?

At peak speed avoiding unneeded context-switching plus controlling which cores did IRQ processing gave us roughly a 2-2.5X throughput increase and had the added benefit of almost boring stability, the variable of the linux scheduler was largely not present.

>> Automating that sort of thing looks annoying, and error prone when you take into account things like hyper-threading (or what AMD does). It seems like getting it wrong could be worse than letting the OS manage threading.

Again, at peak speed, the linux scheduler is much worse than manually tweaking a system. What we strived to do with these performance improvements was to expose our threading internals, so our database can be dynamically dialed between peak performance mode and the default settings (which are more predictable/stable during node-failures/node-additions for the reasons you mention above).

>> Binding groups of cooperating threads to a NUMA node seems like an easier way to get the NUMA and L3 benefits.

It is easier, and it is a low hanging fruit, but the 10 ingredients above all used together give a very significant BOOST.

>> http://www.afewmoreamps.com/2012/04/jitcask.html

Good stuff, I will read through this tonight. SSDs are a tricky marriage of hardware and software indeed :)

September 11, 2012 | Unregistered CommenterRussell Sullivan

Silly question here. What do you do with all default OS processes? Do you pin them all with something like taskset to one core? Thanks for an excellent post!

September 19, 2012 | Unregistered CommenterZilvinas Saltys

>> What do you do with all default OS processes? Do you pin them all with something like taskset to one core?

This question is pretty generic, so ...
My philosophy is to pin the default OS processes AWAY from any bottleneck, which in this type of workload is largely IRQ processing.
So on Quadcores, just let the linux task scheduler take care of OS processes.
On Hexacore+ where IRQ processing is taskset to run on 1+ dedicated cores, use taskset to keep default OS processes AWAY from these dedicated cores.

September 20, 2012 | Unregistered CommenterRussell Sullivan

Excellent article - thanks for sharing!

35% kernel to user mode context switch overhead. Damn! I'm looking forward to the kernel mode version of your software... hehe!

July 28, 2013 | Unregistered CommenterMat Dodgson

Terms corrections:

key/value, not "key-value"
client/server, not "client-server"

July 29, 2013 | Unregistered CommenterWhat Haveyou

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>