A Toolkit to Measure Basic System Performance and OS Jitter

Jean Dagenais published a great response on a mechanical-sympathy thread to Gil Tene's article, The Black Magic Of Systematically Reducing Linux OS Jitter. It's full of helpful tools for tracking down jitter problems. I apologize for the incomplete attribution. I did not find a web presence for Jean.

To complement the great information I got on the “Systematic Way to Find Linux Jitter”, I have created a toolkit that I now used to evaluate current and future trading platforms.

In case this can be useful, I have listed these tools, as well as the URLs to get the source code and a description of their usage. I am learning a lot by reading the source code, and the blog entry associated.

This is far from an exhaustive list, as every week I find either a new problem area or a new tool that improve my understanding of this beautiful problem domain ;)

These tools are grouped into these categories:

  1. CPU, Memory, Disk, Network
  2. X86, Linux, and Java time resolution
  3. Context Switches & Inter Thread Latency
  4. System Jitter
  5. Application Building Blocks: distruptor, openHft, Aeron & Workload Generator
  6. Application Performance Testing

Happy Benchmarking and Jitter Chasing!

1. CPU, Memory, Disk, Network

1.1 lmbench - http://www.bitmover.com/lmbench/

lmbench provides basic performance data for multiple aspects of the server, from instructions, fork, context switches, disk, network, and memory, and more

1.2.1 Memory - lmbench – lat_mem_rd

The article describes how to measure memory access, and I used this approach to measure the performance of the memory subsystem under local and remote access (e.g. NUMA effect)

1.2.2 Java Memory Access Patterns – Martin Thompson

Great article and a useful tool that highlight impact of local/remote memory access and NUMA effects.

1.3 Disk Performance - fio - FIO Source Code

1.4 Network

There are many tools available, and these are the ones I find the most useful. They are useful to tune network parameters (e.g. kernel bypass drivers) when measuring throughput, latency, and jitter.

Java Ping - Nitsan Wakart network-measuring tools

Note: The c/c++ network tools will behave differently from Java (e.g. jit, garbage generation, behavior) as such they are very complementary and offer many options that can be tested easily.

2. X86, Linux, and Java time resolution

2.1 Measuring granularity and latency of currentTimeMillis() & nanoTime(); - Aleksey Shipilev

Great article and use the jmh to test granularity and latency.

2.1 Measuring Latency in Linux

Provides information about how Linux measure time, and a test program that displays the different timers available and their resolution.

3. Context Switches & Inter Thread Latency

3.1 This is a good write up on Stack Overflow

3.2 Java and C++ Inter Thread Latency – Martin Thompson

4. System Jitter

4.1 Sysjitter - SysJitter from Solarflare - OpenOnload

sysjitter measures the extent to which the system impacts on user-level code by causing jitter.  It runs a thread on each processor core, and when the thread is "knocked off" the core it measures how long for.  At the end of the run it outputs some summary statistics for each core, and optionally the full raw data.

4.2 jHicckup – Gil Tene – Azul - jHiccup

jHiccup is an open source tool designed to measure the pauses and stalls (or “hiccups”) associated with an application’s underlying Java runtime platform. The new tool captures the aggregate effects of the Java Virtual Machine (JVM), operating system, hypervisor (if used) and hardware on application stalls and response time.

4.3 MicroJitterSampler - Peter Lawrey - Micro Jitter Busy Waiting and Binding

The micro jitter sampler looks at interrupts to a running thread.  It is similar to jHiccup but instead of measuring how delayed a thread is in waking up, it measures how delays a thread gets once it has started running.  Surprisingly how you run your threads impacts the sort of delays it will see once it wakes up.

5. Application Building Blocks: distruptor, openHft, Aeron, etc

The Disruptor, openHft, and Aeron contain programs that are useful to measure the “raw” performance of each building block/library/framework before we incorporate them with our applications.

It’s a lot more difficult to figure out the contribution/limitations of each component when doing a macro benchmark of 20 jvms, 5 physical servers, too many cores to count them, and 10,000’s of operations per second.

I am quite impressed by Aeron, which is both an eye opener (e.g. It can process 11,000,000 messages per second @ 500MB/second on our server), and a major source of learning (from design to implementation) for me. It shows what’s possible to do in Java, when you understand and apply “current best practices”.

Great work by Martin Thompson and team.

5.2 tress-ng

This tool is used to generate “synthetic workload” on a server. It’s useful for measuring “interferences” with a running app. E.g. what happen if we add this kind of workload on a server.

stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.

6. Application Performance Testing

It takes me about 1 day to run most of the tools mentioned before and collect performance data and analyze it. The baseline data is used to assess if the platform/server performance and stability (e.g. jitter) is good enough to go the next level of testing. In general the answer is no, and this will require the use of different tricks (e.g. stopping services, bios setup, numactl, taskset, os settings, and everything mentioned in the other threads) to get an acceptable level.

The application benchmark turns out to be the most valuable but very challenging, as any performance limitations/jitter of the underlying platform and services will now be amplified by the complexity of the application components and their interactions.

For example, these are a few things that are challenging to observe and understand

  • 100’s of app & gc threads running on different cores, sockets, and servers
  • Linux scheduler moving threads around cores/socket/NUMA nodes
  • Different application threading models for reader, worker, and sender threads (BUSY_SPIN, WAIT, BLOCKED, ASYNC)
  • off-heap memory access and the bdflush process cleaning dirty pages and causing disk io bottleneck
  • Messaging infrastructure and appliances …