Dramatically Improving Performance by Debugging Brutally Complex Prolems

Debugging complex problems is 90% persistence and 50% cool tools. Brendan Gregg in 10 Performance Wins tells a fascinating story of how a team at Joyent solved some weird and challenging performance issues deep in the OS. It took lots of effort, DTrace, Flame Graphs, USE Method, and writing custom tools when necessary. Here's a quick summary of the solved cases:

  • Monitoring. 1000x improvement. An application blocked while paging anonymous memory back in. It was also blocked during file system fsync() calls. The application was misconfigured and sometimes briefly exceeded available memory, getting page out.
  • Riak. 2x improvement. The Erlang VM used half the CPU count it was supposed to, so CPUs remained unused.  Fix was a configuration change.
  • MySQL. 380x improvement. Reads were slow. Cause was correlated writes. Fix was to tune the cache flush interval on the storage controller.
  • Various. 2800x improvement. Large systems calls to getvmusage() could take a few seconds. Cause was a priority inversion that caused packets not to be processed. Fix was to use kernel preempt priority (kpreemptpri).
  • Network Stack. 4.5x improvement. A single CPU became hot in the system. Cause was a kernel function that had become expensive. Fix was changing the code.
  • Database. 20% improvement.  Mutex connection for malloc from multiple threads slowed down performance. Fix was to use libumem instead of libc.
  • Database. 10% improvement. Memory fragmentation caused endless heap growth. Fix was changing the allocator.
  • Riak. 100x improvement. Calls to getvmusage() held an address space lock while bitcask blocked on mmap. This causes TCP listen drops and slow query responses. Fix was to use less expensive techniques.
  • Various. 2x improvement. Programs would not get the expected portion of CPU because the scheduler wasn't adjusting priorities fast enough. Fix was to change the code.
  • KVM. 8x improvement. Slow network performance was traced to packets not being coalesced. Fix was to change the timer.

The article contains many juicy details, but the take home is a general process for dramatically improving performance by debugging brutally complex problems.