Paper: An Analysis of Linux Scalability to Many Cores

An Analysis of Linux Scalability to Many Cores, by a number of MIT researchers, is a refreshingly practical paper on what it takes to scale Linux and common applications like Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce to run on 48 core systems. A very timely paper given moderately massive multicore systems are reportedly the near future of computing.

This paper must have taken a lot of work. They both tracked down bottlenecks in a number of applications and the Linux kernel and they also tried to fix them. Modestly speaking the authors said they made "modest" changes to the kernel and applications, but there's nothing modest about what they did. It's excellent work.

After the next bit, which is the abstract, there is a list of the problems they found and how they fixed them.

The abstract:

This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48-core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques— this paper introduces one new technique, sloppy counters—these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet.

Here's a summary of Linux scalability problems encountered and their corresponding fixes:

  • Parallel accept: Apache:  Concurrent accept system calls contend on shared socket fields. ⇒ User per-core backlog queues for listening sockets.
  • dentry reference counting: Apache, Exim:  File name resolution contends on directory entry reference counts. ⇒ Use sloppy counters to reference count directory entry objects.
  • Mount point (vfsmount) reference counting:  Apache, Exim: Walking file name paths contends on mount point reference counts. ⇒ Use sloppy counters for mount point objects.
  • IP packet destination (dst entry) reference counting: memcached, Apache: IP packet transmission contends on routing table entries. ⇒ Use sloppy counters for IP routing table entries.
  • Protocol memory usage tracking: memcached, Apache: Cores contend on counters for tracking protocol memory consumption. ⇒ Use sloppy counters for protocol usage counting.
  • Acquiring directory entry (dentry) spin locks: Apache, Exim: Walking file name paths contends on per-directory entry spin locks. ⇒ Use a lock-free protocol in dlookup for checking filename matches.
  • Mount point table spin lock: Apache, Exim: Resolving path names to mount points contends on a global spin lock. ⇒ Use per-core mount table caches.
  • Adding files to the open list: Apache, Exim:  Cores contend on a per-super block list that tracks open files. ⇒ Use per-core open file lists for each super block that has open files.
  • Allocating DMA buffers: memcached, Apache: DMA memory allocations contend on the memory node 0 spin lock. ⇒ Allocate Ethernet device DMA buffers from the local memory node.
  • False sharing in net device and device: memcached, Apache, PostgreSQL: False sharing causes contention for read-only structure fields. ⇒ Place read-only fields on their own cache lines.
  • False sharing in page: Exim: False sharing causes contention for read-mostly structure fields. ⇒ Place read-only fields on their own cache lines.
  • inode lists: memcached, Apache: Cores contend on global locks protecting lists used to track inodes. ⇒ Avoid acquiring the locks when not necessary.
  • Dcache lists: memcached, Apache:  Cores contend on global locks protecting lists used to track dentrys. ⇒ Avoid acquiring the locks when not necessary.
  • Per-inode mutex: PostgreSQL: Cores contend on a per-inode mutex in lseek. ⇒ Use atomic reads to eliminate the need to acquire the mutex.
  • Super-page fine grained locking: Metis: Super-page soft page faults contend on a per-process mutex. ⇒ Protect each super-page memory mapping with its own mutex.
  • Zeroing super-pages: Metis: Zeroing super-pages flushes the contents of on-chip caches. ⇒ Use non-caching instructions to zero the contents of super-pages.
  • Receive Packet Steering:

A quick summary might be that sharing is good in kindergarten, but it's not so good when programming multiple processors. Keep separate things separate.