Paper: Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask

Awesome paper on how particular synchronization mechanisms scale on multi-core architectures: Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask.

The goal is to pick a locking approach that doesn't degrade as the number of cores increase. Like everything else in life, that doesn't appear to be generically possible:

None of the nine locking schemes we consider consistently outperforms any other one, on all target architectures or workloads. Strictly speaking, to seek optimality, a lock algorithm should thus be selected based on the hardware platform and the expected workload.

Abstract:

This paper presents the most exhaustive study of synchronization to date. We span multiple layers, from hardware cache-coherence protocols up to high-level concurrent software. We do so on different types architectures, from single-socket – uniform and nonuniform – to multi-socket – directory and broadcastbased – many-cores. We draw a set of observations that, roughly speaking, imply that scalability of synchronization is mainly a property of the hardware.

Some Findings:

  1. Synchronization scales much better within a single socket, irrespective of the contention level crossing sockets significantly impacts synchronization, regardless of the layer, e.g., cache coherence, atomic operations, locks. In order to be able to scale, synchronization should better be confined to a single socket, ideally a uniform one.
  2. Even on a singlesocket many-core such as the TILE-Gx36, a system should reduce the amount of highly contended data to avoid performance degradation (due to the hardware).
  3. Each of the nine state-of-the-art lock algorithms we evaluate performs the best on at least one workload/platform combination. Nevertheless, if we reduce the context of synchronization to a single socket (either one socket of a multi-socket, or a single-socket many-core), then our results indicate that spin locks should be preferred over more complex locks.
  4. Implementing multi-socket coherence using broadcast or an incomplete directory (as on the Opteron) is not favorable to synchronization.
  5. The behavior of the TPC-H benchmarks on MonetDB is similar to the get workload of Memcache: synchronization is not a bottleneck.
  • SSYNC is a cross-platform synchronization suite. Contains libslock, a library that abstracts lock algorithms behind a common interface and libssmp, a library with fine-tuned implementations of message passing for each of the supported platforms.