Parallel programs running on shared memory multiprocessors
coordinate via shared
data objects/structures. To ensure the consistency of the
shared data structures,
programs typically rely on some forms of software
typical software synchronisation mechanisms usually result
in poor performance
because they produce large amounts of memory and
interconnection network contention
and, more significantly, because they produce convoy effects
that degrade significantly
in multiprogramming environments: if one process holding
a lock is preempted, other
processes on different processors waiting for the lock will
not be able to proceed.
Researchers have introduced non-blocking synchronisation to
address the above problems.
Non-blocking implementations allow multiple tasks to access a shared
object at the same time, but without enforcing mutual exclusion to accomplish
this. However, its performance implications are not well understood on modern
systems or on real applications.
In this paper we study the impact of the non-blocking synchronisation on
parallel applications running on top of a modern, 64 processor,
cache-coherent, shared memory multiprocessor system: the SGI Origin 2000.
Cache-coherent non-uniform memory access (ccNUMA) shared memory
multiprocessor systems have attracted
considerable research and commercial interest in the last years.
In addition to the performance results on a modern system,
we also investigate the key synchronisation schemes that are used in
multiprocessor applications and their efficient transformation to