Re: JavaMemoryModel: A memory model for the masses

From: Doug Lea (dl@altair.cs.oswego.edu)
Date: Tue Nov 09 1999 - 08:13:06 EST


> > A simple rule that requires extra synchronizations on machines that
> > do not really need it will be difficult to maintain unless VMs on
> > those machines are able to simply remove this overhead when it is
> > not needed. People running on pentiums, sparcs, and other machines
> > with relatively strong memory models should not be penalized.
>
> I agree that this is an important requirment. I think the "always
> synchronize" methodology combined with something like Bill's strawman
> memory model (minus the global action ordering) achieves this goal.

I wasn't expecting this answer! Maybe you folks know of cleverer synch
removal algorithms than I know (or can even conceive of).

As an example, here's a popular design for an unbounded linked queue,
published in Maged Michael and Michael Scott's "Simple, Fast, and
Practical Non-Blocking and Blocking Concurrent Queue Algorithms",
Proceedings, 15th ACM Symposium on Principles of Distributed
Computing, ACM, 1996 -- see
  http://www.cs.rochester.edu/u/michael/PODC96.html.

This is the fastest lock-based queue design that I know of. It is used
in a lot of systems. A Java version that conforms to both current and
proposd revised models looks like this. (This is from p130-131 of 2nd
edition of CPJ, and is the basis of my util.concurrent.LinkedQueue class.)

  class LinkedQueue {
    protected Node head = new Node(null);
    protected Node last = head;

    protected final Object pollLock = new Object();
    protected final Object putLock = new Object();

    public void put(Object x) {
      Node node = new Node(x);
      synchronized (putLock) { // insert at end of list
        synchronized (last) { // (*)
          last.next = node; // extend list
          last = node;
        }
      }
    }

    public Object poll() { // returns null if empty
      synchronized (pollLock) {
        synchronized (head) { // (**)
          Object x = null;
          Node first = head.next; // get to first real node
          if (first != null) {
            x = first.object;
            first.object = null; // forget old object to enable GC
            head = first; // first becomes new head
          }
          return x;
        }
      }
    }

    static class Node { // local node class for queue
      Object object;
      Node next = null;

      Node(Object x) { object = x; }
    }
  }

But, actually, the inner locks at (*) and (**) are not present in
Michael and Scott's orginal paper or sample C code, and are not needed
on any machine/VM that can guarantee "publication safety" of new nodes
(i.e., if you can see the ref to a new node, you can see its contents,
and transitively so), which is true of VMs on sparcs, pentiums,
etc. (It takes some thought to see that this is true here -- see the
Michael and Scott paper.)

This secondary lock adds a constant overhead factor. This could be a
performance concern in some applications -- such queues are often used
in performance-critical contexts. It would be nice to remove this
overhead on platforms that do not require it.

Under the current model, you cannot replace the locks with any set of
volatile declarations here since this would not guarantee visibility
of the returned object on a system that did not already guarantee
it. So, on some systems, the thread obtaining the object via poll
could not reliably do anything with it. (BTW, thanks to Josh Bloch
for originally pointing this out to me!) You would have to rely on an
optimizer to realize that the inner locks could be removed if running
on the right kind of system. Does anyone know of an approach to synch
removal that could infer this properly?

Note that volatiles could come to the rescue here if dereferencing a
volatile reference did guarantee visibility of contents.

I suppose that is worth mentioning here that the non-blocking version
of this queue presented in the Michael & Scott paper cannot be
efficiently implemented in Java because it requires a CAS.

I have a bunch more examples if you don't like this one :-)

Reminder: I am NOT trying to argue for complicated memory models,
and absolutely not arguing for keeping the old model. I just want
to make sure that performance concerns are understood. In many
concurrent programs, one of the main goals is to reduce concurrency
overhead enough to make the concurrency worthwhile.

-Doug



This archive was generated by hypermail 2b29 : Thu Oct 13 2005 - 07:00:22 EDT