Re: JavaMemoryModel: Worries about JIT compilation and class loading on multiprocessors

From: sanjay@pa.dec.com
Date: Wed Jul 14 1999 - 19:57:43 EDT


Bill Pugh writes:
> I am still seriously concerned about JIT compilation and class loading
> on shared memory multiprocessors.
>
> On the Alpha for example, consider the case where processor 1
> * loads a class Bar,
> * creates all of the internal data structures for Bar,
> * generates native code for some of the methods in Bar
> * runs the static initializer for Bar
> * creates an instance of Bar
> * does a memory barrier (possibly at other points above)
> * stores a reference to that instance in Foo.x
> Processor 2
> (while not calling any synchronized methods, and thus
> not needing memory barriers, except to prohibit behavior
> I am about to describe)
> * reads the reference in Foo.x, happens to see the value
> written by processor 1
> * Invokes a virtual method on that reference
> * reads the vtbl pointer in the instance
> * reads the contents of the vtbl
> * jumps to native code generated by processor 1
> * does an instanceof on "this", reading C-level fields
> of the class data structure
> * reads instance and static fields of the instance
>
> All of the stuff done on processor 2 after reading Foo.x
> might see stale values. We've discussed way processor 2
> could detect and recover from reading stale values for the
> vtbl pointer from the object header. But in order to be sure that
> all the class data structures and vtbl entries are valid, processor
> 2 must do a memory barrier instruction. To ensure that the
> native code generated by processor 1 isn't stale,
> processor 2 needs to do a CALL_PAL IMB (instruction
> memory barrier).

I have run into this bug with a real program where the instruction cache
of one processor had a stale version of memory that had been filled with
generated code on another processor.

Solution overview
-----------------
Here is a solution (one actually used in the JVM I have been working
on): After the thread running on processor 1 has initialized the class
data structures, but before it has marked the class as initialized
(i.e. published the results), it arranges to perform a memory-barrier
on every processor in the system. If some code has also been generated
or modified, then a "CALL_PAL imb" also should be executed on every
processor before a pointer to the generated code is published.
The initialization sequence becomes:

Initializing thread:
        run the class initializer
        for every processor p in the system {
                execute an mb (and maybe a call_pal imb) on p
        }
        mark the class as initialized

Now if a thread observes that a class has been initialized,
it is guaranteed that the processor on which it is running has
performed a memory-barrier since the class was initialized.
Therefore the thread will not be able to read stale values
out of the class. The race condition you described in your
example has also disappeared because between the time the
class contents are initialized and the time processor 2
reads Foo.x, processor 2 is guaranteed to have executed a
memory-barrier.

> Even without class loading, if a HotSpot compiler decided
> to generate optimized native code for a method, it couldn't
> just substitute the new, optimized code: other processors
> might see stale, invalid versions.

The same technique applies here. Except that instead of just
performing a memory-barrier on every processor, a "CALL_pal imb"
is also needed (the "call_pal imb" will make the dcache and the
icache of the current processor consistent).

Implementation
--------------
The simplest way to implement this memory barrier broadcast would
be with some operating system support. I think many OS's have
a mechanism like this anyway to ensure that changes in page
mappings made on one processor are reflected on other processors.
(Think about support for dynamically loaded shared libraries
on multi-processors as analogous to generation of native code
in a JVM.)

Another implementation would be to create a high priority "bound"
thread on each processor and communicate with it whenever a memory
barrier needs to be done on that processor.

If there are many fewer threads than processors, the JVM
could arrange to have each thread perform a barrier instead
of having each processor do so. The end effect will be the
same.

Performance
-----------
> The only reasonable way I see to fix this, is to not make
> newly loaded classes or newly compiled code available
> until the next garbage collection (at which time, everyone does
> a global memory barrier).

Class initializations typically happen rarely enough that
it is okay to execute the global memory barrier every time a
class initialization finished (which is what you were trying
to achieve by the garbage collection). Of course the same
technique would be prohibitively expensive for instance
initializations because those happen much more frequently.

For generated native code, some amount of batching of global
memory barriers might be beneficial, but only if the JVM is
aggressively recompiling code.

-Sanjay

-------------------------------
This is the JavaMemoryModel mailing list, managed by Majordomo 1.94.4.

To send a message to the list, email JavaMemoryModel@cs.umd.edu
To send a request to the list, email majordomo@cs.umd.edu and put
your request in the body of the message (use the request "help" for help).
For more information, visit http://www.cs.umd.edu/~pugh/java/memoryModel



This archive was generated by hypermail 2b29 : Thu Oct 13 2005 - 07:00:16 EDT