Re: JavaMemoryModel: cost of initialization-safety on an Alpha SMP

Date: Thu Aug 26 1999 - 14:38:27 EDT

Bill asked me to collect some more information related to the
cost of initialization safety. These numbers won't make sense
unless you have read my earlier message, which can be found
in the mailing-list archive.

Here are the extra statistics I collected. These were for a run of
the JVM where in one run it executed each SPEC benchmark k times. I
have given the running time (user, system, and elapsed), as well as
static and dynamic counts of memory-barriers and write-memory-barriers.
I have also included a static count of total number of instructions
emitted (not including any barriers or counting code). The four
configurations are:

none Base case (no initialization safety)
unopt mb before every getfield
dflow dataflow optimization
both dataflow and "this" optimization

Machine configuration (same as before):
        #Processors: 2
        Processor: 21264a
        Speed: 667 MHz
        On-chip-caches: 64KB D + 64KB I
        Board-cache: 4MB per processor
        RAM: 512MB
        OS: Tru64 Unix 4.0f

                                   static counts dynamic counts
                                -------------------- ---------------------
config user(s) sys(s) elapsed insts mb wmb mb wmb
none 213.6 11.1 226.8 222416 0 0 0 0
unopt 721.2 12.6 732.4 223221 7858 1988 12987952675 67180227
dflow 408.6 11.0 421.2 222545 2254 1988 4029712036 67180227
this 353.6 11.0 357.8 227165 5881 1988 3128366384 67180245

Some interesting points and clarifications:

1. user + sys is sometimes bigger than elapsed because one of the benchmarks
   is multi-threaded.
2. The static count of instructions changes because of alignment effects
   caused by the insertion of memory barriers
3. The static count of memory-barriers is higher in "this" configuration
   as compared to "dflow", but the dynamic count is lower.
4. The dynamic count of wmb changes in the "this" configuration. I think
   this is because the spec harness's execution depends on timing:
   It measures the elapsed time and prints it out, and the printing code
   may print varying number of characters across the four configurations.
5. The ratio of the user time for "dflow", and "none" does not show
   exactly the 63% difference in SPEC numbers reported earlier because the
   SPEC number is a geometric mean of scaled inverses of running times,
   whereas the measure here is a sum of the running times.

I did one further calculation to find the cycle cost of each executed
memory-barrier. This calculation involves taking the increase in
user-time over the user-time for the "none" configuration, converting
it to cycles, and dividing by the dynamic count of memory-barriers.
(Remember, the clock frequency is 667 MHz.)

   avg_mb_cycles = ((user_time - none_user_time) * 667e6) / dynamic_mb_count

(This equation slightly undercounts the cycles per mb because one of
the benchmarks is multi-threaded and uses up more than 667e6 cycles
per second).

config cycles
unopt 26
dflow 32
both 30

These numbers seem to be in the right ballpark because a tight loop
that executes a memory-barrier over and over again takes about 35
cycles per iteration. (Which is very close to the latency of an
access to the board-cache on this machine.)

-Sanjay Ghemawat
 Compaq Systems Research Center

JavaMemoryModel mailing list -

This archive was generated by hypermail 2b29 : Thu Oct 13 2005 - 07:00:18 EDT