JavaMemoryModel: Information about IA-64 memory operations

From: Bill Pugh (
Date: Fri Aug 13 1999 - 12:33:54 EDT

I exchanged some email with people at Intel and got back the following.

My notes are at the end of the quote.

At 1:09 PM -0700 8/5/99, Willis, Thomas E wrote:
>I work with Allan Knies in the IA-64 architecture group and deal
>with the memory ordering model in IA-64. He passed along your
>question regarding the performance of strong and weak memory
>The ordering model allows *any* later memory (acquire, release,
>or unordered) operation to become visible before earlier release
>operations have been made visible. This allows unordered
>load/stores or acquire loads to hit in caches even if prior
>release operations (e.g., stores) are still pending (e.g., cache
>Furthermore, the local processor is allowed to bypass local data
>prior to it becoming globally observed, i.e. local processor loads
>can bypass from a local st.rel *before* is has been globally
>These things help with performance.
>As long as you are hitting cache, there is essentially minimal
>performance difference between ld.acq/st.rel and ld/st.
>There can be significant performance differences if memory
>operations around an ordered operation miss cache. For example,
> st4 [a] = r1 <- store hits cache
> st4 [b] = r2 <- store misses cache
> st4.rel [c] = r3 <- store hits cache but may not be made
> visible until *all* prior memory ops
> are visible.
>The st.rel to "c" cannot be made visible until the st to "b" is
>made visible.
>Again, keep in mind that subsequent memory operations can be made
>visible before the st4.rel above becomes visible and that it is
>OK for *subsequent* local loads to bypass data from not-yet
>visible stores, i.e., from a st or st.rel for which there is a
>cache miss pending.
>If you are after sequential consistency, acquire/release semantics
>are not strong enough and you will have to use a fence (mf) either
>before a ld.acq or after st.rel. For example, consider the code
> st4.rel [a] = r1 // miss
> ld4 r2 = [b]
> st4.rel [a] = r1 // miss
> mf
> ld4 r2 = [b]
>In the first code sequence, the ld4 can be made visible before the
>st4.rel and may bypass data from the st4.rel (assuming [a] and [b]
>completely overlap). The fence in the second code sequence
>prevents the processor from executing the load before the store is
>In the sequentially consistent case, the performance penalty
>depends on what happens next in the reference stream, where in the
>memory hierarchy the miss is serviced from, and the system fabric
>(e.g., how long does it take the platform to perform an
>invalidation?). The worst case occurs where you have a situation
>like that shown in the second example above where an L2 miss is
>followed by a fence that prevents the hardware from re-ordering
>memory operations after the miss.
>It is difficult to predict what the bottom line will be since it is
>a strong function of the platform.
>Hope this helps,
>Thomas E. Willis / Intel, IA-64 Processor Division Architecture Group

On the read side, it appears that making the read of a reference be
an ld.acq will suffice to guarantee initialization safety for the
fields of the object referenced, assuming we do things correctly when
the object is created. According to the note above, if that hits in
cache, there is little or no cost for doing so. If it misses the
cache, it could be expensive.

Of course, if you read 2 references, and then need initialization
safety for the fields of both, you need to make _both_ reads of the
references be ld.acq; alternatively, a _single_ fence, placed between
the reads of the references and the reads of their fields, would
suffice. But I suspect that using fences will be too expensive.


JavaMemoryModel mailing list -

This archive was generated by hypermail 2b29 : Thu Oct 13 2005 - 07:00:18 EDT