Performance of techniques for correctly implementing lazy initialization

by Doug Lea

[This note was originally sent as email by Doug Lea on the Java Memory Model mailing list in response to questions about the performance of a technique implementing lazy initialization using ThreadLocals -- Bill Pugh]

The main concern, that I should have mentioned before, is that ThreadLocal varies tremendously in speed across JVMs and JDK versions. On most 1.2.x JVMs, performance is so bad in this context that you'd never want to use it. (The main reason is that until 1.3 ThreadLocal internally used WeakHashMaps, which are needlessly heavy. The 1.4 version will in turn be faster than 1.3.)

You can usually avoid this uncertainty though if you need to.

If you can create and use your own thread subclass, you can implement your own variants of ThreadLocals. (See Section 2.3.2.1 of the 2nd edition of my CPJ book). In fact, if you know in advance all of the singletons you'll use, you don't need a table, just fields in the thread subclass will do. You can squeeze times even further if you can just pass in Thread refs rather than looking it up each time via Thread.currentThread. The attached file shows examples/hacks. I'm not sure I recommend any of this, but if you are going to go this route, you might as well make it both fast and correct.

Due to the nice folks at http://www.testdrive.compaq.com, I did test out some of this on alphas. (Testdrive is a very nice service! Anyone can register. It would be great if other MP vendors did this too.)

The fastest versions of Java I could find on MP alphas at testdrive were 1.2.2 VMs on a 2X500 running Tru64 and a 4X667 running linux. The 4-CPU box failed some of Bill's "volatile" tests (at http://www.cs.umd.edu/~pugh/java/memoryModel/). I gather that these JVMs don't use enough barriers even for "old" volatile (which is itself insufficient to guarantee double check).

The machines were NOT idle (load average was usually around one), but repeated tests gave about the same ratios, so these figures are probably in the right ballpark.

Here are results (the 3rd and 4th columns are 4-CPU sparc, and the last 2 columns are results on basically the same tests, taken from last post) Table entries are ratios compared to "Eager" version of Singleton.
CPUs 4-CPU 2-CPU 4-CPU 4-CPU 2-CPU 1-CPU
chip alpha alpha sparc sparc x86 sparc
OS linux Tru64 sol 8 sol8 ? Sol 98
JDK 1.2.2 1.2.2 1.3 1.2.2_07 1.3 1.3
Eager 1.00 1.00 1.00 1.00 1.00 1.00
Volatile(DCL) 1.09 1.01 1.22 1.34 1.31 1.18
ThreadLocal 300.80 17.84 6.32 240.74 6.50 5.01
SimThreadLocal 4.43 4.19 4.81 2.39 ? ?
Synch 189.26 5.73 69.03 66.41 32.12 9.64
Thread Field 2.16 2.71 4.16 2.00 ? ?
Direct Field 1.00 1.25 1.18 1.29 ? ?

Notes: