> > Does that really buy you performance at user level? Unlike thread
> locals,
> > the semantics of "CPULocals" seem nontrivial, since you can
> ge preempted
> > halfway through an update ...
> that should not be a problem for atomic writes to
> pre-allocated storage.
Yes, but it seems to add lots of additional corner cases that would need to
be defined. And CPU locals don't seem to be very intuitive for the

> and i also think that there should not be any problem to add
> full mutex
> based synchronization (still atomic writes but to dynamically
> allocated
> storage using internal DCL w/o memory barriers) so that
> get/lookup calls
> would still _not_ need any synchronization / memory barriers - that is
> what really buys performance. however, it is really important
> that set()
> after get() should not assume that it updates the same
> CPULocal variable
> which was checked via get() - CPU may change - a thread could
> be running
> on a different CPU after get() (fortunately that will make _no_
> difference with respect to memory visibility).
I guess I'm still confused as to what we're trying to accomplish. Why are
CPU locals better than thread locals?

Thread locals seem to be implementable with moderate overhead, though I
would clearly like to see faster implementations. But I'm having trouble
coming up with an implementation of user-level CPU locals that's much
faster. My understanding is that there are substantial costs associated
with having either per-thread or per-cpu address mappings. Thus in either
case, you would have to look up a (thread, CPU) id, and then use that to
either get to a local storage pointer in the (thread, CPU) descriptor, or
use that as an index into a multiple-concurrent-reader hash table of some
sort. In the per-CPU case, you may get switched between the CPU id lookup
and the actual read, so unless you somehow inhibit preemption, you may still
get the wrong one.

