thr3ads.net - llvm dev - [LLVMdev] Atomic Operation and Synchronization Proposal v2 [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2007-Jul-12 06:14 UTC

[LLVMdev] Atomic Operation and Synchronization Proposal v2

Hello,

This is the second major revision of the atomic proposal for LLVM. I
will try and give a brief overview of the motivating changes, but a
greater portion of the text has changed, along with some changes to
the proposed additions.

http://chandlerc.net/llvm_atomics.html

- The proposal has been rewritten to better delineate the goals and
purposes of LLVM, and these additions to LLVM. The why and to what
purpose has been a source of confusion, and this is addressed
directly.
- The explanation of memory barriers and why the representation was
chosen has been reworked to address concerns raised.
- The explanation of load-linked, store-conditionally atomic
operations has been expanded and reworded to be more clear and
precise. The paper to investigate theoretical details in has been
added to the references.
- Both the hardware survey, and all implementation tables have been
updated to include Alpha architecture stuff.
- The entire set of additions has been moved to intrinsic functions
instead of instructions. If these merit and can benefit from moving up
to instructions, they may, but this eases the implementation burden
significantly.
- The types supported have been narrowed to only integers. The sizes
of integers supported is determined by the target implementation.
- GCC builtins have been added with rough outlines of how they could
be lowered to the proposed intrinsic functions.
- Other minor (and not-so-minor) cosmetic cleaning took place to make
these integrate smoothly and present a more cohesive proposal.

A few things were not changed because they can easily be added later
if and when desired, and would complicate the initial implementation.
Incremental is good!

- Vector type support
- Other type support
- Acquire and release implied memory constraints on atomic operations

This revision grew directly out of the feedback provided. Please
continue to review this, and hopefully we will see a first rate
representation of these constructs in LLVM soon.

Thank you again,
-Chandler Carruth

Torvald Riegel

2007-Jul-12 12:23 UTC

head link

[LLVMdev] Atomic Operation and Synchronization Proposal v2

Here are some comments, quotes are from the draft.
> an operation based constraint cannot guard other operations
I think constraints associated with a particular instruction usually apply 
to this instruction and previous/subsequent instructions, so this wouldn't 
be true. This is the case in the atomic_ops model, and also on ia64 I 
think.

> The single instruction constraints can, at their most flexible, constrain 
> any set of possible pairings of loads from memory and stores to memory
I'm not sure about this, but can we get issues due to "special"
kinds of
data transfers (such as vector stuff, DMA, ...?). Memcpy implementations 
could be a one thing to look at.
This kind of breaks down to how universal you want the memory model to be.

Regarding swap(): Which uses do you have in mind? To me, support for TAS 
would be more useful as it is used for spinlocks (and no concurrent algo 
immediately comes to my mind that uses swap() and couldn't use TAS). The 
difference is that TAS can have a weaker interface, because it operates on 
a boolean value only, making it easier to map on different hardware (and 
TAS is faster than CAS on some architectures I think). For example, for TAS 
a byte is sufficient, whereas with swap you probably would require that the 
exchange has machine-word size.

> These implementations assume a very conservative interpretation. 
> result = __sync_fetch_and_add( <ty>* ptr, <ty> value )
>             call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true,
>               i1 true ) 
> %result   = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty>
%value )
Shouldn't you have a second membar after the las() to be very conservative 
(i.e., if las() is supposed to really be linearizable)? Otherwise, the 
effects of the las() can be reordered with respect to effects of subsequent 
instructions.

This also shows that you get additional overhead in the codegen if barriers 
are not associated with an operation: To emit efficient code, codegen would 
have to check whether membars are next to an operation, and whether they 
can be merged with the operation. If codegens don't do it, then runtime 
overhead will be higher due to the unnecessary barriers. This also implies 
that there is no reordering of las() with other operations until codegen 
phase, so you would at least have to have some sort of compiler barrier, or 
require the las() target to be volatile (does LLVM volatile ordering 
guarantees apply to non-volatile loads/stores also, or just to volatile 
ones?)

If you use the other approach instead (ordering constraint attached to 
instructions), then you have to support more intrinsics, but you don't need 
this kind of analysis, and you wouldn't need compiler reordering barriers 
assuming that they are implicit for anything that carries any reordering 
constraint.

I would guess that with the second approach (constraints for operations), 
codegen implementations could be actually easier because you can 
concentrate on just one operation with constraints.

> result = __sync_fetch_and_or( <ty>* ptr, <ty> value )
or/and/... should be added to the list of supported intrinsics at some time. 
x86 has built-in support for that and doesn't need the CAS loop.


Torvald

David Greene

2007-Jul-12 15:06 UTC

head link

[LLVMdev] Atomic Operation and Synchronization Proposal v2

On Thursday 12 July 2007 07:23, Torvald Riegel wrote:
> > The single instruction constraints can, at their most flexible,
constrain
> > any set of possible pairings of loads from memory and stores to memory
>
> I'm not sure about this, but can we get issues due to
"special" kinds of
> data transfers (such as vector stuff, DMA, ...?). Memcpy implementations
> could be a one thing to look at.
> This kind of breaks down to how universal you want the memory model to be.
Right.  For example, the Cray X1 has a much richer set of memory ordering 
instructions than anything on the commodity micros:

http://tinyurl.com/3agjjn

The memory ordering intrinsics in the current llvm proposal can't take 
advantage of them because they are too coarse-grained.

Now, I don't expect we'll see an llvm-based X1 code generator, but
looking at
what the HPC vendors are doing in this area will go a long way toward 
informing the kind of operations we may want to include in llvm.  The trend is 
for vendors to include ever more finely targeted semantics to allow scaling to 
machines with millions of cores.

If we can incrementally refine the size of the memory ordering hammers, I'm
ok with that.  If it's simply a matter of adding finer-grained intrinsics 
later, that's cool.  But I don't want to get us into a situation where
llvm
requires stricter memory ordering than is strictly necessary and we can't
get
out from under the stone.

                                                      -Dave

Chandler Carruth

2007-Jul-12 17:48 UTC

head link

[LLVMdev] Atomic Operation and Synchronization Proposal v2

On 7/12/07, Torvald Riegel <torvald at se.inf.tu-dresden.de>
wrote:> Here are some comments, quotes are from the draft.
>
> > an operation based constraint cannot guard other operations
>
> I think constraints associated with a particular instruction usually apply
> to this instruction and previous/subsequent instructions, so this
wouldn't
> be true. This is the case in the atomic_ops model, and also on ia64 I
> think.
"guard other operations" means guarding operations other than the
atomic intrinsic functions. Specifically, loads from and stores to
shared memory are perfectly legal, and inherently atomic. However,
they might need memory synchronization routines. Tying synchronization
to only operations would force typing them to nearly every operation.

All that aside, as I stated in the proposal, if the _hardware_ demands
it, these atomic operations could be extended to provide built-in
memory synchronization semantics. A standalone memory barrier would
still be required (precisely as it is in hardware), but lowerings
could use the potentially better optimized versions of the operations.
I see this as an extension of the existing proposal, not a problem.
>
>
> > The single instruction constraints can, at their most flexible,
constrain
> > any set of possible pairings of loads from memory and stores to memory
>
> I'm not sure about this, but can we get issues due to
"special" kinds of
> data transfers (such as vector stuff, DMA, ...?). Memcpy implementations
> could be a one thing to look at.
> This kind of breaks down to how universal you want the memory model to be.
Issues? Perhaps. On the target architectures, these barriers would be
treated conservatively. *All* memory accesses are protected. If the
hardware provides atypical memory access mechanisms, it should also
provide mechanisms for protecting those. These memory barriers would
then be implemented to utilize those. The model proposed is extremely
universal, because it is extremely general. Implementations for
specialized hardware should take a conservative approach. If target
architectures need more specialized memory constraint constructs,
these could again be added incrementally in order to better support
fine tuning those architecture's performance characteristics.
>
> Regarding swap(): Which uses do you have in mind? To me, support for TAS
> would be more useful as it is used for spinlocks (and no concurrent algo
> immediately comes to my mind that uses swap() and couldn't use TAS).
The
> difference is that TAS can have a weaker interface, because it operates on
> a boolean value only, making it easier to map on different hardware (and
> TAS is faster than CAS on some architectures I think). For example, for TAS
> a byte is sufficient, whereas with swap you probably would require that the
> exchange has machine-word size.
I cannot express this enough. This is *not*, repeat *not* an API.
Programmers will *not* be using these constructs directly. This is an
interface to the hardware. So what is more useful to algorithms is not
the primary consideration. Rather, what the hardware provides for
algorithms is.

Furthermore, TAS on most architectures is implemented with a swap.
Specifically, the architecture with the closest thing to a "more
efficient" test and set is x86, and the atomic_ops package, which I
did look at, uses "xchgb". This is precisely the lowering x86 would
provide for the LLVM intrinsic function "i8 @llvm.atomic.swap( i8*
%ptr, i8 swap )". Thus it makes perfect sense for LLVM to support a
"swap" intrinsic function in order to implement an API which provides
test-and-set locks. I cannot see the benefit of limiting the LLVM
representation to that of the API, when a common abstraction of how to
implement that API is available on every architecture targetted.

Lastly, most of the architectures which support and size selection at
all, support it for all of the instructions necessary to implement
these operations. 'cas' and 'swap' will lower to single byte
instructions on x86 f.ex.

> > These implementations assume a very conservative interpretation.
> > result = __sync_fetch_and_add( <ty>* ptr, <ty> value )
> >             call void @llvm.atomic.membarrier( i1 true, i1 true, i1
true,
> >               i1 true )
> > %result   = call <ty> @llvm.atomic.las( <ty>* %ptr,
<ty> %value )
>
> Shouldn't you have a second membar after the las() to be very
conservative
> (i.e., if las() is supposed to really be linearizable)? Otherwise, the
> effects of the las() can be reordered with respect to effects of subsequent
> instructions.
You are probably right here. It was very late, and as mentioned, the
GCC spec is extremely ambiguous on the precise semantics for these
intrinsics. I'm going to move to what I think are more appropriate for
the instructions, rather than trusting their descriptions.

> This also shows that you get additional overhead in the codegen if barriers
> are not associated with an operation: To emit efficient code, codegen would
> have to check whether membars are next to an operation, and whether they
> can be merged with the operation. If codegens don't do it, then runtime
> overhead will be higher due to the unnecessary barriers. This also implies
> that there is no reordering of las() with other operations until codegen
> phase, so you would at least have to have some sort of compiler barrier, or
> require the las() target to be volatile (does LLVM volatile ordering
> guarantees apply to non-volatile loads/stores also, or just to volatile
> ones?)
This is not true in any situation for any architecture but Itanium.
Furthermore, given the specs of GCC's intrinsics, it is not true even
for Itanium in this specific situation. The GCC specification (such
that it is) for these intrinsics indicates that they must form a "full
barrier". Itanium's instruction-based barriers cannot provide this
functionality, forcing it to fall back on a full fence on either side
of the instruction. Is this poor code? Yes it is, but it is required
to match the semantics provided by GCC.

On every other architecture, there _are_ no instruction-based
barriers. x86 almost has instruction based barriers, as for some chips
_every_ instruction is a full barrier, but this is not what you are
aiming for either and is slowly changing in newer x86 implementations.
For the other architectures, memory barriers _are_ separate
instructions. x86 has separate instructions for memory barriers.

The memory barriers, as proposed, will provide all necessary barriers
to the compiler reordering.

> If you use the other approach instead (ordering constraint attached to
> instructions), then you have to support more intrinsics, but you don't
need
> this kind of analysis, and you wouldn't need compiler reordering
barriers
> assuming that they are implicit for anything that carries any reordering
> constraint.
>
> I would guess that with the second approach (constraints for operations),
> codegen implementations could be actually easier because you can
> concentrate on just one operation with constraints.
As stated above, this assumes that all architectures work this way,
which they simply do not. This is a hardware interface, and as such
follows the hardware. As such, the codegen will not be easier or
harder in either of these cases.

> > result = __sync_fetch_and_or( <ty>* ptr, <ty> value )
>
> or/and/... should be added to the list of supported intrinsics at some
time.
> x86 has built-in support for that and doesn't need the CAS loop.
It doesn't have built-in support. These are "fetch and *"
operations.
x86 has support for an atomic "or" to memory which does not include
any fetch. x86 has "xadd" which allows a fetch and add, and fetch and
sub (through negation) without the compare and swap, however neither
"or" nor "and" are available in this fashion. Perhaps I am
missing
something in the x86 instruction set, but I can't find any other way
to perform these operations. The many non-fetching atomic operations
in x86 are more geared at dispatching and joining of threads rather
than synchronizing contending threads.

-Chandler Carruth
>
>
> Torvald
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Torvald Riegel

2007-Jul-13 13:54 UTC

head link

[LLVMdev] Atomic Operation and Synchronization Proposal v2

On Thursday 12 July 2007 14:23, Torvald Riegel wrote:> Regarding swap(): Which uses do you have in mind? To me, support for TAS
> would be more useful as it is used for spinlocks (and no concurrent algo
> immediately comes to my mind that uses swap() and couldn't use TAS).
The
Please ignore this statement, keep get-and-set. I should implement more 
locks ...
Nevertheless, I'd still like to see test-and-set.

torvald

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Jul 2007 - [LLVMdev] Atomic Operation and Synchronization Proposal v2

[LLVMdev] Atomic Operation and Synchronization Proposal v2

[LLVMdev] Atomic Operation and Synchronization Proposal v2

[LLVMdev] Atomic Operation and Synchronization Proposal v2

[LLVMdev] Atomic Operation and Synchronization Proposal v2

[LLVMdev] Atomic Operation and Synchronization Proposal v2

Apparently Analagous Threads