Chandler Carruth
2007-Jul-12 06:14 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
Hello, This is the second major revision of the atomic proposal for LLVM. I will try and give a brief overview of the motivating changes, but a greater portion of the text has changed, along with some changes to the proposed additions. http://chandlerc.net/llvm_atomics.html - The proposal has been rewritten to better delineate the goals and purposes of LLVM, and these additions to LLVM. The why and to what purpose has been a source of confusion, and this is addressed directly. - The explanation of memory barriers and why the representation was chosen has been reworked to address concerns raised. - The explanation of load-linked, store-conditionally atomic operations has been expanded and reworded to be more clear and precise. The paper to investigate theoretical details in has been added to the references. - Both the hardware survey, and all implementation tables have been updated to include Alpha architecture stuff. - The entire set of additions has been moved to intrinsic functions instead of instructions. If these merit and can benefit from moving up to instructions, they may, but this eases the implementation burden significantly. - The types supported have been narrowed to only integers. The sizes of integers supported is determined by the target implementation. - GCC builtins have been added with rough outlines of how they could be lowered to the proposed intrinsic functions. - Other minor (and not-so-minor) cosmetic cleaning took place to make these integrate smoothly and present a more cohesive proposal. A few things were not changed because they can easily be added later if and when desired, and would complicate the initial implementation. Incremental is good! - Vector type support - Other type support - Acquire and release implied memory constraints on atomic operations This revision grew directly out of the feedback provided. Please continue to review this, and hopefully we will see a first rate representation of these constructs in LLVM soon. Thank you again, -Chandler Carruth
Torvald Riegel
2007-Jul-12 12:23 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
Here are some comments, quotes are from the draft.> an operation based constraint cannot guard other operationsI think constraints associated with a particular instruction usually apply to this instruction and previous/subsequent instructions, so this wouldn't be true. This is the case in the atomic_ops model, and also on ia64 I think.> The single instruction constraints can, at their most flexible, constrain > any set of possible pairings of loads from memory and stores to memoryI'm not sure about this, but can we get issues due to "special" kinds of data transfers (such as vector stuff, DMA, ...?). Memcpy implementations could be a one thing to look at. This kind of breaks down to how universal you want the memory model to be. Regarding swap(): Which uses do you have in mind? To me, support for TAS would be more useful as it is used for spinlocks (and no concurrent algo immediately comes to my mind that uses swap() and couldn't use TAS). The difference is that TAS can have a weaker interface, because it operates on a boolean value only, making it easier to map on different hardware (and TAS is faster than CAS on some architectures I think). For example, for TAS a byte is sufficient, whereas with swap you probably would require that the exchange has machine-word size.> These implementations assume a very conservative interpretation. > result = __sync_fetch_and_add( <ty>* ptr, <ty> value ) > call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, > i1 true ) > %result = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value )Shouldn't you have a second membar after the las() to be very conservative (i.e., if las() is supposed to really be linearizable)? Otherwise, the effects of the las() can be reordered with respect to effects of subsequent instructions. This also shows that you get additional overhead in the codegen if barriers are not associated with an operation: To emit efficient code, codegen would have to check whether membars are next to an operation, and whether they can be merged with the operation. If codegens don't do it, then runtime overhead will be higher due to the unnecessary barriers. This also implies that there is no reordering of las() with other operations until codegen phase, so you would at least have to have some sort of compiler barrier, or require the las() target to be volatile (does LLVM volatile ordering guarantees apply to non-volatile loads/stores also, or just to volatile ones?) If you use the other approach instead (ordering constraint attached to instructions), then you have to support more intrinsics, but you don't need this kind of analysis, and you wouldn't need compiler reordering barriers assuming that they are implicit for anything that carries any reordering constraint. I would guess that with the second approach (constraints for operations), codegen implementations could be actually easier because you can concentrate on just one operation with constraints.> result = __sync_fetch_and_or( <ty>* ptr, <ty> value )or/and/... should be added to the list of supported intrinsics at some time. x86 has built-in support for that and doesn't need the CAS loop. Torvald
David Greene
2007-Jul-12 15:06 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On Thursday 12 July 2007 07:23, Torvald Riegel wrote:> > The single instruction constraints can, at their most flexible, constrain > > any set of possible pairings of loads from memory and stores to memory > > I'm not sure about this, but can we get issues due to "special" kinds of > data transfers (such as vector stuff, DMA, ...?). Memcpy implementations > could be a one thing to look at. > This kind of breaks down to how universal you want the memory model to be.Right. For example, the Cray X1 has a much richer set of memory ordering instructions than anything on the commodity micros: http://tinyurl.com/3agjjn The memory ordering intrinsics in the current llvm proposal can't take advantage of them because they are too coarse-grained. Now, I don't expect we'll see an llvm-based X1 code generator, but looking at what the HPC vendors are doing in this area will go a long way toward informing the kind of operations we may want to include in llvm. The trend is for vendors to include ever more finely targeted semantics to allow scaling to machines with millions of cores. If we can incrementally refine the size of the memory ordering hammers, I'm ok with that. If it's simply a matter of adding finer-grained intrinsics later, that's cool. But I don't want to get us into a situation where llvm requires stricter memory ordering than is strictly necessary and we can't get out from under the stone. -Dave
Chandler Carruth
2007-Jul-12 17:48 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On 7/12/07, Torvald Riegel <torvald at se.inf.tu-dresden.de> wrote:> Here are some comments, quotes are from the draft. > > > an operation based constraint cannot guard other operations > > I think constraints associated with a particular instruction usually apply > to this instruction and previous/subsequent instructions, so this wouldn't > be true. This is the case in the atomic_ops model, and also on ia64 I > think."guard other operations" means guarding operations other than the atomic intrinsic functions. Specifically, loads from and stores to shared memory are perfectly legal, and inherently atomic. However, they might need memory synchronization routines. Tying synchronization to only operations would force typing them to nearly every operation. All that aside, as I stated in the proposal, if the _hardware_ demands it, these atomic operations could be extended to provide built-in memory synchronization semantics. A standalone memory barrier would still be required (precisely as it is in hardware), but lowerings could use the potentially better optimized versions of the operations. I see this as an extension of the existing proposal, not a problem.> > > > The single instruction constraints can, at their most flexible, constrain > > any set of possible pairings of loads from memory and stores to memory > > I'm not sure about this, but can we get issues due to "special" kinds of > data transfers (such as vector stuff, DMA, ...?). Memcpy implementations > could be a one thing to look at. > This kind of breaks down to how universal you want the memory model to be.Issues? Perhaps. On the target architectures, these barriers would be treated conservatively. *All* memory accesses are protected. If the hardware provides atypical memory access mechanisms, it should also provide mechanisms for protecting those. These memory barriers would then be implemented to utilize those. The model proposed is extremely universal, because it is extremely general. Implementations for specialized hardware should take a conservative approach. If target architectures need more specialized memory constraint constructs, these could again be added incrementally in order to better support fine tuning those architecture's performance characteristics.> > Regarding swap(): Which uses do you have in mind? To me, support for TAS > would be more useful as it is used for spinlocks (and no concurrent algo > immediately comes to my mind that uses swap() and couldn't use TAS). The > difference is that TAS can have a weaker interface, because it operates on > a boolean value only, making it easier to map on different hardware (and > TAS is faster than CAS on some architectures I think). For example, for TAS > a byte is sufficient, whereas with swap you probably would require that the > exchange has machine-word size.I cannot express this enough. This is *not*, repeat *not* an API. Programmers will *not* be using these constructs directly. This is an interface to the hardware. So what is more useful to algorithms is not the primary consideration. Rather, what the hardware provides for algorithms is. Furthermore, TAS on most architectures is implemented with a swap. Specifically, the architecture with the closest thing to a "more efficient" test and set is x86, and the atomic_ops package, which I did look at, uses "xchgb". This is precisely the lowering x86 would provide for the LLVM intrinsic function "i8 @llvm.atomic.swap( i8* %ptr, i8 swap )". Thus it makes perfect sense for LLVM to support a "swap" intrinsic function in order to implement an API which provides test-and-set locks. I cannot see the benefit of limiting the LLVM representation to that of the API, when a common abstraction of how to implement that API is available on every architecture targetted. Lastly, most of the architectures which support and size selection at all, support it for all of the instructions necessary to implement these operations. 'cas' and 'swap' will lower to single byte instructions on x86 f.ex.> > These implementations assume a very conservative interpretation. > > result = __sync_fetch_and_add( <ty>* ptr, <ty> value ) > > call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, > > i1 true ) > > %result = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value ) > > Shouldn't you have a second membar after the las() to be very conservative > (i.e., if las() is supposed to really be linearizable)? Otherwise, the > effects of the las() can be reordered with respect to effects of subsequent > instructions.You are probably right here. It was very late, and as mentioned, the GCC spec is extremely ambiguous on the precise semantics for these intrinsics. I'm going to move to what I think are more appropriate for the instructions, rather than trusting their descriptions.> This also shows that you get additional overhead in the codegen if barriers > are not associated with an operation: To emit efficient code, codegen would > have to check whether membars are next to an operation, and whether they > can be merged with the operation. If codegens don't do it, then runtime > overhead will be higher due to the unnecessary barriers. This also implies > that there is no reordering of las() with other operations until codegen > phase, so you would at least have to have some sort of compiler barrier, or > require the las() target to be volatile (does LLVM volatile ordering > guarantees apply to non-volatile loads/stores also, or just to volatile > ones?)This is not true in any situation for any architecture but Itanium. Furthermore, given the specs of GCC's intrinsics, it is not true even for Itanium in this specific situation. The GCC specification (such that it is) for these intrinsics indicates that they must form a "full barrier". Itanium's instruction-based barriers cannot provide this functionality, forcing it to fall back on a full fence on either side of the instruction. Is this poor code? Yes it is, but it is required to match the semantics provided by GCC. On every other architecture, there _are_ no instruction-based barriers. x86 almost has instruction based barriers, as for some chips _every_ instruction is a full barrier, but this is not what you are aiming for either and is slowly changing in newer x86 implementations. For the other architectures, memory barriers _are_ separate instructions. x86 has separate instructions for memory barriers. The memory barriers, as proposed, will provide all necessary barriers to the compiler reordering.> If you use the other approach instead (ordering constraint attached to > instructions), then you have to support more intrinsics, but you don't need > this kind of analysis, and you wouldn't need compiler reordering barriers > assuming that they are implicit for anything that carries any reordering > constraint. > > I would guess that with the second approach (constraints for operations), > codegen implementations could be actually easier because you can > concentrate on just one operation with constraints.As stated above, this assumes that all architectures work this way, which they simply do not. This is a hardware interface, and as such follows the hardware. As such, the codegen will not be easier or harder in either of these cases.> > result = __sync_fetch_and_or( <ty>* ptr, <ty> value ) > > or/and/... should be added to the list of supported intrinsics at some time. > x86 has built-in support for that and doesn't need the CAS loop.It doesn't have built-in support. These are "fetch and *" operations. x86 has support for an atomic "or" to memory which does not include any fetch. x86 has "xadd" which allows a fetch and add, and fetch and sub (through negation) without the compare and swap, however neither "or" nor "and" are available in this fashion. Perhaps I am missing something in the x86 instruction set, but I can't find any other way to perform these operations. The many non-fetching atomic operations in x86 are more geared at dispatching and joining of threads rather than synchronizing contending threads. -Chandler Carruth> > > Torvald > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Torvald Riegel
2007-Jul-13 13:54 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On Thursday 12 July 2007 14:23, Torvald Riegel wrote:> Regarding swap(): Which uses do you have in mind? To me, support for TAS > would be more useful as it is used for spinlocks (and no concurrent algo > immediately comes to my mind that uses swap() and couldn't use TAS). ThePlease ignore this statement, keep get-and-set. I should implement more locks ... Nevertheless, I'd still like to see test-and-set. torvald
Possibly Parallel Threads
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2