David Greene
2007-Jul-12 15:06 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On Thursday 12 July 2007 07:23, Torvald Riegel wrote:> > The single instruction constraints can, at their most flexible, constrain > > any set of possible pairings of loads from memory and stores to memory > > I'm not sure about this, but can we get issues due to "special" kinds of > data transfers (such as vector stuff, DMA, ...?). Memcpy implementations > could be a one thing to look at. > This kind of breaks down to how universal you want the memory model to be.Right. For example, the Cray X1 has a much richer set of memory ordering instructions than anything on the commodity micros: http://tinyurl.com/3agjjn The memory ordering intrinsics in the current llvm proposal can't take advantage of them because they are too coarse-grained. Now, I don't expect we'll see an llvm-based X1 code generator, but looking at what the HPC vendors are doing in this area will go a long way toward informing the kind of operations we may want to include in llvm. The trend is for vendors to include ever more finely targeted semantics to allow scaling to machines with millions of cores. If we can incrementally refine the size of the memory ordering hammers, I'm ok with that. If it's simply a matter of adding finer-grained intrinsics later, that's cool. But I don't want to get us into a situation where llvm requires stricter memory ordering than is strictly necessary and we can't get out from under the stone. -Dave
Dan Gohman
2007-Jul-12 15:56 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On Thu, Jul 12, 2007 at 10:06:04AM -0500, David Greene wrote:> On Thursday 12 July 2007 07:23, Torvald Riegel wrote: > > > > The single instruction constraints can, at their most flexible, constrain > > > any set of possible pairings of loads from memory and stores to memory > > > > I'm not sure about this, but can we get issues due to "special" kinds of > > data transfers (such as vector stuff, DMA, ...?). Memcpy implementations > > could be a one thing to look at. > > This kind of breaks down to how universal you want the memory model to be. > > Right. For example, the Cray X1 has a much richer set of memory ordering > instructions than anything on the commodity micros: > > http://tinyurl.com/3agjjn > > The memory ordering intrinsics in the current llvm proposal can't take > advantage of them because they are too coarse-grained.I guess the descriptions on that page are, heh, a little terse ;-). The Cray X1 has a dimension of synchronization that isn't covered in this proposal, and that's the set of observers need to observe the ordering. For example you can synchronize a team of streams in a multi-streaming processor without requiring that the ordering of memory operations be observed by the entire system. That's what motivates most of the variety in that list. There's one other specific aspect I'd like to point out here. There's an "acquire" which orders prior *scalar* loads with *all* subsequent memory accesses, and a "release" which orders *all* prior accesses with subsequent *scalar* stores. The Cray X1's interest in distinguishing scalar accesses from vector accesses is specific to its architecture, but in general, it is another case that motivates having more granularity than just "all loads" and "all stores". Overall though, I'm quite happy to see that the newest revision of the proposal has switched from LLVM instructions to LLVM intrinsics. That will make it easier to experiment with extensions in the future. And having the string "atomic" right there in the names of each operation is very much appreciated :-). Dan -- Dan Gohman, Cray Inc.
Chandler Carruth
2007-Jul-12 17:59 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On 7/12/07, David Greene <dag at cray.com> wrote:> On Thursday 12 July 2007 07:23, Torvald Riegel wrote: > > > > The single instruction constraints can, at their most flexible, constrain > > > any set of possible pairings of loads from memory and stores to memory > > > > I'm not sure about this, but can we get issues due to "special" kinds of > > data transfers (such as vector stuff, DMA, ...?). Memcpy implementations > > could be a one thing to look at. > > This kind of breaks down to how universal you want the memory model to be. > > Right. For example, the Cray X1 has a much richer set of memory ordering > instructions than anything on the commodity micros: > > http://tinyurl.com/3agjjnThanks for this link! Very interesting to see an architecture which pays much more attention to its memory ordering.> The memory ordering intrinsics in the current llvm proposal can't take > advantage of them because they are too coarse-grained.>From what I can clean, this coarseness comes in two flavors -- globalv. local memory access, and type-based granularities. Is this a correct interpretation? (I'm clearly not going to be an expert on the X1. ;])> > Now, I don't expect we'll see an llvm-based X1 code generator, but looking at > what the HPC vendors are doing in this area will go a long way toward > informing the kind of operations we may want to include in llvm. The trend is > for vendors to include ever more finely targeted semantics to allow scaling to > machines with millions of cores.Absolutely! Like I said, its great to see this kind of information. A few points about the current proposal: 1) It currently only deals with integers in order to make it simple to implement, and representable across all architectures. While this is limiting, I think it remains a good starting point, and shouldn't cause any problems for later expansion to more type-aware interpretations. 2) The largest assumption made is that all memory is just "memory". After that, the most fine grained interpretation of barriers available was chosen (note that only SPARC can do all the various combinations... most only use one big fence...). The only major thing I can see that would increase this granularity is to treat different types differently, or treat them as going into different parts of "memory"? Really not sure here, but it definitely is something to look into. However, I think this may require a much later proposal when the hardware is actively being used at this level, and we can try and find a more finegrained way of targetting all the available architectures. For the time being, it seems that the current proposal hits all the architectures very neatly.> If we can incrementally refine the size of the memory ordering hammers, I'm > ok with that. If it's simply a matter of adding finer-grained intrinsics > later, that's cool. But I don't want to get us into a situation where llvm > requires stricter memory ordering than is strictly necessary and we can't get > out from under the stone.With the current version you can specify exactly what ordering you desire. The only thing ignored is the type of the various loads and stores. I think adding that level of granularity to the existing highly granular pairing selection would be a smooth incremental update. Is there another update you see needed that would be less smooth? Again, thanks for the information on the X1's memory architecture, very interesting... I'm going to try and get into it a bit more in a response to Dan Gohman's email below... =] -Chandler> > -Dave > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Chandler Carruth
2007-Jul-12 18:08 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On 7/12/07, Dan Gohman <djg at cray.com> wrote:> On Thu, Jul 12, 2007 at 10:06:04AM -0500, David Greene wrote: > > On Thursday 12 July 2007 07:23, Torvald Riegel wrote: > > > > > > The single instruction constraints can, at their most flexible, constrain > > > > any set of possible pairings of loads from memory and stores to memory > > > > > > I'm not sure about this, but can we get issues due to "special" kinds of > > > data transfers (such as vector stuff, DMA, ...?). Memcpy implementations > > > could be a one thing to look at. > > > This kind of breaks down to how universal you want the memory model to be. > > > > Right. For example, the Cray X1 has a much richer set of memory ordering > > instructions than anything on the commodity micros: > > > > http://tinyurl.com/3agjjn > > > > The memory ordering intrinsics in the current llvm proposal can't take > > advantage of them because they are too coarse-grained. > > I guess the descriptions on that page are, heh, a little terse ;-).A bit. ;] I was glad to see your clarification.> The > Cray X1 has a dimension of synchronization that isn't covered in this > proposal, and that's the set of observers need to observe the ordering. > For example you can synchronize a team of streams in a multi-streaming > processor without requiring that the ordering of memory operations be > observed by the entire system. That's what motivates most of the variety > in that list.This is fascinating to me, personally. I don't know how reasonable it is to implement directly in LLVM, however, could a codegen for the X1 in theory establish if the "shared memory" was part of a stream in a multi-streaming processor, and use those local synchronization routines? I'm not sure how reasonable this is. Alternatively, to target this specific of an architecture, perhaps the LLVM code could be annotated to show where it is operating on streams, versus across processors, and allow that to guide the codegen decision as to which type of synchronization to utilize. As LLVM doesn't really understand the parallel implementation the code is running on, it seems like it might be impossible to build this into LLVM without it being X1-type-system specific... but perhaps you have better ideas how to do such things from working on it for some time?> > There's one other specific aspect I'd like to point out here. There's an > "acquire" which orders prior *scalar* loads with *all* subsequent memory > accesses, and a "release" which orders *all* prior accesses with subsequent > *scalar* stores. The Cray X1's interest in distinguishing scalar accesses > from vector accesses is specific to its architecture, but in general, it > is another case that motivates having more granularity than just "all > loads" and "all stores".This clarifies some of those instructions. Here is my thought on how to fit this behavior in with the current proposal: You're still ordering load-store pairings, there is juts the added dimensionality of types. This seems like an easy extension to the existing proposal to combine the load and store pairings with a type dimension to achieve finer-grained control. Does this make sense as an incremental step from your end with much more experience comparing your hardware to LLVM's IR?> > Overall though, I'm quite happy to see that the newest revision of the > proposal has switched from LLVM instructions to LLVM intrinsics. That > will make it easier to experiment with extensions in the future. And > having the string "atomic" right there in the names of each operation > is very much appreciated :-).The atomic in the name is nice. It does make the syntax a bit less elegant, but it'll get the ball rolling faster, and thats far more important! Thanks for the input, and I really love the X1 example for a radically different memory model from the architectures LLVM is currently targeting. -Chandler Carruth> > Dan > > -- > Dan Gohman, Cray Inc. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Possibly Parallel Threads
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2