Chandler Carruth
2007-Jul-12 18:08 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On 7/12/07, Dan Gohman <djg at cray.com> wrote:> On Thu, Jul 12, 2007 at 10:06:04AM -0500, David Greene wrote: > > On Thursday 12 July 2007 07:23, Torvald Riegel wrote: > > > > > > The single instruction constraints can, at their most flexible, constrain > > > > any set of possible pairings of loads from memory and stores to memory > > > > > > I'm not sure about this, but can we get issues due to "special" kinds of > > > data transfers (such as vector stuff, DMA, ...?). Memcpy implementations > > > could be a one thing to look at. > > > This kind of breaks down to how universal you want the memory model to be. > > > > Right. For example, the Cray X1 has a much richer set of memory ordering > > instructions than anything on the commodity micros: > > > > http://tinyurl.com/3agjjn > > > > The memory ordering intrinsics in the current llvm proposal can't take > > advantage of them because they are too coarse-grained. > > I guess the descriptions on that page are, heh, a little terse ;-).A bit. ;] I was glad to see your clarification.> The > Cray X1 has a dimension of synchronization that isn't covered in this > proposal, and that's the set of observers need to observe the ordering. > For example you can synchronize a team of streams in a multi-streaming > processor without requiring that the ordering of memory operations be > observed by the entire system. That's what motivates most of the variety > in that list.This is fascinating to me, personally. I don't know how reasonable it is to implement directly in LLVM, however, could a codegen for the X1 in theory establish if the "shared memory" was part of a stream in a multi-streaming processor, and use those local synchronization routines? I'm not sure how reasonable this is. Alternatively, to target this specific of an architecture, perhaps the LLVM code could be annotated to show where it is operating on streams, versus across processors, and allow that to guide the codegen decision as to which type of synchronization to utilize. As LLVM doesn't really understand the parallel implementation the code is running on, it seems like it might be impossible to build this into LLVM without it being X1-type-system specific... but perhaps you have better ideas how to do such things from working on it for some time?> > There's one other specific aspect I'd like to point out here. There's an > "acquire" which orders prior *scalar* loads with *all* subsequent memory > accesses, and a "release" which orders *all* prior accesses with subsequent > *scalar* stores. The Cray X1's interest in distinguishing scalar accesses > from vector accesses is specific to its architecture, but in general, it > is another case that motivates having more granularity than just "all > loads" and "all stores".This clarifies some of those instructions. Here is my thought on how to fit this behavior in with the current proposal: You're still ordering load-store pairings, there is juts the added dimensionality of types. This seems like an easy extension to the existing proposal to combine the load and store pairings with a type dimension to achieve finer-grained control. Does this make sense as an incremental step from your end with much more experience comparing your hardware to LLVM's IR?> > Overall though, I'm quite happy to see that the newest revision of the > proposal has switched from LLVM instructions to LLVM intrinsics. That > will make it easier to experiment with extensions in the future. And > having the string "atomic" right there in the names of each operation > is very much appreciated :-).The atomic in the name is nice. It does make the syntax a bit less elegant, but it'll get the ball rolling faster, and thats far more important! Thanks for the input, and I really love the X1 example for a radically different memory model from the architectures LLVM is currently targeting. -Chandler Carruth> > Dan > > -- > Dan Gohman, Cray Inc. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
David Greene
2007-Jul-12 21:51 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On Thursday 12 July 2007 13:08, Chandler Carruth wrote:> > > Right. For example, the Cray X1 has a much richer set of memory > > > ordering instructions than anything on the commodity micros: > > > > > > http://tinyurl.com/3agjjn > > > > > > The memory ordering intrinsics in the current llvm proposal can't take > > > advantage of them because they are too coarse-grained. > > > > I guess the descriptions on that page are, heh, a little terse ;-). > > A bit. ;] I was glad to see your clarification.Yeah, sorry. I was heading out the door to a meeting when I posted the link. I'm glad Dan clarified some things. Unfortunately, I could only link to publicly-available documents. Our internal ISA book explains this all much better. :)> > The > > Cray X1 has a dimension of synchronization that isn't covered in this > > proposal, and that's the set of observers need to observe the ordering. > > For example you can synchronize a team of streams in a multi-streaming > > processor without requiring that the ordering of memory operations be > > observed by the entire system. That's what motivates most of the variety > > in that list. > > This is fascinating to me, personally. I don't know how reasonable it > is to implement directly in LLVM, however, could a codegen for the X1 > in theory establish if the "shared memory" was part of a stream in a > multi-streaming processor, and use those local synchronization > routines?Absolutely. The X1 compiler is responsible for partitioning loops to run on multiple streams and synchronizing among the streams as necessary. That synchronization is at a level "above" general system memory ordering. The X1 has multiple levels of parallelism: - Vectorization - Decoupled vector/scalar execution (this is where the lsyncs come in) - Multistreaming (the msync operations) - Multiprocessing (global machine-wide synchronization via gsync) The compiler is basically responsible for the first three levels while the user does the fourth via MPI, OpenMP, CAF, UPC, etc. In general sometimes the user inserts directives to help the compiler with 1-3 but the compiler gets a lot of cases on its own automatically.> I'm not sure how reasonable this is. Alternatively, to > target this specific of an architecture, perhaps the LLVM code could > be annotated to show where it is operating on streams, versus across > processors, and allow that to guide the codegen decision as to which > type of synchronization to utilize. As LLVM doesn't really understand > the parallel implementation the code is running on, it seems like it > might be impossible to build this into LLVM without it being > X1-type-system specific... but perhaps you have better ideas how to do > such things from working on it for some time?In a parallelizing compiler, the compiler must keep track of where it placed data when it parallelized code as it must know how to handle dependencies and insert synchronizations. In the case of the X1, the compiler partitioned a loop to run on multiple cores, so it knows to use msyncs when that code accesses data shared among the cores. The compiler determined which data to share among the cores and which to keep private in each core. Similarly, when it vectorizes, it knows the dependencies between vector and scalar operations and inserts the necessary lsyncs. PGI, Pathscale and Intel, for example, are starting to talk about automatic OpenMP. They will need to insert synchronizations across cores similarly to what's done on the X1. Those will probably be some form of MFENCE. The abstraction here is the "level" of parallelism. Vectorization is very fine-grained. Most implementations in hardware do not need explicit software syncs between scalar and vector code. The next level up is multithreading (we call that multistreaming on the X1 for historical reasons). Depending on architecture, this could happen within a single core (MTA style) or across multiple cores (X1 style), providing two distinct levels of parallelism and possibly two distinct sets of sync instructions in the general case. Then you've got a level of parallelism around the set of sockets that are cache coherent (so-called "local" processors, or a "node" in X1 parlance). You might have another set of sync instructions for this (the X1 does not). Then you have the most general case of parallelism across "the system" where communication time between processors is extremely long. This is the "gsync" level on the X1. Other more sophisticated architectures may have even more levels of parallelism. So in thinking about extending your work (which again, I want to stress is not immediately necessary, but still good to think about), I would suggest we think in terms of level of parallelization or perhaps "distance among participants." It's not a good idea to hard-code things like "vector-scalar sync" but I can imagine intrinsics that say, "order memory among these participants," or "order memory at this level," or "order memory between these levels," where the levels are defined by the target architecture. If a target doesn't have as many levels as used in the llvm code, then it can just choose to use a more expensive sync instruction. In X1 terms, a gsync is a really big hammer, but it can always be used in place of an lsync. I don't know if any plans exist to incorporate parallelizing transformations into llvm, but I can certainly imagine building an auto-parallelizing infrastructure above it. That infrastructure would have to communicate information down to llvm so it could generate code properly. How to do that is an entirely other can of worms. :)> > There's one other specific aspect I'd like to point out here. There's an > > "acquire" which orders prior *scalar* loads with *all* subsequent memory > > accesses, and a "release" which orders *all* prior accesses with > > subsequent *scalar* stores. The Cray X1's interest in distinguishing > > scalar accesses from vector accesses is specific to its architecture, but > > in general, it is another case that motivates having more granularity > > than just "all loads" and "all stores". > > This clarifies some of those instructions. Here is my thought on how > to fit this behavior in with the current proposal: > > You're still ordering load-store pairings, there is juts the added > dimensionality of types. This seems like an easy extension to the > existing proposal to combine the load and store pairings with a type > dimension to achieve finer-grained control. Does this make sense as an > incremental step from your end with much more experience comparing > your hardware to LLVM's IR?This would work for X1-style lsyncs, but we should think about whether this is too architecture-specific. Decoupled execution doesn't fit completely snugly into the "levels of parallelism" model I outlined above, so it's a bit of an oddball. It's parallelism, but of a different form. Commodity micros have decoupled execution but they handle syncs in hardware (thus moving to/from a GPR and XMM is expensive). The X1 fsync falls into the same category. It's there because the X1 does not have precise traps for floating point code and doesn't really have anything to do with parallelization. Ditto isync (all modern processors have some form of this to guard against self-modifying code). The bottom line is that I don't have easy cut-and-dry answers. I suspect this will be an organic process and we'll learn how to abstract these things in an architecture-independent manner as we go. -Dave
David A. Greene
2007-Jul-13 02:07 UTC
[LLVMdev] Atomic Operation and Synchronization Proposal v2
On Thursday 12 July 2007 16:51, David Greene wrote:> > You're still ordering load-store pairings, there is juts the added > > dimensionality of types. This seems like an easy extension to the > > existing proposal to combine the load and store pairings with a type > > dimension to achieve finer-grained control. Does this make sense as an > > incremental step from your end with much more experience comparing > > your hardware to LLVM's IR? > > This would work for X1-style lsyncsI take that back. Maybe. If by "type" you literally mean the data type of the value (int, float, etc.) and the extent of the data (vector or scalar), then it won't handle the X1 case where integer scalar instructions feed floating point vector instructions and similar combinations. If by "type" you only mean the extent of the data, then it would work fine. -Dave
Maybe Matching Threads
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2
- [LLVMdev] Atomic Operation and Synchronization Proposal v2