Sahasrabuddhe, Sameer
2014-Nov-14 19:09 UTC
[LLVMdev] memory scopes in atomic instructions
On 11/15/2014 12:08 AM, Tom Stellard wrote:> Can you send a plain-text version of this email. It's easier to read > and reply to.Sorry about that! Here's the plain text (I hope!): Hi all, OpenCL 2.0 introduced the notion of memory scope in atomic operations to global memory. These scopes are a hint to the underlying platform to optimize how synchronization is achieved. HSAIL also has a notion of memory scopes that is compatible with OpenCL 2.0. Currently, the LLVM IR uses a binary value (SingleThread/CrossThread) to represent synchronization scope on atomic instructions. This makes it difficult to translate OpenCL 2.0 atomic operations to LLVM IR, and also to implement HSAIL memory scopes in the proposed HSAIL backend for LLVM. We would like to enhance the representation of memory scopes in LLVM IR to allow more values than just the current two. The intention of this email is to invite comments before we start prototyping. Here's what we have in mind: 1. Update the synchronization scope field in atomic instructions from a single bit to a wider field, say 32-bit unsigned integer. 2. Retain the current default of zero as "system scope", replacing the current "cross thread" scope. 3. All other values are target-defined. 4. The use of "single thread scope" is not clear. If it is required in target-independent transforms, then it could be encoded as just "1", or as "all ones" in the wider field. The latter option is a bit weird, because most targets will have very few scopes. But it is useful in case the next point is included in LLVM IR. 5. Possibly add the following constraint on memory scopes: "The scope represented by a larger value is nested inside (is a proper subset of) the scope represented by a smaller value." This would also imply that the value used for single-thread scope must be the largest value used by the target. This constraint on "nesting" is easily satisfied by HSAIL (and also OpenCL), where synchronization scopes increase from a single work-item to the entire system. But it is conceivable that other targets do not have this constraint. For example, a platform may define synchronization scopes in terms of overlapping sets instead of proper subsets. 6. The impact of this change is limited to increasing the number of bits used to store synchronization scope. Future optimizations on atomics may need to interpret scopes in target-defined ways. When the synchronization scopes of two atomic instructions do not match, these optimizations must query the target for validity. *Relation with SPIR: *SPIR defines an enumeration for memory scopes, but it does not support LLVM atomic instructions. So memory scopes in SPIR are independent of the representation finally chosen in LLVM IR. A compiler that translates SPIR to native LLVM IR will have to translate memory scopes wherever appropriate. Sameer.
On Fri, Nov 14, 2014 at 1:09 PM, Sahasrabuddhe, Sameer < sameer.sahasrabuddhe at amd.com> wrote:> 1. Update the synchronization scope field in atomic instructions from a > single bit to a wider field, say 32-bit unsigned integer. >I think this should be an arbitrary bit width integer. I think baking any size into this is a mistake unless that size is "1".> 2. Retain the current default of zero as "system scope", replacing the > current "cross thread" scope. >I would suggest, address-space scope.> 3. All other values are target-defined. >You need to define single-thread scope.> 4. The use of "single thread scope" is not clear. >Consider trying to read from memory written in a thread from a signal handler delivered to that thread. Essentially, there may be a need to write code which we know will execute in a single hardware thread, but where the compiler optimizations precluded by atomics need to be precluded as the control flow within the hardware thread may arbitrarily move from one sequence of instructions to another.> If it is required in > target-independent transforms, >Yes, it is. sig_atomic_t.> then it could be encoded as just "1", > or as "all ones" in the wider field. The latter option is a bit > weird, because most targets will have very few scopes. But it is > useful in case the next point is included in LLVM IR. >If we go with your proposed constraint below, I think it is essential to model single-thread-scope as the maximum integer. It should be a strict subset of all inter-thread scopes.> 5. Possibly add the following constraint on memory scopes: "The scope > represented by a larger value is nested inside (is a proper subset > of) the scope represented by a smaller value." This would also imply > that the value used for single-thread scope must be the largest > value used by the target. > This constraint on "nesting" is easily satisfied by HSAIL (and also > OpenCL), where synchronization scopes increase from a single > work-item to the entire system. But it is conceivable that other > targets do not have this constraint. For example, a platform may > define synchronization scopes in terms of overlapping sets instead > of proper subsets. >I think this is the important thing to settle on in the design. I'd really like to hear from a diverse set of vendors and folks operating in the GPU space to understand whether having this constraint is critically important or problematic for any reasons. I think (unfortunately) it would be hard to add this later... -Chandler -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141118/00ce74b2/attachment.html>
> On Nov 18, 2014, at 2:35 PM, Chandler Carruth <chandlerc at google.com> wrote: > > > On Fri, Nov 14, 2014 at 1:09 PM, Sahasrabuddhe, Sameer <sameer.sahasrabuddhe at amd.com <mailto:sameer.sahasrabuddhe at amd.com>> wrote: > 1. Update the synchronization scope field in atomic instructions from a > single bit to a wider field, say 32-bit unsigned integer. > > I think this should be an arbitrary bit width integer. I think baking any size into this is a mistake unless that size is "1”....> If we go with your proposed constraint below, I think it is essential to model single-thread-scope as the maximum integer. It should be a strict subset of all inter-thread scopes.These seem mutually contradictory.> > 5. Possibly add the following constraint on memory scopes: "The scope > represented by a larger value is nested inside (is a proper subset > of) the scope represented by a smaller value." This would also imply > that the value used for single-thread scope must be the largest > value used by the target. > This constraint on "nesting" is easily satisfied by HSAIL (and also > OpenCL), where synchronization scopes increase from a single > work-item to the entire system. But it is conceivable that other > targets do not have this constraint. For example, a platform may > define synchronization scopes in terms of overlapping sets instead > of proper subsets. > > I think this is the important thing to settle on in the design. I'd really like to hear from a diverse set of vendors and folks operating in the GPU space to understand whether having this constraint is critically important or problematic for any reasons.I am not aware of any systems (including GPUs) that would need non-nested memory scopes. If such exist, I might expect them to be some kind of clustered NUMA HPC machine. —Owen -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141118/8b396e3e/attachment.html>
Sahasrabuddhe, Sameer
2014-Nov-19 17:54 UTC
[LLVMdev] memory scopes in atomic instructions
On 11/19/2014 4:05 AM, Chandler Carruth wrote:> > On Fri, Nov 14, 2014 at 1:09 PM, Sahasrabuddhe, Sameer > <sameer.sahasrabuddhe at amd.com <mailto:sameer.sahasrabuddhe at amd.com>> > wrote: > > 1. Update the synchronization scope field in atomic instructions > from a > single bit to a wider field, say 32-bit unsigned integer. > > > I think this should be an arbitrary bit width integer. I think baking > any size into this is a mistake unless that size is "1".I noticed that the LRM never specifies a width for address spaces, but the implementation uses "unsigned" everywhere, which is clearly not an arbitrary width integer. Is this how memory scopes should also be implemented?> 4. The use of "single thread scope" is not clear. > > > Consider trying to read from memory written in a thread from a signal > handler delivered to that thread. Essentially, there may be a need to > write code which we know will execute in a single hardware thread, but > where the compiler optimizations precluded by atomics need to be > precluded as the control flow within the hardware thread may > arbitrarily move from one sequence of instructions to another. > > If it is required in > target-independent transforms, > > > Yes, it is. sig_atomic_t.Thanks! This also explains why SingleThread is baked into tsan. I couldn't find a way to work around __tsan_atomic_signal_fence if I removed SingleThread as a well-known memory scope.> > 5. Possibly add the following constraint on memory scopes: "The scope > represented by a larger value is nested inside (is a proper subset > of) the scope represented by a smaller value." This would also > imply > that the value used for single-thread scope must be the largest > value used by the target. > This constraint on "nesting" is easily satisfied by HSAIL (and also > OpenCL), where synchronization scopes increase from a single > work-item to the entire system. But it is conceivable that other > targets do not have this constraint. For example, a platform may > define synchronization scopes in terms of overlapping sets instead > of proper subsets. > > > I think this is the important thing to settle on in the design. I'd > really like to hear from a diverse set of vendors and folks operating > in the GPU space to understand whether having this constraint is > critically important or problematic for any reasons.I think "heterogenous systems" (in general, and not just HSA) might be a better term since it covers more than just GPU devices. Also, I don't see why this constraint in the general LLVM IR could be critically important to some target. But I can see why it could be problematic for a target! If I understand this correctly, the main issue is that if we do not build nested scopes into the IR, then we can never have target-independent optimizations that work with multiple memory scopes. Is that correct? Is that really so important? What happens when we do have a target that does not have nested memory scopes? Will it not be harder to remove this assumption from the target-independent optimizations?> I think (unfortunately) it would be hard to add this later...I am not sure I understand this part. The only effect I see is that targets might use enumerations that do not follow a strict order in their list of memory scopes. We can always encourage a future-looking convention to list the memory scopes in nesting order. And in the worst case, the enumerations can be reordered when the need arises, right? Sameer.
Possibly Parallel Threads
- [LLVMdev] memory scopes in atomic instructions
- [LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux
- [LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux
- [LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux
- [LLVMdev] memory scopes in atomic instructions