I notice that there is not currently any intrinsic support for atomics in the PTX backend. Is this on the roadmap? Should it be as easy to add as it seems (plumbing through just like the thread ID instructions, &c.)? The obvious difference is that these ops have side effects. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3754 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111031/a41dd976/attachment.bin>
On Mon, Oct 31, 2011 at 3:15 PM, Jonathan Ragan-Kelley <jrk at csail.mit.edu>wrote:> I notice that there is not currently any intrinsic support for atomics in > the PTX backend. Is this on the roadmap? Should it be as easy to add as it > seems (plumbing through just like the thread ID instructions, &c.)? The > obvious difference is that these ops have side effects. >It should be just a matter of defining these as back-end intrinsics. Patches are always welcome. :)> _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-- Thanks, Justin Holewinski -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111101/a1998430/attachment.html>
Looking further during down time at the dev meeting today, it actually seems that PTX atom.* and red.* intrinsics map extremely naturally onto the LLVM atomicrmw and cmpxchg instructions. The biggest issue is that a subset of things expressible with these LLVM instructions do not trivially map to PTX, and the range of things naturally supported depends on the features of a given target. With sufficient effort, all possible uses of these instructions could be emulated on all targets, at the cost of efficiency, but this would significantly complicate the codegen and probably produce steep performance cliffs. The basic model: %d = cmpxchg {T}* %a, {T} %b, {T} %c --> atom.{space of %a}.cas.{T} d, [a], b, c %d = atomicrmw {OP} {T}* %a, {T} b --> atom.{space of %a}.{OP}.{T} d, [a], b for op in { add, and, or, xor, min, max, xchg } with the special cases: %d is unused --> red.{space of %a}.{OP}.{T} d, [a], b # i.e. use the reduce instr instead of the atom instr {OP} == {add, sub} && b == 1 --> use PTX inc/dec op I think the right answer for the indefinite future is to map exactly those operations and types which trivially map to the given PTX and processor versions, leaving other cases as unsupported. (Even on the SSE and NEON backends, after all, select with a vector condition has barfed for years.) In the longer run, it could be quite useful for portability to map the full range of atomicrmw behaviors to all PTX targets using emulation when necessary, but relative to the current state of the art (manually writing different CUDA code paths with different sync strategies for different machine generations), only supporting what maps trivially is not a regression. Thoughts?
Apparently Analagous Threads
- [LLVMdev] PTX backend support for atomics
- [LLVMdev] PTX backend support for atomics
- RFC: Extending atomic loads and stores to floating point and vector types
- [LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
- RFC: Atomic LL/SC loops in LLVM revisited