Jingyue Wu via llvm-dev
2015-Aug-21 22:11 UTC
[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?
I'm using 7.0. I am attaching the reduced example. nvcc sync.cu -arch=sm_35 -ptx gives // .globl _Z3foov .visible .entry _Z3foov( ) { .reg .pred %p<2>; .reg .s32 %r<3>; mov.u32 %r1, %tid.x; and.b32 %r2, %r1, 1; setp.eq.b32 %p1, %r2, 1; @!%p1 bra BB7_2; bra.uni BB7_1; BB7_1: bar.sync 0; bra.uni BB7_3; BB7_2: bar.sync 0; BB7_3: ret; } As you see, bar.sync is duplicated. On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <jholewinski at nvidia.com> wrote:> That’s an interesting case. AFAIK, inlining should be restricted here for > the reason you mention. Inlining should only be valid if it doesn’t > duplicate the barrier. Which nvcc shows this behavior? > > Adding Vinod and Yuan for comment. > > > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com> wrote: > > > > Hi Justin, > > > > Is a compiler allowed to inline a function that calls __syncthreads? I > saw nvcc does that, but not sure it's valid though. For example, > > > > void foo() { > > __syncthreads(); > > } > > > > if (threadIdx.x % 2 == 0) { > > ... > > foo(); > > } else { > > ... > > foo(); > > } > > > > Before inlining, all threads meet at one __syncthreads(). After inlining > > > > if (threadIdx.x % 2 == 0) { > > ... > > __syncthreads(); > > } else { > > ... > > __syncthreads(); > > } > > > > The __syncthreads call is duplicated, and it's no longer guaranteed that > all threads can meet one __syncthreads(). > > > > Any thoughts? > > > > Jingyue > > > > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and > may contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > > ----------------------------------------------------------------------------------- >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/5538a63f/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: sync.cu Type: application/octet-stream Size: 273 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/5538a63f/attachment.obj>
Jingyue Wu via llvm-dev
2015-Aug-21 23:51 UTC
[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?
Looking at this section <http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar> in the PTX ISA, there's a sentence saying:> In conditionally executed code, a bar instruction should only be used ifit is known that all threads evaluate the> condition identically (the warp does not diverge).Does that mean __syncthreads should only be called uniformly when no threads diverge? If so, my sync.cu example is undefined. The reason is that, although every threads reach __syncthreads, they are reaching them divergently: 1. threads diverge at the "if" statement 2. the warp runs __syncthreads() with half of the threads enabled 3. the warp jumps back to the "else" branch 4. the warp runs __syncthreads() with the other half of the threads enabled If my understanding is correct (__syncthreads() can only be called when the warp doesn't diverge), unrolling a loop that contains a __syncthreads() and inlining a function that may call __syncthreads() are fine. Am I right? Jingyue On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <jingyue at google.com> wrote:> I'm using 7.0. I am attaching the reduced example. > > nvcc sync.cu -arch=sm_35 -ptx > > gives > > // .globl _Z3foov > .visible .entry _Z3foov( > > ) > { > .reg .pred %p<2>; > .reg .s32 %r<3>; > > > mov.u32 %r1, %tid.x; > and.b32 %r2, %r1, 1; > setp.eq.b32 %p1, %r2, 1; > @!%p1 bra BB7_2; > bra.uni BB7_1; > > BB7_1: > bar.sync 0; > bra.uni BB7_3; > > BB7_2: > bar.sync 0; > > BB7_3: > ret; > } > > As you see, bar.sync is duplicated. > > On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <jholewinski at nvidia.com > > wrote: > >> That’s an interesting case. AFAIK, inlining should be restricted here >> for the reason you mention. Inlining should only be valid if it doesn’t >> duplicate the barrier. Which nvcc shows this behavior? >> >> Adding Vinod and Yuan for comment. >> >> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com> wrote: >> > >> > Hi Justin, >> > >> > Is a compiler allowed to inline a function that calls __syncthreads? I >> saw nvcc does that, but not sure it's valid though. For example, >> > >> > void foo() { >> > __syncthreads(); >> > } >> > >> > if (threadIdx.x % 2 == 0) { >> > ... >> > foo(); >> > } else { >> > ... >> > foo(); >> > } >> > >> > Before inlining, all threads meet at one __syncthreads(). After inlining >> > >> > if (threadIdx.x % 2 == 0) { >> > ... >> > __syncthreads(); >> > } else { >> > ... >> > __syncthreads(); >> > } >> > >> > The __syncthreads call is duplicated, and it's no longer guaranteed >> that all threads can meet one __syncthreads(). >> > >> > Any thoughts? >> > >> > Jingyue >> >> >> >> ----------------------------------------------------------------------------------- >> This email message is for the sole use of the intended recipient(s) and >> may contain >> confidential information. Any unauthorized review, use, disclosure or >> distribution >> is prohibited. If you are not the intended recipient, please contact the >> sender by >> reply email and destroy all copies of the original message. >> >> ----------------------------------------------------------------------------------- >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/52827105/attachment.html>
Bjarke Roune via llvm-dev
2015-Sep-09 21:46 UTC
[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?
Hi Justin, Yuan and Vinod, It seems that what __syncthreads() requires in CUDA C++ (as opposed to PTX) is to be executed uniformly across all threads in the block and not just the warp. If so, it would be helpful if there were a precise statement about when a statement is considered to be executed uniformly in CUDA C++. Is there a precise statement somewhere from NVIDIA about this? I haven't found one so far. In particular, it's not clear to me at what point diverging threads are considered to have joined up again in CUDA C++. My best guess is that this is at the immediate post-dominator of the statement that starts the divergence, with the caveat that there is an implicit shared CFG node following each return statement in a function. Bjarke On Fri, Aug 21, 2015 at 4:51 PM, Jingyue Wu <jingyue at google.com> wrote:> Looking at this section > <http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar> in > the PTX ISA, there's a sentence saying: > > > In conditionally executed code, a bar instruction should only be used if > it is known that all threads evaluate the > > condition identically (the warp does not diverge). > > Does that mean __syncthreads should only be called uniformly when no > threads diverge? If so, my sync.cu example is undefined. The reason is > that, although every threads reach __syncthreads, they are reaching them > divergently: > 1. threads diverge at the "if" statement > 2. the warp runs __syncthreads() with half of the threads enabled > 3. the warp jumps back to the "else" branch > 4. the warp runs __syncthreads() with the other half of the threads enabled > > If my understanding is correct (__syncthreads() can only be called when > the warp doesn't diverge), unrolling a loop that contains a __syncthreads() > and inlining a function that may call __syncthreads() are fine. Am I right? > > Jingyue > > > > On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <jingyue at google.com> wrote: > >> I'm using 7.0. I am attaching the reduced example. >> >> nvcc sync.cu -arch=sm_35 -ptx >> >> gives >> >> // .globl _Z3foov >> .visible .entry _Z3foov( >> >> ) >> { >> .reg .pred %p<2>; >> .reg .s32 %r<3>; >> >> >> mov.u32 %r1, %tid.x; >> and.b32 %r2, %r1, 1; >> setp.eq.b32 %p1, %r2, 1; >> @!%p1 bra BB7_2; >> bra.uni BB7_1; >> >> BB7_1: >> bar.sync 0; >> bra.uni BB7_3; >> >> BB7_2: >> bar.sync 0; >> >> BB7_3: >> ret; >> } >> >> As you see, bar.sync is duplicated. >> >> On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski < >> jholewinski at nvidia.com> wrote: >> >>> That’s an interesting case. AFAIK, inlining should be restricted here >>> for the reason you mention. Inlining should only be valid if it doesn’t >>> duplicate the barrier. Which nvcc shows this behavior? >>> >>> Adding Vinod and Yuan for comment. >>> >>> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com> wrote: >>> > >>> > Hi Justin, >>> > >>> > Is a compiler allowed to inline a function that calls __syncthreads? I >>> saw nvcc does that, but not sure it's valid though. For example, >>> > >>> > void foo() { >>> > __syncthreads(); >>> > } >>> > >>> > if (threadIdx.x % 2 == 0) { >>> > ... >>> > foo(); >>> > } else { >>> > ... >>> > foo(); >>> > } >>> > >>> > Before inlining, all threads meet at one __syncthreads(). After >>> inlining >>> > >>> > if (threadIdx.x % 2 == 0) { >>> > ... >>> > __syncthreads(); >>> > } else { >>> > ... >>> > __syncthreads(); >>> > } >>> > >>> > The __syncthreads call is duplicated, and it's no longer guaranteed >>> that all threads can meet one __syncthreads(). >>> > >>> > Any thoughts? >>> > >>> > Jingyue >>> >>> >>> >>> ----------------------------------------------------------------------------------- >>> This email message is for the sole use of the intended recipient(s) and >>> may contain >>> confidential information. Any unauthorized review, use, disclosure or >>> distribution >>> is prohibited. If you are not the intended recipient, please contact >>> the sender by >>> reply email and destroy all copies of the original message. >>> >>> ----------------------------------------------------------------------------------- >>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150909/197d133d/attachment.html>