Jingyue Wu via llvm-dev
2015-Aug-21 22:11 UTC
[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?
I'm using 7.0. I am attaching the reduced example.
nvcc sync.cu -arch=sm_35 -ptx
gives
// .globl _Z3foov
.visible .entry _Z3foov(
)
{
.reg .pred %p<2>;
.reg .s32 %r<3>;
mov.u32 %r1, %tid.x;
and.b32 %r2, %r1, 1;
setp.eq.b32 %p1, %r2, 1;
@!%p1 bra BB7_2;
bra.uni BB7_1;
BB7_1:
bar.sync 0;
bra.uni BB7_3;
BB7_2:
bar.sync 0;
BB7_3:
ret;
}
As you see, bar.sync is duplicated.
On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <jholewinski at
nvidia.com>
wrote:
> That’s an interesting case. AFAIK, inlining should be restricted here for
> the reason you mention. Inlining should only be valid if it doesn’t
> duplicate the barrier. Which nvcc shows this behavior?
>
> Adding Vinod and Yuan for comment.
>
> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com>
wrote:
> >
> > Hi Justin,
> >
> > Is a compiler allowed to inline a function that calls __syncthreads? I
> saw nvcc does that, but not sure it's valid though. For example,
> >
> > void foo() {
> > __syncthreads();
> > }
> >
> > if (threadIdx.x % 2 == 0) {
> > ...
> > foo();
> > } else {
> > ...
> > foo();
> > }
> >
> > Before inlining, all threads meet at one __syncthreads(). After
inlining
> >
> > if (threadIdx.x % 2 == 0) {
> > ...
> > __syncthreads();
> > } else {
> > ...
> > __syncthreads();
> > }
> >
> > The __syncthreads call is duplicated, and it's no longer
guaranteed that
> all threads can meet one __syncthreads().
> >
> > Any thoughts?
> >
> > Jingyue
>
>
>
>
-----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information. Any unauthorized review, use, disclosure or
> distribution
> is prohibited. If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
>
-----------------------------------------------------------------------------------
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/5538a63f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sync.cu
Type: application/octet-stream
Size: 273 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/5538a63f/attachment.obj>
Jingyue Wu via llvm-dev
2015-Aug-21 23:51 UTC
[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?
Looking at this section <http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar> in the PTX ISA, there's a sentence saying:> In conditionally executed code, a bar instruction should only be used ifit is known that all threads evaluate the> condition identically (the warp does not diverge).Does that mean __syncthreads should only be called uniformly when no threads diverge? If so, my sync.cu example is undefined. The reason is that, although every threads reach __syncthreads, they are reaching them divergently: 1. threads diverge at the "if" statement 2. the warp runs __syncthreads() with half of the threads enabled 3. the warp jumps back to the "else" branch 4. the warp runs __syncthreads() with the other half of the threads enabled If my understanding is correct (__syncthreads() can only be called when the warp doesn't diverge), unrolling a loop that contains a __syncthreads() and inlining a function that may call __syncthreads() are fine. Am I right? Jingyue On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <jingyue at google.com> wrote:> I'm using 7.0. I am attaching the reduced example. > > nvcc sync.cu -arch=sm_35 -ptx > > gives > > // .globl _Z3foov > .visible .entry _Z3foov( > > ) > { > .reg .pred %p<2>; > .reg .s32 %r<3>; > > > mov.u32 %r1, %tid.x; > and.b32 %r2, %r1, 1; > setp.eq.b32 %p1, %r2, 1; > @!%p1 bra BB7_2; > bra.uni BB7_1; > > BB7_1: > bar.sync 0; > bra.uni BB7_3; > > BB7_2: > bar.sync 0; > > BB7_3: > ret; > } > > As you see, bar.sync is duplicated. > > On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <jholewinski at nvidia.com > > wrote: > >> That’s an interesting case. AFAIK, inlining should be restricted here >> for the reason you mention. Inlining should only be valid if it doesn’t >> duplicate the barrier. Which nvcc shows this behavior? >> >> Adding Vinod and Yuan for comment. >> >> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com> wrote: >> > >> > Hi Justin, >> > >> > Is a compiler allowed to inline a function that calls __syncthreads? I >> saw nvcc does that, but not sure it's valid though. For example, >> > >> > void foo() { >> > __syncthreads(); >> > } >> > >> > if (threadIdx.x % 2 == 0) { >> > ... >> > foo(); >> > } else { >> > ... >> > foo(); >> > } >> > >> > Before inlining, all threads meet at one __syncthreads(). After inlining >> > >> > if (threadIdx.x % 2 == 0) { >> > ... >> > __syncthreads(); >> > } else { >> > ... >> > __syncthreads(); >> > } >> > >> > The __syncthreads call is duplicated, and it's no longer guaranteed >> that all threads can meet one __syncthreads(). >> > >> > Any thoughts? >> > >> > Jingyue >> >> >> >> ----------------------------------------------------------------------------------- >> This email message is for the sole use of the intended recipient(s) and >> may contain >> confidential information. Any unauthorized review, use, disclosure or >> distribution >> is prohibited. If you are not the intended recipient, please contact the >> sender by >> reply email and destroy all copies of the original message. >> >> ----------------------------------------------------------------------------------- >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/52827105/attachment.html>
Bjarke Roune via llvm-dev
2015-Sep-09 21:46 UTC
[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?
Hi Justin, Yuan and Vinod, It seems that what __syncthreads() requires in CUDA C++ (as opposed to PTX) is to be executed uniformly across all threads in the block and not just the warp. If so, it would be helpful if there were a precise statement about when a statement is considered to be executed uniformly in CUDA C++. Is there a precise statement somewhere from NVIDIA about this? I haven't found one so far. In particular, it's not clear to me at what point diverging threads are considered to have joined up again in CUDA C++. My best guess is that this is at the immediate post-dominator of the statement that starts the divergence, with the caveat that there is an implicit shared CFG node following each return statement in a function. Bjarke On Fri, Aug 21, 2015 at 4:51 PM, Jingyue Wu <jingyue at google.com> wrote:> Looking at this section > <http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar> in > the PTX ISA, there's a sentence saying: > > > In conditionally executed code, a bar instruction should only be used if > it is known that all threads evaluate the > > condition identically (the warp does not diverge). > > Does that mean __syncthreads should only be called uniformly when no > threads diverge? If so, my sync.cu example is undefined. The reason is > that, although every threads reach __syncthreads, they are reaching them > divergently: > 1. threads diverge at the "if" statement > 2. the warp runs __syncthreads() with half of the threads enabled > 3. the warp jumps back to the "else" branch > 4. the warp runs __syncthreads() with the other half of the threads enabled > > If my understanding is correct (__syncthreads() can only be called when > the warp doesn't diverge), unrolling a loop that contains a __syncthreads() > and inlining a function that may call __syncthreads() are fine. Am I right? > > Jingyue > > > > On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <jingyue at google.com> wrote: > >> I'm using 7.0. I am attaching the reduced example. >> >> nvcc sync.cu -arch=sm_35 -ptx >> >> gives >> >> // .globl _Z3foov >> .visible .entry _Z3foov( >> >> ) >> { >> .reg .pred %p<2>; >> .reg .s32 %r<3>; >> >> >> mov.u32 %r1, %tid.x; >> and.b32 %r2, %r1, 1; >> setp.eq.b32 %p1, %r2, 1; >> @!%p1 bra BB7_2; >> bra.uni BB7_1; >> >> BB7_1: >> bar.sync 0; >> bra.uni BB7_3; >> >> BB7_2: >> bar.sync 0; >> >> BB7_3: >> ret; >> } >> >> As you see, bar.sync is duplicated. >> >> On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski < >> jholewinski at nvidia.com> wrote: >> >>> That’s an interesting case. AFAIK, inlining should be restricted here >>> for the reason you mention. Inlining should only be valid if it doesn’t >>> duplicate the barrier. Which nvcc shows this behavior? >>> >>> Adding Vinod and Yuan for comment. >>> >>> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com> wrote: >>> > >>> > Hi Justin, >>> > >>> > Is a compiler allowed to inline a function that calls __syncthreads? I >>> saw nvcc does that, but not sure it's valid though. For example, >>> > >>> > void foo() { >>> > __syncthreads(); >>> > } >>> > >>> > if (threadIdx.x % 2 == 0) { >>> > ... >>> > foo(); >>> > } else { >>> > ... >>> > foo(); >>> > } >>> > >>> > Before inlining, all threads meet at one __syncthreads(). After >>> inlining >>> > >>> > if (threadIdx.x % 2 == 0) { >>> > ... >>> > __syncthreads(); >>> > } else { >>> > ... >>> > __syncthreads(); >>> > } >>> > >>> > The __syncthreads call is duplicated, and it's no longer guaranteed >>> that all threads can meet one __syncthreads(). >>> > >>> > Any thoughts? >>> > >>> > Jingyue >>> >>> >>> >>> ----------------------------------------------------------------------------------- >>> This email message is for the sole use of the intended recipient(s) and >>> may contain >>> confidential information. Any unauthorized review, use, disclosure or >>> distribution >>> is prohibited. If you are not the intended recipient, please contact >>> the sender by >>> reply email and destroy all copies of the original message. >>> >>> ----------------------------------------------------------------------------------- >>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150909/197d133d/attachment.html>