thr3ads.net - llvm dev - [llvm-dev] [CUDA/NVPTX] is inlining _

If this information is useful, please help other people find it:
Share via:

Jingyue Wu via llvm-dev

2015-Aug-21 22:11 UTC

[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?

I'm using 7.0. I am attaching the reduced example.

nvcc sync.cu -arch=sm_35 -ptx

gives

        // .globl       _Z3foov
.visible .entry _Z3foov(

)
{
        .reg .pred      %p<2>;
        .reg .s32       %r<3>;


        mov.u32         %r1, %tid.x;
        and.b32         %r2, %r1, 1;
        setp.eq.b32     %p1, %r2, 1;
        @!%p1 bra       BB7_2;
        bra.uni         BB7_1;

BB7_1:
        bar.sync        0;
        bra.uni         BB7_3;

BB7_2:
        bar.sync        0;

BB7_3:
        ret;
}

As you see, bar.sync is duplicated.

On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <jholewinski at
nvidia.com>
wrote:
> That’s an interesting case.  AFAIK, inlining should be restricted here for
> the reason you mention.  Inlining should only be valid if it doesn’t
> duplicate the barrier.  Which nvcc shows this behavior?
>
> Adding Vinod and Yuan for comment.
>
> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com>
wrote:
> >
> > Hi Justin,
> >
> > Is a compiler allowed to inline a function that calls __syncthreads? I
> saw nvcc does that, but not sure it's valid though. For example,
> >
> > void foo() {
> >   __syncthreads();
> > }
> >
> > if (threadIdx.x % 2 == 0) {
> >   ...
> >   foo();
> > } else {
> >   ...
> >   foo();
> > }
> >
> > Before inlining, all threads meet at one __syncthreads(). After
inlining
> >
> > if (threadIdx.x % 2 == 0) {
> >   ...
> >   __syncthreads();
> > } else {
> >   ...
> >   __syncthreads();
> > }
> >
> > The __syncthreads call is duplicated, and it's no longer
guaranteed that
> all threads can meet one __syncthreads().
> >
> > Any thoughts?
> >
> > Jingyue
>
>
>
>
-----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
>
-----------------------------------------------------------------------------------
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/5538a63f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sync.cu
Type: application/octet-stream
Size: 273 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/5538a63f/attachment.obj>

Jingyue Wu via llvm-dev

2015-Aug-21 23:51 UTC

head link

[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?

Looking at this section
<http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar>
in
the PTX ISA, there's a sentence saying:
> In conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the> condition identically (the warp does not diverge).
Does that mean __syncthreads should only be called uniformly when no
threads diverge? If so, my sync.cu example is undefined. The reason is
that, although every threads reach __syncthreads, they are reaching them
divergently:
1. threads diverge at the "if" statement
2. the warp runs __syncthreads() with half of the threads enabled
3. the warp jumps back to the "else" branch
4. the warp runs __syncthreads() with the other half of the threads enabled

If my understanding is correct (__syncthreads() can only be called when the
warp doesn't diverge), unrolling a loop that contains a __syncthreads() and
inlining a function that may call __syncthreads() are fine. Am I right?

Jingyue



On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <jingyue at google.com> wrote:
> I'm using 7.0. I am attaching the reduced example.
>
> nvcc sync.cu -arch=sm_35 -ptx
>
> gives
>
>         // .globl       _Z3foov
> .visible .entry _Z3foov(
>
> )
> {
>         .reg .pred      %p<2>;
>         .reg .s32       %r<3>;
>
>
>         mov.u32         %r1, %tid.x;
>         and.b32         %r2, %r1, 1;
>         setp.eq.b32     %p1, %r2, 1;
>         @!%p1 bra       BB7_2;
>         bra.uni         BB7_1;
>
> BB7_1:
>         bar.sync        0;
>         bra.uni         BB7_3;
>
> BB7_2:
>         bar.sync        0;
>
> BB7_3:
>         ret;
> }
>
> As you see, bar.sync is duplicated.
>
> On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <jholewinski at
nvidia.com
> > wrote:
>
>> That’s an interesting case.  AFAIK, inlining should be restricted here
>> for the reason you mention.  Inlining should only be valid if it
doesn’t
>> duplicate the barrier.  Which nvcc shows this behavior?
>>
>> Adding Vinod and Yuan for comment.
>>
>> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at
google.com> wrote:
>> >
>> > Hi Justin,
>> >
>> > Is a compiler allowed to inline a function that calls
__syncthreads? I
>> saw nvcc does that, but not sure it's valid though. For example,
>> >
>> > void foo() {
>> >   __syncthreads();
>> > }
>> >
>> > if (threadIdx.x % 2 == 0) {
>> >   ...
>> >   foo();
>> > } else {
>> >   ...
>> >   foo();
>> > }
>> >
>> > Before inlining, all threads meet at one __syncthreads(). After
inlining
>> >
>> > if (threadIdx.x % 2 == 0) {
>> >   ...
>> >   __syncthreads();
>> > } else {
>> >   ...
>> >   __syncthreads();
>> > }
>> >
>> > The __syncthreads call is duplicated, and it's no longer
guaranteed
>> that all threads can meet one __syncthreads().
>> >
>> > Any thoughts?
>> >
>> > Jingyue
>>
>>
>>
>>
-----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information.  Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited.  If you are not the intended recipient, please contact
the
>> sender by
>> reply email and destroy all copies of the original message.
>>
>>
-----------------------------------------------------------------------------------
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/52827105/attachment.html>

Bjarke Roune via llvm-dev

2015-Sep-09 21:46 UTC

head link

[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?

Hi Justin, Yuan and Vinod,

It seems that what __syncthreads() requires in CUDA C++ (as opposed to PTX)
is to be executed uniformly across all threads in the block and not just
the warp. If so, it would be helpful if there were a precise statement
about when a statement is considered to be executed uniformly in CUDA C++.
Is there a precise statement somewhere from NVIDIA about this? I haven't
found one so far.

In particular, it's not clear to me at what point diverging threads are
considered to have joined up again in CUDA C++. My best guess is that this
is at the immediate post-dominator of the statement that starts the
divergence, with the caveat that there is an implicit shared CFG node
following each return statement in a function.

Bjarke

On Fri, Aug 21, 2015 at 4:51 PM, Jingyue Wu <jingyue at google.com> wrote:
> Looking at this section
>
<http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar>
in
> the PTX ISA, there's a sentence saying:
>
> > In conditionally executed code, a bar instruction should only be used
if
> it is known that all threads evaluate the
> > condition identically (the warp does not diverge).
>
> Does that mean __syncthreads should only be called uniformly when no
> threads diverge? If so, my sync.cu example is undefined. The reason is
> that, although every threads reach __syncthreads, they are reaching them
> divergently:
> 1. threads diverge at the "if" statement
> 2. the warp runs __syncthreads() with half of the threads enabled
> 3. the warp jumps back to the "else" branch
> 4. the warp runs __syncthreads() with the other half of the threads enabled
>
> If my understanding is correct (__syncthreads() can only be called when
> the warp doesn't diverge), unrolling a loop that contains a
__syncthreads()
> and inlining a function that may call __syncthreads() are fine. Am I right?
>
> Jingyue
>
>
>
> On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <jingyue at google.com>
wrote:
>
>> I'm using 7.0. I am attaching the reduced example.
>>
>> nvcc sync.cu -arch=sm_35 -ptx
>>
>> gives
>>
>>         // .globl       _Z3foov
>> .visible .entry _Z3foov(
>>
>> )
>> {
>>         .reg .pred      %p<2>;
>>         .reg .s32       %r<3>;
>>
>>
>>         mov.u32         %r1, %tid.x;
>>         and.b32         %r2, %r1, 1;
>>         setp.eq.b32     %p1, %r2, 1;
>>         @!%p1 bra       BB7_2;
>>         bra.uni         BB7_1;
>>
>> BB7_1:
>>         bar.sync        0;
>>         bra.uni         BB7_3;
>>
>> BB7_2:
>>         bar.sync        0;
>>
>> BB7_3:
>>         ret;
>> }
>>
>> As you see, bar.sync is duplicated.
>>
>> On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <
>> jholewinski at nvidia.com> wrote:
>>
>>> That’s an interesting case.  AFAIK, inlining should be restricted
here
>>> for the reason you mention.  Inlining should only be valid if it
doesn’t
>>> duplicate the barrier.  Which nvcc shows this behavior?
>>>
>>> Adding Vinod and Yuan for comment.
>>>
>>> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at
google.com> wrote:
>>> >
>>> > Hi Justin,
>>> >
>>> > Is a compiler allowed to inline a function that calls
__syncthreads? I
>>> saw nvcc does that, but not sure it's valid though. For
example,
>>> >
>>> > void foo() {
>>> >   __syncthreads();
>>> > }
>>> >
>>> > if (threadIdx.x % 2 == 0) {
>>> >   ...
>>> >   foo();
>>> > } else {
>>> >   ...
>>> >   foo();
>>> > }
>>> >
>>> > Before inlining, all threads meet at one __syncthreads().
After
>>> inlining
>>> >
>>> > if (threadIdx.x % 2 == 0) {
>>> >   ...
>>> >   __syncthreads();
>>> > } else {
>>> >   ...
>>> >   __syncthreads();
>>> > }
>>> >
>>> > The __syncthreads call is duplicated, and it's no longer
guaranteed
>>> that all threads can meet one __syncthreads().
>>> >
>>> > Any thoughts?
>>> >
>>> > Jingyue
>>>
>>>
>>>
>>>
-----------------------------------------------------------------------------------
>>> This email message is for the sole use of the intended recipient(s)
and
>>> may contain
>>> confidential information.  Any unauthorized review, use, disclosure
or
>>> distribution
>>> is prohibited.  If you are not the intended recipient, please
contact
>>> the sender by
>>> reply email and destroy all copies of the original message.
>>>
>>>
-----------------------------------------------------------------------------------
>>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150909/197d133d/attachment.html>

llvm dev - Aug 2015 - [CUDA/NVPTX] is inlining __syncthreads allowed?

[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?

[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?

[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?