thr3ads.net - llvm dev - [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Tom Stellard via llvm-dev

2017-Jun-13 00:23 UTC

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On 06/12/2017 08:03 PM, Connor Abbott wrote:> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at
redhat.com> wrote:
>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>> cc some people who have worked on this.
>>>
>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>> Hi all,
>>>>
>>>> I've been looking into how to implement the more advanced
Shader Model
>>>> 6 reduction operations in radv (and obviously most of the work
would
>>>> be useful for radeonsi too). They're explained in the spec
for
>>>> GL_AMD_shader_ballot at
>>>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>> but I'll summarize them here. There are two types of
operations:
>>>> reductions that always return a uniform value, and prefix scan
>>>> operations. The reductions can be implemented in terms of the
prefix
>>>> scan (although in practice I don't think we want to
implement them in
>>>> exactly the same way), and the concerns are mostly the same, so
I'll
>>>> focus on the prefix scan operations for now. Given an operation
`op'
>>>> and an input value `a' (that's really a SIMD array with
one value per
>>>> invocation, even though it's a scalar value in LLVM), the
prefix scan
>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation
1, a[0]
>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix
scan will also
>>>> work for non-uniform control flow: it simply skips inactive
>>>> invocations.
>>>>
>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>> low-level shuffle intrinsics implemented that you need to do
this, but
>>>> I can think of a few concerns/questions. First of all, to
implement
>>>> the prefix scan, we'll need to do a code sequence that
looks like
>>>> this, modified from
>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
(replace
>>>> v_foo_f32 with the appropriate operation):
>>>>
>>>> ; v0 is the input register
>>>> v_mov_b32 v1, v0
>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>
>>>> The problem is that the way these instructions use the DPP word
isn't
>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>> intrinsic, but it isn't enough. For example, take the first
>>>> instruction:
>>>>
>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>
>>>> What it's doing is shifting v0 right by one within each row
and adding
>>>> it to v1. v1 stays the same in the first lane of each row,
however.
>>>> With llvm.amdgcn.mov_dpp, we could try to express it as
something like
>>>> this, in LLVM-like pseduocode:
>>>>
>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>> %result = foo %tmp, %input
>>>>
>>>> but this is incorrect. If I'm reading the source correctly,
this will
>>>> make %tmp garbage in lane 0 (since it just turns into a normal
move
>>>> with the dpp modifier, and no restrictions on the destination).
We
>>>> could set bound_ctrl to 0 to work around this, since it will
make %tmp
>>>> 0 in lane 0, but that won't work with operations whose
identity is
>>>> non-0 like min and max. What we need is something like:
>>>>
>>
>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>> or %input (bound_ctrl = 1)?
> 
> Oh, maybe it is... for that to happen the underlying move would need
> to have the source and destination constrained to be the same. I
> couldn't see that constraint anywhere I looked, but I'm not an
expert,
> so I may have overlooked it. In any case, that behavior still isn't
> what we want if we want to implement the prefix scan operations
> efficiently.
> 
Ok, I see what you are saying now.  I think the best option here is to
document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
thread, and then when bound_ctrl=1, lower the intrinsic to a special tied
version
of V_MOV_B32_dpp where the src and dst are the same register.

-Tom
> Connor
> 
>>
>> -Tom
>>
>>
>>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result
row_shr:1
>>>>
>>>> where llvm.amdgcn.foo_dpp copies the first argument to the
result,
>>>> then applies the DPP swizzling to the second argument and does
'foo'
>>>> to the second and third arguments. It would mean that we'd
have a
>>>> separate intrinsic for every operation we care about, but I
can't
>>>> think of a better way to express it. Is there a better way that
>>>> doesn't involve creating an intrinsic for each operation?
>>>>
>>>> Next, there's the fact that this code sequence only works
when the
>>>> active lanes are densely-packed, but we have to make this work
even
>>>> when control flow is non-uniform. Essentially, we need to
"skip over"
>>>> the inactive lanes by setting them to the identity, and then we
need
>>>> to enable them in the exec mask when doing the reduction to
make sure
>>>> they pass along the correct result. That is, to handle
non-uniform
>>>> control flow, we need something like:
>>>>
>>>> invert EXEC
>>>> result = identity
>>>> set EXEC to ~0
>>>> <original code sequence>
>>>> restore original EXEC
>>>>
>>>> I imagine we'd need to add some special
llvm.amdcgn.set_inactive_lanes
>>>> intrinsic that returns the first argument with inactive lanes
set to
>>>> the second argument. We'd also need something like WQM to
make all the
>>>> lanes active during the sequence. But that raises some hairy
>>>> requirements for register allocation. For example, in something
like:
>>>>
>>>> foo = ...
>>>> if (...) {
>>>>     bar = minInvocationsInclusiveScanAMD(...)
>>>> } else {
>>>>     ... = foo;
>>>> }
>>>>
>>>> we have to make sure that foo isn't allocated to the same
register as
>>>> one of the temporaries used inside
minInvocationsInclusiveScanAMD(),
>>>> though they don't interfere. That's because the
implementation of
>>>> minInvocationsInclusiveScanAMD() will do funny things with the
exec
>>>> mask, possibly overwriting foo, if the condition is
non-uniform. Or
>>>> consider the following:
>>>>
>>>> do {
>>>>    bar = minInvocationsInclusiveScanAMD(...);
>>>>    // ...
>>>>    ... = bar; // last use of bar
>>>>   foo = ...;
>>>> } while (...);
>>>>
>>>> ... = foo;
>>>>
>>>> again, foo and the temporaries used to compute bar can't be
assigned
>>>> to the same register, even though their live ranges don't
intersect,
>>>> since minInvocationsInclusiveScanAMD() may overwrite the value
of foo
>>>> in a previous iteration if the loop exit condition isn't
uniform. How
>>>> can we express this in the backend? I don't know much about
the LLVM
>>>> infrastucture, so I'm not sure if it's relatively easy
or really hard.
>>>>
>>>> Thanks,
>>>>
>>>> Connor
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>

Matt Arsenault via llvm-dev

2017-Jun-13 23:33 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com>
wrote:
> 
> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at
redhat.com> wrote:
>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>> cc some people who have worked on this.
>>>> 
>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>> Hi all,
>>>>> 
>>>>> I've been looking into how to implement the more
advanced Shader Model
>>>>> 6 reduction operations in radv (and obviously most of the
work would
>>>>> be useful for radeonsi too). They're explained in the
spec for
>>>>> GL_AMD_shader_ballot at
>>>>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>> but I'll summarize them here. There are two types of
operations:
>>>>> reductions that always return a uniform value, and prefix
scan
>>>>> operations. The reductions can be implemented in terms of
the prefix
>>>>> scan (although in practice I don't think we want to
implement them in
>>>>> exactly the same way), and the concerns are mostly the
same, so I'll
>>>>> focus on the prefix scan operations for now. Given an
operation `op'
>>>>> and an input value `a' (that's really a SIMD array
with one value per
>>>>> invocation, even though it's a scalar value in LLVM),
the prefix scan
>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in
invocation 1, a[0]
>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The
prefix scan will also
>>>>> work for non-uniform control flow: it simply skips inactive
>>>>> invocations.
>>>>> 
>>>>> On the LLVM side, I think that we have most of the
AMD-specific
>>>>> low-level shuffle intrinsics implemented that you need to
do this, but
>>>>> I can think of a few concerns/questions. First of all, to
implement
>>>>> the prefix scan, we'll need to do a code sequence that
looks like
>>>>> this, modified from
>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
(replace
>>>>> v_foo_f32 with the appropriate operation):
>>>>> 
>>>>> ; v0 is the input register
>>>>> v_mov_b32 v1, v0
>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>>> v_nop
>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction
4
>>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>>> v_nop
>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction
5
>>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>>> v_nop
>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa //
Instruction 6
>>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>>> v_nop
>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc //
Instruction 7
>>>>> 
>>>>> The problem is that the way these instructions use the DPP
word isn't
>>>>> currently expressible in LLVM. We have the
llvm.amdgcn.mov_dpp
>>>>> intrinsic, but it isn't enough. For example, take the
first
>>>>> instruction:
>>>>> 
>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>> 
>>>>> What it's doing is shifting v0 right by one within each
row and adding
>>>>> it to v1. v1 stays the same in the first lane of each row,
however.
>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as
something like
>>>>> this, in LLVM-like pseduocode:
>>>>> 
>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>> %result = foo %tmp, %input
>>>>> 
>>>>> but this is incorrect. If I'm reading the source
correctly, this will
>>>>> make %tmp garbage in lane 0 (since it just turns into a
normal move
>>>>> with the dpp modifier, and no restrictions on the
destination). We
>>>>> could set bound_ctrl to 0 to work around this, since it
will make %tmp
>>>>> 0 in lane 0, but that won't work with operations whose
identity is
>>>>> non-0 like min and max. What we need is something like:
>>>>> 
>>> 
>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl
=0)
>>> or %input (bound_ctrl = 1)?
>> 
>> Oh, maybe it is... for that to happen the underlying move would need
>> to have the source and destination constrained to be the same. I
>> couldn't see that constraint anywhere I looked, but I'm not an
expert,
>> so I may have overlooked it. In any case, that behavior still isn't
>> what we want if we want to implement the prefix scan operations
>> efficiently.
>> 
> 
> Ok, I see what you are saying now.  I think the best option here is to
> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
> copy its src operand to dst when bound_ctrl = 1 and it reads from an
invalid
> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied
version
> of V_MOV_B32_dpp where the src and dst are the same register.
> 
> -Tom
This came up before that there’s no way to represent the unmodified input
register for the inactive lanes. I think the conclusion was that a new intrinsic
is needed to represent this case but I don’t think there was a consensus on what
it should look like.

-Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170613/caeac9d4/attachment.html>

Tom Stellard via llvm-dev

2017-Jun-14 01:13 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On 06/13/2017 07:33 PM, Matt Arsenault wrote:> 
>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com
<mailto:tstellar at redhat.com>> wrote:
>>
>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at
redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>> cc some people who have worked on this.
>>>>>
>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I've been looking into how to implement the more
advanced Shader Model
>>>>>> 6 reduction operations in radv (and obviously most of
the work would
>>>>>> be useful for radeonsi too). They're explained in
the spec for
>>>>>> GL_AMD_shader_ballot at
>>>>>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>>> but I'll summarize them here. There are two types
of operations:
>>>>>> reductions that always return a uniform value, and
prefix scan
>>>>>> operations. The reductions can be implemented in terms
of the prefix
>>>>>> scan (although in practice I don't think we want to
implement them in
>>>>>> exactly the same way), and the concerns are mostly the
same, so I'll
>>>>>> focus on the prefix scan operations for now. Given an
operation `op'
>>>>>> and an input value `a' (that's really a SIMD
array with one value per
>>>>>> invocation, even though it's a scalar value in
LLVM), the prefix scan
>>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in
invocation 1, a[0]
>>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The
prefix scan will also
>>>>>> work for non-uniform control flow: it simply skips
inactive
>>>>>> invocations.
>>>>>>
>>>>>> On the LLVM side, I think that we have most of the
AMD-specific
>>>>>> low-level shuffle intrinsics implemented that you need
to do this, but
>>>>>> I can think of a few concerns/questions. First of all,
to implement
>>>>>> the prefix scan, we'll need to do a code sequence
that looks like
>>>>>> this, modified from
>>>>>>
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>
>>>>>> ; v0 is the input register
>>>>>> v_mov_b32 v1, v0
>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>>> v_nop // Add two independent instructions to avoid a
data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe //
Instruction 4
>>>>>> v_nop // Add two independent instructions to avoid a
data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc //
Instruction 5
>>>>>> v_nop // Add two independent instructions to avoid a
data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa //
Instruction 6
>>>>>> v_nop // Add two independent instructions to avoid a
data hazard
>>>>>> v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc //
Instruction 7
>>>>>>
>>>>>> The problem is that the way these instructions use the
DPP word isn't
>>>>>> currently expressible in LLVM. We have the
llvm.amdgcn.mov_dpp
>>>>>> intrinsic, but it isn't enough. For example, take
the first
>>>>>> instruction:
>>>>>>
>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>
>>>>>> What it's doing is shifting v0 right by one within
each row and adding
>>>>>> it to v1. v1 stays the same in the first lane of each
row, however.
>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as
something like
>>>>>> this, in LLVM-like pseduocode:
>>>>>>
>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>>> %result = foo %tmp, %input
>>>>>>
>>>>>> but this is incorrect. If I'm reading the source
correctly, this will
>>>>>> make %tmp garbage in lane 0 (since it just turns into a
normal move
>>>>>> with the dpp modifier, and no restrictions on the
destination). We
>>>>>> could set bound_ctrl to 0 to work around this, since it
will make %tmp
>>>>>> 0 in lane 0, but that won't work with operations
whose identity is
>>>>>> non-0 like min and max. What we need is something like:
>>>>>>
>>>>
>>>> Why is %tmp garbage?  I thought the two options were 0
(bound_ctrl =0)
>>>> or %input (bound_ctrl = 1)?
>>>
>>> Oh, maybe it is... for that to happen the underlying move would
need
>>> to have the source and destination constrained to be the same. I
>>> couldn't see that constraint anywhere I looked, but I'm not
an expert,
>>> so I may have overlooked it. In any case, that behavior still
isn't
>>> what we want if we want to implement the prefix scan operations
>>> efficiently.
>>>
>>
>> Ok, I see what you are saying now.  I think the best option here is to
>> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
>> copy its src operand to dst when bound_ctrl = 1 and it reads from an
invalid
>> thread, and then when bound_ctrl=1, lower the intrinsic to a special
tied version
>> of V_MOV_B32_dpp where the src and dst are the same register.
>>
>> -Tom
> 
> This came up before that there’s no way to represent the unmodified input
register for the inactive lanes. I think the conclusion was that a new intrinsic
is needed to represent this case but I don’t think there was a consensus on what
it should look like.
> 
I think re-defining the existing intrinsic will be easier than adding a new
one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken
for the bound_ctrl=1 case, which isn't really ideal.

-Tom
> -Matt
>

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Jun 2017 - Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Reasonably Related Threads