thr3ads.net - llvm dev - [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Tom Stellard via llvm-dev

2017-Jun-12 23:56 UTC

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:> cc some people who have worked on this.
> 
> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>> Hi all,
>>
>> I've been looking into how to implement the more advanced Shader
Model
>> 6 reduction operations in radv (and obviously most of the work would
>> be useful for radeonsi too). They're explained in the spec for
>> GL_AMD_shader_ballot at
>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>> but I'll summarize them here. There are two types of operations:
>> reductions that always return a uniform value, and prefix scan
>> operations. The reductions can be implemented in terms of the prefix
>> scan (although in practice I don't think we want to implement them
in
>> exactly the same way), and the concerns are mostly the same, so
I'll
>> focus on the prefix scan operations for now. Given an operation
`op'
>> and an input value `a' (that's really a SIMD array with one
value per
>> invocation, even though it's a scalar value in LLVM), the prefix
scan
>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will
also
>> work for non-uniform control flow: it simply skips inactive
>> invocations.
>>
>> On the LLVM side, I think that we have most of the AMD-specific
>> low-level shuffle intrinsics implemented that you need to do this, but
>> I can think of a few concerns/questions. First of all, to implement
>> the prefix scan, we'll need to do a code sequence that looks like
>> this, modified from
>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>> v_foo_f32 with the appropriate operation):
>>
>> ; v0 is the input register
>> v_mov_b32 v1, v0
>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>
>> The problem is that the way these instructions use the DPP word
isn't
>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>> intrinsic, but it isn't enough. For example, take the first
>> instruction:
>>
>> v_foo_f32 v1, v0, v1 row_shr:1
>>
>> What it's doing is shifting v0 right by one within each row and
adding
>> it to v1. v1 stays the same in the first lane of each row, however.
>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>> this, in LLVM-like pseduocode:
>>
>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>> %result = foo %tmp, %input
>>
>> but this is incorrect. If I'm reading the source correctly, this
will
>> make %tmp garbage in lane 0 (since it just turns into a normal move
>> with the dpp modifier, and no restrictions on the destination). We
>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>> 0 in lane 0, but that won't work with operations whose identity is
>> non-0 like min and max. What we need is something like:
>>
Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
or %input (bound_ctrl = 1)?

-Tom

>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>
>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>> then applies the DPP swizzling to the second argument and does
'foo'
>> to the second and third arguments. It would mean that we'd have a
>> separate intrinsic for every operation we care about, but I can't
>> think of a better way to express it. Is there a better way that
>> doesn't involve creating an intrinsic for each operation?
>>
>> Next, there's the fact that this code sequence only works when the
>> active lanes are densely-packed, but we have to make this work even
>> when control flow is non-uniform. Essentially, we need to "skip
over"
>> the inactive lanes by setting them to the identity, and then we need
>> to enable them in the exec mask when doing the reduction to make sure
>> they pass along the correct result. That is, to handle non-uniform
>> control flow, we need something like:
>>
>> invert EXEC
>> result = identity
>> set EXEC to ~0
>> <original code sequence>
>> restore original EXEC
>>
>> I imagine we'd need to add some special
llvm.amdcgn.set_inactive_lanes
>> intrinsic that returns the first argument with inactive lanes set to
>> the second argument. We'd also need something like WQM to make all
the
>> lanes active during the sequence. But that raises some hairy
>> requirements for register allocation. For example, in something like:
>>
>> foo = ...
>> if (...) {
>>     bar = minInvocationsInclusiveScanAMD(...)
>> } else {
>>     ... = foo;
>> }
>>
>> we have to make sure that foo isn't allocated to the same register
as
>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>> though they don't interfere. That's because the implementation
of
>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>> consider the following:
>>
>> do {
>>    bar = minInvocationsInclusiveScanAMD(...);
>>    // ...
>>    ... = bar; // last use of bar
>>   foo = ...;
>> } while (...);
>>
>> ... = foo;
>>
>> again, foo and the temporaries used to compute bar can't be
assigned
>> to the same register, even though their live ranges don't
intersect,
>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>> in a previous iteration if the loop exit condition isn't uniform.
How
>> can we express this in the backend? I don't know much about the
LLVM
>> infrastucture, so I'm not sure if it's relatively easy or
really hard.
>>
>> Thanks,
>>
>> Connor
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Connor Abbott via llvm-dev

2017-Jun-13 00:03 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com>
wrote:> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>> cc some people who have worked on this.
>>
>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>> Hi all,
>>>
>>> I've been looking into how to implement the more advanced
Shader Model
>>> 6 reduction operations in radv (and obviously most of the work
would
>>> be useful for radeonsi too). They're explained in the spec for
>>> GL_AMD_shader_ballot at
>>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>> but I'll summarize them here. There are two types of
operations:
>>> reductions that always return a uniform value, and prefix scan
>>> operations. The reductions can be implemented in terms of the
prefix
>>> scan (although in practice I don't think we want to implement
them in
>>> exactly the same way), and the concerns are mostly the same, so
I'll
>>> focus on the prefix scan operations for now. Given an operation
`op'
>>> and an input value `a' (that's really a SIMD array with one
value per
>>> invocation, even though it's a scalar value in LLVM), the
prefix scan
>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1,
a[0]
>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan
will also
>>> work for non-uniform control flow: it simply skips inactive
>>> invocations.
>>>
>>> On the LLVM side, I think that we have most of the AMD-specific
>>> low-level shuffle intrinsics implemented that you need to do this,
but
>>> I can think of a few concerns/questions. First of all, to implement
>>> the prefix scan, we'll need to do a code sequence that looks
like
>>> this, modified from
>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>> v_foo_f32 with the appropriate operation):
>>>
>>> ; v0 is the input register
>>> v_mov_b32 v1, v0
>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>
>>> The problem is that the way these instructions use the DPP word
isn't
>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>> intrinsic, but it isn't enough. For example, take the first
>>> instruction:
>>>
>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>
>>> What it's doing is shifting v0 right by one within each row and
adding
>>> it to v1. v1 stays the same in the first lane of each row, however.
>>> With llvm.amdgcn.mov_dpp, we could try to express it as something
like
>>> this, in LLVM-like pseduocode:
>>>
>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>> %result = foo %tmp, %input
>>>
>>> but this is incorrect. If I'm reading the source correctly,
this will
>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>> with the dpp modifier, and no restrictions on the destination). We
>>> could set bound_ctrl to 0 to work around this, since it will make
%tmp
>>> 0 in lane 0, but that won't work with operations whose identity
is
>>> non-0 like min and max. What we need is something like:
>>>
>
> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
> or %input (bound_ctrl = 1)?
Oh, maybe it is... for that to happen the underlying move would need
to have the source and destination constrained to be the same. I
couldn't see that constraint anywhere I looked, but I'm not an expert,
so I may have overlooked it. In any case, that behavior still isn't
what we want if we want to implement the prefix scan operations
efficiently.

Connor
>
> -Tom
>
>
>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result
row_shr:1
>>>
>>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>>> then applies the DPP swizzling to the second argument and does
'foo'
>>> to the second and third arguments. It would mean that we'd have
a
>>> separate intrinsic for every operation we care about, but I
can't
>>> think of a better way to express it. Is there a better way that
>>> doesn't involve creating an intrinsic for each operation?
>>>
>>> Next, there's the fact that this code sequence only works when
the
>>> active lanes are densely-packed, but we have to make this work even
>>> when control flow is non-uniform. Essentially, we need to
"skip over"
>>> the inactive lanes by setting them to the identity, and then we
need
>>> to enable them in the exec mask when doing the reduction to make
sure
>>> they pass along the correct result. That is, to handle non-uniform
>>> control flow, we need something like:
>>>
>>> invert EXEC
>>> result = identity
>>> set EXEC to ~0
>>> <original code sequence>
>>> restore original EXEC
>>>
>>> I imagine we'd need to add some special
llvm.amdcgn.set_inactive_lanes
>>> intrinsic that returns the first argument with inactive lanes set
to
>>> the second argument. We'd also need something like WQM to make
all the
>>> lanes active during the sequence. But that raises some hairy
>>> requirements for register allocation. For example, in something
like:
>>>
>>> foo = ...
>>> if (...) {
>>>     bar = minInvocationsInclusiveScanAMD(...)
>>> } else {
>>>     ... = foo;
>>> }
>>>
>>> we have to make sure that foo isn't allocated to the same
register as
>>> one of the temporaries used inside
minInvocationsInclusiveScanAMD(),
>>> though they don't interfere. That's because the
implementation of
>>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>>> consider the following:
>>>
>>> do {
>>>    bar = minInvocationsInclusiveScanAMD(...);
>>>    // ...
>>>    ... = bar; // last use of bar
>>>   foo = ...;
>>> } while (...);
>>>
>>> ... = foo;
>>>
>>> again, foo and the temporaries used to compute bar can't be
assigned
>>> to the same register, even though their live ranges don't
intersect,
>>> since minInvocationsInclusiveScanAMD() may overwrite the value of
foo
>>> in a previous iteration if the loop exit condition isn't
uniform. How
>>> can we express this in the backend? I don't know much about the
LLVM
>>> infrastucture, so I'm not sure if it's relatively easy or
really hard.
>>>
>>> Thanks,
>>>
>>> Connor
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>

Tom Stellard via llvm-dev

2017-Jun-13 00:23 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On 06/12/2017 08:03 PM, Connor Abbott wrote:> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at
redhat.com> wrote:
>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>> cc some people who have worked on this.
>>>
>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>> Hi all,
>>>>
>>>> I've been looking into how to implement the more advanced
Shader Model
>>>> 6 reduction operations in radv (and obviously most of the work
would
>>>> be useful for radeonsi too). They're explained in the spec
for
>>>> GL_AMD_shader_ballot at
>>>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>> but I'll summarize them here. There are two types of
operations:
>>>> reductions that always return a uniform value, and prefix scan
>>>> operations. The reductions can be implemented in terms of the
prefix
>>>> scan (although in practice I don't think we want to
implement them in
>>>> exactly the same way), and the concerns are mostly the same, so
I'll
>>>> focus on the prefix scan operations for now. Given an operation
`op'
>>>> and an input value `a' (that's really a SIMD array with
one value per
>>>> invocation, even though it's a scalar value in LLVM), the
prefix scan
>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation
1, a[0]
>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix
scan will also
>>>> work for non-uniform control flow: it simply skips inactive
>>>> invocations.
>>>>
>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>> low-level shuffle intrinsics implemented that you need to do
this, but
>>>> I can think of a few concerns/questions. First of all, to
implement
>>>> the prefix scan, we'll need to do a code sequence that
looks like
>>>> this, modified from
>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
(replace
>>>> v_foo_f32 with the appropriate operation):
>>>>
>>>> ; v0 is the input register
>>>> v_mov_b32 v1, v0
>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>> v_nop // Add two independent instructions to avoid a data
hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>
>>>> The problem is that the way these instructions use the DPP word
isn't
>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>> intrinsic, but it isn't enough. For example, take the first
>>>> instruction:
>>>>
>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>
>>>> What it's doing is shifting v0 right by one within each row
and adding
>>>> it to v1. v1 stays the same in the first lane of each row,
however.
>>>> With llvm.amdgcn.mov_dpp, we could try to express it as
something like
>>>> this, in LLVM-like pseduocode:
>>>>
>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>> %result = foo %tmp, %input
>>>>
>>>> but this is incorrect. If I'm reading the source correctly,
this will
>>>> make %tmp garbage in lane 0 (since it just turns into a normal
move
>>>> with the dpp modifier, and no restrictions on the destination).
We
>>>> could set bound_ctrl to 0 to work around this, since it will
make %tmp
>>>> 0 in lane 0, but that won't work with operations whose
identity is
>>>> non-0 like min and max. What we need is something like:
>>>>
>>
>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>> or %input (bound_ctrl = 1)?
> 
> Oh, maybe it is... for that to happen the underlying move would need
> to have the source and destination constrained to be the same. I
> couldn't see that constraint anywhere I looked, but I'm not an
expert,
> so I may have overlooked it. In any case, that behavior still isn't
> what we want if we want to implement the prefix scan operations
> efficiently.
> 
Ok, I see what you are saying now.  I think the best option here is to
document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
thread, and then when bound_ctrl=1, lower the intrinsic to a special tied
version
of V_MOV_B32_dpp where the src and dst are the same register.

-Tom
> Connor
> 
>>
>> -Tom
>>
>>
>>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result
row_shr:1
>>>>
>>>> where llvm.amdgcn.foo_dpp copies the first argument to the
result,
>>>> then applies the DPP swizzling to the second argument and does
'foo'
>>>> to the second and third arguments. It would mean that we'd
have a
>>>> separate intrinsic for every operation we care about, but I
can't
>>>> think of a better way to express it. Is there a better way that
>>>> doesn't involve creating an intrinsic for each operation?
>>>>
>>>> Next, there's the fact that this code sequence only works
when the
>>>> active lanes are densely-packed, but we have to make this work
even
>>>> when control flow is non-uniform. Essentially, we need to
"skip over"
>>>> the inactive lanes by setting them to the identity, and then we
need
>>>> to enable them in the exec mask when doing the reduction to
make sure
>>>> they pass along the correct result. That is, to handle
non-uniform
>>>> control flow, we need something like:
>>>>
>>>> invert EXEC
>>>> result = identity
>>>> set EXEC to ~0
>>>> <original code sequence>
>>>> restore original EXEC
>>>>
>>>> I imagine we'd need to add some special
llvm.amdcgn.set_inactive_lanes
>>>> intrinsic that returns the first argument with inactive lanes
set to
>>>> the second argument. We'd also need something like WQM to
make all the
>>>> lanes active during the sequence. But that raises some hairy
>>>> requirements for register allocation. For example, in something
like:
>>>>
>>>> foo = ...
>>>> if (...) {
>>>>     bar = minInvocationsInclusiveScanAMD(...)
>>>> } else {
>>>>     ... = foo;
>>>> }
>>>>
>>>> we have to make sure that foo isn't allocated to the same
register as
>>>> one of the temporaries used inside
minInvocationsInclusiveScanAMD(),
>>>> though they don't interfere. That's because the
implementation of
>>>> minInvocationsInclusiveScanAMD() will do funny things with the
exec
>>>> mask, possibly overwriting foo, if the condition is
non-uniform. Or
>>>> consider the following:
>>>>
>>>> do {
>>>>    bar = minInvocationsInclusiveScanAMD(...);
>>>>    // ...
>>>>    ... = bar; // last use of bar
>>>>   foo = ...;
>>>> } while (...);
>>>>
>>>> ... = foo;
>>>>
>>>> again, foo and the temporaries used to compute bar can't be
assigned
>>>> to the same register, even though their live ranges don't
intersect,
>>>> since minInvocationsInclusiveScanAMD() may overwrite the value
of foo
>>>> in a previous iteration if the loop exit condition isn't
uniform. How
>>>> can we express this in the backend? I don't know much about
the LLVM
>>>> infrastucture, so I'm not sure if it's relatively easy
or really hard.
>>>>
>>>> Thanks,
>>>>
>>>> Connor
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>

llvm dev - Jun 2017 - Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend