thr3ads.net - llvm dev - [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Connor Abbott via llvm-dev

2017-Jun-12 21:58 UTC

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Hi all,

I've been looking into how to implement the more advanced Shader Model
6 reduction operations in radv (and obviously most of the work would
be useful for radeonsi too). They're explained in the spec for
GL_AMD_shader_ballot at
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
but I'll summarize them here. There are two types of operations:
reductions that always return a uniform value, and prefix scan
operations. The reductions can be implemented in terms of the prefix
scan (although in practice I don't think we want to implement them in
exactly the same way), and the concerns are mostly the same, so I'll
focus on the prefix scan operations for now. Given an operation `op'
and an input value `a' (that's really a SIMD array with one value per
invocation, even though it's a scalar value in LLVM), the prefix scan
returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
`op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
work for non-uniform control flow: it simply skips inactive
invocations.

On the LLVM side, I think that we have most of the AMD-specific
low-level shuffle intrinsics implemented that you need to do this, but
I can think of a few concerns/questions. First of all, to implement
the prefix scan, we'll need to do a code sequence that looks like
this, modified from
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
v_foo_f32 with the appropriate operation):

; v0 is the input register
v_mov_b32 v1, v0
v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7

The problem is that the way these instructions use the DPP word isn't
currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
intrinsic, but it isn't enough. For example, take the first
instruction:

v_foo_f32 v1, v0, v1 row_shr:1

What it's doing is shifting v0 right by one within each row and adding
it to v1. v1 stays the same in the first lane of each row, however.
With llvm.amdgcn.mov_dpp, we could try to express it as something like
this, in LLVM-like pseduocode:

%tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
%result = foo %tmp, %input

but this is incorrect. If I'm reading the source correctly, this will
make %tmp garbage in lane 0 (since it just turns into a normal move
with the dpp modifier, and no restrictions on the destination). We
could set bound_ctrl to 0 to work around this, since it will make %tmp
0 in lane 0, but that won't work with operations whose identity is
non-0 like min and max. What we need is something like:

%result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1

where llvm.amdgcn.foo_dpp copies the first argument to the result,
then applies the DPP swizzling to the second argument and does 'foo'
to the second and third arguments. It would mean that we'd have a
separate intrinsic for every operation we care about, but I can't
think of a better way to express it. Is there a better way that
doesn't involve creating an intrinsic for each operation?

Next, there's the fact that this code sequence only works when the
active lanes are densely-packed, but we have to make this work even
when control flow is non-uniform. Essentially, we need to "skip over"
the inactive lanes by setting them to the identity, and then we need
to enable them in the exec mask when doing the reduction to make sure
they pass along the correct result. That is, to handle non-uniform
control flow, we need something like:

invert EXEC
result = identity
set EXEC to ~0
<original code sequence>
restore original EXEC

I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
intrinsic that returns the first argument with inactive lanes set to
the second argument. We'd also need something like WQM to make all the
lanes active during the sequence. But that raises some hairy
requirements for register allocation. For example, in something like:

foo = ...
if (...) {
    bar = minInvocationsInclusiveScanAMD(...)
} else {
    ... = foo;
}

we have to make sure that foo isn't allocated to the same register as
one of the temporaries used inside minInvocationsInclusiveScanAMD(),
though they don't interfere. That's because the implementation of
minInvocationsInclusiveScanAMD() will do funny things with the exec
mask, possibly overwriting foo, if the condition is non-uniform. Or
consider the following:

do {
   bar = minInvocationsInclusiveScanAMD(...);
   // ...
   ... = bar; // last use of bar
  foo = ...;
} while (...);

... = foo;

again, foo and the temporaries used to compute bar can't be assigned
to the same register, even though their live ranges don't intersect,
since minInvocationsInclusiveScanAMD() may overwrite the value of foo
in a previous iteration if the loop exit condition isn't uniform. How
can we express this in the backend? I don't know much about the LLVM
infrastucture, so I'm not sure if it's relatively easy or really hard.

Thanks,

Connor

Tom Stellard via llvm-dev

2017-Jun-12 23:15 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

cc some people who have worked on this.

On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev
wrote:> Hi all,
> 
> I've been looking into how to implement the more advanced Shader Model
> 6 reduction operations in radv (and obviously most of the work would
> be useful for radeonsi too). They're explained in the spec for
> GL_AMD_shader_ballot at
>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
> but I'll summarize them here. There are two types of operations:
> reductions that always return a uniform value, and prefix scan
> operations. The reductions can be implemented in terms of the prefix
> scan (although in practice I don't think we want to implement them in
> exactly the same way), and the concerns are mostly the same, so I'll
> focus on the prefix scan operations for now. Given an operation `op'
> and an input value `a' (that's really a SIMD array with one value
per
> invocation, even though it's a scalar value in LLVM), the prefix scan
> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
> work for non-uniform control flow: it simply skips inactive
> invocations.
> 
> On the LLVM side, I think that we have most of the AMD-specific
> low-level shuffle intrinsics implemented that you need to do this, but
> I can think of a few concerns/questions. First of all, to implement
> the prefix scan, we'll need to do a code sequence that looks like
> this, modified from
> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
> v_foo_f32 with the appropriate operation):
> 
> ; v0 is the input register
> v_mov_b32 v1, v0
> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
> 
> The problem is that the way these instructions use the DPP word isn't
> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
> intrinsic, but it isn't enough. For example, take the first
> instruction:
> 
> v_foo_f32 v1, v0, v1 row_shr:1
> 
> What it's doing is shifting v0 right by one within each row and adding
> it to v1. v1 stays the same in the first lane of each row, however.
> With llvm.amdgcn.mov_dpp, we could try to express it as something like
> this, in LLVM-like pseduocode:
> 
> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
> %result = foo %tmp, %input
> 
> but this is incorrect. If I'm reading the source correctly, this will
> make %tmp garbage in lane 0 (since it just turns into a normal move
> with the dpp modifier, and no restrictions on the destination). We
> could set bound_ctrl to 0 to work around this, since it will make %tmp
> 0 in lane 0, but that won't work with operations whose identity is
> non-0 like min and max. What we need is something like:
> 
> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
> 
> where llvm.amdgcn.foo_dpp copies the first argument to the result,
> then applies the DPP swizzling to the second argument and does
'foo'
> to the second and third arguments. It would mean that we'd have a
> separate intrinsic for every operation we care about, but I can't
> think of a better way to express it. Is there a better way that
> doesn't involve creating an intrinsic for each operation?
> 
> Next, there's the fact that this code sequence only works when the
> active lanes are densely-packed, but we have to make this work even
> when control flow is non-uniform. Essentially, we need to "skip
over"
> the inactive lanes by setting them to the identity, and then we need
> to enable them in the exec mask when doing the reduction to make sure
> they pass along the correct result. That is, to handle non-uniform
> control flow, we need something like:
> 
> invert EXEC
> result = identity
> set EXEC to ~0
> <original code sequence>
> restore original EXEC
> 
> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
> intrinsic that returns the first argument with inactive lanes set to
> the second argument. We'd also need something like WQM to make all the
> lanes active during the sequence. But that raises some hairy
> requirements for register allocation. For example, in something like:
> 
> foo = ...
> if (...) {
>     bar = minInvocationsInclusiveScanAMD(...)
> } else {
>     ... = foo;
> }
> 
> we have to make sure that foo isn't allocated to the same register as
> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
> though they don't interfere. That's because the implementation of
> minInvocationsInclusiveScanAMD() will do funny things with the exec
> mask, possibly overwriting foo, if the condition is non-uniform. Or
> consider the following:
> 
> do {
>    bar = minInvocationsInclusiveScanAMD(...);
>    // ...
>    ... = bar; // last use of bar
>   foo = ...;
> } while (...);
> 
> ... = foo;
> 
> again, foo and the temporaries used to compute bar can't be assigned
> to the same register, even though their live ranges don't intersect,
> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
> in a previous iteration if the loop exit condition isn't uniform. How
> can we express this in the backend? I don't know much about the LLVM
> infrastucture, so I'm not sure if it's relatively easy or really
hard.
> 
> Thanks,
> 
> Connor
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Tom Stellard via llvm-dev

2017-Jun-12 23:56 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:> cc some people who have worked on this.
> 
> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>> Hi all,
>>
>> I've been looking into how to implement the more advanced Shader
Model
>> 6 reduction operations in radv (and obviously most of the work would
>> be useful for radeonsi too). They're explained in the spec for
>> GL_AMD_shader_ballot at
>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>> but I'll summarize them here. There are two types of operations:
>> reductions that always return a uniform value, and prefix scan
>> operations. The reductions can be implemented in terms of the prefix
>> scan (although in practice I don't think we want to implement them
in
>> exactly the same way), and the concerns are mostly the same, so
I'll
>> focus on the prefix scan operations for now. Given an operation
`op'
>> and an input value `a' (that's really a SIMD array with one
value per
>> invocation, even though it's a scalar value in LLVM), the prefix
scan
>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will
also
>> work for non-uniform control flow: it simply skips inactive
>> invocations.
>>
>> On the LLVM side, I think that we have most of the AMD-specific
>> low-level shuffle intrinsics implemented that you need to do this, but
>> I can think of a few concerns/questions. First of all, to implement
>> the prefix scan, we'll need to do a code sequence that looks like
>> this, modified from
>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>> v_foo_f32 with the appropriate operation):
>>
>> ; v0 is the input register
>> v_mov_b32 v1, v0
>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>
>> The problem is that the way these instructions use the DPP word
isn't
>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>> intrinsic, but it isn't enough. For example, take the first
>> instruction:
>>
>> v_foo_f32 v1, v0, v1 row_shr:1
>>
>> What it's doing is shifting v0 right by one within each row and
adding
>> it to v1. v1 stays the same in the first lane of each row, however.
>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>> this, in LLVM-like pseduocode:
>>
>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>> %result = foo %tmp, %input
>>
>> but this is incorrect. If I'm reading the source correctly, this
will
>> make %tmp garbage in lane 0 (since it just turns into a normal move
>> with the dpp modifier, and no restrictions on the destination). We
>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>> 0 in lane 0, but that won't work with operations whose identity is
>> non-0 like min and max. What we need is something like:
>>
Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
or %input (bound_ctrl = 1)?

-Tom

>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>
>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>> then applies the DPP swizzling to the second argument and does
'foo'
>> to the second and third arguments. It would mean that we'd have a
>> separate intrinsic for every operation we care about, but I can't
>> think of a better way to express it. Is there a better way that
>> doesn't involve creating an intrinsic for each operation?
>>
>> Next, there's the fact that this code sequence only works when the
>> active lanes are densely-packed, but we have to make this work even
>> when control flow is non-uniform. Essentially, we need to "skip
over"
>> the inactive lanes by setting them to the identity, and then we need
>> to enable them in the exec mask when doing the reduction to make sure
>> they pass along the correct result. That is, to handle non-uniform
>> control flow, we need something like:
>>
>> invert EXEC
>> result = identity
>> set EXEC to ~0
>> <original code sequence>
>> restore original EXEC
>>
>> I imagine we'd need to add some special
llvm.amdcgn.set_inactive_lanes
>> intrinsic that returns the first argument with inactive lanes set to
>> the second argument. We'd also need something like WQM to make all
the
>> lanes active during the sequence. But that raises some hairy
>> requirements for register allocation. For example, in something like:
>>
>> foo = ...
>> if (...) {
>>     bar = minInvocationsInclusiveScanAMD(...)
>> } else {
>>     ... = foo;
>> }
>>
>> we have to make sure that foo isn't allocated to the same register
as
>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>> though they don't interfere. That's because the implementation
of
>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>> consider the following:
>>
>> do {
>>    bar = minInvocationsInclusiveScanAMD(...);
>>    // ...
>>    ... = bar; // last use of bar
>>   foo = ...;
>> } while (...);
>>
>> ... = foo;
>>
>> again, foo and the temporaries used to compute bar can't be
assigned
>> to the same register, even though their live ranges don't
intersect,
>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>> in a previous iteration if the loop exit condition isn't uniform.
How
>> can we express this in the backend? I don't know much about the
LLVM
>> infrastucture, so I'm not sure if it's relatively easy or
really hard.
>>
>> Thanks,
>>
>> Connor
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Nicolai Hähnle via llvm-dev

2017-Jun-13 06:23 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On 12.06.2017 23:58, Connor Abbott wrote:> Hi all,
> 
> I've been looking into how to implement the more advanced Shader Model
> 6 reduction operations in radv (and obviously most of the work would
> be useful for radeonsi too). They're explained in the spec for
> GL_AMD_shader_ballot at
>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
> but I'll summarize them here. There are two types of operations:
> reductions that always return a uniform value, and prefix scan
> operations. The reductions can be implemented in terms of the prefix
> scan (although in practice I don't think we want to implement them in
> exactly the same way), and the concerns are mostly the same, so I'll
> focus on the prefix scan operations for now. Given an operation `op'
> and an input value `a' (that's really a SIMD array with one value
per
> invocation, even though it's a scalar value in LLVM), the prefix scan
> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
> work for non-uniform control flow: it simply skips inactive
> invocations.
> 
> On the LLVM side, I think that we have most of the AMD-specific
> low-level shuffle intrinsics implemented that you need to do this, but
> I can think of a few concerns/questions. First of all, to implement
> the prefix scan, we'll need to do a code sequence that looks like
> this, modified from
> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
> v_foo_f32 with the appropriate operation):
> 
> ; v0 is the input register
> v_mov_b32 v1, v0
> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
> 
> The problem is that the way these instructions use the DPP word isn't
> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
> intrinsic, but it isn't enough. For example, take the first
> instruction:
> 
> v_foo_f32 v1, v0, v1 row_shr:1
> 
> What it's doing is shifting v0 right by one within each row and adding
> it to v1. v1 stays the same in the first lane of each row, however.
> With llvm.amdgcn.mov_dpp, we could try to express it as something like
> this, in LLVM-like pseduocode:
> 
> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
> %result = foo %tmp, %input
> 
> but this is incorrect. If I'm reading the source correctly, this will
> make %tmp garbage in lane 0 (since it just turns into a normal move
> with the dpp modifier, and no restrictions on the destination). We
> could set bound_ctrl to 0 to work around this, since it will make %tmp
> 0 in lane 0, but that won't work with operations whose identity is
> non-0 like min and max. What we need is something like:
> 
> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
> 
> where llvm.amdgcn.foo_dpp copies the first argument to the result,
> then applies the DPP swizzling to the second argument and does
'foo'
> to the second and third arguments. It would mean that we'd have a
> separate intrinsic for every operation we care about, but I can't
> think of a better way to express it. Is there a better way that
> doesn't involve creating an intrinsic for each operation?
> 
> Next, there's the fact that this code sequence only works when the
> active lanes are densely-packed, but we have to make this work even
> when control flow is non-uniform.
 >> Essentially, we need to "skip over"
> the inactive lanes by setting them to the identity, and then we need
> to enable them in the exec mask when doing the reduction to make sure
> they pass along the correct result. That is, to handle non-uniform
> control flow, we need something like:
> 
> invert EXEC
> result = identity
> set EXEC to ~0
> <original code sequence>
> restore original EXEC
Yeah, this is going to be a pain, mostly in how it could interact with 
register spilling.

> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
> intrinsic that returns the first argument with inactive lanes set to
> the second argument. We'd also need something like WQM to make all the
> lanes active during the sequence. But that raises some hairy
> requirements for register allocation. For example, in something like:
> 
> foo = ...
> if (...) {
>      bar = minInvocationsInclusiveScanAMD(...)
> } else {
>      ... = foo;
> }
> 
> we have to make sure that foo isn't allocated to the same register as
> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
> though they don't interfere. That's because the implementation of
> minInvocationsInclusiveScanAMD() will do funny things with the exec
> mask, possibly overwriting foo, if the condition is non-uniform. Or
> consider the following:
> 
> do {
>     bar = minInvocationsInclusiveScanAMD(...);
>     // ...
>     ... = bar; // last use of bar
>    foo = ...;
> } while (...);
> 
> ... = foo;
> 
> again, foo and the temporaries used to compute bar can't be assigned
> to the same register, even though their live ranges don't intersect,
> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
> in a previous iteration if the loop exit condition isn't uniform. How
> can we express this in the backend? I don't know much about the LLVM
> infrastucture, so I'm not sure if it's relatively easy or really
hard.
The actual register allocation is probably comparatively harmless. The 
register allocator runs long after control flow structurization, which 
means that in your examples above, the register allocator actually sees 
that the lifetimes of the relevant variables overlap. Specifically, the 
basic-block structure in your first if-based example is actually:

    foo = ...
    cbranch merge
    ; fall-through

if:
    bar = ...
    ; fall-through

merge:
    cbranch endif
    ; fall-through

else:
    ... = foo
    ; fall-through

endif:

... and so foo and bar will not be assigned the same register.

Again, where it gets hairy is with spilling, because the spiller is 
hopelessly lost when it comes to understanding EXEC masks...

Cheers,
Nicolai

> Thanks,
> 
> Connor
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.

Connor Abbott via llvm-dev

2017-Jun-13 18:48 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On Mon, Jun 12, 2017 at 11:23 PM, Nicolai Hähnle <nhaehnle at gmail.com>
wrote:> On 12.06.2017 23:58, Connor Abbott wrote:
>>
>> Next, there's the fact that this code sequence only works when the
>> active lanes are densely-packed, but we have to make this work even
>> when control flow is non-uniform.
>
>>
>>
>> Essentially, we need to "skip over"
>> the inactive lanes by setting them to the identity, and then we need
>> to enable them in the exec mask when doing the reduction to make sure
>> they pass along the correct result. That is, to handle non-uniform
>> control flow, we need something like:
>>
>> invert EXEC
>> result = identity
>> set EXEC to ~0
>> <original code sequence>
>> restore original EXEC
>
>
> Yeah, this is going to be a pain, mostly in how it could interact with
> register spilling.
>
>
>
>> I imagine we'd need to add some special
llvm.amdcgn.set_inactive_lanes
>> intrinsic that returns the first argument with inactive lanes set to
>> the second argument. We'd also need something like WQM to make all
the
>> lanes active during the sequence. But that raises some hairy
>> requirements for register allocation. For example, in something like:
>>
>> foo = ...
>> if (...) {
>>      bar = minInvocationsInclusiveScanAMD(...)
>> } else {
>>      ... = foo;
>> }
>>
>> we have to make sure that foo isn't allocated to the same register
as
>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>> though they don't interfere. That's because the implementation
of
>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>> consider the following:
>>
>> do {
>>     bar = minInvocationsInclusiveScanAMD(...);
>>     // ...
>>     ... = bar; // last use of bar
>>    foo = ...;
>> } while (...);
>>
>> ... = foo;
>>
>> again, foo and the temporaries used to compute bar can't be
assigned
>> to the same register, even though their live ranges don't
intersect,
>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>> in a previous iteration if the loop exit condition isn't uniform.
How
>> can we express this in the backend? I don't know much about the
LLVM
>> infrastucture, so I'm not sure if it's relatively easy or
really hard.
>
>
> The actual register allocation is probably comparatively harmless. The
> register allocator runs long after control flow structurization, which
means
> that in your examples above, the register allocator actually sees that the
> lifetimes of the relevant variables overlap. Specifically, the basic-block
> structure in your first if-based example is actually:
>
>    foo = ...
>    cbranch merge
>    ; fall-through
>
> if:
>    bar = ...
>    ; fall-through
>
> merge:
>    cbranch endif
>    ; fall-through
>
> else:
>    ... = foo
>    ; fall-through
>
> endif:
>
> ... and so foo and bar will not be assigned the same register.
What about my second example? There, the expanded control flow should
be the same as the original control flow, yet we still have the
problem AFAICT.

Also, does this mean that the registers will interfere even if they're
only ever written with non-overlapping exec masks? That seems like a
shame, since in the vast majority of cases you're going to add
artificial extra register pressure and constrain the register
allocator more than necessary...
>
> Again, where it gets hairy is with spilling, because the spiller is
> hopelessly lost when it comes to understanding EXEC masks...
Ok, I'll look into it. Thanks for the hints.
>
> Cheers,
> Nicolai
>
>
>> Thanks,
>>
>> Connor
>>
>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jun 2017 - Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Reasonably Related Threads