thr3ads.net - llvm dev - [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Sumner, Brian via llvm-dev

2017-Jun-15 13:56 UTC

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

I'm wondering about the focus on bound_cntl.  Any cleared bit in the
row_mask or bank_mask will also disable updating the result.
Brian

-----Original Message-----
From: Connor Abbott [mailto:cwabbott0 at gmail.com] 
Sent: Wednesday, June 14, 2017 6:13 PM
To: tstellar at redhat.com
Cc: Matt Arsenault; llvm-dev at lists.llvm.org; Kolton, Sam; Sumner, Brian;
Pykhtin, Valery
Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU
backend

On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <tstellar at redhat.com>
wrote:> On 06/14/2017 05:05 PM, Connor Abbott wrote:
>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at
redhat.com> wrote:
>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>>
>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at
redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>
>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard
<tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev
wrote:
>>>>>>>> cc some people who have worked on this.
>>>>>>>>
>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via
llvm-dev wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I've been looking into how to implement
the more advanced
>>>>>>>>> Shader Model
>>>>>>>>> 6 reduction operations in radv (and
obviously most of the work
>>>>>>>>> would be useful for radeonsi too).
They're explained in the
>>>>>>>>> spec for GL_AMD_shader_ballot at 
>>>>>>>>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_sha
>>>>>>>>> der_ballot.txt, but I'll summarize them
here. There are two
>>>>>>>>> types of operations:
>>>>>>>>> reductions that always return a uniform
value, and prefix scan
>>>>>>>>> operations. The reductions can be
implemented in terms of the
>>>>>>>>> prefix scan (although in practice I
don't think we want to
>>>>>>>>> implement them in exactly the same way),
and the concerns are
>>>>>>>>> mostly the same, so I'll focus on the
prefix scan operations for now. Given an operation `op'
>>>>>>>>> and an input value `a' (that's
really a SIMD array with one
>>>>>>>>> value per invocation, even though it's
a scalar value in
>>>>>>>>> LLVM), the prefix scan returns a[0] in
invocation 0, a[0] `op'
>>>>>>>>> a[1] in invocation 1, a[0] `op' a[1]
`op' a[2] in invocation
>>>>>>>>> 2, etc. The prefix scan will also work for
non-uniform control
>>>>>>>>> flow: it simply skips inactive invocations.
>>>>>>>>>
>>>>>>>>> On the LLVM side, I think that we have most
of the
>>>>>>>>> AMD-specific low-level shuffle intrinsics
implemented that you
>>>>>>>>> need to do this, but I can think of a few
concerns/questions.
>>>>>>>>> First of all, to implement the prefix scan,
we'll need to do a
>>>>>>>>> code sequence that looks like this,
modified from
>>>>>>>>>
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
>>>>>>>>> (replace
>>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>>
>>>>>>>>> ; v0 is the input register
>>>>>>>>> v_mov_b32 v1, v0
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 //
Instruction 1
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 //
Instruction 2
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ /
Instruction 3 v_nop // Add
>>>>>>>>> two independent instructions to avoid a
data hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4
bank_mask:0xe // Instruction 4
>>>>>>>>> v_nop // Add two independent instructions
to avoid a data
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8
bank_mask:0xc // Instruction 5
>>>>>>>>> v_nop // Add two independent instructions
to avoid a data
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15
row_mask:0xa // Instruction
>>>>>>>>> 6 v_nop // Add two independent instructions
to avoid a data
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31
row_mask:0xc // Instruction
>>>>>>>>> 7
>>>>>>>>>
>>>>>>>>> The problem is that the way these
instructions use the DPP
>>>>>>>>> word isn't currently expressible in
LLVM. We have the
>>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it
isn't enough. For
>>>>>>>>> example, take the first
>>>>>>>>> instruction:
>>>>>>>>>
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>
>>>>>>>>> What it's doing is shifting v0 right by
one within each row
>>>>>>>>> and adding it to v1. v1 stays the same in
the first lane of each row, however.
>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to
express it as
>>>>>>>>> something like this, in LLVM-like
pseduocode:
>>>>>>>>>
>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input
row_shr:1 %result = foo
>>>>>>>>> %tmp, %input
>>>>>>>>>
>>>>>>>>> but this is incorrect. If I'm reading
the source correctly,
>>>>>>>>> this will make %tmp garbage in lane 0
(since it just turns
>>>>>>>>> into a normal move with the dpp modifier,
and no restrictions
>>>>>>>>> on the destination). We could set
bound_ctrl to 0 to work
>>>>>>>>> around this, since it will make %tmp
>>>>>>>>> 0 in lane 0, but that won't work with
operations whose
>>>>>>>>> identity is
>>>>>>>>> non-0 like min and max. What we need is
something like:
>>>>>>>>>
>>>>>>>
>>>>>>> Why is %tmp garbage?  I thought the two options
were 0
>>>>>>> (bound_ctrl =0) or %input (bound_ctrl = 1)?
>>>>>>
>>>>>> Oh, maybe it is... for that to happen the underlying
move would
>>>>>> need to have the source and destination constrained to
be the
>>>>>> same. I couldn't see that constraint anywhere I
looked, but I'm
>>>>>> not an expert, so I may have overlooked it. In any
case, that
>>>>>> behavior still isn't what we want if we want to
implement the
>>>>>> prefix scan operations efficiently.
>>>>>>
>>>>>
>>>>> Ok, I see what you are saying now.  I think the best option
here
>>>>> is to document that the behavior of the llvm.amdgcn.mov.dpp
>>>>> intrinsic is to copy its src operand to dst when bound_ctrl
= 1
>>>>> and it reads from an invalid thread, and then when
bound_ctrl=1,
>>>>> lower the intrinsic to a special tied version of
V_MOV_B32_dpp where the src and dst are the same register.
>>>>>
>>>>> -Tom
>>>>
>>>> This came up before that there’s no way to represent the
unmodified input register for the inactive lanes. I think the conclusion was
that a new intrinsic is needed to represent this case but I don’t think there
was a consensus on what it should look like.
>>>>
>>>
>>> I think re-defining the existing intrinsic will be easier than 
>>> adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp 
>>> will still be broken for the bound_ctrl=1 case, which isn't
really ideal.
>>
>> You can't re-define the existing intrinsic. The problem is that a 
>> move with DPP doesn't just write to its destination register - it 
>> modifies it, i.e. it reads and then writes to it. You can't express
>> that with SSA, since you can only ever write to an SSA value once. I 
>> think what you want is an intrinsic, say llvm.amdgcn.update.dpp 
>> (dunno what you'd call it?) that takes the value to use if the read
>> is invalid. That is, something like:
>>
>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>
>
> This is a more general form of what I'm suggesting.  If you re-define 
> (when I say re-define, I just mean document this behavior) 
> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value of %src 
> will be written to %dst for invalid lanes, then this will have the 
> same effect as llvm.amdgcn.update.dpp when %old = %update.
>
> I can see the usefulness now of having this extra intrinsic, but I 
> still think it would be good to fix llvm.amdgcn.mov.dpp.
> At the MachineIR level, I think you could handle the new intrinsic and 
> llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same pseudo 
> instructions, so fixing llvm.amdgcn.mov.dpp may not be so hard if you 
> are adding the new intrinsic.
Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to
generate the best possible code sequence, and it's a strict superset of
llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm inclined to
not bother fixing it and just deprecate it instead.
>
>
>> would get turned into:
>>
>> v_mov_b32 %new, %old
>> v_mov_b32 %new, %update (dpp control)
>>
>> And hopefully coalescing will remove the first copy. Think of it as 
>> like lowering LLVM's three-address form for e.g. ADD to the x86 
>> two-address form. If you wanted to get fancy, you could also 
>> recognize the case where the old value is zero and turn that into a 
>> DPP move with bound_ctrl = 1.
>>
>> Also, remember that the whole point of this was to be able to express 
>> stuff like:
>>
>> v_min_f32 v1, v0, v1 (dpp control)
>>
>> where you take the minimum of v1 and the swizzled v0, except where 
>> you would've read an invalid lane for v0, you read the old value
for
>> v1 instead. For operations like add and iadd where the identity is 0, 
>> you can set bound_ctrl = 1, and then the optimizer can safely fold 
>> the
>> v_mov_b32 into the operation itself. That is, you'd do:
>>
>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control) 
>> %new = i32 add %swizzled, %old
>>
>> and after coalescing, register allocation, etc. the backend would 
>> turn that into:
>>
>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>>
>> which is functionally equivalent to the version without the bound_ctrl.
>>
>> Otherwise, for operations `op' where a `op' a == a, you can do
something like:
>>
>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control) 
>> %new = f32 fmin %swizzled, %old
>>
>> and then if the optimizer is clever enough, it can also fold it into 
>> one instruction since the invalid lanes will get their old values 
>> back. I think that covers all the cases we care about. The question 
>> is whether it's better to take that route, or whether it's
better to
>> just add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, 
>> etc. so
>> that:
>>
>
> I would only go the route of adding a dpp intrinsic for every 
> operation if it gives you functionality that you can't get with the
llvm.amdgcn.update.dpp.
> The main reason for this is that you will lose the generic combines 
> that LLVM has for add, min, etc.
Ok, good point. I think this is only going to be used in a few specific
scenarios, but I can see stuff like fusing add + into mad being useful.
>
>
> -Tom
>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>>
>> turns into:
>>
>> v_mov_b32 %new, %old
>> v_min_f32 %new, %src0, %src1 (dpp_control)
>>
>> and then we can get what we want directly, without have to do much 
>> optimization except for coalescing that already exists. The downside 
>> is that we'd have to add a lot more intrinsics (one for min, max, 
>> fmin, fmax, add, and iadd).
>>
>>>
>>> -Tom
>>>
>>>> -Matt
>>>>
>>>
>

Connor Abbott via llvm-dev

2017-Jun-15 21:56 UTC

head link

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On Thu, Jun 15, 2017 at 6:56 AM, Sumner, Brian <Brian.Sumner at amd.com>
wrote:> I'm wondering about the focus on bound_cntl.  Any cleared bit in the
row_mask or bank_mask will also disable updating the result.
> Brian
True. Although, the same issue happens if you clear a bit in row_mask
or bank_mask as well.
>
> -----Original Message-----
> From: Connor Abbott [mailto:cwabbott0 at gmail.com]
> Sent: Wednesday, June 14, 2017 6:13 PM
> To: tstellar at redhat.com
> Cc: Matt Arsenault; llvm-dev at lists.llvm.org; Kolton, Sam; Sumner, Brian;
Pykhtin, Valery
> Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU
backend
>
> On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <tstellar at
redhat.com> wrote:
>> On 06/14/2017 05:05 PM, Connor Abbott wrote:
>>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at
redhat.com> wrote:
>>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>>>
>>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at
redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>>
>>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard
<tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via
llvm-dev wrote:
>>>>>>>>> cc some people who have worked on this.
>>>>>>>>>
>>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via
llvm-dev wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I've been looking into how to
implement the more advanced
>>>>>>>>>> Shader Model
>>>>>>>>>> 6 reduction operations in radv (and
obviously most of the work
>>>>>>>>>> would be useful for radeonsi too).
They're explained in the
>>>>>>>>>> spec for GL_AMD_shader_ballot at
>>>>>>>>>>
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_sha
>>>>>>>>>> der_ballot.txt, but I'll summarize
them here. There are two
>>>>>>>>>> types of operations:
>>>>>>>>>> reductions that always return a uniform
value, and prefix scan
>>>>>>>>>> operations. The reductions can be
implemented in terms of the
>>>>>>>>>> prefix scan (although in practice I
don't think we want to
>>>>>>>>>> implement them in exactly the same
way), and the concerns are
>>>>>>>>>> mostly the same, so I'll focus on
the prefix scan operations for now. Given an operation `op'
>>>>>>>>>> and an input value `a' (that's
really a SIMD array with one
>>>>>>>>>> value per invocation, even though
it's a scalar value in
>>>>>>>>>> LLVM), the prefix scan returns a[0] in
invocation 0, a[0] `op'
>>>>>>>>>> a[1] in invocation 1, a[0] `op'
a[1] `op' a[2] in invocation
>>>>>>>>>> 2, etc. The prefix scan will also work
for non-uniform control
>>>>>>>>>> flow: it simply skips inactive
invocations.
>>>>>>>>>>
>>>>>>>>>> On the LLVM side, I think that we have
most of the
>>>>>>>>>> AMD-specific low-level shuffle
intrinsics implemented that you
>>>>>>>>>> need to do this, but I can think of a
few concerns/questions.
>>>>>>>>>> First of all, to implement the prefix
scan, we'll need to do a
>>>>>>>>>> code sequence that looks like this,
modified from
>>>>>>>>>>
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
>>>>>>>>>> (replace
>>>>>>>>>> v_foo_f32 with the appropriate
operation):
>>>>>>>>>>
>>>>>>>>>> ; v0 is the input register
>>>>>>>>>> v_mov_b32 v1, v0
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 //
Instruction 1
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 //
Instruction 2
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ /
Instruction 3 v_nop // Add
>>>>>>>>>> two independent instructions to avoid a
data hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4
bank_mask:0xe // Instruction 4
>>>>>>>>>> v_nop // Add two independent
instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8
bank_mask:0xc // Instruction 5
>>>>>>>>>> v_nop // Add two independent
instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15
row_mask:0xa // Instruction
>>>>>>>>>> 6 v_nop // Add two independent
instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31
row_mask:0xc // Instruction
>>>>>>>>>> 7
>>>>>>>>>>
>>>>>>>>>> The problem is that the way these
instructions use the DPP
>>>>>>>>>> word isn't currently expressible in
LLVM. We have the
>>>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it
isn't enough. For
>>>>>>>>>> example, take the first
>>>>>>>>>> instruction:
>>>>>>>>>>
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>>
>>>>>>>>>> What it's doing is shifting v0
right by one within each row
>>>>>>>>>> and adding it to v1. v1 stays the same
in the first lane of each row, however.
>>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try
to express it as
>>>>>>>>>> something like this, in LLVM-like
pseduocode:
>>>>>>>>>>
>>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input
row_shr:1 %result = foo
>>>>>>>>>> %tmp, %input
>>>>>>>>>>
>>>>>>>>>> but this is incorrect. If I'm
reading the source correctly,
>>>>>>>>>> this will make %tmp garbage in lane 0
(since it just turns
>>>>>>>>>> into a normal move with the dpp
modifier, and no restrictions
>>>>>>>>>> on the destination). We could set
bound_ctrl to 0 to work
>>>>>>>>>> around this, since it will make %tmp
>>>>>>>>>> 0 in lane 0, but that won't work
with operations whose
>>>>>>>>>> identity is
>>>>>>>>>> non-0 like min and max. What we need is
something like:
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Why is %tmp garbage?  I thought the two options
were 0
>>>>>>>> (bound_ctrl =0) or %input (bound_ctrl = 1)?
>>>>>>>
>>>>>>> Oh, maybe it is... for that to happen the
underlying move would
>>>>>>> need to have the source and destination constrained
to be the
>>>>>>> same. I couldn't see that constraint anywhere I
looked, but I'm
>>>>>>> not an expert, so I may have overlooked it. In any
case, that
>>>>>>> behavior still isn't what we want if we want to
implement the
>>>>>>> prefix scan operations efficiently.
>>>>>>>
>>>>>>
>>>>>> Ok, I see what you are saying now.  I think the best
option here
>>>>>> is to document that the behavior of the
llvm.amdgcn.mov.dpp
>>>>>> intrinsic is to copy its src operand to dst when
bound_ctrl = 1
>>>>>> and it reads from an invalid thread, and then when
bound_ctrl=1,
>>>>>> lower the intrinsic to a special tied version of
V_MOV_B32_dpp where the src and dst are the same register.
>>>>>>
>>>>>> -Tom
>>>>>
>>>>> This came up before that there’s no way to represent the
unmodified input register for the inactive lanes. I think the conclusion was
that a new intrinsic is needed to represent this case but I don’t think there
was a consensus on what it should look like.
>>>>>
>>>>
>>>> I think re-defining the existing intrinsic will be easier than
>>>> adding a new one.  Also, if we add a new one,
llvm.amdgcn.mov.dpp
>>>> will still be broken for the bound_ctrl=1 case, which isn't
really ideal.
>>>
>>> You can't re-define the existing intrinsic. The problem is that
a
>>> move with DPP doesn't just write to its destination register -
it
>>> modifies it, i.e. it reads and then writes to it. You can't
express
>>> that with SSA, since you can only ever write to an SSA value once.
I
>>> think what you want is an intrinsic, say llvm.amdgcn.update.dpp
>>> (dunno what you'd call it?) that takes the value to use if the
read
>>> is invalid. That is, something like:
>>>
>>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>>
>>
>> This is a more general form of what I'm suggesting.  If you
re-define
>> (when I say re-define, I just mean document this behavior)
>> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value of %src
>> will be written to %dst for invalid lanes, then this will have the
>> same effect as llvm.amdgcn.update.dpp when %old = %update.
>>
>> I can see the usefulness now of having this extra intrinsic, but I
>> still think it would be good to fix llvm.amdgcn.mov.dpp.
>> At the MachineIR level, I think you could handle the new intrinsic and
>> llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same pseudo
>> instructions, so fixing llvm.amdgcn.mov.dpp may not be so hard if you
>> are adding the new intrinsic.
>
> Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to
generate the best possible code sequence, and it's a strict superset of
llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm inclined to
not bother fixing it and just deprecate it instead.
>
>>
>>
>>> would get turned into:
>>>
>>> v_mov_b32 %new, %old
>>> v_mov_b32 %new, %update (dpp control)
>>>
>>> And hopefully coalescing will remove the first copy. Think of it as
>>> like lowering LLVM's three-address form for e.g. ADD to the x86
>>> two-address form. If you wanted to get fancy, you could also
>>> recognize the case where the old value is zero and turn that into a
>>> DPP move with bound_ctrl = 1.
>>>
>>> Also, remember that the whole point of this was to be able to
express
>>> stuff like:
>>>
>>> v_min_f32 v1, v0, v1 (dpp control)
>>>
>>> where you take the minimum of v1 and the swizzled v0, except where
>>> you would've read an invalid lane for v0, you read the old
value for
>>> v1 instead. For operations like add and iadd where the identity is
0,
>>> you can set bound_ctrl = 1, and then the optimizer can safely fold
>>> the
>>> v_mov_b32 into the operation itself. That is, you'd do:
>>>
>>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp
control)
>>> %new = i32 add %swizzled, %old
>>>
>>> and after coalescing, register allocation, etc. the backend would
>>> turn that into:
>>>
>>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>>>
>>> which is functionally equivalent to the version without the
bound_ctrl.
>>>
>>> Otherwise, for operations `op' where a `op' a == a, you can
do something like:
>>>
>>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>> %new = f32 fmin %swizzled, %old
>>>
>>> and then if the optimizer is clever enough, it can also fold it
into
>>> one instruction since the invalid lanes will get their old values
>>> back. I think that covers all the cases we care about. The question
>>> is whether it's better to take that route, or whether it's
better to
>>> just add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp,
>>> etc. so
>>> that:
>>>
>>
>> I would only go the route of adding a dpp intrinsic for every
>> operation if it gives you functionality that you can't get with the
llvm.amdgcn.update.dpp.
>> The main reason for this is that you will lose the generic combines
>> that LLVM has for add, min, etc.
>
> Ok, good point. I think this is only going to be used in a few specific
scenarios, but I can see stuff like fusing add + into mad being useful.
>
>>
>>
>> -Tom
>>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>>>
>>> turns into:
>>>
>>> v_mov_b32 %new, %old
>>> v_min_f32 %new, %src0, %src1 (dpp_control)
>>>
>>> and then we can get what we want directly, without have to do much
>>> optimization except for coalescing that already exists. The
downside
>>> is that we'd have to add a lot more intrinsics (one for min,
max,
>>> fmin, fmax, add, and iadd).
>>>
>>>>
>>>> -Tom
>>>>
>>>>> -Matt
>>>>>
>>>>
>>

llvm dev - Jun 2017 - Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend