thr3ads.net - llvm dev - [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics [May 2019]

If this information is useful, please help other people find it:
Share via:

Sander De Smalen via llvm-dev

2019-May-16 12:53 UTC

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Hello again,

I've been meaning to follow up on this thread for the last couple of weeks,
my apologies for the delay.

To summarise the feedback on the proposal for vector.reduce.fadd/fmul:

There seems to be consensus to keep the explicit start value to better
accommodate chained reductions (as opposed to generating IR that performs the
reduction of the first element using extract/fadd/insert pattern). An important
use-case for these reductions is to work inside vectorized loops, where chaining
happens through the reduction value's PHI node (i.e. the scalar reduction
value from one iteration will be the input to the next iteration). This
intrinsic would also naturally match reduction instructions of ARM SVE and NEC
SX-aurora.

For Option A (https://reviews.llvm.org/D60261), there is an argument that code
creating or operating on these intrinsics can treat ordered and unordered
reductions the same (in that they have the same arguments). Fast-math flags
determine whether or not the intrinsic needs to be evaluated in strict order.
Codegen for non-strict reductions should be able to fold away the
identity-value.

For Option B (https://reviews.llvm.org/D60262), David made the argument that
making the reduction-order explicit (as opposed to deducing this from fast-math
flags) would ensure the ordering is always as expected, even when FMF on the
call sites are dropped for some reason. 


Is it correct that I sensed a slight preference for Option A? i.e. Renaming the
intrinsics and keeping the same signature, but dropping the special-cased
behaviour for the identity-value with non-strict reductions. For David's
argument, I think that although the extra expressiveness would be nice to have,
LLVM normally depends on the FMF being propagated correctly to produce faster
code so this should also be sufficient for reductions.

If we go for Option A, I suggest we drop the 'experimental' prefix from
experimental.vector.reduce.fadd/fmul to avoid having to add an awkward
'.v2' suffix to the new intrinsic. When we implement all the suggestions
from this proposal (possibly including the one mentioned below), I wouldn't
really know what other features we could add other than predication (which would
be covered by the LLVM-VP proposal and thus require another renaming), or
possibly adding 'constrained' variants which I assume would have
separate intrinsics. So we might as well drop the 'experimental' prefix.

Finally, do we want to remove the restriction that the result type must always
match the vector-element type? A wider result type would then allow the
reduction to be performed in the wider type.

Thanks,

Sander
> On 10 Apr 2019, at 18:56, Amara Emerson <aemerson at apple.com>
wrote:
> 
> I’m fine with the direction this is going, but let’s keep renaming to a
minimum. They’ve been experimental long enough now that we should be able to now
jump to a final form after all the feedback.
> 
> Amara
> 
>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>> 
>>> 
>>> On 8 Apr 2019, at 11:37, Simon Moll <moll at
cs.uni-saarland.de> wrote:
>>> 
>>> Hi,
>>> 
>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>> Proposed change:
>>>>>> ----------------------------
>>>>>> In this RFC I propose changing the intrinsics for
llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul
(see options A and B). I also propose renaming the 'accumulator' operand
to 'start value' because for fmul this is the start value of the
reduction, rather than a value to which the fmul reduction is accumulated into.
>>> Note that the LLVM-VP proposal also changes the way reductions are
handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to
avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the
"llvm.vp.*" namespace and we can fix the reduction semantics in the
progress.
>> Thanks for pointing out Simon. I think for now we should keep this
proposal separate from LLVM-VP as they serve different purposes and have
different scope. But yes we can easily rename the intrinsics again when the VP
proposal lands.
>> 
>>> 
>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on
LLVM-VP.
>>> 
>>>>>> 
>>>>>> [Option A] Always using the start value operand in the
reduction (https://reviews.llvm.org/D60261)
>>>>>> 
>>>>>>  declare float
@llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x
float> %vec)
>>>>>> 
>>>>>> This means that if the start value is 'undef',
the result will be undef and all code creating such a reduction will need to
ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When
using 'fast' or ‘reassoc’ on the call it will be implemented using an
unordered reduction, otherwise it will be implemented with an ordered reduction.
Note that a new intrinsic is required to capture the new semantics. In this
proposal the intrinsic is prefixed with a 'v2' for the time being, with
the expectation this will be dropped when we remove 'experimental' from
the reduction intrinsics in the future.
>>>>>> 
>>>>>> [Option B] Having separate ordered and unordered
intrinsics (https://reviews.llvm.org/D60262).
>>>>>> 
>>>>>>  declare float
@llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value,
<4 x float> %vec)
>>>>>>  declare float
@llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float>
%vec)
>>>>>> 
>>>>>> This will mean that the behaviour is explicit from the
intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect
on how that intrinsic is lowered. The ordered reduction intrinsic will take a
scalar start-value operand, where the unordered reduction intrinsic will only
take a vector operand.
>>>>>> 
>>>>>> Both options auto-upgrade the IR to use the new
(version of the) intrinsics. I'm personally slightly in favour of [Option
B], because it better aligns with the definition of the SelectionDAG nodes and
is more explicit in its semantics. We also avoid having to use an artificial
'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>> Do we have any targets with instructions that can actually
use the start value? TBH I'd be tempted to suggest we just make the initial
extractelement/fadd/insertelement pattern a manual extra stage and avoid having
having that argument entirely.
>>>>> 
>>> NEC SX-Aurora has reduction instructions that take in a start value
in a scalar register. We are hoping to upstream the backend:
http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>> Great, I think combined with the argument for chaining of ordered
reductions (often inside vectorized loops) and two architectures (ARM SVE and
SX-Aurora) taking a scalar start register, this is enough of an argument to keep
the explicit operand for the ordered reductions.
>> 
>>>>> 
>>>>>> Further efforts:
>>>>>> ----------------------------
>>>>>> Here a non-exhaustive list of items I think work
towards making the intrinsics non-experimental:
>>>>>> 
>>>>>> 	• Adding SelectionDAG legalization for the  _STRICT
reduction SDNodes. After some great work from Nikita in D58015, unordered
reductions are now legalized/expanded in SelectionDAG, so if we add expansion in
SelectionDAG for strict reductions this would make the ExpandReductionsPass
redundant.
>>>>>> 	• Better enforcing the constraints of the intrinsics
(see https://reviews.llvm.org/D60260 ).
>>>>>> 
>>>>>> 	• I think we'll also want to be able to overload
the result operand based on the vector element type for the intrinsics having
the constraint that the result type must match the vector element type. e.g.
dropping the redundant 'i32' in:
>>>>>>  i32
@llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32
@llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>> since i32 is implied by <4 x i32>. This would
have the added benefit that LLVM would automatically check for the operands to
match.
>>>>>> 
>>>>> Won't this cause issues with overflow? Isn't the
point  of an add (or mul....) reduction of say, <64 x i8> giving a larger
(i32 or i64) result so we don't lose anything? I agree for bitop reductions
it doesn't make sense though.
>>>>> 
>>>> Sorry - I forgot to add: which asks the question - should we be
considering signed/unsigned add/mul and possibly saturation reductions?
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> 
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> -- 
>>> 
>>> Simon Moll
>>> Researcher / PhD Student
>>> 
>>> Compiler Design Lab (Prof. Hack)
>>> Saarland University, Computer Science
>>> Building E1.3, Room 4.31
>>> 
>>> Tel. +49 (0)681 302-57521 : 
>>> moll at cs.uni-saarland.de
>>> 
>>> Fax. +49 (0)681 302-3065  : 
>>> http://compilers.cs.uni-saarland.de/people/moll
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

David Green via llvm-dev

2019-May-16 16:36 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Hello.

Thanks for working on this.

Are we talking only about the floating point versions of these, or the integer
ones as well?

For the integer ones, there are a number of new MVE instructions that reduce
from, for example i16's into an i32. Or long versions that accumulate into
an i64.

For example the VADDVA.U16 Rda, Qm instruction will accumulate into a 32bit
register.
The VADDLVA.U16  RdaLo, RdaHi, Qm instruction will accumulate into a pair of
32bit registers (so a 64bit value).

Thanks,
Dave

P.S. There are two different people on this thread that are named "David
Greene"/"David Green". Sorry in advanced for the confusion.



From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Sander De
Smalen via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 16 May 2019 13:53
To: llvm-dev at lists.llvm.org
Cc: David Greene; nd
Subject: Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce
intrinsics
 
Hello again,

I've been meaning to follow up on this thread for the last couple of weeks,
my apologies for the delay.

To summarise the feedback on the proposal for vector.reduce.fadd/fmul:

There seems to be consensus to keep the explicit start value to better
accommodate chained reductions (as opposed to generating IR that performs the
reduction of the first element using extract/fadd/insert pattern). An important
use-case for these reductions is to work inside vectorized loops, where chaining
happens through the reduction value's PHI node (i.e. the scalar reduction
value from one iteration will be the input to the next iteration). This
intrinsic would also naturally match reduction instructions of ARM SVE and NEC
SX-aurora.

For Option A (https://reviews.llvm.org/D60261), there is an argument that code
creating or operating on these intrinsics can treat ordered and unordered
reductions the same (in that they have the same arguments). Fast-math flags
determine whether or not the intrinsic needs to be evaluated in strict order.
Codegen for non-strict reductions should be able to fold away the
identity-value.

For Option B (https://reviews.llvm.org/D60262), David made the argument that
making the reduction-order explicit (as opposed to deducing this from fast-math
flags) would ensure the ordering is always as expected, even when FMF on the
call sites are dropped for some reason. 


Is it correct that I sensed a slight preference for Option A? i.e. Renaming the
intrinsics and keeping the same signature, but dropping the special-cased
behaviour for the identity-value with non-strict reductions. For David's
argument, I think that although the extra expressiveness would be nice to have,
LLVM normally depends on the FMF being propagated correctly to produce faster
code so this should also be sufficient for reductions.

If we go for Option A, I suggest we drop the 'experimental' prefix from
experimental.vector.reduce.fadd/fmul to avoid having to add an awkward
'.v2' suffix to the new intrinsic. When we implement all the suggestions
from this proposal (possibly including the one mentioned below), I wouldn't
really know what other features we could add other than predication (which would
be covered by the LLVM-VP proposal and thus require another renaming), or
possibly adding 'constrained' variants which I assume would have
separate intrinsics. So we might as well drop the 'experimental' prefix.

Finally, do we want to remove the restriction that the result type must always
match the vector-element type? A wider result type would then allow the
reduction to be performed in the wider type.

Thanks,

Sander
> On 10 Apr 2019, at 18:56, Amara Emerson <aemerson at apple.com>
wrote:
> 
> I’m fine with the direction this is going, but let’s keep renaming to a
minimum. They’ve been experimental long enough now that we should be able to now
jump to a final form after all the feedback.
> 
> Amara
> 
>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>> 
>>> 
>>> On 8 Apr 2019, at 11:37, Simon Moll <moll at
cs.uni-saarland.de> wrote:
>>> 
>>> Hi,
>>> 
>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>> Proposed change:
>>>>>> ----------------------------
>>>>>> In this RFC I propose changing the intrinsics for
llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul
(see options A and B). I also propose renaming the 'accumulator' operand
to 'start value' because for fmul this is the start value of the
reduction, rather than a value to which the fmul reduction is accumulated into.
>>> Note that the LLVM-VP proposal also changes the way reductions are
handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to
avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the
"llvm.vp.*" namespace and we can fix the reduction semantics in the
progress.
>> Thanks for pointing out Simon. I think for now we should keep this
proposal separate from LLVM-VP as they serve different purposes and have
different scope. But yes we can easily rename the intrinsics again when the VP
proposal lands.
>> 
>>> 
>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on
LLVM-VP.
>>> 
>>>>>> 
>>>>>> [Option A] Always using the start value operand in the
reduction (https://reviews.llvm.org/D60261)
>>>>>> 
>>>>>>  declare float
@llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x
float> %vec)
>>>>>> 
>>>>>> This means that if the start value is 'undef',
the result will be undef and all code creating such a reduction will need to
ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When
using 'fast' or ‘reassoc’ on the call it will be implemented using an
unordered reduction, otherwise it will be implemented with an ordered reduction.
Note that a new intrinsic is required to capture the new semantics. In this
proposal the intrinsic is prefixed with a 'v2' for the time being, with
the expectation this will be dropped when we remove 'experimental' from
the reduction intrinsics in the future.
>>>>>> 
>>>>>> [Option B] Having separate ordered and unordered
intrinsics (https://reviews.llvm.org/D60262).
>>>>>> 
>>>>>>  declare float
@llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value,
<4 x float> %vec)
>>>>>>  declare float
@llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float>
%vec)
>>>>>> 
>>>>>> This will mean that the behaviour is explicit from the
intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect
on how that intrinsic is lowered. The ordered reduction intrinsic will take a
scalar start-value operand, where the unordered reduction intrinsic will only
take a vector operand.
>>>>>> 
>>>>>> Both options auto-upgrade the IR to use the new
(version of the) intrinsics. I'm personally slightly in favour of [Option
B], because it better aligns with the definition of the SelectionDAG nodes and
is more explicit in its semantics. We also avoid having to use an artificial
'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>> Do we have any targets with instructions that can actually
use the start value? TBH I'd be tempted to suggest we just make the initial
extractelement/fadd/insertelement pattern a manual extra stage and avoid having
having that argument entirely.
>>>>> 
>>> NEC SX-Aurora has reduction instructions that take in a start value
in a scalar register. We are hoping to upstream the backend:
http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>> Great, I think combined with the argument for chaining of ordered
reductions (often inside vectorized loops) and two architectures (ARM SVE and
SX-Aurora) taking a scalar start register, this is enough of an argument to keep
the explicit operand for the ordered reductions.
>> 
>>>>> 
>>>>>> Further efforts:
>>>>>> ----------------------------
>>>>>> Here a non-exhaustive list of items I think work
towards making the intrinsics non-experimental:
>>>>>> 
>>>>>>   • Adding SelectionDAG legalization for the  _STRICT
reduction SDNodes. After some great work from Nikita in D58015, unordered
reductions are now legalized/expanded in SelectionDAG, so if we add expansion in
SelectionDAG for strict reductions this would make the ExpandReductionsPass
redundant.
>>>>>>   • Better enforcing the constraints of the intrinsics
(see https://reviews.llvm.org/D60260 ).
>>>>>> 
>>>>>>   • I think we'll also want to be able to overload
the result operand based on the vector element type for the intrinsics having
the constraint that the result type must match the vector element type. e.g.
dropping the redundant 'i32' in:
>>>>>>  i32
@llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32
@llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>> since i32 is implied by <4 x i32>. This would
have the added benefit that LLVM would automatically check for the operands to
match.
>>>>>> 
>>>>> Won't this cause issues with overflow? Isn't the
point  of an add (or mul....) reduction of say, <64 x i8> giving a larger
(i32 or i64) result so we don't lose anything? I agree for bitop reductions
it doesn't make sense though.
>>>>> 
>>>> Sorry - I forgot to add: which asks the question - should we be
considering signed/unsigned add/mul and possibly saturation reductions?
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> 
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> -- 
>>> 
>>> Simon Moll
>>> Researcher / PhD Student
>>> 
>>> Compiler Design Lab (Prof. Hack)
>>> Saarland University, Computer Science
>>> Building E1.3, Room 4.31
>>> 
>>> Tel. +49 (0)681 302-57521 : 
>>> moll at cs.uni-saarland.de
>>> 
>>> Fax. +49 (0)681 302-3065  : 
>>> http://compilers.cs.uni-saarland.de/people/moll
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Philip Reames via llvm-dev

2019-May-16 19:38 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

On 5/16/19 5:53 AM, Sander De Smalen via llvm-dev wrote:> Hello again,
>
> I've been meaning to follow up on this thread for the last couple of
weeks, my apologies for the delay.
>
> To summarise the feedback on the proposal for vector.reduce.fadd/fmul:
>
> There seems to be consensus to keep the explicit start value to better
accommodate chained reductions (as opposed to generating IR that performs the
reduction of the first element using extract/fadd/insert pattern). An important
use-case for these reductions is to work inside vectorized loops, where chaining
happens through the reduction value's PHI node (i.e. the scalar reduction
value from one iteration will be the input to the next iteration). This
intrinsic would also naturally match reduction instructions of ARM SVE and NEC
SX-aurora.
>
> For Option A (https://reviews.llvm.org/D60261), there is an argument that
code creating or operating on these intrinsics can treat ordered and unordered
reductions the same (in that they have the same arguments). Fast-math flags
determine whether or not the intrinsic needs to be evaluated in strict order.
Codegen for non-strict reductions should be able to fold away the
identity-value.
>
> For Option B (https://reviews.llvm.org/D60262), David made the argument
that making the reduction-order explicit (as opposed to deducing this from
fast-math flags) would ensure the ordering is always as expected, even when FMF
on the call sites are dropped for some reason. 
>
>
> Is it correct that I sensed a slight preference for Option A? i.e. Renaming
the intrinsics and keeping the same signature, but dropping the special-cased
behaviour for the identity-value with non-strict reductions. For David's
argument, I think that although the extra expressiveness would be nice to have,
LLVM normally depends on the FMF being propagated correctly to produce faster
code so this should also be sufficient for reductions.
>From your summary, Option A sounds better to me, but I'll defer to those
actively involved in this area.> If we go for Option A, I suggest we drop the 'experimental' prefix
from experimental.vector.reduce.fadd/fmul to avoid having to add an awkward
'.v2' suffix to the new intrinsic. When we implement all the suggestions
from this proposal (possibly including the one mentioned below), I wouldn't
really know what other features we could add other than predication (which would
be covered by the LLVM-VP proposal and thus require another renaming), or
possibly adding 'constrained' variants which I assume would have
separate intrinsics. So we might as well drop the 'experimental' prefix.I'd suggest separating the experimental renaming proposal.  You don't
need it right now, you can simply change the IR semantics and forward
serialize old bitcode if desired.  Removing the experimental prefix has
support obligations, and I would recommend doing that as a separate
proposal if you want to.>
> Finally, do we want to remove the restriction that the result type must
always match the vector-element type? A wider result type would then allow the
reduction to be performed in the wider type.
>
> Thanks,
>
> Sander
>
>> On 10 Apr 2019, at 18:56, Amara Emerson <aemerson at apple.com>
wrote:
>>
>> I’m fine with the direction this is going, but let’s keep renaming to a
minimum. They’ve been experimental long enough now that we should be able to now
jump to a final form after all the feedback.
>>
>> Amara
>>
>>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On 8 Apr 2019, at 11:37, Simon Moll <moll at
cs.uni-saarland.de> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>>> Proposed change:
>>>>>>> ----------------------------
>>>>>>> In this RFC I propose changing the intrinsics for
llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul
(see options A and B). I also propose renaming the 'accumulator' operand
to 'start value' because for fmul this is the start value of the
reduction, rather than a value to which the fmul reduction is accumulated into.
>>>> Note that the LLVM-VP proposal also changes the way reductions
are handled in IR (https://reviews.llvm.org/D57504). This could be an
opportunity to avoid the "v2" suffix issue: LLVM-VP moves the
intrinsic to the "llvm.vp.*" namespace and we can fix the reduction
semantics in the progress.
>>> Thanks for pointing out Simon. I think for now we should keep this
proposal separate from LLVM-VP as they serve different purposes and have
different scope. But yes we can easily rename the intrinsics again when the VP
proposal lands.
>>>
>>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on
LLVM-VP.
>>>>
>>>>>>> [Option A] Always using the start value operand in
the reduction (https://reviews.llvm.org/D60261)
>>>>>>>
>>>>>>>  declare float
@llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x
float> %vec)
>>>>>>>
>>>>>>> This means that if the start value is
'undef', the result will be undef and all code creating such a reduction
will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for
fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented
using an unordered reduction, otherwise it will be implemented with an ordered
reduction. Note that a new intrinsic is required to capture the new semantics.
In this proposal the intrinsic is prefixed with a 'v2' for the time
being, with the expectation this will be dropped when we remove
'experimental' from the reduction intrinsics in the future.
>>>>>>>
>>>>>>> [Option B] Having separate ordered and unordered
intrinsics (https://reviews.llvm.org/D60262).
>>>>>>>
>>>>>>>  declare float
@llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value,
<4 x float> %vec)
>>>>>>>  declare float
@llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float>
%vec)
>>>>>>>
>>>>>>> This will mean that the behaviour is explicit from
the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no
effect on how that intrinsic is lowered. The ordered reduction intrinsic will
take a scalar start-value operand, where the unordered reduction intrinsic will
only take a vector operand.
>>>>>>>
>>>>>>> Both options auto-upgrade the IR to use the new
(version of the) intrinsics. I'm personally slightly in favour of [Option
B], because it better aligns with the definition of the SelectionDAG nodes and
is more explicit in its semantics. We also avoid having to use an artificial
'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>>> Do we have any targets with instructions that can
actually use the start value? TBH I'd be tempted to suggest we just make the
initial extractelement/fadd/insertelement pattern a manual extra stage and avoid
having having that argument entirely.
>>>>>>
>>>> NEC SX-Aurora has reduction instructions that take in a start
value in a scalar register. We are hoping to upstream the backend:
http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>>> Great, I think combined with the argument for chaining of ordered
reductions (often inside vectorized loops) and two architectures (ARM SVE and
SX-Aurora) taking a scalar start register, this is enough of an argument to keep
the explicit operand for the ordered reductions.
>>>
>>>>>>> Further efforts:
>>>>>>> ----------------------------
>>>>>>> Here a non-exhaustive list of items I think work
towards making the intrinsics non-experimental:
>>>>>>>
>>>>>>> 	• Adding SelectionDAG legalization for the 
_STRICT reduction SDNodes. After some great work from Nikita in D58015,
unordered reductions are now legalized/expanded in SelectionDAG, so if we add
expansion in SelectionDAG for strict reductions this would make the
ExpandReductionsPass redundant.
>>>>>>> 	• Better enforcing the constraints of the
intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>>
>>>>>>> 	• I think we'll also want to be able to
overload the result operand based on the vector element type for the intrinsics
having the constraint that the result type must match the vector element type.
e.g. dropping the redundant 'i32' in:
>>>>>>>  i32
@llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32
@llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>>> since i32 is implied by <4 x i32>. This would
have the added benefit that LLVM would automatically check for the operands to
match.
>>>>>>>
>>>>>> Won't this cause issues with overflow? Isn't
the point  of an add (or mul....) reduction of say, <64 x i8> giving a
larger (i32 or i64) result so we don't lose anything? I agree for bitop
reductions it doesn't make sense though.
>>>>>>
>>>>> Sorry - I forgot to add: which asks the question - should
we be considering signed/unsigned add/mul and possibly saturation reductions?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>>
>>>>> llvm-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> -- 
>>>>
>>>> Simon Moll
>>>> Researcher / PhD Student
>>>>
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland University, Computer Science
>>>> Building E1.3, Room 4.31
>>>>
>>>> Tel. +49 (0)681 302-57521 : 
>>>> moll at cs.uni-saarland.de
>>>>
>>>> Fax. +49 (0)681 302-3065  : 
>>>> http://compilers.cs.uni-saarland.de/people/moll
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

David Greene via llvm-dev

2019-May-16 19:56 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

I am fine with either option, really.

I think we do want to remove the result type restriction.

I think I'm fine with removing experimental but do we need more time
with the new versions to be sure?

                           -David

Sander De Smalen <Sander.DeSmalen at arm.com> writes:
> Hello again,
>
> I've been meaning to follow up on this thread for the last couple of
weeks, my apologies for the delay.
>
> To summarise the feedback on the proposal for vector.reduce.fadd/fmul:
>
> There seems to be consensus to keep the explicit start value to better
> accommodate chained reductions (as opposed to generating IR that
> performs the reduction of the first element using extract/fadd/insert
> pattern). An important use-case for these reductions is to work inside
> vectorized loops, where chaining happens through the reduction value's
> PHI node (i.e. the scalar reduction value from one iteration will be
> the input to the next iteration). This intrinsic would also naturally
> match reduction instructions of ARM SVE and NEC SX-aurora.
>
> For Option A (https://reviews.llvm.org/D60261), there is an argument
> that code creating or operating on these intrinsics can treat ordered
> and unordered reductions the same (in that they have the same
> arguments). Fast-math flags determine whether or not the intrinsic
> needs to be evaluated in strict order. Codegen for non-strict
> reductions should be able to fold away the identity-value.
>
> For Option B (https://reviews.llvm.org/D60262), David made the
> argument that making the reduction-order explicit (as opposed to
> deducing this from fast-math flags) would ensure the ordering is
> always as expected, even when FMF on the call sites are dropped for
> some reason. 
>
>
> Is it correct that I sensed a slight preference for Option A?
> i.e. Renaming the intrinsics and keeping the same signature, but
> dropping the special-cased behaviour for the identity-value with
> non-strict reductions. For David's argument, I think that although the
> extra expressiveness would be nice to have, LLVM normally depends on
> the FMF being propagated correctly to produce faster code so this
> should also be sufficient for reductions.
>
> If we go for Option A, I suggest we drop the 'experimental' prefix
> from experimental.vector.reduce.fadd/fmul to avoid having to add an
> awkward '.v2' suffix to the new intrinsic. When we implement all
the
> suggestions from this proposal (possibly including the one mentioned
> below), I wouldn't really know what other features we could add other
> than predication (which would be covered by the LLVM-VP proposal and
> thus require another renaming), or possibly adding 'constrained'
> variants which I assume would have separate intrinsics. So we might as
> well drop the 'experimental' prefix.
>
> Finally, do we want to remove the restriction that the result type
> must always match the vector-element type? A wider result type would
> then allow the reduction to be performed in the wider type.
>
> Thanks,
>
> Sander
>
>> On 10 Apr 2019, at 18:56, Amara Emerson <aemerson at apple.com>
wrote:
>> 
>> I’m fine with the direction this is going, but let’s keep renaming
>> to a minimum. They’ve been experimental long enough now that we
>> should be able to now jump to a final form after all the feedback.
>> 
>> Amara
>> 
>>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>> 
>>>> 
>>>> On 8 Apr 2019, at 11:37, Simon Moll <moll at
cs.uni-saarland.de> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>>> Proposed change:
>>>>>>> ----------------------------
>>>>>>> In this RFC I propose changing the intrinsics for
>>>>>>> llvm.experimental.vector.reduce.fadd and
>>>>>>> llvm.experimental.vector.reduce.fmul (see options A
and B). I
>>>>>>> also propose renaming the 'accumulator'
operand to 'start
>>>>>>> value' because for fmul this is the start value
of the
>>>>>>> reduction, rather than a value to which the fmul
reduction is
>>>>>>> accumulated into.
>>>> Note that the LLVM-VP proposal also changes the way reductions
are
>>>> handled in IR (https://reviews.llvm.org/D57504). This could be
an
>>>> opportunity to avoid the "v2" suffix issue: LLVM-VP
moves the
>>>> intrinsic to the "llvm.vp.*" namespace and we can fix
the
>>>> reduction semantics in the progress.
>>> Thanks for pointing out Simon. I think for now we should keep this
>>> proposal separate from LLVM-VP as they serve different purposes and
>>> have different scope. But yes we can easily rename the intrinsics
>>> again when the VP proposal lands.
>>> 
>>>> 
>>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on
LLVM-VP.
>>>> 
>>>>>>> 
>>>>>>> [Option A] Always using the start value operand in
the reduction (https://reviews.llvm.org/D60261)>>>>>>
>>>>>>>  declare float
@llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x
float> %vec)
>>>>>>> 
>>>>>>> This means that if the start value is
'undef', the result will
>>>>>>> be undef and all code creating such a reduction
will need to
>>>>>>> ensure it has a sensible start value (e.g. 0.0 for
fadd, 1.0
>>>>>>> for fmul). When using 'fast' or ‘reassoc’
on the call it will
>>>>>>> be implemented using an unordered reduction,
otherwise it will
>>>>>>> be implemented with an ordered reduction. Note that
a new
>>>>>>> intrinsic is required to capture the new semantics.
In this
>>>>>>> proposal the intrinsic is prefixed with a
'v2' for the time
>>>>>>> being, with the expectation this will be dropped
when we remove
>>>>>>> 'experimental' from the reduction
intrinsics in the future.
>>>>>>> 
>>>>>>> [Option B] Having separate ordered and unordered
intrinsics (https://reviews.llvm.org/D60262).>>>>>>
>>>>>>>  declare float
@llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value,
<4 x float> %vec)
>>>>>>>  declare float
@llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float>
%vec)
>>>>>>> 
>>>>>>> This will mean that the behaviour is explicit from
the
>>>>>>> intrinsic and the use of 'fast' or
‘reassoc’ on the call has no
>>>>>>> effect on how that intrinsic is lowered. The
ordered reduction
>>>>>>> intrinsic will take a scalar start-value operand,
where the
>>>>>>> unordered reduction intrinsic will only take a
vector operand.
>>>>>>> 
>>>>>>> Both options auto-upgrade the IR to use the new
(version of
>>>>>>> the) intrinsics. I'm personally slightly in
favour of [Option
>>>>>>> B], because it better aligns with the definition of
the
>>>>>>> SelectionDAG nodes and is more explicit in its
semantics. We
>>>>>>> also avoid having to use an artificial 'v2'
like prefix to
>>>>>>> denote the new behaviour of the intrinsic.
>>>>>> Do we have any targets with instructions that can
actually use
>>>>>> the start value? TBH I'd be tempted to suggest we
just make the
>>>>>> initial extractelement/fadd/insertelement pattern a
manual extra
>>>>>> stage and avoid having having that argument entirely.
>>>>>> 
>>>> NEC SX-Aurora has reduction instructions that take in a start
>>>> value in a scalar register. We are hoping to upstream the
backend:
>>>>
http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html>> Great, I
think combined with the argument for chaining of ordered
>>> reductions (often inside vectorized loops) and two architectures
>>> (ARM SVE and SX-Aurora) taking a scalar start register, this is
>>> enough of an argument to keep the explicit operand for the ordered
>>> reductions.
>>> 
>>>>>> 
>>>>>>> Further efforts:
>>>>>>> ----------------------------
>>>>>>> Here a non-exhaustive list of items I think work
towards making the intrinsics non-experimental:
>>>>>>> 
>>>>>>> 	• Adding SelectionDAG legalization for the _STRICT
reduction
>>>>>>> SDNodes. After some great work from Nikita in
D58015, unordered
>>>>>>> reductions are now legalized/expanded in
SelectionDAG, so if we
>>>>>>> add expansion in SelectionDAG for strict reductions
this would
>>>>>>> make the ExpandReductionsPass redundant.
>>>>>>> 	• Better enforcing the constraints of the
intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>> 
>>>>>>> 	• I think we'll also want to be able to
overload the result
>>>>>>> operand based on the vector element type for the
intrinsics
>>>>>>> having the constraint that the result type must
match the
>>>>>>> vector element type. e.g. dropping the redundant
'i32' in:
>>>>>>>  i32
@llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32>
>>>>>>> %a) => i32
@llvm.experimental.vector.reduce.and.v4i32(<4 x i32>
>>>>>>> %a)
>>>>>>> since i32 is implied by <4 x i32>. This would
have the added
>>>>>>> benefit that LLVM would automatically check for the
operands to
>>>>>>> match.
>>>>>>> 
>>>>>> Won't this cause issues with overflow? Isn't
the point of an add
>>>>>> (or mul....) reduction of say, <64 x i8> giving a
larger (i32 or
>>>>>> i64) result so we don't lose anything? I agree for
bitop
>>>>>> reductions it doesn't make sense though.
>>>>>> 
>>>>> Sorry - I forgot to add: which asks the question - should
we be
>>>>> considering signed/unsigned add/mul and possibly saturation
>>>>> reductions?
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> 
>>>>> llvm-dev at lists.llvm.org
>>>>>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>>> --
>>>> 
>>>> Simon Moll
>>>> Researcher / PhD Student
>>>> 
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland University, Computer Science
>>>> Building E1.3, Room 4.31
>>>> 
>>>> Tel. +49 (0)681 302-57521 : 
>>>> moll at cs.uni-saarland.de
>>>> 
>>>> Fax. +49 (0)681 302-3065  : 
>>>> http://compilers.cs.uni-saarland.de/people/moll>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>

Sander De Smalen via llvm-dev

2019-May-17 15:36 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Hi David,

Thanks for bringing that up! Integer ones were also suggested earlier in this
thread, I just thought it would be good to clear up the changes to FP intrinsics
first.

Yes, similar to the suggestion to allow a wider result type for the fadd/fmul
reductions, we could define explicit intrinsics to perform the (integer
promotion and) reduction in a wider type. In the IR, this would mean adding
support for explicit sign- and zero-extending variants:

  i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %input)
      ; unchanged - reduce elements into i32 result

  i64 @llvm.experimental.vector.reduce.sadd.i64.v4i32(<4 x i32> %input)
      ; new intrinsic - sign extend input elements and reduce into i64 result

  i64 @llvm.experimental.vector.reduce.uadd.i64.v4i32(<4 x i32> %input)
      ; new intrinsic - zero extend input elements and reduce into i64 result

Both MVE and SVE have instructions that match this behaviour. By having these
explicit intrinsics, matching these would become more straight-forward and we
avoid the possibility of the zero/sign-extension being obfuscated when it gets
to CodeGen. LLVM can canonicalise to use the regular 'vector.reduce.add'
intrinsic for sadd/uadd cases where the result- and element types are the same.

Thanks,

Sander
> On 16 May 2019, at 17:36, David Green <David.Green at arm.com> wrote:
> 
> Hello.
> 
> Thanks for working on this.
> 
> Are we talking only about the floating point versions of these, or the
integer ones as well?
> 
> For the integer ones, there are a number of new MVE instructions that
reduce from, for example i16's into an i32. Or long versions that accumulate
into an i64.
> 
> For example the VADDVA.U16 Rda, Qm instruction will accumulate into a 32bit
register.
> The VADDLVA.U16  RdaLo, RdaHi, Qm instruction will accumulate into a pair
of 32bit registers (so a 64bit value).
> 
> Thanks,
> Dave
> 
> P.S. There are two different people on this thread that are named
"David Greene"/"David Green". Sorry in advanced for the
confusion.
> 
> 
> 
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Sander De Smalen via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: 16 May 2019 13:53
> To: llvm-dev at lists.llvm.org
> Cc: David Greene; nd
> Subject: Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce
intrinsics
>  
> Hello again,
> 
> I've been meaning to follow up on this thread for the last couple of
weeks, my apologies for the delay.
> 
> To summarise the feedback on the proposal for vector.reduce.fadd/fmul:
> 
> There seems to be consensus to keep the explicit start value to better
accommodate chained reductions (as opposed to generating IR that performs the
reduction of the first element using extract/fadd/insert pattern). An important
use-case for these reductions is to work inside vectorized loops, where chaining
happens through the reduction value's PHI node (i.e. the scalar reduction
value from one iteration will be the input to the next iteration). This
intrinsic would also naturally match reduction instructions of ARM SVE and NEC
SX-aurora.
> 
> For Option A (https://reviews.llvm.org/D60261), there is an argument that
code creating or operating on these intrinsics can treat ordered and unordered
reductions the same (in that they have the same arguments). Fast-math flags
determine whether or not the intrinsic needs to be evaluated in strict order.
Codegen for non-strict reductions should be able to fold away the
identity-value.
> 
> For Option B (https://reviews.llvm.org/D60262), David made the argument
that making the reduction-order explicit (as opposed to deducing this from
fast-math flags) would ensure the ordering is always as expected, even when FMF
on the call sites are dropped for some reason.
> 
> 
> 
> Is it correct that I sensed a slight preference for Option A? i.e. Renaming
the intrinsics and keeping the same signature, but dropping the special-cased
behaviour for the identity-value with non-strict reductions. For David's
argument, I think that although the extra expressiveness would be nice to have,
LLVM normally depends on the FMF being propagated correctly to produce faster
code so this should also be sufficient for reductions.
> 
> If we go for Option A, I suggest we drop the 'experimental' prefix
from experimental.vector.reduce.fadd/fmul to avoid having to add an awkward
'.v2' suffix to the new intrinsic. When we implement all the suggestions
from this proposal (possibly including the one mentioned below), I wouldn't
really know what other features we could add other than predication (which would
be covered by the LLVM-VP proposal and thus require another renaming), or
possibly adding 'constrained' variants which I assume would have
separate intrinsics. So we might as well drop the 'experimental' prefix.
> 
> Finally, do we want to remove the restriction that the result type must
always match the vector-element type? A wider result type would then allow the
reduction to be performed in the wider type.
> 
> Thanks,
> 
> Sander
> 
>> On 10 Apr 2019, at 18:56, Amara Emerson <aemerson at apple.com>
wrote:
>> 
>> I’m fine with the direction this is going, but let’s keep renaming to a
minimum. They’ve been experimental long enough now that we should be able to now
jump to a final form after all the feedback.
>> 
>> Amara
>> 
>>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>> 
>>>> 
>>>> On 8 Apr 2019, at 11:37, Simon Moll <moll at
cs.uni-saarland.de> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>>> Proposed change:
>>>>>>> ----------------------------
>>>>>>> In this RFC I propose changing the intrinsics for
llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul
(see options A and B). I also propose renaming the 'accumulator' operand
to 'start value' because for fmul this is the start value of the
reduction, rather than a value to which the fmul reduction is accumulated into.
>>>> Note that the LLVM-VP proposal also changes the way reductions
are handled in IR (https://reviews.llvm.org/D57504). This could be an
opportunity to avoid the "v2" suffix issue: LLVM-VP moves the
intrinsic to the "llvm.vp.*" namespace and we can fix the reduction
semantics in the progress.
>>> Thanks for pointing out Simon. I think for now we should keep this
proposal separate from LLVM-VP as they serve different purposes and have
different scope. But yes we can easily rename the intrinsics again when the VP
proposal lands.
>>> 
>>>> 
>>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on
LLVM-VP.
>>>> 
>>>>>>> 
>>>>>>> [Option A] Always using the start value operand in
the reduction (https://reviews.llvm.org/D60261)
>>>>>>> 
>>>>>>>   declare float
@llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x
float> %vec)
>>>>>>> 
>>>>>>> This means that if the start value is
'undef', the result will be undef and all code creating such a reduction
will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for
fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented
using an unordered reduction, otherwise it will be implemented with an ordered
reduction. Note that a new intrinsic is required to capture the new semantics.
In this proposal the intrinsic is prefixed with a 'v2' for the time
being, with the expectation this will be dropped when we remove
'experimental' from the reduction intrinsics in the future.
>>>>>>> 
>>>>>>> [Option B] Having separate ordered and unordered
intrinsics (https://reviews.llvm.org/D60262).
>>>>>>> 
>>>>>>>   declare float
@llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value,
<4 x float> %vec)
>>>>>>>   declare float
@llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float>
%vec)
>>>>>>> 
>>>>>>> This will mean that the behaviour is explicit from
the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no
effect on how that intrinsic is lowered. The ordered reduction intrinsic will
take a scalar start-value operand, where the unordered reduction intrinsic will
only take a vector operand.
>>>>>>> 
>>>>>>> Both options auto-upgrade the IR to use the new
(version of the) intrinsics. I'm personally slightly in favour of [Option
B], because it better aligns with the definition of the SelectionDAG nodes and
is more explicit in its semantics. We also avoid having to use an artificial
'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>>> Do we have any targets with instructions that can
actually use the start value? TBH I'd be tempted to suggest we just make the
initial extractelement/fadd/insertelement pattern a manual extra stage and avoid
having having that argument entirely.
>>>>>> 
>>>> NEC SX-Aurora has reduction instructions that take in a start
value in a scalar register. We are hoping to upstream the backend:
http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>>> Great, I think combined with the argument for chaining of ordered
reductions (often inside vectorized loops) and two architectures (ARM SVE and
SX-Aurora) taking a scalar start register, this is enough of an argument to keep
the explicit operand for the ordered reductions.
>>> 
>>>>>> 
>>>>>>> Further efforts:
>>>>>>> ----------------------------
>>>>>>> Here a non-exhaustive list of items I think work
towards making the intrinsics non-experimental:
>>>>>>> 
>>>>>>>    • Adding SelectionDAG legalization for the 
_STRICT reduction SDNodes. After some great work from Nikita in D58015,
unordered reductions are now legalized/expanded in SelectionDAG, so if we add
expansion in SelectionDAG for strict reductions this would make the
ExpandReductionsPass redundant.
>>>>>>>    • Better enforcing the constraints of the
intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>> 
>>>>>>>    • I think we'll also want to be able to
overload the result operand based on the vector element type for the intrinsics
having the constraint that the result type must match the vector element type.
e.g. dropping the redundant 'i32' in:
>>>>>>>   i32
@llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32
@llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>>> since i32 is implied by <4 x i32>. This would
have the added benefit that LLVM would automatically check for the operands to
match.
>>>>>>> 
>>>>>> Won't this cause issues with overflow? Isn't
the point  of an add (or mul....) reduction of say, <64 x i8> giving a
larger (i32 or i64) result so we don't lose anything? I agree for bitop
reductions it doesn't make sense though.
>>>>>> 
>>>>> Sorry - I forgot to add: which asks the question - should
we be considering signed/unsigned add/mul and possibly saturation reductions?
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> 
>>>>> llvm-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> -- 
>>>> 
>>>> Simon Moll
>>>> Researcher / PhD Student
>>>> 
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland University, Computer Science
>>>> Building E1.3, Room 4.31
>>>> 
>>>> Tel. +49 (0)681 302-57521 : 
>>>> moll at cs.uni-saarland.de
>>>> 
>>>> Fax. +49 (0)681 302-3065  : 
>>>> http://compilers.cs.uni-saarland.de/people/moll
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - May 2019 - [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics