thr3ads.net - llvm dev - [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Sander De Smalen via llvm-dev

2019-Apr-04 13:11 UTC

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Hi all,

While working on a patch to improve codegen for fadd reductions on AArch64, I
stumbled upon an outstanding issue with the experimental.vector.reduce.fadd/fmul
intrinsics where the accumulator argument is ignored when fast-math is set on
the intrinsic call. This behaviour is described in the LangRef
(https://www.llvm.org/docs/LangRef.html#id1905) and is mentioned in
https://bugs.llvm.org/show_bug.cgi?id=36734 and further discussed in D45336 and
D59356.

This means that for example:
    %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float
undef, <4 x float> %v)

does not result in %res being 'undef', but rather a reduction of <4 x
float> %v. The definition of these intrinsics are different from their
corresponding SelectionDAG nodes which explicitly split out a non-strict
VECREDUCE_FADD that explicitly does not take a start-value operand, and a
VECREDUCE_STRICT_FADD which does.

With the vector reduction intrinsics still experimental, I would like to propose
to change this behaviour. I would also like to take this opportunity to ask what
other changes are required, so that we can make an effort towards dropping the
'experimental' prefix at some point.


Proposed change:
----------------------------
In this RFC I propose changing the intrinsics for
llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul
(see options A and B). I also propose renaming the 'accumulator' operand
to 'start value' because for fmul this is the start value of the
reduction, rather than a value to which the fmul reduction is accumulated into.

[Option A] Always using the start value operand in the reduction
(https://reviews.llvm.org/D60261)

  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float
%start_value, <4 x float> %vec)

This means that if the start value is 'undef', the result will be undef
and all code creating such a reduction will need to ensure it has a sensible
start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or
‘reassoc’ on the call it will be implemented using an unordered reduction,
otherwise it will be implemented with an ordered reduction. Note that a new
intrinsic is required to capture the new semantics. In this proposal the
intrinsic is prefixed with a 'v2' for the time being, with the
expectation this will be dropped when we remove 'experimental' from the
reduction intrinsics in the future.

[Option B] Having separate ordered and unordered intrinsics
(https://reviews.llvm.org/D60262).

  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float
%start_value, <4 x float> %vec)
  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4
x float> %vec)

This will mean that the behaviour is explicit from the intrinsic and the use of
'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is
lowered. The ordered reduction intrinsic will take a scalar start-value operand,
where the unordered reduction intrinsic will only take a vector operand.

Both options auto-upgrade the IR to use the new (version of the) intrinsics.
I'm personally slightly in favour of [Option B], because it better aligns
with the definition of the SelectionDAG nodes and is more explicit in its
semantics. We also avoid having to use an artificial 'v2' like prefix to
denote the new behaviour of the intrinsic.


Further efforts:
----------------------------
Here a non-exhaustive list of items I think work towards making the intrinsics
non-experimental: 

  *   Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After
some great work from Nikita in D58015, unordered reductions are now
legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for
strict reductions this would make the ExpandReductionsPass redundant.
  *   Better enforcing the constraints of the intrinsics (see
https://reviews.llvm.org/D60260 ). 
  *   I think we'll also want to be able to overload the result operand
based on the vector element type for the intrinsics having the constraint that
the result type must match the vector element type. e.g. dropping the redundant
'i32' in:
  i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) =>
i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)

since i32 is implied by <4 x i32>. This would have the added benefit that
LLVM would automatically check for the operands to match. 

  *   Dropping the 'experimental' from the reduction intrinsics and
auto-upgrading to the new intrinsics once there is agreement on stability of the
definitions.

Please let me know if anything obvious is missing here.

Thanks,

Sander
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190404/e5ebfc18/attachment.html>

David Greene via llvm-dev

2019-Apr-04 15:44 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Sander De Smalen via llvm-dev <llvm-dev at lists.llvm.org> writes:
> This means that for example:
>
>     %res = call fast float
@llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float>
%v)
>
>  
>
> does not result in %res being 'undef', but rather a reduction of
<4 x
> float> %v. The definition of these intrinsics are different from their
> corresponding SelectionDAG nodes which explicitly split out a
> non-strict VECREDUCE_FADD that explicitly does not take a start-value
> operand, and a VECREDUCE_STRICT_FADD which does.
This seems very strange to me.  What was the rationale for ignoring the
first argument?  What was the rationale for the first argument existing
at all?  Because that's how SVE reductions work?  The asymmetry with
llvm.experimental.vector.reduce.add is odd.>
> [Option B] Having separate ordered and unordered intrinsics
(https://reviews.llvm.org/D60262).
>
>  
>
>   declare float
@llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value,
<4 x float> %vec)
>
>   declare float
@llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float>
%vec)
>
>  
>
> This will mean that the behaviour is explicit from the intrinsic and
> the use of 'fast' or ‘reassoc’ on the call has no effect on how
that
> intrinsic is lowered. The ordered reduction intrinsic will take a
> scalar start-value operand, where the unordered reduction intrinsic
> will only take a vector operand.
This seems by far the better solution.  I'd much rather have things be
explicit in the IR than implicit via flags that might accidentally get
dropped.

Again, the asymmetry between these (one with a start value and one
without) seems strange and arbitrary.  Why do we need start values at
all?  Is it really difficult for isel to match s + vector.reduce(v)?

                       -David

Sander De Smalen via llvm-dev

2019-Apr-04 16:07 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Hi David,

The reason for the asymmetry and requiring an explicit start-value operand is to
be able to describe strict reductions that need to preserve the same
associativity of a scalarized reduction.

For example:
  %res = call float @llvm.experimental.vector.reduce.fadd(float %start, <4 x
float> <float %elt0, float %elt1, float %elt2, float %elt3>)

describes the following reduction:
  %res = (((%start + %elt0) + %elt1) + %elt2) + %elt3

Where:
  %tmp = call float @llvm.experimental.vector.reduce.fadd(<4 x float>
<float %elt0, float %elt1, float %elt2, float %elt3>)
  %res = add float %start, %tmp

Describes:
  %res = %start + (((%elt0 + %elt1) + %elt2) + %elt3)

Which is not the same, hence why the start operand is needed in the intrinsic
itself. For fast-math (specifically the 'reassoc' property) the compiler
is free to reassociate the expression, so the start/accumulator operand
isn't needed.

Cheers,

Sander


On 04/04/2019, 16:44, "David Greene" <dag at cray.com> wrote:

    Sander De Smalen via llvm-dev <llvm-dev at lists.llvm.org> writes:
    
    > This means that for example:
    >
    >     %res = call fast float
@llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float>
%v)
    >
    >  
    >
    > does not result in %res being 'undef', but rather a reduction
of <4 x
    > float> %v. The definition of these intrinsics are different from
their
    > corresponding SelectionDAG nodes which explicitly split out a
    > non-strict VECREDUCE_FADD that explicitly does not take a start-value
    > operand, and a VECREDUCE_STRICT_FADD which does.
    
    This seems very strange to me.  What was the rationale for ignoring the
    first argument?  What was the rationale for the first argument existing
    at all?  Because that's how SVE reductions work?  The asymmetry with
    llvm.experimental.vector.reduce.add is odd.
    >
    > [Option B] Having separate ordered and unordered intrinsics
(https://reviews.llvm.org/D60262).
    >
    >  
    >
    >   declare float
@llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value,
<4 x float> %vec)
    >
    >   declare float
@llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float>
%vec)
    >
    >  
    >
    > This will mean that the behaviour is explicit from the intrinsic and
    > the use of 'fast' or ‘reassoc’ on the call has no effect on how
that
    > intrinsic is lowered. The ordered reduction intrinsic will take a
    > scalar start-value operand, where the unordered reduction intrinsic
    > will only take a vector operand.
    
    This seems by far the better solution.  I'd much rather have things be
    explicit in the IR than implicit via flags that might accidentally get
    dropped.
    
    Again, the asymmetry between these (one with a start value and one
    without) seems strange and arbitrary.  Why do we need start values at
    all?  Is it really difficult for isel to match s + vector.reduce(v)?
    
                           -David

Robin Kruppe via llvm-dev

2019-Apr-04 21:26 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Hi Sander,

thank you for pushing these intrinsics forward. I am not familiar
enough with the details to say whether anything else might be missing
for them to become less experimental, but the changes you propose all
look like clear-cut improvements to me.

On Thu, 4 Apr 2019 at 15:11, Sander De Smalen via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> Hi all,
>
>
>
> While working on a patch to improve codegen for fadd reductions on AArch64,
I stumbled upon an outstanding issue with the
experimental.vector.reduce.fadd/fmul intrinsics where the accumulator argument
is ignored when fast-math is set on the intrinsic call. This behaviour is
described in the LangRef (https://www.llvm.org/docs/LangRef.html#id1905) and is
mentioned in https://bugs.llvm.org/show_bug.cgi?id=36734 and further discussed
in D45336 and D59356.
>
>
>
> This means that for example:
>
>     %res = call fast float
@llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float>
%v)
>
>
>
> does not result in %res being 'undef', but rather a reduction of
<4 x float> %v. The definition of these intrinsics are different from
their corresponding SelectionDAG nodes which explicitly split out a non-strict
VECREDUCE_FADD that explicitly does not take a start-value operand, and a
VECREDUCE_STRICT_FADD which does.
>
>
>
> With the vector reduction intrinsics still experimental, I would like to
propose to change this behaviour. I would also like to take this opportunity to
ask what other changes are required, so that we can make an effort towards
dropping the 'experimental' prefix at some point.
Big +1 for removing this pitfall. It's a very counterintuitive design
IMO and is partially responsible for some unnecessarily complex code
in the Rust compiler, so I'd be happy to see it go.
>
>
>
> Proposed change:
>
> ----------------------------
>
> In this RFC I propose changing the intrinsics for
llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul
(see options A and B). I also propose renaming the 'accumulator' operand
to 'start value' because for fmul this is the start value of the
reduction, rather than a value to which the fmul reduction is accumulated into.
>
>
>
> [Option A] Always using the start value operand in the reduction
(https://reviews.llvm.org/D60261)
>
>
>
>   declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float
%start_value, <4 x float> %vec)
>
>
>
> This means that if the start value is 'undef', the result will be
undef and all code creating such a reduction will need to ensure it has a
sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using
'fast' or ‘reassoc’ on the call it will be implemented using an
unordered reduction, otherwise it will be implemented with an ordered reduction.
Note that a new intrinsic is required to capture the new semantics. In this
proposal the intrinsic is prefixed with a 'v2' for the time being, with
the expectation this will be dropped when we remove 'experimental' from
the reduction intrinsics in the future.
I like that this option will allow some code (e.g., frontends,
constant folding, certain instcombines) to treat unordered reductions
as essentially the same as ordered ones without having to go out its
way to cover two different intrinsics with slightly different argument
lists.

Cheers,
Robin

Simon Pilgrim via llvm-dev

2019-Apr-05 08:37 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

On 04/04/2019 14:11, Sander De Smalen wrote:> Proposed change:
>
> ----------------------------
>
> In this RFC I propose changing the intrinsics for 
> llvm.experimental.vector.reduce.fadd and 
> llvm.experimental.vector.reduce.fmul (see options A and B). I also 
> propose renaming the 'accumulator' operand to 'start value'
because
> for fmul this is the start value of the reduction, rather than a value 
> to which the fmul reduction is accumulated into.
>
> [Option A] Always using the start value operand in the reduction 
> (https://reviews.llvm.org/D60261)
>
>   declare float 
> @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, 
> <4 x float> %vec)
>
> This means that if the start value is 'undef', the result will be 
> undef and all code creating such a reduction will need to ensure it 
> has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When 
> using 'fast' or ‘reassoc’ on the call it will be implemented using
an
> unordered reduction, otherwise it will be implemented with an ordered 
> reduction. Note that a new intrinsic is required to capture the new 
> semantics. In this proposal the intrinsic is prefixed with a 'v2'
for
> the time being, with the expectation this will be dropped when we 
> remove 'experimental' from the reduction intrinsics in the future.
>
> [Option B] Having separate ordered and unordered intrinsics 
> (https://reviews.llvm.org/D60262).
>
>   declare float 
> @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float 
> %start_value, <4 x float> %vec)
>
>   declare float 
> @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float>
> %vec)
>
> This will mean that the behaviour is explicit from the intrinsic and 
> the use of 'fast' or ‘reassoc’ on the call has no effect on how
that
> intrinsic is lowered. The ordered reduction intrinsic will take a 
> scalar start-value operand, where the unordered reduction intrinsic 
> will only take a vector operand.
>
> Both options auto-upgrade the IR to use the new (version of the) 
> intrinsics. I'm personally slightly in favour of [Option B], because 
> it better aligns with the definition of the SelectionDAG nodes and is 
> more explicit in its semantics. We also avoid having to use an 
> artificial 'v2' like prefix to denote the new behaviour of the
intrinsic.
>Do we have any targets with instructions that can actually use the start 
value? TBH I'd be tempted to suggest we just make the initial 
extractelement/fadd/insertelement pattern a manual extra stage and avoid 
having having that argument entirely.
> Further efforts:
>
> ----------------------------
>
> Here a non-exhaustive list of items I think work towards making the 
> intrinsics non-experimental: 
>
>   * Adding SelectionDAG legalization for the  _STRICT reduction
>     SDNodes. After some great work from Nikita in D58015, unordered
>     reductions are now legalized/expanded in SelectionDAG, so if we
>     add expansion in SelectionDAG for strict reductions this would
>     make the ExpandReductionsPass redundant.
>   * Better enforcing the constraints of the intrinsics (see
>     https://reviews.llvm.org/D60260 ). 
>   * I think we'll also want to be able to overload the result operand
>     based on the vector element type for the intrinsics having the
>     constraint that the result type must match the vector element
>     type. e.g. dropping the redundant 'i32' in:
>     i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a)
>     => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32>
%a)
>
> since i32 is implied by <4 x i32>. This would have the added benefit 
> that LLVM would automatically check for the operands to match. 
>Won't this cause issues with overflow? Isn't the point  of an add (or 
mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result 
so we don't lose anything? I agree for bitop reductions it doesn't make 
sense though.

Simon.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190405/fb6910a4/attachment.html>

Simon Pilgrim via llvm-dev

2019-Apr-05 08:47 UTC

head link

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:> On 04/04/2019 14:11, Sander De Smalen wrote:
>> Proposed change:
>>
>> ----------------------------
>>
>> In this RFC I propose changing the intrinsics for 
>> llvm.experimental.vector.reduce.fadd and 
>> llvm.experimental.vector.reduce.fmul (see options A and B). I also 
>> propose renaming the 'accumulator' operand to 'start
value' because
>> for fmul this is the start value of the reduction, rather than a 
>> value to which the fmul reduction is accumulated into.
>>
>> [Option A] Always using the start value operand in the reduction 
>> (https://reviews.llvm.org/D60261)
>>
>>   declare float 
>> @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 
>> %start_value, <4 x float> %vec)
>>
>> This means that if the start value is 'undef', the result will
be
>> undef and all code creating such a reduction will need to ensure it 
>> has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When 
>> using 'fast' or ‘reassoc’ on the call it will be implemented
using an
>> unordered reduction, otherwise it will be implemented with an ordered 
>> reduction. Note that a new intrinsic is required to capture the new 
>> semantics. In this proposal the intrinsic is prefixed with a
'v2' for
>> the time being, with the expectation this will be dropped when we 
>> remove 'experimental' from the reduction intrinsics in the
future.
>>
>> [Option B] Having separate ordered and unordered intrinsics 
>> (https://reviews.llvm.org/D60262).
>>
>>   declare float 
>> @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float 
>> %start_value, <4 x float> %vec)
>>
>>   declare float 
>> @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x
float>
>> %vec)
>>
>> This will mean that the behaviour is explicit from the intrinsic and 
>> the use of 'fast' or ‘reassoc’ on the call has no effect on how
that
>> intrinsic is lowered. The ordered reduction intrinsic will take a 
>> scalar start-value operand, where the unordered reduction intrinsic 
>> will only take a vector operand.
>>
>> Both options auto-upgrade the IR to use the new (version of the) 
>> intrinsics. I'm personally slightly in favour of [Option B],
because
>> it better aligns with the definition of the SelectionDAG nodes and is 
>> more explicit in its semantics. We also avoid having to use an 
>> artificial 'v2' like prefix to denote the new behaviour of the
intrinsic.
>>
> Do we have any targets with instructions that can actually use the 
> start value? TBH I'd be tempted to suggest we just make the initial 
> extractelement/fadd/insertelement pattern a manual extra stage and 
> avoid having having that argument entirely.
>
>> Further efforts:
>>
>> ----------------------------
>>
>> Here a non-exhaustive list of items I think work towards making the 
>> intrinsics non-experimental: 
>>
>>   * Adding SelectionDAG legalization for the  _STRICT reduction
>>     SDNodes. After some great work from Nikita in D58015, unordered
>>     reductions are now legalized/expanded in SelectionDAG, so if we
>>     add expansion in SelectionDAG for strict reductions this would
>>     make the ExpandReductionsPass redundant.
>>   * Better enforcing the constraints of the intrinsics (see
>>     https://reviews.llvm.org/D60260 ). 
>>   * I think we'll also want to be able to overload the result
operand
>>     based on the vector element type for the intrinsics having the
>>     constraint that the result type must match the vector element
>>     type. e.g. dropping the redundant 'i32' in:
>>     i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32>
%a)
>>     => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x
i32> %a)
>>
>> since i32 is implied by <4 x i32>. This would have the added
benefit
>> that LLVM would automatically check for the operands to match. 
>>
> Won't this cause issues with overflow? Isn't the point  of an add
(or
> mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) 
> result so we don't lose anything? I agree for bitop reductions it 
> doesn't make sense though.
>Sorry - I forgot to add: which asks the question - should we be 
considering signed/unsigned add/mul and possibly saturation reductions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190405/cb9e00ac/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Apr 2019 - [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Possibly Parallel Threads