thr3ads.net - llvm dev - [llvm-dev] RFC: Generic IR reductions [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Amara Emerson via llvm-dev

2017-Jan-31 23:16 UTC

[llvm-dev] RFC: Generic IR reductions

+cc Simon who's also interested in reductions for the any_true,
all_true predicate vectors.

On 31 January 2017 at 20:19, Renato Golin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Hi Amara,
>
> We also had some discussions on the SVE side of reductions on the main
> SVE thread, but this description is much more detailed than we had
> before.
>
> I don't want to discuss specifically about SVE, as the spec is not out
> yet, but I think we can cover a lot of ground until very close to SVE
> and do the final step when we get there.
The goal of this proposal is to agree on a new representation, so we
are looking at more than SVE and improving things for LLVM as a whole.
>
>
> On 31 January 2017 at 17:27, Amara Emerson via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>         1) As our vector lengths are unknown at compile time (and also
not a power of 2), we cannot generate the same reduction IR pattern as other
targets. We need a direct IR representation of the reduction in order to
vectorize them. SVE also contains strict ordering FP reductions, which cannot be
done using the tree-reduction.
>
> Not SVE specific, for example fast-math.
Can you explain what you mean here? Other targets may well have
ordered reductions, I can't comment on that aspect. However, fast-math
vectorization is a case where you *don't* want ordered reductions, as
the relaxed fp-contract means that the conventional tree reduction
algorithm preserves the required semantics.
>
>>         2) That we can use reduction intrinsics to implement our
proposed SVE "test" instruction, intended to check whether a property
of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to
express reductions as intrinsics, we need a separate instruction for the same
reasons as described for regular reductions.
>
> We should leave this for later, as this is very SVE-specific.
As I stated earlier, this is going beyond SVE. I hope to achieve a
consensus between other targets as well, even if some don't yet have
efficient ways to handle it.
>
>>         3) Other, non-vectorizer, users or LLVM that may want to
generate vector code themselves. Front-ends which want to do this currently must
deal with the difficulties of generating the tree shuffle pattern to ensure that
they're matched to efficient instructions later.
>
> This is a minor point, since bloated IR is "efficient" if it
collapses
> into a small number of instructions, for example dozens of shuffles
> into a ZIP.
The hassle of generating reductions may well be at most a minor
motivator, but my point still stands. If a front-end wants the target
to be able to generate the best code for a reduction idiom, they must
generate a lot of IR for many-element vectors. You still have to paid
the price in bloated IR, see the tests changed as part of the AArch64
NEON patch.
> The argument that the intrinsic is harder to destroy through
> optimisation passes is the same as other cases of stiff rich semantics
> vs. generic pattern matching, so orthogonal to this issue.
>
>
>> We propose to introduce the following reduction intrinsics as a
starting point:
>> int_vector_reduce_add(vector_src)
>
> Is this C intrinsic? Shouldn't an IR builtin be something like:
>
> @llvm.vector.reduce.add(...) ?
You're right, in IR assembly it would appear like that. The ones I
proposed were the tablegen syntax of the intrinsics definitions, so in
practice they would look like @llvm.vector.reduce.[operation] in IR
asm.
>
>> These intrinsics do not do any type promotion of the scalar result.
Architectures like SVE which can do type promotion and reduction in a single
instruction can pattern match the promote->reduce sequence.
>
> Yup.
>
>
>> ...
>> int_vector_reduce_fmax(vector_src, i32 NoNaNs)
>
> A large list, and probably doesn't even cover all SVE can do, let
> alone other reductions.
>
> Why not simplify this into something like:
>
>   %sum = add <N x float>, <N x float> %a, <N x float> %b
>   %red = @llvm.reduce(%sum, float %acc)
> or
>   %fast_red = @llvm.reduce(%sum)
Because the semantics of an operation would not depend solely in the
operand value types and operation, but on a chain of computations
forming the operands. If the input operand is a phi, you then have to
do potentially inter-block analysis. If it's a function parameter or
simply a load from memory then you're pretty much stuck and you can't
resolve the semantics.

During the dev meeting, a reductions proposal where the operation to
be performed was a kind of opcode was discussed, and rejected by the
community. I don't believe having many intrinsics would be a problem.
> For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?
For the same reasons that we don't re-use the other binary operator
instructions like add, sub, mul. The vector versions of those are not
horizontal operations, they instead produce vector results.
>
>> We have multiple options for expressing vector predication in
reductions:
>>         1. The first is to simply add a predicate operand to the
intrinsics, and require that targets without predication explicitly pattern
match for an all-true predicate in order to select hardware instructions.
>>         2. The second option is to mask out inactive lanes in the input
vector by using a select between the input, and a vector splat of the
reduction's identity values, e.g. 0.0 for fp-add.
>>
>> We believe option 2 will be sufficient for predication capable
architectures, while keeping the intrinsics definitions simple and minimal. If
there are targets for which the identity value for a reduction is different,
then we could use an IR constant to express this in a generic way.
>
> I agree. I haven't followed Elena's work on the similar concept for
> Intel, but I vaguely remember we reached a similar conclusion for
> AVX512.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

2017-Feb-01 08:27 UTC

head link

[llvm-dev] RFC: Generic IR reductions

+Elena, as she has done a lot of work for AVX512, which has similar concepts.


On 31 January 2017 at 23:16, Amara Emerson <amara.emerson at gmail.com>
wrote:>> Not SVE specific, for example fast-math.
>
> Can you explain what you mean here? Other targets may well have
> ordered reductions, I can't comment on that aspect. However, fast-math
> vectorization is a case where you *don't* want ordered reductions, as
> the relaxed fp-contract means that the conventional tree reduction
> algorithm preserves the required semantics.
That's my point. Fast-math can change the target's semantics regarding
reductions, independently of scalable vectors, so it's worth
discussing in a more general case, ie in this thread.

Sorry for being terse.

> The hassle of generating reductions may well be at most a minor
> motivator, but my point still stands. If a front-end wants the target
> to be able to generate the best code for a reduction idiom, they must
> generate a lot of IR for many-element vectors. You still have to paid
> the price in bloated IR, see the tests changed as part of the AArch64
> NEON patch.
This is a completely orthogonal discussion, as I stated here:
>> The argument that the intrinsic is harder to destroy through
>> optimisation passes is the same as other cases of stiff rich semantics
>> vs. generic pattern matching, so orthogonal to this issue.
One that we have had multiple times and the usual consensus is: if it
can be represented in plain IR, it must. Adding multiple semantics for
the same concept, especially stiff ones like builtins, adds complexity
to the optimiser.

Regardless of the merits in this case, builtins should only be
introduced IFF there is no other way. So first we should discuss
adding it to IR with generic concepts, just like we did for
scatter/gather and strided access.

>> Why not simplify this into something like:
>>
>>   %sum = add <N x float>, <N x float> %a, <N x float>
%b
>>   %red = @llvm.reduce(%sum, float %acc)
>> or
>>   %fast_red = @llvm.reduce(%sum)
>
> Because the semantics of an operation would not depend solely in the
> operand value types and operation, but on a chain of computations
> forming the operands. If the input operand is a phi, you then have to
> do potentially inter-block analysis. If it's a function parameter or
> simply a load from memory then you're pretty much stuck and you
can't
> resolve the semantics.
I think you have just described the pattern matching algorithm,
meaning it's possible to write that in a sequence of IR instructions,
thus using add+reduce should work. Same with pointer types and other
reduction operations.

If the argument comes from a function parameter that is a non-strict
pointer to memory, then all bets are off anyway and the front-end
wouldn't be able to generate anything more specific, unless you're
using SIMD intrinsics, in which case this point is moot.

> During the dev meeting, a reductions proposal where the operation to
> be performed was a kind of opcode was discussed, and rejected by the
> community.
Well, that was certainly a smaller group than the list. Design
decisions should not be taken off list, so we must have this
discussion on the list again, I'm afraid.

> I don't believe having many intrinsics would be a problem.
This is against every decision I remember. Saying it out loud in a
meeting is one thing, writing them down and implementing and having to
bear the maintenance costs is another entirely.

That's why the consensus has to happen on the list.

>> For a min/max reduction, why not just extend @llvm.minnum and
@llvm.maxnum?
>
> For the same reasons that we don't re-use the other binary operator
> instructions like add, sub, mul. The vector versions of those are not
> horizontal operations, they instead produce vector results.
Sorry, I meant min/max + reduce, just like above.

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %min = @llvm.minnum(<N x float> %sum)
   %red = @llvm.reduce(%min, float %acc)

cheers,
--renato

Demikhovsky, Elena via llvm-dev

2017-Feb-01 09:38 UTC

head link

[llvm-dev] RFC: Generic IR reductions

> One that we have had multiple times and the usual consensus is: if it can
be represented in plain IR, it must. Adding multiple semantics for the same
concept, especially stiff ones like builtins, adds complexity to the optimiser.
> Regardless of the merits in this case, builtins should only be introduced
IFF there is no other way. So first we should discuss adding it to IR with
generic concepts, just like we did for scatter/gather and strided access.
I suppose, we can let Target to "decide".  Target may decide to use an
intrinsic since it will be converted later to one instruction or leave a plain
IR letting the optimizer to work smoothly.
Actually, we do the same for gather/scatter - vectorizer does not build
intrinsics if the Target does not support them.
> adds complexity to the optimizerOptimizer should not deal with intrinsics, with this kind of intrinsics at
least.

The target should be able to answer on question
IsProfitable(ORDERED_REDUCTION_FADD, VF)  - ordered reductions, for example, are
not supported on X86 and the answer will, probably, be "false".
In this case we'll stay with plain IR.
But SVE will answer "true" and an intrinsic will be created. So, in my
opinion, we need reduction intrinsics.
> if it can be represented in plain IR, it mustThe IR is plain and the ISAs are complex, we miss optimizations at the end. IR
should be reach in order to serve complex ISAs.

As far as intrinsics set, we, probably, need "ordered" and
"plain" mode for FP reductions. X86 does not have the
"ordered" today, but might be interesting in a loop-tail operation in
FP fast mode.


-  Elena


-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: Wednesday, February 01, 2017 10:27
To: Amara Emerson <amara.emerson at gmail.com>
Cc: Amara Emerson <Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org;
nd <nd at arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk>;
Demikhovsky, Elena <elena.demikhovsky at intel.com>
Subject: Re: [llvm-dev] RFC: Generic IR reductions

+Elena, as she has done a lot of work for AVX512, which has similar concepts.


On 31 January 2017 at 23:16, Amara Emerson <amara.emerson at gmail.com>
wrote:>> Not SVE specific, for example fast-math.
>
> Can you explain what you mean here? Other targets may well have 
> ordered reductions, I can't comment on that aspect. However, fast-math 
> vectorization is a case where you *don't* want ordered reductions, as 
> the relaxed fp-contract means that the conventional tree reduction 
> algorithm preserves the required semantics.
That's my point. Fast-math can change the target's semantics regarding
reductions, independently of scalable vectors, so it's worth discussing in a
more general case, ie in this thread.

Sorry for being terse.

> The hassle of generating reductions may well be at most a minor 
> motivator, but my point still stands. If a front-end wants the target 
> to be able to generate the best code for a reduction idiom, they must 
> generate a lot of IR for many-element vectors. You still have to paid 
> the price in bloated IR, see the tests changed as part of the AArch64 
> NEON patch.
This is a completely orthogonal discussion, as I stated here:
>> The argument that the intrinsic is harder to destroy through 
>> optimisation passes is the same as other cases of stiff rich 
>> semantics vs. generic pattern matching, so orthogonal to this issue.
One that we have had multiple times and the usual consensus is: if it can be
represented in plain IR, it must. Adding multiple semantics for the same
concept, especially stiff ones like builtins, adds complexity to the optimiser.

Regardless of the merits in this case, builtins should only be introduced IFF
there is no other way. So first we should discuss adding it to IR with generic
concepts, just like we did for scatter/gather and strided access.

>> Why not simplify this into something like:
>>
>>   %sum = add <N x float>, <N x float> %a, <N x float>
%b
>>   %red = @llvm.reduce(%sum, float %acc) or
>>   %fast_red = @llvm.reduce(%sum)
>
> Because the semantics of an operation would not depend solely in the 
> operand value types and operation, but on a chain of computations 
> forming the operands. If the input operand is a phi, you then have to 
> do potentially inter-block analysis. If it's a function parameter or 
> simply a load from memory then you're pretty much stuck and you
can't
> resolve the semantics.
I think you have just described the pattern matching algorithm, meaning it's
possible to write that in a sequence of IR instructions, thus using add+reduce
should work. Same with pointer types and other reduction operations.

If the argument comes from a function parameter that is a non-strict pointer to
memory, then all bets are off anyway and the front-end wouldn't be able to
generate anything more specific, unless you're using SIMD intrinsics, in
which case this point is moot.

> During the dev meeting, a reductions proposal where the operation to 
> be performed was a kind of opcode was discussed, and rejected by the 
> community.
Well, that was certainly a smaller group than the list. Design decisions should
not be taken off list, so we must have this discussion on the list again,
I'm afraid.

> I don't believe having many intrinsics would be a problem.
This is against every decision I remember. Saying it out loud in a meeting is
one thing, writing them down and implementing and having to bear the maintenance
costs is another entirely.

That's why the consensus has to happen on the list.

>> For a min/max reduction, why not just extend @llvm.minnum and
@llvm.maxnum?
>
> For the same reasons that we don't re-use the other binary operator 
> instructions like add, sub, mul. The vector versions of those are not 
> horizontal operations, they instead produce vector results.
Sorry, I meant min/max + reduce, just like above.

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %min = @llvm.minnum(<N x float> %sum)
   %red = @llvm.reduce(%min, float %acc)

cheers,
--renato
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Renato Golin via llvm-dev

2017-Feb-01 10:59 UTC

head link

[llvm-dev] RFC: Generic IR reductions

On 1 February 2017 at 08:27, Renato Golin <renato.golin at linaro.org>
wrote:> Sorry, I meant min/max + reduce, just like above.
>
>   %sum = add <N x float>, <N x float> %a, <N x float> %b
>   %min = @llvm.minnum(<N x float> %sum)
>    %red = @llvm.reduce(%min, float %acc)
No, this is wrong. I actually meant overriding the max/min intrinsics
to take vectors instead of two scalar options.

The semantics of those intrinsics is to get a number of parameters and
return the max/min on their own type. A vector is just a different
packing of parameters, all of which have the same type, so there's no
semantic difference other than the number of arguments.

However, when they're lowered, they'll end up a a short sequence of
instruction (if supported) in the same way.

You'd only emit a max/min in vectorisation if the target supports it
and the cost is low. For instance, in AArch64 it would only emit it if
SVE support is enabled.

Makes sense?

cheers,
--renato

Amara Emerson via llvm-dev

2017-Feb-01 14:56 UTC

head link

[llvm-dev] RFC: Generic IR reductions

+CC Chandler and Philip.

On 1 February 2017 at 08:27, Renato Golin <renato.golin at linaro.org>
wrote:
> Well, that was certainly a smaller group than the list. Design
> decisions should not be taken off list, so we must have this
> discussion on the list again, I'm afraid.
>
>
>> I don't believe having many intrinsics would be a problem.
>
> This is against every decision I remember. Saying it out loud in a
> meeting is one thing, writing them down and implementing and having to
> bear the maintenance costs is another entirely.
>
That's fair. To bring some context, during Elena's BOF
Chandler/Philip's made some arguments against the idea of general
purpose single intrinsics that take an opcode parameter, which went
along the lines of: the design would lower the bar for introducing new
idioms to an extent that each one would not have to undergo as much
scrutiny as separate additions. I specifically recall someone, and I
think it was Chandler but correct me if I'm wrong, saying that many
intrinsics were not in itself a problem. Their points were also
broader than that of reductions, and covered the general approach of
handling groups of idioms. Chandler/Philip I hope I haven't
mischaracterized your thoughts.

Amara

llvm dev - Feb 2017 - RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions