thr3ads.net - llvm dev - [llvm-dev] RFC: Generic IR reductions [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Amara Emerson via llvm-dev

2017-Jan-31 17:27 UTC

[llvm-dev] RFC: Generic IR reductions

Hi all,

During the Nov 2016 dev meeting, we had a hackers’ lab session where we
discussed some issues about loop idiom recognition, IR representation and cost
modelling. I took an action to write up an RFC about introducing reduction
intrinsics to LLVM to represent horizontal operations across vectors.

Vector reductions have been discussed in the past before, notably here:
http://lists.llvm.org/pipermail/llvm-dev/2015-November/092379.html

This was in the context of recognizing the patterns for matching more complex
cases like SAD & dot product. These discussions have centered around adding
a new vector reduce instruction or annotating the reduction PHI to allow pattern
recognition later at codegen. However, these motivations are different to the
ones I'm giving here.

Current status
=============
Today scalar loop reductions are recognized in IR through the LoopUtils
isReductionPHI utility, and this creates a recurrence descriptor after doing
some analysis. The descriptor gives us information about what kind of reduction
it is, e.g. int add, int or, fp add, fp mul, minmax etc)

The vectorizer generates vector versions of these as a final reducing step after
the vector loop body, doing a tree reduction across all the lanes. This has
multiple disadvantages in being complex and difficult to generate for other
passes in LLVM and front-ends. The reduction using shuffles then have to be
matched in the backend. For ARM’s Scalable Vector Extensions (SVE), there are
some more fundamental reasons that requires us to move away from this approach;

	1) As our vector lengths are unknown at compile time (and also not a power of
2), we cannot generate the same reduction IR pattern as other targets. We need a
direct IR representation of the reduction in order to vectorize them. SVE also
contains strict ordering FP reductions, which cannot be done using the
tree-reduction.
	2) That we can use reduction intrinsics to implement our proposed SVE
"test" instruction, intended to check whether a property of a
predicate, {first,last,any,all}x{true,false} holds. Without any ability to
express reductions as intrinsics, we need a separate instruction for the same
reasons as described for regular reductions.
	3) Other, non-vectorizer, users or LLVM that may want to generate vector code
themselves. Front-ends which want to do this currently must deal with the
difficulties of generating the tree shuffle pattern to ensure that they're
matched to efficient instructions later.

Proposal
=======
We propose to introduce the following reduction intrinsics as a starting point:
int_vector_reduce_add(vector_src)
int_vector_reduce_fadd(vector_src)
int_vector_reduce_and(vector_src)
int_vector_reduce_or(vector_src)
int_vector_reduce_eor(vector_src)
int_vector_reduce_mul(vector_src)

These intrinsics do not do any type promotion of the scalar result.
Architectures like SVE which can do type promotion and reduction in a single
instruction can pattern match the promote->reduce sequence.

And for minmax recurrence types, we have:
int_vector_reduce_smax(vector_src)
int_vector_reduce_smin(vector_src)
int_vector_reduce_umax(vector_src)
int_vector_reduce_umin(vector_src)
int_vector_reduce_fmin(vector_src, i32 NoNaNs)
int_vector_reduce_fmax(vector_src, i32 NoNaNs)

These reduction operations can be mapped from the recurrence kinds defined in
LoopUtils, however any front-end or pass in LLVM may use them.

Predication
==========
We have multiple options for expressing vector predication in reductions:
	1. The first is to simply add a predicate operand to the intrinsics, and
require that targets without predication explicitly pattern match for an
all-true predicate in order to select hardware instructions.
	2. The second option is to mask out inactive lanes in the input vector by using
a select between the input, and a vector splat of the reduction's identity
values, e.g. 0.0 for fp-add.

We believe option 2 will be sufficient for predication capable architectures,
while keeping the intrinsics definitions simple and minimal. If there are
targets for which the identity value for a reduction is different, then we could
use an IR constant to express this in a generic way.

All the intrinsics above take a single vector operand and reduce to a scalar of
the same type as the vector element.

Ordered reductions
==================
SVE introduces the capability to execute reductions while maintaining the
sequential order, as a result it can vectorize floating point addition
reductions without requiring fast-math to be enabled. This uses a different
approach to the fast reductions; instead of doing a partial reduction in the
vector loop body and a final reduce to scalar in the loop exit block, we reduce
to a scalar on every iteration in the vector loop. This requires an extra scalar
accumulator operand to be passed into the reduction intrinsic. We propose a
generic intrinsics to handle this as well:

int_vector_reduce_ordered_fadd([float|double] %acc, <vec [float|double]>
%input)

We can use the same select-on-input predication scheme with these as we do with
the normal reductions.

Prototype patches
================
I've developed some initial patches as a proof of concept that should allow
targets to opt in to using these new reductions with fairly minimal effort. The
patches enable the use of generic reduction intrinsics for the AArch64 target,
and as far as I can tell don't affect codegen in any way. The AArch64 patch
cleans up a lot of the old code which was written specifically to match various
forms of the log2 shuffle pattern, and therefore the tests are significantly
reduced in size.

The target independent changes introduce a new transitional API documented
below, by moving some duplicated code into VectorUtils (to generate shuffle
patterns for targets which don't opt-in). Some basic legalization is
implemented, but only enough to support the cases required by AArch64, there may
be some things I've missed that are exposed by other targets.

Transitional API
===============
To support both the old-style tree reduction patterns and the new intrinsics for
clients, we can use a generalization of the TTI call we introduce in our SVE
patches.

We could provide something like:

bool TargetTransformInfo::useReductionIntrinsic(unsigned Opcode, Type *Ty,
                                                bool IsMaxOp, bool IsSigned,
                                                bool NoNaN)

In the prototype patch, I've created two separate ways to generate
reductions from an IR generator's perspective, depending on whether the
client has a reduction descriptor (obtained from a LoopUtils analysis of the
loop's PHIs). If we want to generate from a descriptor:
Value *llvm::createTargetReduction(IRBuilder<> &Builder,
                                   const TargetTransformInfo *TTI,
                                   RecurrenceDescriptor &Desc, Value *Src,
                                   bool InOrder, bool NoNaN)

Or if no descriptor is given, then an alternative API call:

Value *llvm::createTargetReduction(IRBuilder<> &Builder,
                                   const TargetTransformInfo *TTI,
                                   unsigned Opcode, Value *Src) {


By default, these two interfaces would generate a tree reduction for targets
which do not opt-in to the new intrinsic representation.

What does everyone think of this, do these new forms seem reasonable?

Amara

-------------- next part --------------
A non-text attachment was scrubbed...
Name: reductions-enable-aarch64.diff
Type: application/octet-stream
Size: 79807 bytes
Desc: reductions-enable-aarch64.diff
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/dff98280/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reductions-ir.diff
Type: application/octet-stream
Size: 33375 bytes
Desc: reductions-ir.diff
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/dff98280/attachment-0003.obj>

Renato Golin via llvm-dev

2017-Jan-31 20:19 UTC

head link

[llvm-dev] RFC: Generic IR reductions

Hi Amara,

We also had some discussions on the SVE side of reductions on the main
SVE thread, but this description is much more detailed than we had
before.

I don't want to discuss specifically about SVE, as the spec is not out
yet, but I think we can cover a lot of ground until very close to SVE
and do the final step when we get there.

On 31 January 2017 at 17:27, Amara Emerson via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>         1) As our vector lengths are unknown at compile time (and also not
a power of 2), we cannot generate the same reduction IR pattern as other
targets. We need a direct IR representation of the reduction in order to
vectorize them. SVE also contains strict ordering FP reductions, which cannot be
done using the tree-reduction.
Not SVE specific, for example fast-math.

>         2) That we can use reduction intrinsics to implement our proposed
SVE "test" instruction, intended to check whether a property of a
predicate, {first,last,any,all}x{true,false} holds. Without any ability to
express reductions as intrinsics, we need a separate instruction for the same
reasons as described for regular reductions.
We should leave this for later, as this is very SVE-specific.

>         3) Other, non-vectorizer, users or LLVM that may want to generate
vector code themselves. Front-ends which want to do this currently must deal
with the difficulties of generating the tree shuffle pattern to ensure that
they're matched to efficient instructions later.
This is a minor point, since bloated IR is "efficient" if it collapses
into a small number of instructions, for example dozens of shuffles
into a ZIP.

The argument that the intrinsic is harder to destroy through
optimisation passes is the same as other cases of stiff rich semantics
vs. generic pattern matching, so orthogonal to this issue.

> We propose to introduce the following reduction intrinsics as a starting
point:
> int_vector_reduce_add(vector_src)
Is this C intrinsic? Shouldn't an IR builtin be something like:

@llvm.vector.reduce.add(...) ?

> These intrinsics do not do any type promotion of the scalar result.
Architectures like SVE which can do type promotion and reduction in a single
instruction can pattern match the promote->reduce sequence.
Yup.

> ...
> int_vector_reduce_fmax(vector_src, i32 NoNaNs)
A large list, and probably doesn't even cover all SVE can do, let
alone other reductions.

Why not simplify this into something like:

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %red = @llvm.reduce(%sum, float %acc)
or
  %fast_red = @llvm.reduce(%sum)

For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?

> We have multiple options for expressing vector predication in reductions:
>         1. The first is to simply add a predicate operand to the
intrinsics, and require that targets without predication explicitly pattern
match for an all-true predicate in order to select hardware instructions.
>         2. The second option is to mask out inactive lanes in the input
vector by using a select between the input, and a vector splat of the
reduction's identity values, e.g. 0.0 for fp-add.
>
> We believe option 2 will be sufficient for predication capable
architectures, while keeping the intrinsics definitions simple and minimal. If
there are targets for which the identity value for a reduction is different,
then we could use an IR constant to express this in a generic way.
I agree. I haven't followed Elena's work on the similar concept for
Intel, but I vaguely remember we reached a similar conclusion for
AVX512.

cheers,
--renato

Amara Emerson via llvm-dev

2017-Jan-31 23:16 UTC

head link

[llvm-dev] RFC: Generic IR reductions

+cc Simon who's also interested in reductions for the any_true,
all_true predicate vectors.

On 31 January 2017 at 20:19, Renato Golin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Hi Amara,
>
> We also had some discussions on the SVE side of reductions on the main
> SVE thread, but this description is much more detailed than we had
> before.
>
> I don't want to discuss specifically about SVE, as the spec is not out
> yet, but I think we can cover a lot of ground until very close to SVE
> and do the final step when we get there.
The goal of this proposal is to agree on a new representation, so we
are looking at more than SVE and improving things for LLVM as a whole.
>
>
> On 31 January 2017 at 17:27, Amara Emerson via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>         1) As our vector lengths are unknown at compile time (and also
not a power of 2), we cannot generate the same reduction IR pattern as other
targets. We need a direct IR representation of the reduction in order to
vectorize them. SVE also contains strict ordering FP reductions, which cannot be
done using the tree-reduction.
>
> Not SVE specific, for example fast-math.
Can you explain what you mean here? Other targets may well have
ordered reductions, I can't comment on that aspect. However, fast-math
vectorization is a case where you *don't* want ordered reductions, as
the relaxed fp-contract means that the conventional tree reduction
algorithm preserves the required semantics.
>
>>         2) That we can use reduction intrinsics to implement our
proposed SVE "test" instruction, intended to check whether a property
of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to
express reductions as intrinsics, we need a separate instruction for the same
reasons as described for regular reductions.
>
> We should leave this for later, as this is very SVE-specific.
As I stated earlier, this is going beyond SVE. I hope to achieve a
consensus between other targets as well, even if some don't yet have
efficient ways to handle it.
>
>>         3) Other, non-vectorizer, users or LLVM that may want to
generate vector code themselves. Front-ends which want to do this currently must
deal with the difficulties of generating the tree shuffle pattern to ensure that
they're matched to efficient instructions later.
>
> This is a minor point, since bloated IR is "efficient" if it
collapses
> into a small number of instructions, for example dozens of shuffles
> into a ZIP.
The hassle of generating reductions may well be at most a minor
motivator, but my point still stands. If a front-end wants the target
to be able to generate the best code for a reduction idiom, they must
generate a lot of IR for many-element vectors. You still have to paid
the price in bloated IR, see the tests changed as part of the AArch64
NEON patch.
> The argument that the intrinsic is harder to destroy through
> optimisation passes is the same as other cases of stiff rich semantics
> vs. generic pattern matching, so orthogonal to this issue.
>
>
>> We propose to introduce the following reduction intrinsics as a
starting point:
>> int_vector_reduce_add(vector_src)
>
> Is this C intrinsic? Shouldn't an IR builtin be something like:
>
> @llvm.vector.reduce.add(...) ?
You're right, in IR assembly it would appear like that. The ones I
proposed were the tablegen syntax of the intrinsics definitions, so in
practice they would look like @llvm.vector.reduce.[operation] in IR
asm.
>
>> These intrinsics do not do any type promotion of the scalar result.
Architectures like SVE which can do type promotion and reduction in a single
instruction can pattern match the promote->reduce sequence.
>
> Yup.
>
>
>> ...
>> int_vector_reduce_fmax(vector_src, i32 NoNaNs)
>
> A large list, and probably doesn't even cover all SVE can do, let
> alone other reductions.
>
> Why not simplify this into something like:
>
>   %sum = add <N x float>, <N x float> %a, <N x float> %b
>   %red = @llvm.reduce(%sum, float %acc)
> or
>   %fast_red = @llvm.reduce(%sum)
Because the semantics of an operation would not depend solely in the
operand value types and operation, but on a chain of computations
forming the operands. If the input operand is a phi, you then have to
do potentially inter-block analysis. If it's a function parameter or
simply a load from memory then you're pretty much stuck and you can't
resolve the semantics.

During the dev meeting, a reductions proposal where the operation to
be performed was a kind of opcode was discussed, and rejected by the
community. I don't believe having many intrinsics would be a problem.
> For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?
For the same reasons that we don't re-use the other binary operator
instructions like add, sub, mul. The vector versions of those are not
horizontal operations, they instead produce vector results.
>
>> We have multiple options for expressing vector predication in
reductions:
>>         1. The first is to simply add a predicate operand to the
intrinsics, and require that targets without predication explicitly pattern
match for an all-true predicate in order to select hardware instructions.
>>         2. The second option is to mask out inactive lanes in the input
vector by using a select between the input, and a vector splat of the
reduction's identity values, e.g. 0.0 for fp-add.
>>
>> We believe option 2 will be sufficient for predication capable
architectures, while keeping the intrinsics definitions simple and minimal. If
there are targets for which the identity value for a reduction is different,
then we could use an IR constant to express this in a generic way.
>
> I agree. I haven't followed Elena's work on the similar concept for
> Intel, but I vaguely remember we reached a similar conclusion for
> AVX512.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Jan 2017 - RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

Apparently Analagous Threads