thr3ads.net - llvm dev - [llvm-dev] RFC: Generic IR reductions [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Demikhovsky, Elena via llvm-dev

2017-Feb-01 09:38 UTC

[llvm-dev] RFC: Generic IR reductions

> One that we have had multiple times and the usual consensus is: if it can
be represented in plain IR, it must. Adding multiple semantics for the same
concept, especially stiff ones like builtins, adds complexity to the optimiser.
> Regardless of the merits in this case, builtins should only be introduced
IFF there is no other way. So first we should discuss adding it to IR with
generic concepts, just like we did for scatter/gather and strided access.
I suppose, we can let Target to "decide".  Target may decide to use an
intrinsic since it will be converted later to one instruction or leave a plain
IR letting the optimizer to work smoothly.
Actually, we do the same for gather/scatter - vectorizer does not build
intrinsics if the Target does not support them.
> adds complexity to the optimizerOptimizer should not deal with intrinsics, with this kind of intrinsics at
least.

The target should be able to answer on question
IsProfitable(ORDERED_REDUCTION_FADD, VF)  - ordered reductions, for example, are
not supported on X86 and the answer will, probably, be "false".
In this case we'll stay with plain IR.
But SVE will answer "true" and an intrinsic will be created. So, in my
opinion, we need reduction intrinsics.
> if it can be represented in plain IR, it mustThe IR is plain and the ISAs are complex, we miss optimizations at the end. IR
should be reach in order to serve complex ISAs.

As far as intrinsics set, we, probably, need "ordered" and
"plain" mode for FP reductions. X86 does not have the
"ordered" today, but might be interesting in a loop-tail operation in
FP fast mode.


-  Elena


-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: Wednesday, February 01, 2017 10:27
To: Amara Emerson <amara.emerson at gmail.com>
Cc: Amara Emerson <Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org;
nd <nd at arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk>;
Demikhovsky, Elena <elena.demikhovsky at intel.com>
Subject: Re: [llvm-dev] RFC: Generic IR reductions

+Elena, as she has done a lot of work for AVX512, which has similar concepts.


On 31 January 2017 at 23:16, Amara Emerson <amara.emerson at gmail.com>
wrote:>> Not SVE specific, for example fast-math.
>
> Can you explain what you mean here? Other targets may well have 
> ordered reductions, I can't comment on that aspect. However, fast-math 
> vectorization is a case where you *don't* want ordered reductions, as 
> the relaxed fp-contract means that the conventional tree reduction 
> algorithm preserves the required semantics.
That's my point. Fast-math can change the target's semantics regarding
reductions, independently of scalable vectors, so it's worth discussing in a
more general case, ie in this thread.

Sorry for being terse.

> The hassle of generating reductions may well be at most a minor 
> motivator, but my point still stands. If a front-end wants the target 
> to be able to generate the best code for a reduction idiom, they must 
> generate a lot of IR for many-element vectors. You still have to paid 
> the price in bloated IR, see the tests changed as part of the AArch64 
> NEON patch.
This is a completely orthogonal discussion, as I stated here:
>> The argument that the intrinsic is harder to destroy through 
>> optimisation passes is the same as other cases of stiff rich 
>> semantics vs. generic pattern matching, so orthogonal to this issue.
One that we have had multiple times and the usual consensus is: if it can be
represented in plain IR, it must. Adding multiple semantics for the same
concept, especially stiff ones like builtins, adds complexity to the optimiser.

Regardless of the merits in this case, builtins should only be introduced IFF
there is no other way. So first we should discuss adding it to IR with generic
concepts, just like we did for scatter/gather and strided access.

>> Why not simplify this into something like:
>>
>>   %sum = add <N x float>, <N x float> %a, <N x float>
%b
>>   %red = @llvm.reduce(%sum, float %acc) or
>>   %fast_red = @llvm.reduce(%sum)
>
> Because the semantics of an operation would not depend solely in the 
> operand value types and operation, but on a chain of computations 
> forming the operands. If the input operand is a phi, you then have to 
> do potentially inter-block analysis. If it's a function parameter or 
> simply a load from memory then you're pretty much stuck and you
can't
> resolve the semantics.
I think you have just described the pattern matching algorithm, meaning it's
possible to write that in a sequence of IR instructions, thus using add+reduce
should work. Same with pointer types and other reduction operations.

If the argument comes from a function parameter that is a non-strict pointer to
memory, then all bets are off anyway and the front-end wouldn't be able to
generate anything more specific, unless you're using SIMD intrinsics, in
which case this point is moot.

> During the dev meeting, a reductions proposal where the operation to 
> be performed was a kind of opcode was discussed, and rejected by the 
> community.
Well, that was certainly a smaller group than the list. Design decisions should
not be taken off list, so we must have this discussion on the list again,
I'm afraid.

> I don't believe having many intrinsics would be a problem.
This is against every decision I remember. Saying it out loud in a meeting is
one thing, writing them down and implementing and having to bear the maintenance
costs is another entirely.

That's why the consensus has to happen on the list.

>> For a min/max reduction, why not just extend @llvm.minnum and
@llvm.maxnum?
>
> For the same reasons that we don't re-use the other binary operator 
> instructions like add, sub, mul. The vector versions of those are not 
> horizontal operations, they instead produce vector results.
Sorry, I meant min/max + reduce, just like above.

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %min = @llvm.minnum(<N x float> %sum)
   %red = @llvm.reduce(%min, float %acc)

cheers,
--renato
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Renato Golin via llvm-dev

2017-Feb-01 09:58 UTC

head link

[llvm-dev] RFC: Generic IR reductions

On 1 February 2017 at 09:38, Demikhovsky, Elena
<elena.demikhovsky at intel.com> wrote:> I suppose, we can let Target to "decide".  Target may decide to
use an intrinsic since it will be converted later to one instruction or leave a
plain IR letting the optimizer to work smoothly.
> Actually, we do the same for gather/scatter - vectorizer does not build
intrinsics if the Target does not support them.
Absolutely, this is a target-specific decision.

For gather / scatter we had some intrinsics, but we also used plain IR
for others, no? For strided access, we used mostly IR, despite the
original proposal to be mostly intrinsics.

I think the balance is: what is the cost across the compiler to
generate and keep the intrinsics syntactically correct and optimal?

I'm not claiming I know, as I don't know AVX or SVE too well, but we
need answers to those questions before we take any decision.

>> adds complexity to the optimizer
> Optimizer should not deal with intrinsics, with this kind of intrinsics at
least.
That's why it adds complexity. To most optimisers, intrinsics are
function calls that could either be replaced by a code block, or a
specific instruction or a library call. This may make certain
transformations opaque.

We have had in the past, examples in the NEON generation that has
shown better code after inlining due to new IR patterns being found,
and if we had used intrinsics, they wouldn't.

> The target should be able to answer on question
IsProfitable(ORDERED_REDUCTION_FADD, VF)  - ordered reductions, for example, are
not supported on X86 and the answer will, probably, be "false".
> In this case we'll stay with plain IR.
> But SVE will answer "true" and an intrinsic will be created. So,
in my opinion, we need reduction intrinsics.
I'm not sure what you mean. The cost analysis is per instruction,
which adds up to per block or function, be them intrinsics or legal IR
instructions.

The implementation of functions that calculate profitability already
have to take into account non-intrinsics, so that shouldn't make any
difference.

> The IR is plain and the ISAs are complex, we miss optimizations at the end.
IR should be reach in order to serve complex ISAs.
Creating more intrinsics actually make the IR *more* complex. What
kind of optimisations would we miss?

If you mean "patterns may not be matched, and reduction instructions
will not be generated, making the code worse", then this is just a
matter of making the patterns obvious and the back-ends robust enough
to cope with it, no?

I mean, we already do this for almost everything else. If the current
IR can express the required semantics, then we should use plain IR.

The alternative is to start forcing intrinsics for everything, which
would make IR really pointless.

> As far as intrinsics set, we, probably, need "ordered" and
"plain" mode for FP reductions.
The ordered and plain modes can be chosen via an additional parameter,
the accumulator.

> X86 does not have the "ordered" today, but might be interesting
in a loop-tail operation in FP fast mode.
Indeed, same for NEON and I guess most other SIMD engines.

cheers,
--renato

Demikhovsky, Elena via llvm-dev

2017-Feb-01 10:30 UTC

head link

[llvm-dev] RFC: Generic IR reductions

> If you mean "patterns may not be matched, and reduction instructions
will not be generated, making the code worse", then this is just a matter
of making the patterns obvious and the back-ends robust enough to cope with it,
no?The Back-end should be as robust as possible, I agree. The problem that I see is
in adding another kind of complexity to the optimizer that works between the
Vectorizer and the Back-end. It should be able to recognize all
"obvious" patterns in order to preserve them.
>> The target should be able to answer on question
IsProfitable(ORDERED_REDUCTION_FADD, VF)  - ordered reductions, for example, are
not supported on X86 and the answer will, probably, be "false".
>> In this case we'll stay with plain IR.
>> But SVE will answer "true" and an intrinsic will be created.
So, in my opinion, we need reduction intrinsics.
> I'm not sure what you mean. The cost analysis is per instruction, which
adds up to per block or function, be them intrinsics or legal IR instructions.
Now we look at a Reduction Phi and if the FP mode requires the
"Ordered" reduction which is not supported, the whole loop remains
scalar.
If we would leave this decision to the Cost Model, it will provide a cost of
scalarization.  And at the end we may decide to scalarize reduction operation
inside vector loop. Now, once the taken decision is vectorization, inserting an
intrinsic or generating a plain IR should be a Target decision.

-  Elena

-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: Wednesday, February 01, 2017 11:58
To: Demikhovsky, Elena <elena.demikhovsky at intel.com>
Cc: Amara Emerson <amara.emerson at gmail.com>; Amara Emerson
<Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org; nd <nd at
arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk>
Subject: Re: [llvm-dev] RFC: Generic IR reductions

On 1 February 2017 at 09:38, Demikhovsky, Elena <elena.demikhovsky at
intel.com> wrote:> I suppose, we can let Target to "decide".  Target may decide to
use an intrinsic since it will be converted later to one instruction or leave a
plain IR letting the optimizer to work smoothly.
> Actually, we do the same for gather/scatter - vectorizer does not build
intrinsics if the Target does not support them.
Absolutely, this is a target-specific decision.

For gather / scatter we had some intrinsics, but we also used plain IR for
others, no? For strided access, we used mostly IR, despite the original proposal
to be mostly intrinsics.

I think the balance is: what is the cost across the compiler to generate and
keep the intrinsics syntactically correct and optimal?

I'm not claiming I know, as I don't know AVX or SVE too well, but we
need answers to those questions before we take any decision.

>> adds complexity to the optimizer
> Optimizer should not deal with intrinsics, with this kind of intrinsics at
least.
That's why it adds complexity. To most optimisers, intrinsics are function
calls that could either be replaced by a code block, or a specific instruction
or a library call. This may make certain transformations opaque.

We have had in the past, examples in the NEON generation that has shown better
code after inlining due to new IR patterns being found, and if we had used
intrinsics, they wouldn't.

> The target should be able to answer on question
IsProfitable(ORDERED_REDUCTION_FADD, VF)  - ordered reductions, for example, are
not supported on X86 and the answer will, probably, be "false".
> In this case we'll stay with plain IR.
> But SVE will answer "true" and an intrinsic will be created. So,
in my opinion, we need reduction intrinsics.
I'm not sure what you mean. The cost analysis is per instruction, which adds
up to per block or function, be them intrinsics or legal IR instructions.

The implementation of functions that calculate profitability already have to
take into account non-intrinsics, so that shouldn't make any difference.

> The IR is plain and the ISAs are complex, we miss optimizations at the end.
IR should be reach in order to serve complex ISAs.
Creating more intrinsics actually make the IR *more* complex. What kind of
optimisations would we miss?

If you mean "patterns may not be matched, and reduction instructions will
not be generated, making the code worse", then this is just a matter of
making the patterns obvious and the back-ends robust enough to cope with it, no?

I mean, we already do this for almost everything else. If the current IR can
express the required semantics, then we should use plain IR.

The alternative is to start forcing intrinsics for everything, which would make
IR really pointless.

> As far as intrinsics set, we, probably, need "ordered" and
"plain" mode for FP reductions.
The ordered and plain modes can be chosen via an additional parameter, the
accumulator.

> X86 does not have the "ordered" today, but might be interesting
in a loop-tail operation in FP fast mode.
Indeed, same for NEON and I guess most other SIMD engines.

cheers,
--renato
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Feb 2017 - RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

Apparently Analagous Threads