thr3ads.net - llvm dev - [llvm-dev] RFC: Generic IR reductions [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Demikhovsky, Elena via llvm-dev

2017-Feb-01 13:06 UTC

[llvm-dev] RFC: Generic IR reductions

Constant propagation:

%sum = add <N x float> %a, %b
@llvm.reduce(ext <N x double>  %sum)

if %a and %b are vector of constants, the %sum also becomes a vector of
constants.
At this point you have @llvm.reduce(ext <N x double>  %sum) and don't
know what kind of reduction do you need.

-  Elena

-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: Wednesday, February 01, 2017 14:40
To: Demikhovsky, Elena <elena.demikhovsky at intel.com>
Cc: Amara Emerson <amara.emerson at gmail.com>; Amara Emerson
<Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org; nd <nd at
arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk>
Subject: Re: [llvm-dev] RFC: Generic IR reductions

On 1 February 2017 at 11:59, Demikhovsky, Elena <elena.demikhovsky at
intel.com> wrote:>   > @llvm.reduce(ext <N x double> ( add <N x float> %a, %b))
>
> And if we don't have %b? We just want to sum all elements of %a? 
> Something like @llvm.reduce(ext <N x double> ( add <N x float>
%a,
> zeroinitializer))
Hum, that's a good point. My examples were actually wrong, as they
weren't related to simple reductions. Your zeroinit is the thing I was
looking for.

> Don't we have a problem with constant propagation in this approach?
I'm not sure. Can you expand this?

> I proposed a "generic" intrinsic approach on the BOF (Nov, 2016),
like
>     %scalar = @llvm.reduce(OPCODE, %vector_input) - OPCODE may be a string,
integer or metadata.
I wouldn't use metadata. Integer would be cumbersome and lead to eventual
ABI breakages, and "text" would be the same as:

  %scalar = @llvm.reduce.add(%vector)

which is the same thing Amara proposed.

I'm not saying it is wrong, I'm just worried that, by mandating the
encoding of the reduction into an intrinsic, we'll force the middle-end to
convert high-level code patterns to the intrinsic or the target will ignore it
completely.

There is a pattern already for reductions, and the back-ends already match it.
This should not change, unless there is a serious flaw in it
- for the targets that *already* support it. This is an orthogonal discussion.

SVE has more restrictions, for instance, one cannot know how many shuffles to do
because the vector size is unknown, so the current representation is
insufficient, in which case, we need the intrinsic.

But replace everything else with intrinsics just because one target can't
cope with it doesn't work.

On thing that does happen is that code optimisations expose patterns that would
otherwise not be apparent. This includes potential reduction or fusion patterns
and can lead to massively smaller code or even eliding the whole block. If you
convert a block to an intrinsic too early you may lose the ability to merge it
back again later, as we're doing today.

These are all hypothetical wrt SVE, but they did happen in NEON in the past and
were the reason why we only have a handful of NEON intrinsics. Everything else
are encoded with sequences of instructions.

cheers,
--renato
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Renato Golin via llvm-dev

2017-Feb-01 14:20 UTC

head link

[llvm-dev] RFC: Generic IR reductions

On 1 February 2017 at 13:06, Demikhovsky, Elena
<elena.demikhovsky at intel.com> wrote:> Constant propagation:
>
> %sum = add <N x float> %a, %b
> @llvm.reduce(ext <N x double>  %sum)
>
> if %a and %b are vector of constants, the %sum also becomes a vector of
constants.
> At this point you have @llvm.reduce(ext <N x double>  %sum) and
don't know what kind of reduction do you need.
Well, sum is an add node. If %a and %b have special semantics for the
target, than @reduce is meaningful and this means some type of
summation.

But the more I think of it, the less I think this could actually solve
the semantic issues around reductions... The zeroinit idiom would be
more of a hack than re-using similar concepts in IR.

So, let me take a step back, and assume that, for scalable vectors we
*have* to use all intrinsics. IR simply has no compatible idiom, and
unless we introduce some (like the stepvector), it won't work.

As I said before, adding intrinsics is better than new IR constructs,
so let's also assume this is the less costly way forward for now.

But having IR instructions is better than adding intrinsics, and I'm
not sure we want to completely replace what's there already, for
intrinsics.

Does AVX512 suffer from any cost in using the current extract/op
idiom? NEON has small vectors, so IR sequences end up being 1~4 ops,
which I don't consider a problem.

Also, the current idiom can cope with ordered/unordered reduction by
interleaving the operations or not:

  %sum0 = %vec[0] + %vec[1]
  %sum1 = %vec[2] + %vec[3]
  %sum = %sum0 + %sum1

or

  %sum0 = %vec[0] + %vec[1]
  %sum1 = %sum0 + %vec[2]
  %sum = %sum1 + %vec[3]

It may not cope with special semantics leading to use of
target-specific instructions, in which case we obviously need
intrinsics. It certainly can't cope with unknown vector sizes either.

cheers,
--renato

Amara Emerson via llvm-dev

2017-Feb-01 14:44 UTC

head link

[llvm-dev] RFC: Generic IR reductions

Her point is that the %sum value will no longer be an add node, but
simply a constant vector. There is no way to resolve the semantics and
have meaningful IR after a simple constant prop. This means that other
simple transformations will all have to have special case logic just
to handle reductions, for example LCSSA.
> On thing that does happen is that code optimisations expose patterns
> that would otherwise not be apparent. This includes potential
> reduction or fusion patterns and can lead to massively smaller code or
> even eliding the whole block. If you convert a block to an intrinsic
> too early you may lose the ability to merge it back again later, as
> we're doing today.
> These are all hypothetical wrt SVE, but they did happen in NEON in the
> past and were the reason why we only have a handful of NEON
> intrinsics. Everything else are encoded with sequences of
> instructions.
Can you give a specific example? Reductions are actually very robust
to optimizations breaking the idiom, which is why I was able to
replace the reductions with intrinsics in my patches and with simple
matching generate identical code to before. No other changes were
required in the mid-end.


On 1 February 2017 at 14:20, Renato Golin <renato.golin at linaro.org>
wrote:> On 1 February 2017 at 13:06, Demikhovsky, Elena
> <elena.demikhovsky at intel.com> wrote:
>> Constant propagation:
>>
>> %sum = add <N x float> %a, %b
>> @llvm.reduce(ext <N x double>  %sum)
>>
>> if %a and %b are vector of constants, the %sum also becomes a vector of
constants.
>> At this point you have @llvm.reduce(ext <N x double>  %sum) and
don't know what kind of reduction do you need.
>
> Well, sum is an add node. If %a and %b have special semantics for the
> target, than @reduce is meaningful and this means some type of
> summation.
>
> But the more I think of it, the less I think this could actually solve
> the semantic issues around reductions... The zeroinit idiom would be
> more of a hack than re-using similar concepts in IR.
>
> So, let me take a step back, and assume that, for scalable vectors we
> *have* to use all intrinsics. IR simply has no compatible idiom, and
> unless we introduce some (like the stepvector), it won't work.
>
> As I said before, adding intrinsics is better than new IR constructs,
> so let's also assume this is the less costly way forward for now.
>
> But having IR instructions is better than adding intrinsics, and I'm
> not sure we want to completely replace what's there already, for
> intrinsics.
>
> Does AVX512 suffer from any cost in using the current extract/op
> idiom? NEON has small vectors, so IR sequences end up being 1~4 ops,
> which I don't consider a problem.
>
> Also, the current idiom can cope with ordered/unordered reduction by
> interleaving the operations or not:
>
>   %sum0 = %vec[0] + %vec[1]
>   %sum1 = %vec[2] + %vec[3]
>   %sum = %sum0 + %sum1
>
> or
>
>   %sum0 = %vec[0] + %vec[1]
>   %sum1 = %sum0 + %vec[2]
>   %sum = %sum1 + %vec[3]
>
> It may not cope with special semantics leading to use of
> target-specific instructions, in which case we obviously need
> intrinsics. It certainly can't cope with unknown vector sizes either.
>
> cheers,
> --renato

llvm dev - Feb 2017 - RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions