thr3ads.net - llvm dev - [llvm-dev] LV: predication [May 2020]

If this information is useful, please help other people find it:
Share via:

Sjoerd Meijer via llvm-dev

2020-May-04 22:05 UTC

[llvm-dev] LV: predication

Hi Roger,

That's a good example, that shows most of the moving parts involved here. In
a nutshell, the difference is, and what we would like to make explicit, is the
vector trip versus the scalar loop trip count. In your IR example, the
loads/stores are predicated on a mask that is calculated from a splat induction
variable, which is compared with the vector trip count. Illustrated with your
example simplified, and with some pseudo-code, if we tail-fold and vectorize
this scalar loop:

for i= 0 to 10
  a[i] = b[i] + c[i];

the vector loop trip count is rounded up to 14, the next multiple of 4, and
lanes are predicated on i < 10:

for i= 0 to 12
  a[i:4] = b[i:4] + c[i:4],    if i < 10;

what we would like to generate is a vector loop with implicit predication, which
works by setting up the the number of elements processed by the loop:

hwloop 10
  [i:4] = b[i:4] + c[i:4]

This is implicit since instructions don't produce/consume a mask, but it is
generated ans used under the hood by the "hwloop" construct. Your
observation that the information in the IR is mostly there is correct, but
rather than pattern matching and reconstructing this in the backend, we would
like to makes this explicit. In this example, the scalar iteration count 10 iis
the number of elements processed by this loop, which is what we want to pass on
from the vectoriser to backend passes.

Hope this helps.
Cheers,
Sjoerd.



________________________________
From: Roger Ferrer Ibáñez <rofirrim at gmail.com>
Sent: 04 May 2020 21:22
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Cc: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at
lists.llvm.org>; Sam Parker <Sam.Parker at arm.com>
Subject: Re: [llvm-dev] LV: predication

Hi Sjoerd,


That would be an excellent way of doing it and it would also map very well to
MVE too, where we have a VCTP intrinsic/instruction that creates the
mask/predicate (Vector Create Tail-Predicate). So I will go for this approach.
Such an intrinsic was actually also proposed in Sam's original RFC (see
https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), but we
hadn't implemented it yet. This intrinsic will probably look something like
this:

    <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)

It produces a <N x i1> predicate based on its two arguments, the number of
elements and the vector trip count, and it will be used by the predicated masked
loads/stores instructions in the vector body. I will start drafting an
implementation for this and continue with this in D79100.

I'm curious about this, because this looks to me very similar to the code
that -prefer-predicate-over-epilog is already emitting for the "outer
mask" of a tail-folded loop.

The following code

void foo(int N, int *restrict c, int *restrict a, int *restrict b) {
#pragma clang loop vectorize(enable) interleave(disable)
  for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
  }
}

compiled with clang --target=x86_64 -mavx512f -mllvm
-prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR

vector.body:                                      ; preds = %vector.body,
%for.body.preheader.new
  %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1, %vector.body
]
  %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.1,
%vector.body ]
  %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index,
i32 0
  %broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12,
<16 x i64> undef, <16 x i32> zeroinitializer
  %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2,
i64 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 13,
i64 14, i64 15>
  %4 = getelementptr inbounds i32, i32* %b, i64 %index
  %5 = icmp ule <16 x i64> %induction, %broadcast.splat
  ...
  %wide.masked.load = call <16 x i32>
@llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %6, i32 4, <16 x i1>
%5, <16 x i32> undef), !tbaa !2

I understand %5 is not the same your proposed llvm.loop.get.active.mask would
compute, is that correct? Can you elaborate on the difference here?

Thanks a lot,
Roger
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/49cfe347/attachment.html>

Roger Ferrer Ibáñez via llvm-dev

2020-May-05 06:21 UTC

head link

[llvm-dev] LV: predication

Hi Sjoerd,

thanks a lot for the clarification. Makes sense.

Kind regards,

Missatge de Sjoerd Meijer <Sjoerd.Meijer at arm.com> del dia dt., 5 de
maig
2020 a les 0:06:
> Hi Roger,
>
> That's a good example, that shows most of the moving parts involved
here.
> In a nutshell, the difference is, and what we would like to make explicit,
> is the vector trip versus the scalar loop trip count. In your IR example,
> the loads/stores are predicated on a mask that is calculated from a splat
> induction variable, which is compared with the vector trip count.
> Illustrated with your example simplified, and with some pseudo-code, if we
> tail-fold and vectorize this scalar loop:
>
> for i= 0 to 10
>   a[i] = b[i] + c[i];
>
> the vector loop trip count is rounded up to 14, the next multiple of 4,
> and lanes are predicated on i < 10:
>
> for i= 0 to 12
>   a[i:4] = b[i:4] + c[i:4],    if i < 10;
>
> what we would like to generate is a vector loop with implicit predication,
> which works by setting up the the number of elements processed by the loop:
>
> hwloop 10
>   [i:4] = b[i:4] + c[i:4]
>
> This is implicit since instructions don't produce/consume a mask, but
it
> is generated ans used under the hood by the "hwloop" construct.
Your
> observation that the information in the IR is mostly there is correct, but
> rather than pattern matching and reconstructing this in the backend, we
> would like to makes this explicit. In this example, the scalar iteration
> count 10 iis the number of elements processed by this loop, which is what
> we want to pass on from the vectoriser to backend passes.
>
> Hope this helps.
> Cheers,
> Sjoerd.
>
>
>
> ------------------------------
> *From:* Roger Ferrer Ibáñez <rofirrim at gmail.com>
> *Sent:* 04 May 2020 21:22
> *To:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> *Cc:* Eli Friedman <efriedma at quicinc.com>; llvm-dev <
> llvm-dev at lists.llvm.org>; Sam Parker <Sam.Parker at arm.com>
> *Subject:* Re: [llvm-dev] LV: predication
>
> Hi Sjoerd,
>
>
> That would be an excellent way of doing it and it would also map very well
> to MVE too, where we have a VCTP intrinsic/instruction that creates the
> mask/predicate (Vector Create Tail-Predicate). So I will go for this
> approach. Such an intrinsic was actually also proposed in Sam's
original
> RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html),
> but we hadn't implemented it yet. This intrinsic will probably look
> something like this:
>
>     <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)
>
> It produces a <N x i1> predicate based on its two arguments, the
number of
> elements and the vector trip count, and it will be used by the predicated
> masked loads/stores instructions in the vector body. I will start drafting
> an implementation for this and continue with this in D79100.
>
>
> I'm curious about this, because this looks to me very similar to the
code
> that -prefer-predicate-over-epilog is already emitting for the "outer
mask"
> of a tail-folded loop.
>
> The following code
>
> void foo(int N, int *restrict c, int *restrict a, int *restrict b) {
> #pragma clang loop vectorize(enable) interleave(disable)
>   for (int i = 0; i < N; i++) {
>     a[i] = b[i] + c[i];
>   }
> }
>
> compiled with clang --target=x86_64 -mavx512f -mllvm
> -prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR
>
> vector.body:                                      ; preds = %vector.body,
> %for.body.preheader.new
>   %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1,
> %vector.body ]
>   %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [
> %niter.nsub.1, %vector.body ]
>   %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64
%index,
> i32 0
>   %broadcast.splat13 = shufflevector <16 x i64>
%broadcast.splatinsert12,
> <16 x i64> undef, <16 x i32> zeroinitializer
>   %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1,
i64 2, i64
> 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64
> 13, i64 14, i64 15>
>   %4 = getelementptr inbounds i32, i32* %b, i64 %index
>   *%5 = icmp ule <16 x i64> %induction, %broadcast.splat*
>   ...
>   %wide.masked.load = call <16 x i32>
> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %6, i32 4, *<16 x
i1> %5*,
> <16 x i32> undef), !tbaa !2
>
> I understand %5 is not the same your proposed llvm.loop.get.active.mask
> would compute, is that correct? Can you elaborate on the difference here?
>
> Thanks a lot,
> Roger
>

-- 
Roger Ferrer Ibáñez
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200505/66432145/attachment.html>

Simon Moll via llvm-dev

2020-May-18 12:32 UTC

head link

[llvm-dev] LV: predication

On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote:
what we would like to generate is a vector loop with implicit predication, which
works by setting up the the number of elements processed by the loop:

hwloop 10
  [i:4] = b[i:4] + c[i:4]

Why couldn't you use VP intrinsics and scalable types for this?

   %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %sum = <4 x vscale x double> fadd %bval, %cval
   store [..]

I see three issues with the llvm.set.loop.elements approach:
1) It is conceptually broken: as others have pointed out, optimization can move
the intrinsic around since the intrinsic doesn't have any dependencies that
would naturally keep it in place.
2) The whole proposed set of intrinsics is vendor specific: this causes
fragmentation and i don't see why we would want to emit vendor-specific
intrinsics in a generic optimization pass. Soon, we would see reports a la
"your optimization caused regressions for MVE - add a check that the
transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics
when compiling for MVE..". I doubt that you would tolerate when that
intrinsic were some removed in performance-critical code that would then remain
scalar as a result.. so, i do not see the "beauty of the approach".
3) We need a reliable solution to properly support vector ISA such as RISC-V V
extension and SX-Aurora and also MVE.. i don't see that reliability in this
proposal.

If for whatever reason, the above does not work and seems to far away from your
proposal, here is another idea to make more explicit hwloops work with the VP
intrinsics - in a way that does not break with optimizations:

vector.preheader:
  %evl = i32 llvm.hwloop.set.elements(%n)

vector.body:
  %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body]
  %aval = call @llvm.vp.load(Aptr, .., %evl)
  call @llvm.vp.store(Bptr, %aval, ..., %evl)
  %next.evl = call i32 @llvm.hwloop.decrement(%evl)

Note that the way VP intrinsics are designed, it is not possible to break this
code by hoisting the VP calls out of the loop: passing "%evl >= the
operation's vector size" consitutes UB (see
https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use
attributes to do the same for sinking (eg don't move VP across
hwloop.decrement).

- Simon
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/66b37bea/attachment.html>

Sjoerd Meijer via llvm-dev

2020-May-18 12:52 UTC

head link

[llvm-dev] LV: predication

Hi,
I abandoned that approach and followed Eli's suggestion, see somewhere
earlier in this thread, and emit an intrinsic that represents/calculates the
active mask. I've just uploaded a new revision for D79100 that implements
this.
Cheers.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM>
Sent: 18 May 2020 13:32
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Eli Friedman <efriedma
at quicinc.com>; listmail at philipreames.com <listmail at
philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sander De
Smalen <Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com
<hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote:
what we would like to generate is a vector loop with implicit predication, which
works by setting up the the number of elements processed by the loop:

hwloop 10
  [i:4] = b[i:4] + c[i:4]

Why couldn't you use VP intrinsics and scalable types for this?

   %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %sum = <4 x vscale x double> fadd %bval, %cval
   store [..]

I see three issues with the llvm.set.loop.elements approach:
1) It is conceptually broken: as others have pointed out, optimization can move
the intrinsic around since the intrinsic doesn't have any dependencies that
would naturally keep it in place.
2) The whole proposed set of intrinsics is vendor specific: this causes
fragmentation and i don't see why we would want to emit vendor-specific
intrinsics in a generic optimization pass. Soon, we would see reports a la
"your optimization caused regressions for MVE - add a check that the
transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics
when compiling for MVE..". I doubt that you would tolerate when that
intrinsic were some removed in performance-critical code that would then remain
scalar as a result.. so, i do not see the "beauty of the approach".
3) We need a reliable solution to properly support vector ISA such as RISC-V V
extension and SX-Aurora and also MVE.. i don't see that reliability in this
proposal.

If for whatever reason, the above does not work and seems to far away from your
proposal, here is another idea to make more explicit hwloops work with the VP
intrinsics - in a way that does not break with optimizations:

vector.preheader:
  %evl = i32 llvm.hwloop.set.elements(%n)

vector.body:
  %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body]
  %aval = call @llvm.vp.load(Aptr, .., %evl)
  call @llvm.vp.store(Bptr, %aval, ..., %evl)
  %next.evl = call i32 @llvm.hwloop.decrement(%evl)

Note that the way VP intrinsics are designed, it is not possible to break this
code by hoisting the VP calls out of the loop: passing "%evl >= the
operation's vector size" consitutes UB (see
https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use
attributes to do the same for sinking (eg don't move VP across
hwloop.decrement).

- Simon
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/9b7c600c/attachment.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - May 2020 - LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

Reasonably Related Threads