thr3ads.net - llvm dev - [llvm-dev] LV: predication [May 2020]

If this information is useful, please help other people find it:
Share via:

Sjoerd Meijer via llvm-dev

2020-May-04 10:04 UTC

[llvm-dev] LV: predication

> The harm comes if the intrinsic ends up with the wrong value, or attached
to the wrong loop.
The intrinsic is marked as IntrNoDuplicate, so I wasn't worried about it
ending up somewhere else. Also, it is a property of a specific loop, a
tail-folded vector loop, that holds even after it is transformed I think. I.e.
unrolling a vector loop is probably not what you want, but even if you do the
element count would remain the same. But yes, I agree that a future whacky
optimisation on vector loops could invalidate this, which you can then skip but
then you lose out on it.... So, I really like this:
> If the problem is specifically figuring out the underlying element count
given a predicate, maybe we could attack it from that angle?  For example,
introduce a special intrinsic for deriving the mask (sort of like the SVE
whilelo).
That would be an excellent way of doing it and it would also map very well to
MVE too, where we have a VCTP intrinsic/instruction that creates the
mask/predicate (Vector Create Tail-Predicate). So I will go for this approach.
Such an intrinsic was actually also proposed in Sam's original RFC (see
https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), but we
hadn't implemented it yet. This intrinsic will probably look something like
this:

    <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)

It produces a <N x i1> predicate based on its two arguments, the number of
elements and the vector trip count, and it will be used by the predicated masked
loads/stores instructions in the vector body. I will start drafting an
implementation for this and continue with this in D79100.

Thanks,
Sjoerd.


________________________________
From: Eli Friedman <efriedma at quicinc.com>
Sent: 01 May 2020 21:11
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>; llvm-dev <llvm-dev at
lists.llvm.org>
Subject: RE: [llvm-dev] LV: predication






From: Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Sent: Friday, May 1, 2020 11:54 AM
To: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at
lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] LV: predication



Hi Eli,


> The problem with your proposal, as written, is that the vectorizer is
producing the intrinsic.  Because we don’t impose any ordering on optimizations
before codegen, every optimization pass in LLVM would have to be taught to
preserve any @llvm.set.loop.elements.i32 whenever it makes any change.  This is
completely impractical because the intrinsic isn’t related to anything
optimizations would normally look for: it’s a random intrinsic in the middle of
nowhere.


I do see that point. But is that also not the beauty of it? It just sits in the
preheader, if gets removed, then so be it. And if it not recognised, then also
no harm done?



The harm comes if the intrinsic ends up with the wrong value, or attached to the
wrong loop.


> Probably the simplest path to get this working is to derive the number of
elements in the backend (in HardwareLoops, or your tail predication pass). You
should be able to figure it from the masks used in the llvm.masked.load/store
instructions in the loop.


This is what we are currently doing and works excellent for simpler cases. For
the more complicated cases that we now what to handle as well, the pattern
matching just becomes a bit too horrible, and it is fragile too. All we need is
the information that the vectoriser already has, and pass this on somehow.



As I am really keen to simply our backend pass, would there be another way to
pass this information on? If emitting an intrinsic is a blocker, could this be
done with a loop annotation?



If the problem is specifically figuring out the underlying element count given a
predicate, maybe we could attack it from that angle?  For example, introduce a
special intrinsic for deriving the mask (sort of like the SVE whilelo).



-Eli


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/84451aaf/attachment.html>

Philip Reames via llvm-dev

2020-May-04 16:47 UTC

head link

[llvm-dev] LV: predication

On 5/4/20 3:04 AM, Sjoerd Meijer via llvm-dev wrote:> > The harm comes if the intrinsic ends up with the wrong value, or 
> attached to the wrong loop.
>
> The intrinsic is marked as IntrNoDuplicate, so I wasn't worried about 
> it ending up somewhere else. Also, it is a property of a specific 
> loop, a tail-folded vector loop, that holds even after it is 
> transformed I think. I.e. unrolling a vector loop is probably not what 
> you want, but even if you do the element count would remain the same. 
> But yes, I agree that a future whacky optimisation on vector loops 
> could invalidate this, which you can then skip but then you lose out 
> on it.... So, I really like this:
This approach really doesn't work.  Not unless you're willing to impose 
legality restrictions on optimization passes to preserve the information.


It's helpful to think of the optimizer as being adversarial.  The 
question is not "will the optimizer break this?"; it's "can a
malicious
optimizer break this?".  Unless you can reason from the spec (LangRef) 
that the answer is no, then the answer is yes.


In your particular example, consider what might happen is loop fission 
runs on your vectorized loop, then we recognize that iterations N 
through M of the first loop (after fission) were nops and split it into 
two loops over narrow ranges.  You'd have real trouble matching your 
intrinsic to anything meaningful in the backend, and getting it wrong 
would be a correctness bug.
>
> > If the problem is specifically figuring out the underlying element 
> count given a predicate, maybe we could attack it from that angle?  
> For example, introduce a special intrinsic for deriving the mask (sort 
> of like the SVE whilelo).
>
> That would be an excellent way of doing it and it would also map very 
> well to MVE too, where we have a VCTP intrinsic/instruction that 
> creates the mask/predicate (Vector Create Tail-Predicate). So I will 
> go for this approach. Such an intrinsic was actually also proposed in 
> Sam's original RFC (see 
> https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html 
> <https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html>),
but
> we hadn't implemented it yet. This intrinsic will probably look 
> something like this:
>
>     <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)
>
> It produces a <N x i1> predicate based on its two arguments, the 
> number of elements and the vector trip count, and it will be used by 
> the predicated masked loads/stores instructions in the vector body. I 
> will start drafting an implementation for this and continue with this 
> in D79100.
>
> Thanks,
> Sjoerd.
>
>
> ------------------------------------------------------------------------
> *From:* Eli Friedman <efriedma at quicinc.com>
> *Sent:* 01 May 2020 21:11
> *To:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>; llvm-dev 
> <llvm-dev at lists.llvm.org>
> *Subject:* RE: [llvm-dev] LV: predication
>
> *From:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> *Sent:* Friday, May 1, 2020 11:54 AM
> *To:* Eli Friedman <efriedma at quicinc.com>; llvm-dev 
> <llvm-dev at lists.llvm.org>
> *Subject:* [EXT] Re: [llvm-dev] LV: predication
>
> Hi Eli,
>
> > The problem with your proposal, as written, is that the vectorizer is 
> producing the intrinsic.  Because we don’t impose any ordering on 
> optimizations before codegen, every optimization pass in LLVM would 
> have to be taught to preserve any @llvm.set.loop.elements.i32 whenever 
> it makes any change.  This is completely impractical because the 
> intrinsic isn’t related to anything optimizations would normally look 
> for: it’s a random intrinsic in the middle of nowhere.
>
> I do see that point. But is that also not the beauty of it? It just 
> sits in the preheader, if gets removed, then so be it. And if it not 
> recognised, then also no harm done?
>
> The harm comes if the intrinsic ends up with the wrong value, or 
> attached to the wrong loop.
>
> > Probably the simplest path to get this working is to derive the number
of
> elements in the backend (in HardwareLoops, or your tail predication 
> pass). You should be able to figure it from the masks used in the 
> llvm.masked.load/store instructions in the loop.
>
> This is what we are currently doing and works excellent for simpler 
> cases. For the more complicated cases that we now what to handle as 
> well, the pattern matching just becomes a bit too horrible, and it is 
> fragile too. All we need is the information that the vectoriser 
> already has, and pass this on somehow.
>
> As I am really keen to simply our backend pass, would there be another 
> way to pass this information on? If emitting an intrinsic is a 
> blocker, could this be done with a loop annotation?
>
> If the problem is specifically figuring out the underlying element 
> count given a predicate, maybe we could attack it from that angle?  
> For example, introduce a special intrinsic for deriving the mask (sort 
> of like the SVE whilelo).
>
> -Eli
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/4b4c28c4/attachment-0001.html>

Roger Ferrer Ibáñez via llvm-dev

2020-May-04 20:22 UTC

head link

[llvm-dev] LV: predication

Hi Sjoerd,

> That would be an excellent way of doing it and it would also map very well
> to MVE too, where we have a VCTP intrinsic/instruction that creates the
> mask/predicate (Vector Create Tail-Predicate). So I will go for this
> approach. Such an intrinsic was actually also proposed in Sam's
original
> RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html),
> but we hadn't implemented it yet. This intrinsic will probably look
> something like this:
>
>     <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)
>
> It produces a <N x i1> predicate based on its two arguments, the
number of
> elements and the vector trip count, and it will be used by the predicated
> masked loads/stores instructions in the vector body. I will start drafting
> an implementation for this and continue with this in D79100.
>
I'm curious about this, because this looks to me very similar to the code
that -prefer-predicate-over-epilog is already emitting for the "outer
mask"
of a tail-folded loop.

The following code

void foo(int N, int *restrict c, int *restrict a, int *restrict b) {
#pragma clang loop vectorize(enable) interleave(disable)
  for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
  }
}

compiled with clang --target=x86_64 -mavx512f -mllvm
-prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR

vector.body:                                      ; preds = %vector.body,
%for.body.preheader.new
  %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1,
%vector.body ]
  %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [
%niter.nsub.1, %vector.body ]
  %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index,
i32 0
  %broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12,
<16 x i64> undef, <16 x i32> zeroinitializer
  %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2,
i64
3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64
13, i64 14, i64 15>
  %4 = getelementptr inbounds i32, i32* %b, i64 %index
  *%5 = icmp ule <16 x i64> %induction, %broadcast.splat*
  ...
  %wide.masked.load = call <16 x i32>
@llvm.masked.load.v16i32.p0v16i32(<16
x i32>* %6, i32 4, *<16 x i1> %5*, <16 x i32> undef), !tbaa !2

I understand %5 is not the same your proposed llvm.loop.get.active.mask
would compute, is that correct? Can you elaborate on the difference here?

Thanks a lot,
Roger
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/4f752064/attachment.html>

Sjoerd Meijer via llvm-dev

2020-May-04 22:05 UTC

head link

[llvm-dev] LV: predication

Hi Roger,

That's a good example, that shows most of the moving parts involved here. In
a nutshell, the difference is, and what we would like to make explicit, is the
vector trip versus the scalar loop trip count. In your IR example, the
loads/stores are predicated on a mask that is calculated from a splat induction
variable, which is compared with the vector trip count. Illustrated with your
example simplified, and with some pseudo-code, if we tail-fold and vectorize
this scalar loop:

for i= 0 to 10
  a[i] = b[i] + c[i];

the vector loop trip count is rounded up to 14, the next multiple of 4, and
lanes are predicated on i < 10:

for i= 0 to 12
  a[i:4] = b[i:4] + c[i:4],    if i < 10;

what we would like to generate is a vector loop with implicit predication, which
works by setting up the the number of elements processed by the loop:

hwloop 10
  [i:4] = b[i:4] + c[i:4]

This is implicit since instructions don't produce/consume a mask, but it is
generated ans used under the hood by the "hwloop" construct. Your
observation that the information in the IR is mostly there is correct, but
rather than pattern matching and reconstructing this in the backend, we would
like to makes this explicit. In this example, the scalar iteration count 10 iis
the number of elements processed by this loop, which is what we want to pass on
from the vectoriser to backend passes.

Hope this helps.
Cheers,
Sjoerd.



________________________________
From: Roger Ferrer Ibáñez <rofirrim at gmail.com>
Sent: 04 May 2020 21:22
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Cc: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at
lists.llvm.org>; Sam Parker <Sam.Parker at arm.com>
Subject: Re: [llvm-dev] LV: predication

Hi Sjoerd,


That would be an excellent way of doing it and it would also map very well to
MVE too, where we have a VCTP intrinsic/instruction that creates the
mask/predicate (Vector Create Tail-Predicate). So I will go for this approach.
Such an intrinsic was actually also proposed in Sam's original RFC (see
https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), but we
hadn't implemented it yet. This intrinsic will probably look something like
this:

    <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)

It produces a <N x i1> predicate based on its two arguments, the number of
elements and the vector trip count, and it will be used by the predicated masked
loads/stores instructions in the vector body. I will start drafting an
implementation for this and continue with this in D79100.

I'm curious about this, because this looks to me very similar to the code
that -prefer-predicate-over-epilog is already emitting for the "outer
mask" of a tail-folded loop.

The following code

void foo(int N, int *restrict c, int *restrict a, int *restrict b) {
#pragma clang loop vectorize(enable) interleave(disable)
  for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
  }
}

compiled with clang --target=x86_64 -mavx512f -mllvm
-prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR

vector.body:                                      ; preds = %vector.body,
%for.body.preheader.new
  %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1, %vector.body
]
  %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.1,
%vector.body ]
  %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index,
i32 0
  %broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12,
<16 x i64> undef, <16 x i32> zeroinitializer
  %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2,
i64 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 13,
i64 14, i64 15>
  %4 = getelementptr inbounds i32, i32* %b, i64 %index
  %5 = icmp ule <16 x i64> %induction, %broadcast.splat
  ...
  %wide.masked.load = call <16 x i32>
@llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %6, i32 4, <16 x i1>
%5, <16 x i32> undef), !tbaa !2

I understand %5 is not the same your proposed llvm.loop.get.active.mask would
compute, is that correct? Can you elaborate on the difference here?

Thanks a lot,
Roger
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/49cfe347/attachment.html>

llvm dev - May 2020 - LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication