thr3ads.net - llvm dev - [llvm-dev] LV: predication [May 2020]

If this information is useful, please help other people find it:
Share via:

Sjoerd Meijer via llvm-dev

2020-May-19 15:22 UTC

[llvm-dev] LV: predication

Invitation accepted, I am happy to help out with reviews, like I did with the
previous VP patches.

And of course agreed that things should be well defined, and that we
shouldn't paint ourselves in a corner, but I don't think that this is
the case. And it's not that I am in a rush, but I don't think this
change needs to be predicated on a big change landing first like the LV
switching to VP intrinsics.
> The difference is that in the VP version there is an explicit dependence of
every vector operation in the loop to the set.num.elements intrinsic. This
dependence is obscured in the hwloop proposals (more on that below).
This discussion is getting complicated, because I think we are discussing 3
topics at the same time now: predication, hardware loops, and a new set of
intrinsics, the VP intrinsics.
For the change that kicked off this thread, i.e. 1 new intrinsic to get the
active lanes, I think we can eliminate the hardware loops from this story. For
us, that is just the context of this, and so I think we can just focus on
predication. And if we only talk about predication, I think this new intrinsic
can nicely coexist with the VP intrinsics.

And please note again I am not proposing a set.num.elements intrinsic. Well, I
first kind of did, but again, abandoned that approach after push back. Correct
me if I am wrong, but there's no difference in your example whether all
instructions consume some predicate or only masked loads/stores:

  vector.preheader:
    %init.evl = i32 llvm.hwloop.set.elements(%n)
  vector.body:
    %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body]
    %aval = call @llvm.vp.load(Aptr, .., %evl)
    call @llvm.vp.store(Bptr, %aval, ..., %evl)
    %next.evl = call i32 @llvm.hwloop.decrement(%evl)

No difference in that the problem remains that we have a random intrinsic
sitting in the preheader describing a loop property that needs to be maintained.

So, eliminating hardware loops and intrinsic that defines the number of elements
produced, I am proposing

  vector.body:
    %mask = lvm.get.active.lane.mask (%IV, %BTC)
     .. = @llvm.masked.load(.., %mask)

where IV is the induction step, and BTC the backedge taken count.
This completely piggy backs on everything that is already there in the
vectoriser, and nothing is fundamentally changed here. Now, this seems very
generic, and doesn't seem to bite the VP intrinsics.

Cheers,
Sjoerd.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM>
Sent: 19 May 2020 15:07
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Eli Friedman <efriedma
at quicinc.com>; listmail at philipreames.com <listmail at
philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sander De
Smalen <Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com
<hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/19/20 12:38 PM, Sjoerd Meijer wrote:
Hi Simon,

Thanks for reposting the example, and looking at it more carefully, I think it
is very similar to my first proposal. This was met with some resistance here
because it dumps loop information in the vector preheader. Doing it this early,
we want to emit this in the vectoriser, puts a restriction on (future)
optimisations that transform vector loops to honour/update/support this
intrinsic and loop information. In D79100, it is integral part of the vector
body and has some semantics (I will update it today), and thus doesn't have
these disadvantages.
The difference is that in the VP version there is an explicit dependence of
every vector operation in the loop to the set.num.elements intrinsic. This
dependence is obscured in the hwloop proposals (more on that below).
I understand that you are looking to get hwloops working quickly somehow - but
any proposal should be designed in a forward-looking way or we could get stuck
in a place it's hard to get out of. I am looking forward to see the
semantics for this spelled out.

Also, the vectoriser isn't using the VP intrinsics yet, so using them is a
bridge too far for me at this point. But we should definitely re-evaluate at
some point if we should use or transition to them in our backend passes.

I'd very much like to see LV use VP intrinsics. I invite everybody to
collaborate on VP to make it functional and useful quickly! Specifically, i am
hoping we can collaborate on masked reduction intrinsics and implement them in
the VP namespace. There is also the VP expansion pass on Phabricator right now
(D78203 - it says 'work-in-progress' in the summary, which probably was
a mistake: this is the real thing).
> Are all vector instructions in the hwloop implicitly predicated or only the
masked load/store ops?
In a nutshell, when a vector loop with (explicitly) predicated masked
loads/stores hit the backend, we translate the generic intrinsic get.active.mask
to a target specific one. All predication remains explicit, and this remains the
case. Only at the end, we use this intrinsic to instruction select a specific
variant of the hardwarloop with some implicit predication.
I do not see an answer to my question here. If the vectorized loop, prepared for
hwloop, looks like this:

    %m = get.active.mask(..)
    %v = masked.load ... %m
    %r = sdiv %x, %y

Will the `sdiv` execute with implicit hwloop predication?
It makes no difference to the semantics of the intrinsic at which point you
lower it but how.

- Simon

Cheers,
Sjoerd.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 19 May 2020 09:56
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

Hi Sjoerd,

On 5/18/20 3:43 PM, Sjoerd Meijer wrote:> You have similar problems with https://reviews.llvm.org/D79100
The new revision D79100<https://reviews.llvm.org/D79100> solves your
comment 1), and I don't think your comments2) and 3) apply as there are no
vendor specific intrinsics involved at all here. Just to quickly discuss the
optimisation pipeline, D79100<https://reviews.llvm.org/D79100> is a small
extension for the vectoriser, and nothing here is related to hardware-loops or
target specific constructs. The vectoriser tail-folds the loop, and creates
masked load/stores; so existing functionality, and nothing has changed here. The
generic hardware loop codegen pass inserts hardware loop intrinsics. Very late
in the pipeline, e.g. in the PPC and ARM backends, this is picked and turned
into an actual hardwareloop, in our case possibly predicated, or it is reverted.
Thanks for explaining it (possibly once again) I wasn't aware that this will
also be used for PPC. Point 3) still stands.
> What will you do if there are no masked intrinsics in the hwloop body?
Nothing. I.e., it can become a hardware loop, but not one with implicit
predication.
Are all vector instructions in the hwloop implicitly predicated or only the
masked load/store ops? If not, then the issue is that the predicate parameter of
masked load/store basically affects the semantics of all other vector ops in the
loop that do not have an explicit mask parameter:

    %v = masked.load ... %m ; explicit predication - okay
    %r = sdiv %x, %y        ; implicit predication by %m for hwloops -
unpredicated otherwise

> And i am curious why couldn't you use the %evl parameter of VP
intrinsics to get the tail predication you are interested in?
In D79100<https://reviews.llvm.org/D79100>, intrinsic get.active.mask
makes the backedge taken count of the scalar loop explicit. I will look again,
but I don't think the VP intrinsics were able to provide this. But to be
honest, I have no preference at all what this intrinsic is, it is not relevant,
as long as we can make this explicit.
VP intrinsics explicitly make every vector instruction in the loop dependent on
the '%evl'. You would have :

    %v = vp.load ... %evl
    %r = vp.sdiv %x, %y, %evl   ; explicitly predicated by the scalar loop trip
count

My previous mail had an example on how %evl could be tied to the scalar trip
count. Re-posting that here:

vector.preheader:
  %init.evl = i32 llvm.hwloop.set.elements(%n)

vector.body:
  %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body]
  %aval = call @llvm.vp.load(Aptr, .., %evl)
  call @llvm.vp.store(Bptr, %aval, ..., %evl)
  %next.evl = call i32 @llvm.hwloop.decrement(%evl)

- Simon

Cheers.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 18 May 2020 14:11
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/18/20 2:53 PM, Sjoerd Meijer wrote:
Hi,
I abandoned that approach and followed Eli's suggestion, see somewhere
earlier in this thread, and emit an intrinsic that represents/calculates the
active mask. I've just uploaded a new revision for D79100 that implements
this.
Cheers.
You have similar problems with https://reviews.llvm.org/D79100

Since there are no masked operations, except for load/store.. how are LLVM
optimizations supposed to know that they must not hoist/sink operations with
side-effects out of the hwloop? These operations have an implicit dependence on
the iteration variable.

What will you do if there are no masked intrinsics in the hwloop body? This can
happen once you generate vector code beyond trivial loops or have a vector IR
generator other than LV.

And i am curious why couldn't you use the %evl parameter of VP intrinsics to
get the tail predication you are interested in?

- Simon

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 18 May 2020 13:32
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote:
what we would like to generate is a vector loop with implicit predication, which
works by setting up the the number of elements processed by the loop:

hwloop 10
  [i:4] = b[i:4] + c[i:4]

Why couldn't you use VP intrinsics and scalable types for this?

   %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %sum = <4 x vscale x double> fadd %bval, %cval
   store [..]

I see three issues with the llvm.set.loop.elements approach:
1) It is conceptually broken: as others have pointed out, optimization can move
the intrinsic around since the intrinsic doesn't have any dependencies that
would naturally keep it in place.
2) The whole proposed set of intrinsics is vendor specific: this causes
fragmentation and i don't see why we would want to emit vendor-specific
intrinsics in a generic optimization pass. Soon, we would see reports a la
"your optimization caused regressions for MVE - add a check that the
transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics
when compiling for MVE..". I doubt that you would tolerate when that
intrinsic were some removed in performance-critical code that would then remain
scalar as a result.. so, i do not see the "beauty of the approach".
3) We need a reliable solution to properly support vector ISA such as RISC-V V
extension and SX-Aurora and also MVE.. i don't see that reliability in this
proposal.

If for whatever reason, the above does not work and seems to far away from your
proposal, here is another idea to make more explicit hwloops work with the VP
intrinsics - in a way that does not break with optimizations:

vector.preheader:
  %evl = i32 llvm.hwloop.set.elements(%n)

vector.body:
  %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body]
  %aval = call @llvm.vp.load(Aptr, .., %evl)
  call @llvm.vp.store(Bptr, %aval, ..., %evl)
  %next.evl = call i32 @llvm.hwloop.decrement(%evl)

Note that the way VP intrinsics are designed, it is not possible to break this
code by hoisting the VP calls out of the loop: passing "%evl >= the
operation's vector size" consitutes UB (see
https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use
attributes to do the same for sinking (eg don't move VP across
hwloop.decrement).

- Simon

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200519/30a978f8/attachment.html>

Simon Moll via llvm-dev

2020-May-20 08:52 UTC

head link

[llvm-dev] LV: predication

On 5/19/20 5:22 PM, Sjoerd Meijer wrote:
Invitation accepted, I am happy to help out with reviews, like I did with the
previous VP patches.
That's great!

And of course agreed that things should be well defined, and that we
shouldn't paint ourselves in a corner, but I don't think that this is
the case. And it's not that I am in a rush, but I don't think this
change needs to be predicated on a big change landing first like the LV
switching to VP intrinsics.
> The difference is that in the VP version there is an explicit dependence of
every vector operation in the loop to the set.num.elements intrinsic. This
dependence is obscured in the hwloop proposals (more on that below).
This discussion is getting complicated, because I think we are discussing 3
topics at the same time now: predication, hardware loops, and a new set of
intrinsics, the VP intrinsics.
Ok. My questions (the example at the end) was asking whether hwloops imply
predication (and by that i mean logically - if the hwloop implies that a SIMD
instruction may not execute for all lanes in the tail then that is predication
as well).
For the change that kicked off this thread, i.e. 1 new intrinsic to get the
active lanes, I think we can eliminate the hardware loops from this story. For
us, that is just the context of this, and so I think we can just focus on
predication. And if we only talk about predication, I think this new intrinsic
can nicely coexist with the VP intrinsics.

And please note again I am not proposing a set.num.elements intrinsic. Well, I
first kind of did, but again, abandoned that approach after push back. Correct
me if I am wrong, but there's no difference in your example whether all
instructions consume some predicate or only masked loads/stores:
Yes, and that is the point: it's about making the SIMD instructions
dependent on the mask .. and all of them.

  vector.preheader:
    %init.evl = i32 llvm.hwloop.set.elements(%n)
  vector.body:
    %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body]
    %aval = call @llvm.vp.load(Aptr, .., %evl)
    call @llvm.vp.store(Bptr, %aval, ..., %evl)
    %next.evl = call i32 @llvm.hwloop.decrement(%evl)

No difference in that the problem remains that we have a random intrinsic
sitting in the preheader describing a loop property that needs to be maintained.
The difference is that the intrinsic is connected to every SIMD instruction in
the vector loop through data flow. It does not just sit there.. in fact it does
not matter where it is placed as long as those def-use edges are visible to the
hwloop transformation.

So, eliminating hardware loops and intrinsic that defines the number of elements
produced, I am proposing

  vector.body:
    %mask = lvm.get.active.lane.mask (%IV, %BTC)
     .. = @llvm.masked.load(.., %mask)

where IV is the induction step, and BTC the backedge taken count.
This completely piggy backs on everything that is already there in the
vectoriser, and nothing is fundamentally changed here. Now, this seems very
generic, and doesn't seem to bite the VP intrinsics.
I see it the other way round: Right now you seem to have an implicit dependence
from syntactically unmasked SIMD instructions (eg a regular SIMD sdiv) to the
predicate of nearby masked intrinsics (masked.load) - that's on shaky
grounds semantically. VP intrinsics already define a clean semantics for tail
predication - so why not piggyback on that? You should define the hwloop support
in a way that will not just peacefully coexist with VP but leverage it
eventually. I'll continue in that direction in the review.

One specific request (since i got you attention now ;-) ): we need a (generic)
IR primitive to express %lane_id < %n for scalable vector types to expand VP
intrinsics for targets with SVE support but no tail predication.

Cheers,
Sjoerd.
- Simon

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 19 May 2020 15:07
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/19/20 12:38 PM, Sjoerd Meijer wrote:
Hi Simon,

Thanks for reposting the example, and looking at it more carefully, I think it
is very similar to my first proposal. This was met with some resistance here
because it dumps loop information in the vector preheader. Doing it this early,
we want to emit this in the vectoriser, puts a restriction on (future)
optimisations that transform vector loops to honour/update/support this
intrinsic and loop information. In D79100, it is integral part of the vector
body and has some semantics (I will update it today), and thus doesn't have
these disadvantages.
The difference is that in the VP version there is an explicit dependence of
every vector operation in the loop to the set.num.elements intrinsic. This
dependence is obscured in the hwloop proposals (more on that below).
I understand that you are looking to get hwloops working quickly somehow - but
any proposal should be designed in a forward-looking way or we could get stuck
in a place it's hard to get out of. I am looking forward to see the
semantics for this spelled out.

Also, the vectoriser isn't using the VP intrinsics yet, so using them is a
bridge too far for me at this point. But we should definitely re-evaluate at
some point if we should use or transition to them in our backend passes.

I'd very much like to see LV use VP intrinsics. I invite everybody to
collaborate on VP to make it functional and useful quickly! Specifically, i am
hoping we can collaborate on masked reduction intrinsics and implement them in
the VP namespace. There is also the VP expansion pass on Phabricator right now
(D78203 - it says 'work-in-progress' in the summary, which probably was
a mistake: this is the real thing).
> Are all vector instructions in the hwloop implicitly predicated or only the
masked load/store ops?
In a nutshell, when a vector loop with (explicitly) predicated masked
loads/stores hit the backend, we translate the generic intrinsic get.active.mask
to a target specific one. All predication remains explicit, and this remains the
case. Only at the end, we use this intrinsic to instruction select a specific
variant of the hardwarloop with some implicit predication.
I do not see an answer to my question here. If the vectorized loop, prepared for
hwloop, looks like this:

    %m = get.active.mask(..)
    %v = masked.load ... %m
    %r = sdiv %x, %y

Will the `sdiv` execute with implicit hwloop predication?
It makes no difference to the semantics of the intrinsic at which point you
lower it but how.

- Simon

Cheers,
Sjoerd.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 19 May 2020 09:56
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

Hi Sjoerd,

On 5/18/20 3:43 PM, Sjoerd Meijer wrote:> You have similar problems with https://reviews.llvm.org/D79100
The new revision D79100<https://reviews.llvm.org/D79100> solves your
comment 1), and I don't think your comments2) and 3) apply as there are no
vendor specific intrinsics involved at all here. Just to quickly discuss the
optimisation pipeline, D79100<https://reviews.llvm.org/D79100> is a small
extension for the vectoriser, and nothing here is related to hardware-loops or
target specific constructs. The vectoriser tail-folds the loop, and creates
masked load/stores; so existing functionality, and nothing has changed here. The
generic hardware loop codegen pass inserts hardware loop intrinsics. Very late
in the pipeline, e.g. in the PPC and ARM backends, this is picked and turned
into an actual hardwareloop, in our case possibly predicated, or it is reverted.
Thanks for explaining it (possibly once again) I wasn't aware that this will
also be used for PPC. Point 3) still stands.
> What will you do if there are no masked intrinsics in the hwloop body?
Nothing. I.e., it can become a hardware loop, but not one with implicit
predication.
Are all vector instructions in the hwloop implicitly predicated or only the
masked load/store ops? If not, then the issue is that the predicate parameter of
masked load/store basically affects the semantics of all other vector ops in the
loop that do not have an explicit mask parameter:

    %v = masked.load ... %m ; explicit predication - okay
    %r = sdiv %x, %y        ; implicit predication by %m for hwloops -
unpredicated otherwise

> And i am curious why couldn't you use the %evl parameter of VP
intrinsics to get the tail predication you are interested in?
In D79100<https://reviews.llvm.org/D79100>, intrinsic get.active.mask
makes the backedge taken count of the scalar loop explicit. I will look again,
but I don't think the VP intrinsics were able to provide this. But to be
honest, I have no preference at all what this intrinsic is, it is not relevant,
as long as we can make this explicit.
VP intrinsics explicitly make every vector instruction in the loop dependent on
the '%evl'. You would have :

    %v = vp.load ... %evl
    %r = vp.sdiv %x, %y, %evl   ; explicitly predicated by the scalar loop trip
count

My previous mail had an example on how %evl could be tied to the scalar trip
count. Re-posting that here:

vector.preheader:
  %init.evl = i32 llvm.hwloop.set.elements(%n)

vector.body:
  %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body]
  %aval = call @llvm.vp.load(Aptr, .., %evl)
  call @llvm.vp.store(Bptr, %aval, ..., %evl)
  %next.evl = call i32 @llvm.hwloop.decrement(%evl)

- Simon

Cheers.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 18 May 2020 14:11
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/18/20 2:53 PM, Sjoerd Meijer wrote:
Hi,
I abandoned that approach and followed Eli's suggestion, see somewhere
earlier in this thread, and emit an intrinsic that represents/calculates the
active mask. I've just uploaded a new revision for D79100 that implements
this.
Cheers.
You have similar problems with https://reviews.llvm.org/D79100

Since there are no masked operations, except for load/store.. how are LLVM
optimizations supposed to know that they must not hoist/sink operations with
side-effects out of the hwloop? These operations have an implicit dependence on
the iteration variable.

What will you do if there are no masked intrinsics in the hwloop body? This can
happen once you generate vector code beyond trivial loops or have a vector IR
generator other than LV.

And i am curious why couldn't you use the %evl parameter of VP intrinsics to
get the tail predication you are interested in?

- Simon

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 18 May 2020 13:32
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote:
what we would like to generate is a vector loop with implicit predication, which
works by setting up the the number of elements processed by the loop:

hwloop 10
  [i:4] = b[i:4] + c[i:4]

Why couldn't you use VP intrinsics and scalable types for this?

   %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %sum = <4 x vscale x double> fadd %bval, %cval
   store [..]

I see three issues with the llvm.set.loop.elements approach:
1) It is conceptually broken: as others have pointed out, optimization can move
the intrinsic around since the intrinsic doesn't have any dependencies that
would naturally keep it in place.
2) The whole proposed set of intrinsics is vendor specific: this causes
fragmentation and i don't see why we would want to emit vendor-specific
intrinsics in a generic optimization pass. Soon, we would see reports a la
"your optimization caused regressions for MVE - add a check that the
transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics
when compiling for MVE..". I doubt that you would tolerate when that
intrinsic were some removed in performance-critical code that would then remain
scalar as a result.. so, i do not see the "beauty of the approach".
3) We need a reliable solution to properly support vector ISA such as RISC-V V
extension and SX-Aurora and also MVE.. i don't see that reliability in this
proposal.

If for whatever reason, the above does not work and seems to far away from your
proposal, here is another idea to make more explicit hwloops work with the VP
intrinsics - in a way that does not break with optimizations:

vector.preheader:
  %evl = i32 llvm.hwloop.set.elements(%n)

vector.body:
  %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body]
  %aval = call @llvm.vp.load(Aptr, .., %evl)
  call @llvm.vp.store(Bptr, %aval, ..., %evl)
  %next.evl = call i32 @llvm.hwloop.decrement(%evl)

Note that the way VP intrinsics are designed, it is not possible to break this
code by hoisting the VP calls out of the loop: passing "%evl >= the
operation's vector size" consitutes UB (see
https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use
attributes to do the same for sinking (eg don't move VP across
hwloop.decrement).

- Simon

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200520/26c5a34d/attachment.html>

Sjoerd Meijer via llvm-dev

2020-May-20 11:15 UTC

head link

[llvm-dev] LV: predication

Hello,

About this, I am essentially just echoing what others said on the list:
> The difference is that the intrinsic is connected to every SIMD instruction
in the vector loop through data flow. It does not just sit there.. in fact it
does not matter where it is placed as long as those def-use edges are visible to
the hwloop transformation.
Yes, it is well connected with use-def chains, but the intrinsic defines a loop
property. If we would have a transformation that for example peels off one
vector iteration from that loop/vector body, it doesn't process %N elements
but for example %N - 4 data elements. With hwloop.set.elements(%N) still sitting
in the preheader, it could communicate the wrong information to other passes or
the backend. Thus, this puts a maintenance burden to support that intrinsic,
which is not what we want. The feedback was that we need to communicate this
information in a different way, there are different ways to do this.

Now, returning to hardware-loops.
> Ok. My questions (the example at the end) was asking whether hwloops imply
predication (and by that i mean logically - if the hwloop implies that a SIMD
instruction may not execute for all lanes in the tail then that is predication
as well).
We should probably define what we mean by hardwareloops, i.e., where in the
pipeline. In the target independent CodeGen pass HardwareLoops, hardware loop
are supported with a few intrinsics to mark a loop as a hardware loop. This does
not imply any predication. That is, these hardwareloop intrinsics do not
influence in any way prediction or any masking of lanes, thus they do not imply
certain forms of hwloops with or without predication. But there can be masked
loads/stores insides these hardwareloop bodies, they are generated by the
vectoriser. Please note that I am not trying to be pedantic here, but am just
describing the current situation, just to get clarity what we are discussing,
and what the problem is, was becoming a bit unclear to me.
Now, things do change in the ARM backend, because in MVE we have 2 forms of
hardware loops, let's say a normal one, and one with implicit predication.
And to support this, we transform explicit predication into implicit
predication, but of course only when it is okay to do this. With this in mind,
returning to the example:
> I do not see an answer to my question here. If the vectorized loop,
prepared for hwloop, looks like this:
>
>   %m = get.active.mask(..)
>    %v = masked.load ... %m
>    %r = sdiv %x, %y
>
> Will the `sdiv` execute with implicit hwloop predication?
The short answer is "no". There are no hardware loops here at this
point, and thus also we don't distinguish between different hwloop forms.
Here, we use the let's say the vectoriser way of masking/prediction: only
the load/store are masked. Your previous remark, also quoted below, is that VP
intrinsic provide clean semantics, and I fully agree with that.
> I see it the other way round: Right now you seem to have an implicit
dependence from syntactically unmasked SIMD instructions (eg a regular SIMD
sdiv) to the predicate of nearby masked intrinsics (masked.load) - that's on
shaky grounds semantically. VP intrinsics already define a clean semantics for
tail predication - so why not piggyback on that?
IThe @lvm.get.active.lane.mask instrinsic is unrelated, but works exactly the
same as the @num.elements intrinsic, i.e. it is well connected as you said with
def-use chains, feeding the relevant instructions, in this case the masked
loads/stores. You're unhappy that currently the vector instructions
don't have explicit masks/predication, but that is the current state of the
art.  Again, agreed that VP intrinsics are semantically clean, and we will
definitely will use them we can.

Cheers,
Sjoerd.


________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM>
Sent: 20 May 2020 09:52
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Eli Friedman <efriedma
at quicinc.com>; listmail at philipreames.com <listmail at
philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sander De
Smalen <Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com
<hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/19/20 5:22 PM, Sjoerd Meijer wrote:
Invitation accepted, I am happy to help out with reviews, like I did with the
previous VP patches.
That's great!

And of course agreed that things should be well defined, and that we
shouldn't paint ourselves in a corner, but I don't think that this is
the case. And it's not that I am in a rush, but I don't think this
change needs to be predicated on a big change landing first like the LV
switching to VP intrinsics.
> The difference is that in the VP version there is an explicit dependence of
every vector operation in the loop to the set.num.elements intrinsic. This
dependence is obscured in the hwloop proposals (more on that below).
This discussion is getting complicated, because I think we are discussing 3
topics at the same time now: predication, hardware loops, and a new set of
intrinsics, the VP intrinsics.
Ok. My questions (the example at the end) was asking whether hwloops imply
predication (and by that i mean logically - if the hwloop implies that a SIMD
instruction may not execute for all lanes in the tail then that is predication
as well).
For the change that kicked off this thread, i.e. 1 new intrinsic to get the
active lanes, I think we can eliminate the hardware loops from this story. For
us, that is just the context of this, and so I think we can just focus on
predication. And if we only talk about predication, I think this new intrinsic
can nicely coexist with the VP intrinsics.

And please note again I am not proposing a set.num.elements intrinsic. Well, I
first kind of did, but again, abandoned that approach after push back. Correct
me if I am wrong, but there's no difference in your example whether all
instructions consume some predicate or only masked loads/stores:
Yes, and that is the point: it's about making the SIMD instructions
dependent on the mask .. and all of them.

  vector.preheader:
    %init.evl = i32 llvm.hwloop.set.elements(%n)
  vector.body:
    %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body]
    %aval = call @llvm.vp.load(Aptr, .., %evl)
    call @llvm.vp.store(Bptr, %aval, ..., %evl)
    %next.evl = call i32 @llvm.hwloop.decrement(%evl)

No difference in that the problem remains that we have a random intrinsic
sitting in the preheader describing a loop property that needs to be maintained.
The difference is that the intrinsic is connected to every SIMD instruction in
the vector loop through data flow. It does not just sit there.. in fact it does
not matter where it is placed as long as those def-use edges are visible to the
hwloop transformation.

So, eliminating hardware loops and intrinsic that defines the number of elements
produced, I am proposing

  vector.body:
    %mask = lvm.get.active.lane.mask (%IV, %BTC)
     .. = @llvm.masked.load(.., %mask)

where IV is the induction step, and BTC the backedge taken count.
This completely piggy backs on everything that is already there in the
vectoriser, and nothing is fundamentally changed here. Now, this seems very
generic, and doesn't seem to bite the VP intrinsics.
I see it the other way round: Right now you seem to have an implicit dependence
from syntactically unmasked SIMD instructions (eg a regular SIMD sdiv) to the
predicate of nearby masked intrinsics (masked.load) - that's on shaky
grounds semantically. VP intrinsics already define a clean semantics for tail
predication - so why not piggyback on that? You should define the hwloop support
in a way that will not just peacefully coexist with VP but leverage it
eventually. I'll continue in that direction in the review.

One specific request (since i got you attention now ;-) ): we need a (generic)
IR primitive to express %lane_id < %n for scalable vector types to expand VP
intrinsics for targets with SVE support but no tail predication.

Cheers,
Sjoerd.
- Simon

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 19 May 2020 15:07
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/19/20 12:38 PM, Sjoerd Meijer wrote:
Hi Simon,

Thanks for reposting the example, and looking at it more carefully, I think it
is very similar to my first proposal. This was met with some resistance here
because it dumps loop information in the vector preheader. Doing it this early,
we want to emit this in the vectoriser, puts a restriction on (future)
optimisations that transform vector loops to honour/update/support this
intrinsic and loop information. In D79100, it is integral part of the vector
body and has some semantics (I will update it today), and thus doesn't have
these disadvantages.
The difference is that in the VP version there is an explicit dependence of
every vector operation in the loop to the set.num.elements intrinsic. This
dependence is obscured in the hwloop proposals (more on that below).
I understand that you are looking to get hwloops working quickly somehow - but
any proposal should be designed in a forward-looking way or we could get stuck
in a place it's hard to get out of. I am looking forward to see the
semantics for this spelled out.

Also, the vectoriser isn't using the VP intrinsics yet, so using them is a
bridge too far for me at this point. But we should definitely re-evaluate at
some point if we should use or transition to them in our backend passes.

I'd very much like to see LV use VP intrinsics. I invite everybody to
collaborate on VP to make it functional and useful quickly! Specifically, i am
hoping we can collaborate on masked reduction intrinsics and implement them in
the VP namespace. There is also the VP expansion pass on Phabricator right now
(D78203 - it says 'work-in-progress' in the summary, which probably was
a mistake: this is the real thing).
> Are all vector instructions in the hwloop implicitly predicated or only the
masked load/store ops?
In a nutshell, when a vector loop with (explicitly) predicated masked
loads/stores hit the backend, we translate the generic intrinsic get.active.mask
to a target specific one. All predication remains explicit, and this remains the
case. Only at the end, we use this intrinsic to instruction select a specific
variant of the hardwarloop with some implicit predication.
I do not see an answer to my question here. If the vectorized loop, prepared for
hwloop, looks like this:

    %m = get.active.mask(..)
    %v = masked.load ... %m
    %r = sdiv %x, %y

Will the `sdiv` execute with implicit hwloop predication?
It makes no difference to the semantics of the intrinsic at which point you
lower it but how.

- Simon


Cheers,
Sjoerd.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 19 May 2020 09:56
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

Hi Sjoerd,

On 5/18/20 3:43 PM, Sjoerd Meijer wrote:> You have similar problems with https://reviews.llvm.org/D79100
The new revision D79100<https://reviews.llvm.org/D79100> solves your
comment 1), and I don't think your comments2) and 3) apply as there are no
vendor specific intrinsics involved at all here. Just to quickly discuss the
optimisation pipeline, D79100<https://reviews.llvm.org/D79100> is a small
extension for the vectoriser, and nothing here is related to hardware-loops or
target specific constructs. The vectoriser tail-folds the loop, and creates
masked load/stores; so existing functionality, and nothing has changed here. The
generic hardware loop codegen pass inserts hardware loop intrinsics. Very late
in the pipeline, e.g. in the PPC and ARM backends, this is picked and turned
into an actual hardwareloop, in our case possibly predicated, or it is reverted.
Thanks for explaining it (possibly once again) I wasn't aware that this will
also be used for PPC. Point 3) still stands.
> What will you do if there are no masked intrinsics in the hwloop body?
Nothing. I.e., it can become a hardware loop, but not one with implicit
predication.
Are all vector instructions in the hwloop implicitly predicated or only the
masked load/store ops? If not, then the issue is that the predicate parameter of
masked load/store basically affects the semantics of all other vector ops in the
loop that do not have an explicit mask parameter:

    %v = masked.load ... %m ; explicit predication - okay
    %r = sdiv %x, %y        ; implicit predication by %m for hwloops -
unpredicated otherwise

> And i am curious why couldn't you use the %evl parameter of VP
intrinsics to get the tail predication you are interested in?
In D79100<https://reviews.llvm.org/D79100>, intrinsic get.active.mask
makes the backedge taken count of the scalar loop explicit. I will look again,
but I don't think the VP intrinsics were able to provide this. But to be
honest, I have no preference at all what this intrinsic is, it is not relevant,
as long as we can make this explicit.
VP intrinsics explicitly make every vector instruction in the loop dependent on
the '%evl'. You would have :

    %v = vp.load ... %evl
    %r = vp.sdiv %x, %y, %evl   ; explicitly predicated by the scalar loop trip
count

My previous mail had an example on how %evl could be tied to the scalar trip
count. Re-posting that here:

vector.preheader:
  %init.evl = i32 llvm.hwloop.set.elements(%n)

vector.body:
  %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body]
  %aval = call @llvm.vp.load(Aptr, .., %evl)
  call @llvm.vp.store(Bptr, %aval, ..., %evl)
  %next.evl = call i32 @llvm.hwloop.decrement(%evl)


- Simon


Cheers.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 18 May 2020 14:11
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/18/20 2:53 PM, Sjoerd Meijer wrote:
Hi,
I abandoned that approach and followed Eli's suggestion, see somewhere
earlier in this thread, and emit an intrinsic that represents/calculates the
active mask. I've just uploaded a new revision for D79100 that implements
this.
Cheers.
You have similar problems with https://reviews.llvm.org/D79100

Since there are no masked operations, except for load/store.. how are LLVM
optimizations supposed to know that they must not hoist/sink operations with
side-effects out of the hwloop? These operations have an implicit dependence on
the iteration variable.

What will you do if there are no masked intrinsics in the hwloop body? This can
happen once you generate vector code beyond trivial loops or have a vector IR
generator other than LV.

And i am curious why couldn't you use the %evl parameter of VP intrinsics to
get the tail predication you are interested in?

- Simon


________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 18 May 2020 13:32
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at
arm.com>
Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma
at quicinc.com>; listmail at philipreames.com<mailto:listmail at
philipreames.com> <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen
<Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>;
hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com>
<hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com>
Subject: Re: [llvm-dev] LV: predication

On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote:
what we would like to generate is a vector loop with implicit predication, which
works by setting up the the number of elements processed by the loop:

hwloop 10
  [i:4] = b[i:4] + c[i:4]

Why couldn't you use VP intrinsics and scalable types for this?

   %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10)
   %sum = <4 x vscale x double> fadd %bval, %cval
   store [..]

I see three issues with the llvm.set.loop.elements approach:
1) It is conceptually broken: as others have pointed out, optimization can move
the intrinsic around since the intrinsic doesn't have any dependencies that
would naturally keep it in place.
2) The whole proposed set of intrinsics is vendor specific: this causes
fragmentation and i don't see why we would want to emit vendor-specific
intrinsics in a generic optimization pass. Soon, we would see reports a la
"your optimization caused regressions for MVE - add a check that the
transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics
when compiling for MVE..". I doubt that you would tolerate when that
intrinsic were some removed in performance-critical code that would then remain
scalar as a result.. so, i do not see the "beauty of the approach".
3) We need a reliable solution to properly support vector ISA such as RISC-V V
extension and SX-Aurora and also MVE.. i don't see that reliability in this
proposal.

If for whatever reason, the above does not work and seems to far away from your
proposal, here is another idea to make more explicit hwloops work with the VP
intrinsics - in a way that does not break with optimizations:

vector.preheader:
  %evl = i32 llvm.hwloop.set.elements(%n)

vector.body:
  %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body]
  %aval = call @llvm.vp.load(Aptr, .., %evl)
  call @llvm.vp.store(Bptr, %aval, ..., %evl)
  %next.evl = call i32 @llvm.hwloop.decrement(%evl)

Note that the way VP intrinsics are designed, it is not possible to break this
code by hoisting the VP calls out of the loop: passing "%evl >= the
operation's vector size" consitutes UB (see
https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use
attributes to do the same for sinking (eg don't move VP across
hwloop.decrement).

- Simon




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200520/b2eaaeb5/attachment-0001.html>

llvm dev - May 2020 - LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication