thr3ads.net - llvm dev - [llvm-dev] LV: predication [May 2020]

If this information is useful, please help other people find it:
Share via:

Sjoerd Meijer via llvm-dev

2020-May-01 10:50 UTC

[llvm-dev] LV: predication

Hello,

We are working on predication for our vector extension (MVE). Since quite a few
people are working on predication and different forms of it (e.g. SVE, RISC-V,
NEC), I thought I would share what we would like to add to the loop vectoriser.
Hopefully it's just a minor one and not intrusive, but could be interesting
and useful for others, and feedback on this is welcome of course.

TL;DR:

We would like the loop vectoriser to emit a new IR intrinsic for certain loops:

   void @llvm.set.loop.elements.i32(i32 )

This represents the number of data elements processed by a vector loop, and will
be emitted in the preheader block of the vector loop after querying TTI that the
backend understands this intrinsic and that it should be emitted for that loop.
The vectoriser patch is available in D79100, and we pick this intrinsic up in
the ARM backend here in D79175.

Context:

We are working on predication form that we call tail-predication: a vector
hardwareloop has an implicit form of predication that sets active/inactive lanes
for the last iteration of the vector loop. Thus, the scalar epilogue loop (if
there is one) is tail-folded and tail-predicated in the main vector body. And to
support this, we need to know the number of data elements processed by the loop,
which is used in the set up of a tail-predicated vector loop. This new intrinsic
communicates this information from the vectoriser to the codegen passes where we
further lower these loops. In our case, we essentially let
@llvm.set.loop.elements.i32 emit the trip count of the scalar loop, which
represents the number of data elements processed. Thus, we let the vectoriser
emits both the scalar and vector loop trip count.

Although in a different stage in the optimisation pipeline, this is exactly what
the generic HardwareLoop pass is doing to communicate its information to target
specific codegen passes; it emits a few intrinsics to mark a hardware loop. To
illustrate this and also the new intrinsic, this is the flow and life of a
tail-predicated vector loop using some heavily edited/reduced examples. First,
the vectoriser emits the number of elements processed, and the loads/stores are
masked because tail-folding is applied:

  vector.ph:
      call void @llvm.set.loop.elements.i32(i32 %N)
      br label %vector.body
  vector.body:
      call <4 x i32> @llvm.masked.load
      call <4 x i32> @llvm.masked.load
      call void @llvm.masked.store
      br i1 %12, label %.*, label %vector.body

After the HardwareLoop pass this is transformed into this, which adds the
hardware loop intrinsics:

  vector.ph:
      call void @llvm.set.loop.elements.i32(i32 %N)
      call void @llvm.set.loop.iterations.i32(i32 %5)
      br label %vector.body
  vector.body:
      call <4 x i32> @llvm.masked.load
      call <4 x i32> @llvm.masked.load
      call void @llvm.masked.store
      call i32 @llvm.loop.decrement.reg
      br i1 %12, label %.*, label %vector.body

We then pick this up in our tail-predication pass, remove
@llvm.set.loop.elements intrinsic, and add @vctp which is our intrinsic that
generates the mask of active/inactive lanes:

  vector.ph:
      call void @llvm.set.loop.iterations.i32(i32 %5)
      br label %vector.body
  vector.body:
      call <4 x i1> @llvm.arm.mve.vctp32
      call <4 x i32> @llvm.masked.load
      call <4 x i32> @llvm.masked.load
      call void @llvm.masked.store
      call i32 @llvm.loop.decrement.reg
      br i1 %12, label %.*, label %vector.body

And this is then further lowered to a tail-predicted loop, or reverted to a
'normal' vector loop if some restrictions are not met.

Cheers,
Sjoerd.


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200501/2cef8d69/attachment.html>

Eli Friedman via llvm-dev

2020-May-01 18:30 UTC

head link

[llvm-dev] LV: predication

The problem with your proposal, as written, is that the vectorizer is producing
the intrinsic.  Because we don't impose any ordering on optimizations before
codegen, every optimization pass in LLVM would have to be taught to preserve any
@llvm.set.loop.elements.i32 whenever it makes any change.  This is completely
impractical because the intrinsic isn't related to anything optimizations
would normally look for: it's a random intrinsic in the middle of nowhere.

Probably the simplest path to get this working is to derive the number of
elements in the backend (in HardwareLoops, or your tail predication pass). You
should be able to figure it from the masks used in the llvm.masked.load/store
instructions in the loop.

-Eli

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sjoerd
Meijer via llvm-dev
Sent: Friday, May 1, 2020 3:50 AM
To: llvm-dev at lists.llvm.org
Subject: [EXT] [llvm-dev] LV: predication

Hello,

We are working on predication for our vector extension (MVE). Since quite a few
people are working on predication and different forms of it (e.g. SVE, RISC-V,
NEC), I thought I would share what we would like to add to the loop vectoriser.
Hopefully it's just a minor one and not intrusive, but could be interesting
and useful for others, and feedback on this is welcome of course.

TL;DR:

We would like the loop vectoriser to emit a new IR intrinsic for certain loops:

   void @llvm.set.loop.elements.i32(i32 )

This represents the number of data elements processed by a vector loop, and will
be emitted in the preheader block of the vector loop after querying TTI that the
backend understands this intrinsic and that it should be emitted for that loop.
The vectoriser patch is available in D79100, and we pick this intrinsic up in
the ARM backend here in D79175.

Context:

We are working on predication form that we call tail-predication: a vector
hardwareloop has an implicit form of predication that sets active/inactive lanes
for the last iteration of the vector loop. Thus, the scalar epilogue loop (if
there is one) is tail-folded and tail-predicated in the main vector body. And to
support this, we need to know the number of data elements processed by the loop,
which is used in the set up of a tail-predicated vector loop. This new intrinsic
communicates this information from the vectoriser to the codegen passes where we
further lower these loops. In our case, we essentially let
@llvm.set.loop.elements.i32 emit the trip count of the scalar loop, which
represents the number of data elements processed. Thus, we let the vectoriser
emits both the scalar and vector loop trip count.

Although in a different stage in the optimisation pipeline, this is exactly what
the generic HardwareLoop pass is doing to communicate its information to target
specific codegen passes; it emits a few intrinsics to mark a hardware loop. To
illustrate this and also the new intrinsic, this is the flow and life of a
tail-predicated vector loop using some heavily edited/reduced examples. First,
the vectoriser emits the number of elements processed, and the loads/stores are
masked because tail-folding is applied:

  vector.ph:
      call void @llvm.set.loop.elements.i32(i32 %N)
      br label %vector.body
  vector.body:
      call <4 x i32> @llvm.masked.load
      call <4 x i32> @llvm.masked.load
      call void @llvm.masked.store
      br i1 %12, label %.*, label %vector.body

After the HardwareLoop pass this is transformed into this, which adds the
hardware loop intrinsics:

  vector.ph:
      call void @llvm.set.loop.elements.i32(i32 %N)
      call void @llvm.set.loop.iterations.i32(i32 %5)
      br label %vector.body
  vector.body:
      call <4 x i32> @llvm.masked.load
      call <4 x i32> @llvm.masked.load
      call void @llvm.masked.store
      call i32 @llvm.loop.decrement.reg
      br i1 %12, label %.*, label %vector.body

We then pick this up in our tail-predication pass, remove
@llvm.set.loop.elements intrinsic, and add @vctp which is our intrinsic that
generates the mask of active/inactive lanes:

  vector.ph:
      call void @llvm.set.loop.iterations.i32(i32 %5)
      br label %vector.body
  vector.body:
      call <4 x i1> @llvm.arm.mve.vctp32
      call <4 x i32> @llvm.masked.load
      call <4 x i32> @llvm.masked.load
      call void @llvm.masked.store
      call i32 @llvm.loop.decrement.reg
      br i1 %12, label %.*, label %vector.body

And this is then further lowered to a tail-predicted loop, or reverted to a
'normal' vector loop if some restrictions are not met.

Cheers,
Sjoerd.


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200501/eb610127/attachment.html>

Sjoerd Meijer via llvm-dev

2020-May-01 18:53 UTC

head link

[llvm-dev] LV: predication

Hi Eli,
> The problem with your proposal, as written, is that the vectorizer is
producing the intrinsic.  Because we don’t impose any ordering on optimizations
before codegen, every optimization pass in LLVM would have to be taught to
preserve any @llvm.set.loop.elements.i32 whenever it makes any change.  This is
completely impractical because the intrinsic isn’t related to anything
optimizations would normally look for: it’s a random intrinsic in the middle of
nowhere.
I do see that point. But is that also not the beauty of it? It just sits in the
preheader, if gets removed, then so be it. And if it not recognised, then also
no harm done?
> Probably the simplest path to get this working is to derive the number of
elements in the backend (in HardwareLoops, or your tail predication pass). You
should be able to figure it from the masks used in the llvm.masked.load/store
instructions in the loop.
This is what we are currently doing and works excellent for simpler cases. For
the more complicated cases that we now what to handle as well, the pattern
matching just becomes a bit too horrible, and it is fragile too. All we need is
the information that the vectoriser already has, and pass this on somehow.

As I am really keen to simply our backend pass, would there be another way to
pass this information on? If emitting an intrinsic is a blocker, could this be
done with a loop annotation?

Cheers,
Sjoerd.
________________________________
From: Eli Friedman <efriedma at quicinc.com>
Sent: 01 May 2020 19:30
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>; llvm-dev <llvm-dev at
lists.llvm.org>
Subject: RE: [llvm-dev] LV: predication


The problem with your proposal, as written, is that the vectorizer is producing
the intrinsic.  Because we don’t impose any ordering on optimizations before
codegen, every optimization pass in LLVM would have to be taught to preserve any
@llvm.set.loop.elements.i32 whenever it makes any change.  This is completely
impractical because the intrinsic isn’t related to anything optimizations would
normally look for: it’s a random intrinsic in the middle of nowhere.



Probably the simplest path to get this working is to derive the number of
elements in the backend (in HardwareLoops, or your tail predication pass). You
should be able to figure it from the masks used in the llvm.masked.load/store
instructions in the loop.



-Eli



From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sjoerd
Meijer via llvm-dev
Sent: Friday, May 1, 2020 3:50 AM
To: llvm-dev at lists.llvm.org
Subject: [EXT] [llvm-dev] LV: predication



Hello,



We are working on predication for our vector extension (MVE). Since quite a few
people are working on predication and different forms of it (e.g. SVE, RISC-V,
NEC), I thought I would share what we would like to add to the loop vectoriser.
Hopefully it's just a minor one and not intrusive, but could be interesting
and useful for others, and feedback on this is welcome of course.



TL;DR:



We would like the loop vectoriser to emit a new IR intrinsic for certain loops:



   void @llvm.set.loop.elements.i32(i32 )



This represents the number of data elements processed by a vector loop, and will
be emitted in the preheader block of the vector loop after querying TTI that the
backend understands this intrinsic and that it should be emitted for that loop.
The vectoriser patch is available in D79100, and we pick this intrinsic up in
the ARM backend here in D79175.



Context:



We are working on predication form that we call tail-predication: a vector
hardwareloop has an implicit form of predication that sets active/inactive lanes
for the last iteration of the vector loop. Thus, the scalar epilogue loop (if
there is one) is tail-folded and tail-predicated in the main vector body. And to
support this, we need to know the number of data elements processed by the loop,
which is used in the set up of a tail-predicated vector loop. This new intrinsic
communicates this information from the vectoriser to the codegen passes where we
further lower these loops. In our case, we essentially let
@llvm.set.loop.elements.i32 emit the trip count of the scalar loop, which
represents the number of data elements processed. Thus, we let the vectoriser
emits both the scalar and vector loop trip count.



Although in a different stage in the optimisation pipeline, this is exactly what
the generic HardwareLoop pass is doing to communicate its information to target
specific codegen passes; it emits a few intrinsics to mark a hardware loop. To
illustrate this and also the new intrinsic, this is the flow and life of a
tail-predicated vector loop using some heavily edited/reduced examples. First,
the vectoriser emits the number of elements processed, and the loads/stores are
masked because tail-folding is applied:



  vector.ph:

      call void @llvm.set.loop.elements.i32(i32 %N)

      br label %vector.body

  vector.body:

      call <4 x i32> @llvm.masked.load

      call <4 x i32> @llvm.masked.load

      call void @llvm.masked.store

      br i1 %12, label %.*, label %vector.body



After the HardwareLoop pass this is transformed into this, which adds the
hardware loop intrinsics:



  vector.ph:

      call void @llvm.set.loop.elements.i32(i32 %N)

      call void @llvm.set.loop.iterations.i32(i32 %5)

      br label %vector.body

  vector.body:

      call <4 x i32> @llvm.masked.load

      call <4 x i32> @llvm.masked.load

      call void @llvm.masked.store

      call i32 @llvm.loop.decrement.reg

      br i1 %12, label %.*, label %vector.body



We then pick this up in our tail-predication pass, remove
@llvm.set.loop.elements intrinsic, and add @vctp which is our intrinsic that
generates the mask of active/inactive lanes:



  vector.ph:

      call void @llvm.set.loop.iterations.i32(i32 %5)

      br label %vector.body

  vector.body:

      call <4 x i1> @llvm.arm.mve.vctp32

      call <4 x i32> @llvm.masked.load

      call <4 x i32> @llvm.masked.load

      call void @llvm.masked.store

      call i32 @llvm.loop.decrement.reg

      br i1 %12, label %.*, label %vector.body



And this is then further lowered to a tail-predicted loop, or reverted to a
'normal' vector loop if some restrictions are not met.



Cheers,

Sjoerd.




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200501/6bc0d9ad/attachment-0001.html>

Renato Golin via llvm-dev

2020-May-01 19:11 UTC

head link

[llvm-dev] LV: predication

On Fri, 1 May 2020 at 19:30, Eli Friedman via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> The problem with your proposal, as written, is that the vectorizer is
producing the intrinsic.  Because we don’t impose any ordering on optimizations
before codegen, every optimization pass in LLVM would have to be taught to
preserve any @llvm.set.loop.elements.i32 whenever it makes any change.  This is
completely impractical because the intrinsic isn’t related to anything
optimizations would normally look for: it’s a random intrinsic in the middle of
nowhere.
I agree. Requiring a loose intrinsic to have meaning in the CFG is a
non-starter. That's why we have loop annotations.

To me, this looks like what MLIR has in the Loop or Affine dialects.
It would be great to have that in LLVM as well, for the cases where we
know it from the front-end (for ex. when lowering from MLIR), but this
has to be a loop annotation of some form.

A pass that generates such annotations could run just before loop
optimisations, or it could come from the front-end and hope it doesn't
get removed by some pass in between.

Fortran is bound to have more semantically rich loop information, and
they'll use MLIR. It would be interesting to know how that will be
done, and you could get the ground work done beforehand by working
with them to carry the annotations in the right way.

cheers,
--renato

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - May 2020 - LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

[llvm-dev] LV: predication

Seemingly Similar Threads