thr3ads.net - llvm dev - [llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors) [Nov 2020]

If this information is useful, please help other people find it:
Share via:

Simon Moll via llvm-dev

2020-Nov-06 10:07 UTC

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

On 11/6/20 8:49 AM, Roger Ferrer Ibáñez wrote:
Hi Sjoerd,


Trying to remember how everything fits together here, but could
get.active.lane.mask not create the %mask of the VP intrinsics? Or in other
words, in the vectoriser, who's producing the %mask and %evl that is
consumed by the VP intrinsics?

I'm not sure what would be the best way here. I think about the Loop
Vectorizer. I imagine at some point we can teach LV to emit VPred for the
widening. VPred IR needs two additional operands, as you mentioned, %evl and
%mask.

One option is make %evl the max-vector-length of the type being operated and
%mask (that is the "outer block mask" in this context) be
get.active.lane.mask. This maps well for SVE and MVE not so much for VE and
RISC-V (I don't think it is incorrect but it is not an efficient thing to
do).  Perhaps VE and RISC-V can work in this scenario if at some point they
replace the %evl with something like "%n - %base" operands of
get.active.lane.mask, and %mask (the outer block mask) is replaced with a splat
of "i1 1".
Basically, we would extend TTI to let the targets choose how to use the %mask
and %evl operands in the VP intrinsics. So, an 'fadd' would turn into an
'llvm.vp.fadd' for all predicating targets. However, whether
get.active.lane.mask() is used for %mask or whether tail predication is done
with a (splat i1 1) for the mask and setting %evl would be target dependent.

Another option here is make "%n - %base" be the %evl (or at least an
operand of some target hook because "computing" the %evl is
target-specific, targets without evl could compute the identity here) and %mask
(the outer block mask) be a splat of "i1 1". This maps well VE and
RISC-V but makes life harder for AVX-512, SVE and MVE (in general any target
where TargetTransformInfo::hasActiveVectorLength returns false). Those targets
could replace the %evl with the max-vector-length of the operated type and then
use get.active.lane.mask(0, %evl) as the outer block mask. My understanding is
that Simon used this approach in https://reviews.llvm.org/D78203 but in a more
general setting, that would be independent of what Loop Vectorizer does.

For VE, we set %evl = min(max_vector_width, %n - %base) .. that's the same
idiom that the non-LLVM NEC compilers are emitting for tail predication.
Basically, the LV flow could look something like this:


  ; Call the target hook to let the target select %mask and %evl params for the
loop header
  %evl, %mask <- IRBuilder.createIterationPredicate(%i, %n, TTI)

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

  ; Configure the Vector Predication builder to use those
  VPBuilder
      .setExplicitVectorLength(%evl)
      .setMask(%mask);

  ; Start buildling vector-predicated instructions
  VPBuilder.createFadd(%x, %y)    ; --> call @llvm.vp.fadd(%x, %y, %mask,
%evl)


Looks to me the second option makes a more effective use of vpred and D78203
shows that we can always soften vpred into a shape that is reasonable for
lowering in targets without active vector length.
The whole point about VP is to make sure there is one set of vector-predicated
instructions/intrinsics that everybody is using while giving people the freedom
to use these as it fits their targets. We can then concentrate on optimizing VP
intrinsic code and all targets benefit.

- Simon

*: VE's packed mode (512 x 32bit elements) is a use case for a non-trivial
setting of %mask and %evl at the same time (%evl for packs of two 32bit elements
(ie %evl must be even for 32bit lanes), %mask for masking out inside packages).



Thoughts?

Kind regards,
--
Roger Ferrer Ibáñez

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/72f6e006/attachment-0001.html>

Renato Golin via llvm-dev

2020-Nov-06 10:16 UTC

head link

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

On Fri, 6 Nov 2020 at 10:07, Simon Moll <Simon.Moll at emea.nec.com>
wrote:
> The whole point about VP is to make sure there is one set of
> vector-predicated instructions/intrinsics that everybody is using while
> giving people the freedom to use these as it fits their targets. We can
> then concentrate on optimizing VP intrinsic code and all targets benefit.
>
Agreed!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/261f3814/attachment.html>

Roger Ferrer Ibáñez via llvm-dev

2020-Nov-06 10:26 UTC

head link

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Hi Simon
>
> Looks to me the second option makes a more effective use of vpred and
> D78203 shows that we can always soften vpred into a shape that is
> reasonable for lowering in targets without active vector length.
>
> The whole point about VP is to make sure there is one set of
> vector-predicated instructions/intrinsics that everybody is using while
> giving people the freedom to use these as it fits their targets. We can
> then concentrate on optimizing VP intrinsic code and all targets benefit.
>
This is even better than I imagined, then. Thanks for the examples and
clarification.

Kind regards,
-- 
Roger Ferrer Ibáñez
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/37d5a6cc/attachment.html>

Sjoerd Meijer via llvm-dev

2020-Nov-06 11:39 UTC

head link

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Hello Simon,

Thanks for your replies, very useful.  And yes, thanks for the example and
making the target differences clear:

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

Unless I miss something, the AVX example is semantically the same as
get.active.lane.mask:

   %m[i] = icmp ult (%base + i), %n

with i  = 8.

Just saying this to see if we can have "1 interface" for generating
the mask (which is what I was perhaps expecting), and if you just want an all
true mask for VE and if we can merge AVX with the other 2 we just have:

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()

I am not sure why MVE (or AVX) would need the vscale(). But if it does, I am
wondering if it could be something like:

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = call @llvm.vscale(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale(... ,..)

Cheers,
Sjoerd.


________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM>
Sent: 06 November 2020 10:07
To: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Sjoerd Meijer
<Sjoerd.Meijer at arm.com>
Cc: Renato Golin <rengolin at gmail.com>; Vineet Kumar <vineet.kumar at
bsc.es>; LLVM Dev <llvm-dev at lists.llvm.org>; ROGER FERRER IBANEZ
<roger.ferrer at bsc.es>; Arai, Masaki <arai.masaki at
jp.fujitsu.com>
Subject: Re: [llvm-dev] Loop-vectorizer prototype for the EPI Project based on
the RISC-V Vector Extension (Scalable vectors)

On 11/6/20 8:49 AM, Roger Ferrer Ibáñez wrote:
Hi Sjoerd,


Trying to remember how everything fits together here, but could
get.active.lane.mask not create the %mask of the VP intrinsics? Or in other
words, in the vectoriser, who's producing the %mask and %evl that is
consumed by the VP intrinsics?

I'm not sure what would be the best way here. I think about the Loop
Vectorizer. I imagine at some point we can teach LV to emit VPred for the
widening. VPred IR needs two additional operands, as you mentioned, %evl and
%mask.

One option is make %evl the max-vector-length of the type being operated and
%mask (that is the "outer block mask" in this context) be
get.active.lane.mask. This maps well for SVE and MVE not so much for VE and
RISC-V (I don't think it is incorrect but it is not an efficient thing to
do).  Perhaps VE and RISC-V can work in this scenario if at some point they
replace the %evl with something like "%n - %base" operands of
get.active.lane.mask, and %mask (the outer block mask) is replaced with a splat
of "i1 1".
Basically, we would extend TTI to let the targets choose how to use the %mask
and %evl operands in the VP intrinsics. So, an 'fadd' would turn into an
'llvm.vp.fadd' for all predicating targets. However, whether
get.active.lane.mask() is used for %mask or whether tail predication is done
with a (splat i1 1) for the mask and setting %evl would be target dependent.

Another option here is make "%n - %base" be the %evl (or at least an
operand of some target hook because "computing" the %evl is
target-specific, targets without evl could compute the identity here) and %mask
(the outer block mask) be a splat of "i1 1". This maps well VE and
RISC-V but makes life harder for AVX-512, SVE and MVE (in general any target
where TargetTransformInfo::hasActiveVectorLength returns false). Those targets
could replace the %evl with the max-vector-length of the operated type and then
use get.active.lane.mask(0, %evl) as the outer block mask. My understanding is
that Simon used this approach in https://reviews.llvm.org/D78203 but in a more
general setting, that would be independent of what Loop Vectorizer does.

For VE, we set %evl = min(max_vector_width, %n - %base) .. that's the same
idiom that the non-LLVM NEC compilers are emitting for tail predication.
Basically, the LV flow could look something like this:


  ; Call the target hook to let the target select %mask and %evl params for the
loop header
  %evl, %mask <- IRBuilder.createIterationPredicate(%i, %n, TTI)

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

  ; Configure the Vector Predication builder to use those
  VPBuilder
      .setExplicitVectorLength(%evl)
      .setMask(%mask);

  ; Start buildling vector-predicated instructions
  VPBuilder.createFadd(%x, %y)    ; --> call @llvm.vp.fadd(%x, %y, %mask,
%evl)


Looks to me the second option makes a more effective use of vpred and D78203
shows that we can always soften vpred into a shape that is reasonable for
lowering in targets without active vector length.
The whole point about VP is to make sure there is one set of vector-predicated
instructions/intrinsics that everybody is using while giving people the freedom
to use these as it fits their targets. We can then concentrate on optimizing VP
intrinsic code and all targets benefit.

- Simon

*: VE's packed mode (512 x 32bit elements) is a use case for a non-trivial
setting of %mask and %evl at the same time (%evl for packs of two 32bit elements
(ie %evl must be even for 32bit lanes), %mask for masking out inside packages).



Thoughts?

Kind regards,
--
Roger Ferrer Ibáñez

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/24e204b1/attachment.html>

Simon Moll via llvm-dev

2020-Nov-06 15:37 UTC

head link

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

On 11/6/20 12:39 PM, Sjoerd Meijer wrote:
Hello Simon,

Thanks for your replies, very useful.  And yes, thanks for the example and
making the target differences clear:

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

Unless I miss something, the AVX example is semantically the same as
get.active.lane.mask:

   %m[i] = icmp ult (%base + i), %n

with i  = 8.
Correct (llvm.get.active.lane.mask.v8i1.i32).

Just saying this to see if we can have "1 interface" for generating
the mask (which is what I was perhaps expecting), and if you just want an all
true mask for VE and if we can merge AVX with the other 2 we just have:

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
For VE, we want to do as much predication as possible through %evl and as little
as possible with %mask. This has performance implications on VE and RISC-V - VE
does not generate a mask from %evl but %evl is directly mapped to hardware,
passing the all-true mask is free.
So for VE, the %evl does all the predication and there is no reason to have
anything other than a (splat i1 1) %mask here.

On SVE/MVE you may want to use get.active.lane.mask instead and on RISC-V V,
AFAIU, the %evl parameter will have to be computed by some RISC-V specific
`setvl` intrinsic. Both of this is okay because VP gives you that flexibility.


I am not sure why MVE (or AVX) would need the vscale(). But if it does, I am
wondering if it could be something like:

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = call @llvm.vscale(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale(... ,..)
The vscale is only necessary with scalable types, eg you can inactivate the %evl
parameter like so:

  llvm.vp.fadd nxv4f128(%x, %y, %mask, (@llvm.vscale() * 4))

The VPIntrinsic class upstream already has the functionality to check whether
the %evl parameter is inactivated in this way
(VPIntrinsic::canIgnoreVectorLengthParam()).


Cheers,
Sjoerd.
- Simon


________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at
EMEA.NEC.COM>
Sent: 06 November 2020 10:07
To: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at
gmail.com>; Sjoerd Meijer <Sjoerd.Meijer at
arm.com><mailto:Sjoerd.Meijer at arm.com>
Cc: Renato Golin <rengolin at gmail.com><mailto:rengolin at
gmail.com>; Vineet Kumar <vineet.kumar at
bsc.es><mailto:vineet.kumar at bsc.es>; LLVM Dev <llvm-dev at
lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; ROGER FERRER IBANEZ
<roger.ferrer at bsc.es><mailto:roger.ferrer at bsc.es>; Arai,
Masaki <arai.masaki at jp.fujitsu.com><mailto:arai.masaki at
jp.fujitsu.com>
Subject: Re: [llvm-dev] Loop-vectorizer prototype for the EPI Project based on
the RISC-V Vector Extension (Scalable vectors)

On 11/6/20 8:49 AM, Roger Ferrer Ibáñez wrote:
Hi Sjoerd,


Trying to remember how everything fits together here, but could
get.active.lane.mask not create the %mask of the VP intrinsics? Or in other
words, in the vectoriser, who's producing the %mask and %evl that is
consumed by the VP intrinsics?

I'm not sure what would be the best way here. I think about the Loop
Vectorizer. I imagine at some point we can teach LV to emit VPred for the
widening. VPred IR needs two additional operands, as you mentioned, %evl and
%mask.

One option is make %evl the max-vector-length of the type being operated and
%mask (that is the "outer block mask" in this context) be
get.active.lane.mask. This maps well for SVE and MVE not so much for VE and
RISC-V (I don't think it is incorrect but it is not an efficient thing to
do).  Perhaps VE and RISC-V can work in this scenario if at some point they
replace the %evl with something like "%n - %base" operands of
get.active.lane.mask, and %mask (the outer block mask) is replaced with a splat
of "i1 1".
Basically, we would extend TTI to let the targets choose how to use the %mask
and %evl operands in the VP intrinsics. So, an 'fadd' would turn into an
'llvm.vp.fadd' for all predicating targets. However, whether
get.active.lane.mask() is used for %mask or whether tail predication is done
with a (splat i1 1) for the mask and setting %evl would be target dependent.

Another option here is make "%n - %base" be the %evl (or at least an
operand of some target hook because "computing" the %evl is
target-specific, targets without evl could compute the identity here) and %mask
(the outer block mask) be a splat of "i1 1". This maps well VE and
RISC-V but makes life harder for AVX-512, SVE and MVE (in general any target
where TargetTransformInfo::hasActiveVectorLength returns false). Those targets
could replace the %evl with the max-vector-length of the operated type and then
use get.active.lane.mask(0, %evl) as the outer block mask. My understanding is
that Simon used this approach in https://reviews.llvm.org/D78203 but in a more
general setting, that would be independent of what Loop Vectorizer does.

For VE, we set %evl = min(max_vector_width, %n - %base) .. that's the same
idiom that the non-LLVM NEC compilers are emitting for tail predication.
Basically, the LV flow could look something like this:


  ; Call the target hook to let the target select %mask and %evl params for the
loop header
  %evl, %mask <- IRBuilder.createIterationPredicate(%i, %n, TTI)

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

  ; Configure the Vector Predication builder to use those
  VPBuilder
      .setExplicitVectorLength(%evl)
      .setMask(%mask);

  ; Start buildling vector-predicated instructions
  VPBuilder.createFadd(%x, %y)    ; --> call @llvm.vp.fadd(%x, %y, %mask,
%evl)


Looks to me the second option makes a more effective use of vpred and D78203
shows that we can always soften vpred into a shape that is reasonable for
lowering in targets without active vector length.
The whole point about VP is to make sure there is one set of vector-predicated
instructions/intrinsics that everybody is using while giving people the freedom
to use these as it fits their targets. We can then concentrate on optimizing VP
intrinsic code and all targets benefit.

- Simon

*: VE's packed mode (512 x 32bit elements) is a use case for a non-trivial
setting of %mask and %evl at the same time (%evl for packs of two 32bit elements
(ie %evl must be even for 32bit lanes), %mask for masking out inside packages).



Thoughts?

Kind regards,
--
Roger Ferrer Ibáñez

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/72ccfa58/attachment-0001.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Nov 2020 - Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Possibly Parallel Threads