thr3ads.net - llvm dev - [llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors) [Nov 2020]

If this information is useful, please help other people find it:
Share via:

Sjoerd Meijer via llvm-dev

2020-Nov-05 19:16 UTC

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

For RISC-V V and VE being explicit about %evl is important for performance &
correctness and that is what VP does. The get.active.lane.mask intrinsic is used
as a hint for the MVE, SVE backends to use hardware tail-predication (the
backends reverse engineer that hint by pattern matching for get.active.lane.mask
in the mask parameter of "some" masked intrinsics). IMHO, it's
more of a hot fix to get some tail-predication working quickly with the existing
infrastructure. It is still useful by itself, eg the ExpandVPIntrinsic pass uses
it to expand the %evl parameter in VP intrinsics for scalable vector types.
So I don't think that makes it a hot fix ??, but agreed with the general
picture here.

VE uses VP-style SDNodes in the isel layer (upstream patch on Phabricator to
follow soon-ish). We simply translate both VP and regular SIMD SDNodes into
these custom SDNodes as an intermediate layer. Even the VE machine instructions
still have an explicit %evl operand. We have a machine function pass that
inserts code to re-configure the VL register in-between vector instructions that
have a different %evl value (we had a poster on that at the LLVM US DevMtg
'19). This isel strategy has been working well for us.

The goal is to teach LV, VPlan to emit VP intrinsics with a convenient builder
class (VPBuilder in the reference patch).
Trying to remember how everything fits together here, but could
get.active.lane.mask not create the %mask of the VP intrinsics? Or in other
words, in the vectoriser, who's producing the %mask and %evl that is
consumed by the VP intrinsics?

Cheers,
Sjoerd.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM>
Sent: 05 November 2020 11:07
To: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Sjoerd Meijer
<Sjoerd.Meijer at arm.com>
Cc: Renato Golin <rengolin at gmail.com>; Vineet Kumar <vineet.kumar at
bsc.es>; LLVM Dev <llvm-dev at lists.llvm.org>; ROGER FERRER IBANEZ
<roger.ferrer at bsc.es>; Arai, Masaki <arai.masaki at
jp.fujitsu.com>
Subject: Re: [llvm-dev] Loop-vectorizer prototype for the EPI Project based on
the RISC-V Vector Extension (Scalable vectors)

Hi all,

On 11/5/20 10:32 AM, Roger Ferrer Ibáñez wrote:
Hi Sjoerd,

thanks for pointing us to this intrinsic.

I see it returns a mask/predicate type. My understanding is that VPred
intrinsics have both a vector length operand and a mask operand. It looks to me
that a "popcount" of get.active.lane.mask would correspond to the
vector length operand. Then additional "control flow" mask of
predicated code would correspond to the mask operand.

My intepretation was that get.active.lane.mask allowed targets that do not have
a concept of vector length (such as SVE or MVE) to represent it as a mask. For
those targets, the vector length operand can be given a value that means
"use the whole register" and then only the mask operand is relevant to
them.
For RISC-V V and VE being explicit about %evl is important for performance &
correctness and that is what VP does. The get.active.lane.mask intrinsic is used
as a hint for the MVE, SVE backends to use hardware tail-predication (the
backends reverse engineer that hint by pattern matching for get.active.lane.mask
in the mask parameter of "some" masked intrinsics). IMHO, it's
more of a hot fix to get some tail-predication working quickly with the existing
infrastructure. It is still useful by itself, eg the ExpandVPIntrinsic pass uses
it to expand the %evl parameter in VP intrinsics for scalable vector types.

But maybe my interpretation is wrong.

@Simon: what is VE going to do here?

VE uses VP-style SDNodes in the isel layer (upstream patch on Phabricator to
follow soon-ish). We simply translate both VP and regular SIMD SDNodes into
these custom SDNodes as an intermediate layer. Even the VE machine instructions
still have an explicit %evl operand. We have a machine function pass that
inserts code to re-configure the VL register in-between vector instructions that
have a different %evl value (we had a poster on that at the LLVM US DevMtg
'19). This isel strategy has been working well for us.

The goal is to teach LV, VPlan to emit VP intrinsics with a convenient builder
class (VPBuilder in the reference patch).

- Simon


Kind regards,

Missatge de Sjoerd Meijer via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> del dia dj., 5 de
nov. 2020 a les 10:00:
Fold the epilog loop into the vector body.

  *   This is done by setting the vector length in each iteration. This induces
a predicate/mask over all the vector instructions of the loop (any other
predicates/masks in the vector body are needed for control flow).

That's what we do for Arm MVE using intrinsic get.active.lane.mask (*) which
is emitted in the vectoriser. It generates a predicate that is used by the
masked loads/stores. That's the current state of the art, long term that
should indeed be using the VP intrinsics. Just wanted to point you at 
get.active.lane.mask, because it would also be nice to get confirmation that
this not only works for fixed vectors but also scalable vectors, which I think
should be the case...

(*) https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics

Cheers,
Sjoerd.
________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org>> on behalf of Vineet Kumar via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Sent: 05 November 2020 01:36
To: Renato Golin <rengolin at gmail.com<mailto:rengolin at
gmail.com>>
Cc: LLVM Dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; ROGER FERRER IBANEZ <roger.ferrer at
bsc.es<mailto:roger.ferrer at bsc.es>>; Arai, Masaki <arai.masaki at
jp.fujitsu.com<mailto:arai.masaki at jp.fujitsu.com>>
Subject: Re: [llvm-dev] Loop-vectorizer prototype for the EPI Project based on
the RISC-V Vector Extension (Scalable vectors)


Hi Renato,

Thanks a lot for your comments!

(more inline.)


Thanks and Regards,

Vineet


On 2020-11-02 5:43 p.m., Renato Golin wrote:
Hi Vineet,

Thanks for sharing! I haven't looked at the code yet, just read the README
file you have and it has already answered a lot of questions that I initially
had. Some general comments...

I'm very happy to see that Simon's predication changes were useful to
your work. It's a nice validation of their work and hopefully will help SVE,
too.
Simon's vector predication ideas fit really nicely with our approach to
predicated vectorization, specially the support for EVL parameter. We look
forward to more discussions around it.

Your main approach to strip-mine + fuse tail loop is what I was going to propose
for now. It matches well with the bite-sized approach VPlan has and could build
on existing vector formats. For example, you always try to strip-mine (for
scalable and non-scalable) and then only for scalable, you try to fuse the
scalar loops, which would improve the solution and give RVV/SVVE an edge over
the other extensions on the same hardware.
While our implemented approach with tail folding and predication is guided by
the research interests of the EPI project, I agree that for a more general
implementation your proposed approach for now makes more sense before moving on
to better predication support and exploring other approaches.

There were also in the past proposals to vectorise the tail loop, which could be
a similar step. For example, in case the main vector body is 8-way or 16-way,
the tail loop would be 7-way or 15-way, which is horribly inefficient. The idea
was to further vectorise the 7-way as 4+2+1 ways, same for 15. If those loops
are then unrolled, you end up with a nice decaling down pattern. On scalable
vectors, this becomes a noop.

There is a separate thread for vectorisation cost model [1] which talks about
some of the challenges there, I think we need to include scalable vectors in
consideration when thinking about it.
Agreed. It would be very useful to think about a scalable vectors aware
cost-model right from the beginning now that there is effort already underway to
integrate it into VPlan. There was also a discussion around it in the latest
SVE/SVE2 sync-up meeting and I think almost everyone was in agreement.

The NEON vs RISCV register shadowing is interesting. It is true we mostly
ignored 64-bit vectors in the vectoriser, but LLVM can still generate them with
the (SLP) region vectoriser. IIRC, support for that kind of aliasing is not
trivial (and why GCC's description of NEON registers sucked for so long),
but the motivation of register pressure inside hot loops is indeed important.
I'm adding Arai Masaki in CC as this is something he was working on.

Thanks for adding Arai! I will be happy to pick their brain on the the topic.

One specific place where we have to deal with it is when computing a feasible
max VF. I am currently experimenting with an approach to have user specify (via
a command line flag) a vector register width multiplier - a factor by which the
operating vector register width would be the multiple of the minimum vector
register width and then based on that, estimate the highest VF that won't
spill registers (relies on TTI for information about the number of registers in
relation to register width). This is definitely not a generic solution and
probably not elegant either but personally it serves as a starting point to
think about the broader issue.

Otherwise, I think working with the current folks on VPlan and scalable
extensions will be a good way to upstreaming all the ideas you guys had in your
work.
That's the plan!

Thanks!
--renato

[1] http://lists.llvm.org/pipermail/llvm-dev/2020-October/146236.html



On Mon, 2 Nov 2020 at 15:52, Vineet Kumar via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Hi all,

At the Barcelona Supercomputing Center, we have been working on an end-to-end
vectorizer using scalable vectors for RISC-V Vector extension in context of the
EPI Project<https://www.european-processor-initiative.eu/accelerator/>. We
earlier shared a demo of our prototype implementation 
(https://repo.hca.bsc.es/epic/z/9eYRIF, see below) with the folks involved with
LLVM SVE/SVE2 development. Since there was an interest in looking at the source
code during the discussions in the subsequent LLVM SVE/SVE2 sync-up meetings, we
are also publishing a public copy of our repository.

It is available at https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi and will sync
with our ongoing development on a weekly basis. Note that this is very much a
work in progress and the code in this repository is only for reference purpose.
Please see the
README<https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi/-/blob/EPI/README.md>
file in the repo for details on our approach, design decisions, and limitations.

We welcome any questions and feedback.


Thanks and Regards,
Vineet Kumar - vineet.kumar at bsc.es<mailto:vineet.kumar at bsc.es>
Barcelona Supercomputing Center - Centro Nacional de Supercomputación




On 2020-07-29 3:10 a.m., Vineet Kumar wrote:

Hi all,

Following up on the discussion in the last meeting about auto-
vectorization for RISC-V Vector extension (scalable vectors) at the
Barcelona Supercomputing Center, here are some additional details.

We have a working prototype for end-to-end compilation targeting the
RISC-V Vector extension. The auto-vectorizer supports two strategies to
generate LLVM IR using scalable vectors:

1) Generate a vector loop using VF (vscale x k) = whole vector register
width, followed by a scalar tail loop.

2) Generate only a vector loop with active vector length controlled by
the RISC-V `vsetvli` instruction and using Vector Predicated intrinsics
(https://reviews.llvm.org/D57504). (Of course, intrinsics come with
their own limitations but we feel it serves as a good proof of concept
for our use case.) We also extend the VPlan to generate VPInstructions
that are expanded using predicated intrinsics.

We also considered a third hybrid approach of having a vector loop with
VF = whole register width, followed by a vector tail loop using
predicated intrinsics. For now though, based on project requirements,
we favoured the second approach.

We have also taken care to not break any fixed-vector implementation.
All the scalable vector IR gen is guarded by conditions set by TTI.

For shuffles, the most used case is broadcast which is supported by the
current semantics of `shufflevector` instruction. For other cases like
reverse, concat, etc., we have defined our own intrinsics.

Current limitaitons:
The cost model for scalable vectors doesn't do much other than always
decideing to vectorize with VF based on TargetWidestType/SmallestType.
We also do not support interleaving yet.

Demo:
The current implementation is very much in alpha and eventually, once
it's more polished and thoroughly verified, we will put out patches on
Phabricator. Till then, we have set up a Compiler Explorer server
against our development branch to showcase the generated code.

You can see and experiment with the generated LLVM IR and VPlan for a
set of examples, with predicated vector loop (`-mprefer-predicate-over-
epilog`) at https://repo.hca.bsc.es/epic/z/JB4ZoJ
and with a scalar epilog (`-mno-prefer-predicate-over-epilog`) at
https://repo.hca.bsc.es/epic/z/0WoDGt.
Note that you can remove the `-emit-llvm` option to see the generated
RISC-V assembly.

We welcome any questions and feedback.

Thanks and Regards,
Vineet Kumar - vineet.kumar at bsc.es<mailto:vineet.kumar at bsc.es>
Barcelona Supercomputing Center - Centro Nacional de Supercomputación





WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain information which
is privileged, confidential, proprietary, or exempt from disclosure under
applicable law. If you are not the intended recipient or the person responsible
for delivering the message to the intended recipient, you are strictly
prohibited from disclosing, distributing, copying, or in any way using this
message. If you have received this communication in error, please notify the
sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain information which
is privileged, confidential, proprietary, or exempt from disclosure under
applicable law. If you are not the intended recipient or the person responsible
for delivering the message to the intended recipient, you are strictly
prohibited from disclosing, distributing, copying, or in any way using this
message. If you have received this communication in error, please notify the
sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Roger Ferrer Ibáñez

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201105/f35d4aca/attachment-0001.html>

Roger Ferrer Ibáñez via llvm-dev

2020-Nov-06 07:49 UTC

head link

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Hi Sjoerd,

> Trying to remember how everything fits together here, but could
> get.active.lane.mask not create the %mask of the VP intrinsics? Or in other
> words, in the vectoriser, who's producing the %mask and %evl that is
> consumed by the VP intrinsics?
>
> I'm not sure what would be the best way here. I think about the LoopVectorizer. I imagine at some point we can teach LV to emit VPred for the
widening. VPred IR needs two additional operands, as you mentioned, %evl
and %mask.

One option is make %evl the max-vector-length of the type being operated
and %mask (that is the "outer block mask" in this context) be
get.active.lane.mask. This maps well for SVE and MVE not so much for VE and
RISC-V (I don't think it is incorrect but it is not an efficient thing to
do).  Perhaps VE and RISC-V can work in this scenario if at some point they
replace the %evl with something like "%n - %base" operands of
get.active.lane.mask, and %mask (the outer block mask) is replaced with a
splat of "i1 1".

Another option here is make "%n - %base" be the %evl (or at least an
operand of some target hook because "computing" the %evl is
target-specific, targets without evl could compute the identity here) and
%mask (the outer block mask) be a splat of "i1 1". This maps well VE
and
RISC-V but makes life harder for AVX-512, SVE and MVE (in general any
target where TargetTransformInfo::hasActiveVectorLength returns false).
Those targets could replace the %evl with the max-vector-length of the
operated type and then use get.active.lane.mask(0, %evl) as the outer block
mask. My understanding is that Simon used this approach in
https://reviews.llvm.org/D78203 but in a more general setting, that would
be independent of what Loop Vectorizer does.

Looks to me the second option makes a more effective use of vpred and
D78203 shows that we can always soften vpred into a shape that is
reasonable for lowering in targets without active vector length.

Thoughts?

Kind regards,
-- 
Roger Ferrer Ibáñez
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/94e80330/attachment.html>

Simon Moll via llvm-dev

2020-Nov-06 10:07 UTC

head link

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

On 11/6/20 8:49 AM, Roger Ferrer Ibáñez wrote:
Hi Sjoerd,


Trying to remember how everything fits together here, but could
get.active.lane.mask not create the %mask of the VP intrinsics? Or in other
words, in the vectoriser, who's producing the %mask and %evl that is
consumed by the VP intrinsics?

I'm not sure what would be the best way here. I think about the Loop
Vectorizer. I imagine at some point we can teach LV to emit VPred for the
widening. VPred IR needs two additional operands, as you mentioned, %evl and
%mask.

One option is make %evl the max-vector-length of the type being operated and
%mask (that is the "outer block mask" in this context) be
get.active.lane.mask. This maps well for SVE and MVE not so much for VE and
RISC-V (I don't think it is incorrect but it is not an efficient thing to
do).  Perhaps VE and RISC-V can work in this scenario if at some point they
replace the %evl with something like "%n - %base" operands of
get.active.lane.mask, and %mask (the outer block mask) is replaced with a splat
of "i1 1".
Basically, we would extend TTI to let the targets choose how to use the %mask
and %evl operands in the VP intrinsics. So, an 'fadd' would turn into an
'llvm.vp.fadd' for all predicating targets. However, whether
get.active.lane.mask() is used for %mask or whether tail predication is done
with a (splat i1 1) for the mask and setting %evl would be target dependent.

Another option here is make "%n - %base" be the %evl (or at least an
operand of some target hook because "computing" the %evl is
target-specific, targets without evl could compute the identity here) and %mask
(the outer block mask) be a splat of "i1 1". This maps well VE and
RISC-V but makes life harder for AVX-512, SVE and MVE (in general any target
where TargetTransformInfo::hasActiveVectorLength returns false). Those targets
could replace the %evl with the max-vector-length of the operated type and then
use get.active.lane.mask(0, %evl) as the outer block mask. My understanding is
that Simon used this approach in https://reviews.llvm.org/D78203 but in a more
general setting, that would be independent of what Loop Vectorizer does.

For VE, we set %evl = min(max_vector_width, %n - %base) .. that's the same
idiom that the non-LLVM NEC compilers are emitting for tail predication.
Basically, the LV flow could look something like this:


  ; Call the target hook to let the target select %mask and %evl params for the
loop header
  %evl, %mask <- IRBuilder.createIterationPredicate(%i, %n, TTI)

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

  ; Configure the Vector Predication builder to use those
  VPBuilder
      .setExplicitVectorLength(%evl)
      .setMask(%mask);

  ; Start buildling vector-predicated instructions
  VPBuilder.createFadd(%x, %y)    ; --> call @llvm.vp.fadd(%x, %y, %mask,
%evl)


Looks to me the second option makes a more effective use of vpred and D78203
shows that we can always soften vpred into a shape that is reasonable for
lowering in targets without active vector length.
The whole point about VP is to make sure there is one set of vector-predicated
instructions/intrinsics that everybody is using while giving people the freedom
to use these as it fits their targets. We can then concentrate on optimizing VP
intrinsic code and all targets benefit.

- Simon

*: VE's packed mode (512 x 32bit elements) is a use case for a non-trivial
setting of %mask and %evl at the same time (%evl for packs of two 32bit elements
(ie %evl must be even for 32bit lanes), %mask for masking out inside packages).



Thoughts?

Kind regards,
--
Roger Ferrer Ibáñez

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/72f6e006/attachment-0001.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Nov 2020 - Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Reasonably Related Threads