thr3ads.net - llvm dev - [llvm-dev] Tail-Loop Folding/Predication [Jul 2019]

If this information is useful, please help other people find it:
Share via:

Sjoerd Meijer via llvm-dev

2019-Jul-15 14:45 UTC

[llvm-dev] Tail-Loop Folding/Predication

I am looking for feedback to add support for a new loop pragma to Clang/LLVM.
With "#pragma tail_predicate" the idea would be to indicate that a
loop
epilogue/tail can, or should be, folded into the main loop. I see two use
cases for this pragma.

First, this could be interesting for the vectorizer. It currently supports tail
folding by masking all loop instructions/blocks, but does this only when
optimising for size is enabled. This pragma could override the
cost-model/opt-level.

Second use case would be the Armv8.1-M MVE vector extension, which supports
tail-predicated hardware loops. This version of hardware loops sets the vector
lanes to be masked, and is thus a nice optimisation that avoids generating a
tail loop when the number of elements processed is not a multiple of the vector
length.

For this use case, the tail predicate pragma could be good user experience
improvement, as it would for example allow this more compact form without
any predicated intrinsics:

  #pragma tail_predicate
  do {
    VLD(..);   // some vector load intrinsic
    VST(..);   // some vector store intrinsic
    ..
  } while (N);

which can then be transformed and predication made explicit through data
dependencies like so:

  do {
    mask = vctp(N);   // intrinsic that generates the mask of active lanes
    VLD(.., mask);
    VST(.., mask);
    ..
  } while (N);

A vector loop in this form can easily be picked up the new hardware loop pass,
and the corresponding tail-predicated hardware loop can be generated. This is
only a small example, but we think for more complicated examples we think
the benefit could be substantial.

I have uploaded a patch for the initial Clang plumbing exercise here:
https://reviews.llvm.org/D64744

Cheers,
Sjoerd.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190715/1c00f035/attachment.html>

Scott Manley via llvm-dev

2019-Jul-15 15:42 UTC

head link

[llvm-dev] Tail-Loop Folding/Predication

By "folded into the main loop", do you actually mean replace the main
loop?
So, in effect, running the entire loop under predicate so there is only one
loop body?

If so, I think that will be a useful pragma in general, but in my opinion,
the name is not appropriate since it won't have anything to do with the
tail other than how this is accomplished at the moment. Is your thinking
that the front end would generate the mask calculation, or are you just
leveraging the exiting fold tail by masking and removing the original
vectorized loop body?

I think the proper implementation should really be to generate the
predicated instructions in the first place (I'd like to also see actual
predicates on the instructions instead of selects, but that is another
thread), so I think #pragma loop vectorize(enable) predicated(enable) (or
something like that) seems a better choice. This would also allow you to
disable loops run under predicate if the cost model in LLVM (or downstream)
in the future thinks its best to generate this type of loop and performance
numbers suggest otherwise.

On Mon, Jul 15, 2019 at 9:46 AM Sjoerd Meijer via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I am looking for feedback to add support for a new loop pragma to
> Clang/LLVM.
> With "#pragma tail_predicate" the idea would be to indicate that
a loop
> epilogue/tail can, or should be, folded into the main loop. I see two use
> cases for this pragma.
>
> First, this could be interesting for the vectorizer. It currently supports
> tail
> folding by masking all loop instructions/blocks, but does this only when
> optimising for size is enabled. This pragma could override the
> cost-model/opt-level.
>
> Second use case would be the Armv8.1-M MVE vector extension, which supports
> tail-predicated hardware loops. This version of hardware loops sets the
> vector
> lanes to be masked, and is thus a nice optimisation that avoids generating
> a
> tail loop when the number of elements processed is not a multiple of the
> vector
> length.
>
> For this use case, the tail predicate pragma could be good user experience
> improvement, as it would for example allow this more compact form without
> any predicated intrinsics:
>
>   #pragma tail_predicate
>   do {
>     VLD(..);   // some vector load intrinsic
>     VST(..);   // some vector store intrinsic
>     ..
>   } while (N);
>
> which can then be transformed and predication made explicit through data
> dependencies like so:
>
>   do {
>     mask = vctp(N);   // intrinsic that generates the mask of active lanes
>     VLD(.., mask);
>     VST(.., mask);
>     ..
>   } while (N);
>
> A vector loop in this form can easily be picked up the new hardware loop
> pass,
> and the corresponding tail-predicated hardware loop can be generated. This
> is
> only a small example, but we think for more complicated examples we think
> the benefit could be substantial.
>
> I have uploaded a patch for the initial Clang plumbing exercise here:
> https://reviews.llvm.org/D64744
>
> Cheers,
> Sjoerd.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190715/651d585c/attachment.html>

Sjoerd Meijer via llvm-dev

2019-Jul-16 09:02 UTC

head link

[llvm-dev] Tail-Loop Folding/Predication

Hi Scott,

Yes, I meant exactly that:
>  So, in effect, running the entire loop under predicate so there is only
one loop body?
We see having one loop body as the easiest and also best way to detect and
support these so called "tail-predicated hardware loops". This solves
the pass ordering problem of vectorizing and creating hardware loops and keeps
the loop in an easier to analyse form.
> ..., or are you just leveraging the exiting fold tail by masking and
removing the original vectorized loop body?
It looks like we can leverage the vectoriser as it already capable of folding
the tail and predicating the instructions, but we just need to steer the
decision making with e.g. this new pragma. For example, in
LoopVectorizationLegality::canFoldTailByMasking() this decision making and
transformation is happening.

Probably my suggestion for 1 generic pragma for both cases (the vectorizer, and
tail-predicated hardware loops) was wrong. My reason however for suggesting 1
generic pragma was that I thought people would not be an awful less interested
in an ARM MVE specific pragma; I thought I would increase my chances with a
generic one, but that didn't seem to work. ;-) I actually agree that a
generic one like this:

    #pragma tail_predicate

doesn't seem a good fit for the vectorizer, which should indeed be something
like this:

   #pragma loop vectorize(enable) predicated(enable)

but this one doesn't seem a great fit for my example, i.e. I see that it
probably also work for my use case, but a vectorize pragma is probably not a
great fit here. So that suggest 2 pragmas would be best.

It looks like people like the extra "predicated(enable)" hint as part
of the vectorize pragma, so I will implement that first. This also allows to me
to continue prototyping the actual transformation (rewriting unpredicated
intrinsics to predicated ones), which might perhaps be a bit of an odd one, but
I still think that could be convenient for users. I can always propose another
pragma once I got more experience with my transformation.

Thanks,
Sjoerd.

________________________________
From: Scott Manley <rscottmanley at gmail.com>
Sent: 15 July 2019 16:42
To: Sjoerd Meijer
Cc: llvm-dev at lists.llvm.org; cfe-dev at lists.llvm.org; scottm
Subject: Re: [llvm-dev] Tail-Loop Folding/Predication

By "folded into the main loop", do you actually mean replace the main
loop? So, in effect, running the entire loop under predicate so there is only
one loop body?

If so, I think that will be a useful pragma in general, but in my opinion, the
name is not appropriate since it won't have anything to do with the tail
other than how this is accomplished at the moment. Is your thinking that the
front end would generate the mask calculation, or are you just leveraging the
exiting fold tail by masking and removing the original vectorized loop body?

I think the proper implementation should really be to generate the predicated
instructions in the first place (I'd like to also see actual predicates on
the instructions instead of selects, but that is another thread), so I think
#pragma loop vectorize(enable) predicated(enable) (or something like that) seems
a better choice. This would also allow you to disable loops run under predicate
if the cost model in LLVM (or downstream) in the future thinks its best to
generate this type of loop and performance numbers suggest otherwise.

On Mon, Jul 15, 2019 at 9:46 AM Sjoerd Meijer via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
I am looking for feedback to add support for a new loop pragma to Clang/LLVM.
With "#pragma tail_predicate" the idea would be to indicate that a
loop
epilogue/tail can, or should be, folded into the main loop. I see two use
cases for this pragma.

First, this could be interesting for the vectorizer. It currently supports tail
folding by masking all loop instructions/blocks, but does this only when
optimising for size is enabled. This pragma could override the
cost-model/opt-level.

Second use case would be the Armv8.1-M MVE vector extension, which supports
tail-predicated hardware loops. This version of hardware loops sets the vector
lanes to be masked, and is thus a nice optimisation that avoids generating a
tail loop when the number of elements processed is not a multiple of the vector
length.

For this use case, the tail predicate pragma could be good user experience
improvement, as it would for example allow this more compact form without
any predicated intrinsics:

  #pragma tail_predicate
  do {
    VLD(..);   // some vector load intrinsic
    VST(..);   // some vector store intrinsic
    ..
  } while (N);

which can then be transformed and predication made explicit through data
dependencies like so:

  do {
    mask = vctp(N);   // intrinsic that generates the mask of active lanes
    VLD(.., mask);
    VST(.., mask);
    ..
  } while (N);

A vector loop in this form can easily be picked up the new hardware loop pass,
and the corresponding tail-predicated hardware loop can be generated. This is
only a small example, but we think for more complicated examples we think
the benefit could be substantial.

I have uploaded a patch for the initial Clang plumbing exercise here:
https://reviews.llvm.org/D64744

Cheers,
Sjoerd.
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190716/d50dc0a8/attachment-0001.html>

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jul 2019 - Tail-Loop Folding/Predication

[llvm-dev] Tail-Loop Folding/Predication

[llvm-dev] Tail-Loop Folding/Predication

[llvm-dev] Tail-Loop Folding/Predication

Reasonably Related Threads