thr3ads.net - llvm dev - [llvm-dev] Vectorizing multiple exit loops [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Philip Reames via llvm-dev

2019-Sep-09 17:53 UTC

[llvm-dev] Vectorizing multiple exit loops

I've recently mentioned in a few places that I'm interested in enhancing
the loop vectorizer to handle multiple exit loops, and have been asked 
to share plans.  This email is intended to a) share my current thinking 
and b) help spark discussion among interested parties.  I do need to 
warn that my near term plans for this have been delayed; I got pulled 
into an internal project instead.

*Background*

At the moment, our loop vectorizer does not handle any loop with more 
than one exit, or where that sole exit is not from the latch block.  
Interestingly, it does handle internal predication within a single loop 
iteration.  This results in a slightly odd behavior where a loop which 
can be written with either a continue or a break can exhibit wildly 
different performance depending on that choice.  It does hint at a 
possible direction for implementation though, and implies that most of 
the cost modeling pieces are already in place.

The use cases I'm looking at basically fall into two buckets:

for (int i = 0; i < N; i++) {
   if (expr(a[i])) break;
   ... other vectorizable stuff ...
}

for (int i = 0; i < N; i++) {
   if (expr(i)) break;
   ... other vectorizable stuff ...
}

The former are the actually "interesting" examples.  The later are
cases
where we missed eliminating some check we could have, but 
not-vectorizing creates an unfortunate performance cliff.

*Framing*

We have three important sub-cases to handle.

First, there are all the cases where we could have handled the multiple 
exit loop, but chose not to for implementation reasons. A great example is:

for (int i = 0; i < N; i++) {
   if (i > M) break;
   a[i] = i;
}

In this case, SCEV can happily compute the actual exit bound of the 
loop, and we could use the existing vector-body/scalar slow-path 
structure.  The only change needed would be to exit the vector body 
earlier.  (There are some details here, but it's almost as easy as I'm 
describing if my POC patch isn't missing something major.)

There's a couple other cases like this.  I suspect we can get decent 
mileage out of just generalizing the existing code.


Second, there are the cases where we actually have to handle iterations 
within the vector-body with predication (or speculation).  The good news 
is that there's already support in code for predication, we just need to 
add another source of potential predication.  The bad news is that's a 
fairly major change.

Our challenge will be finding a runtime check for dereferenceability.  
Consider the following example:
for (int i = 0; i < N; i++)
   if (a[i] == 0) return false;
return true;

If 'a' is an alloca, or statically sized global, we can form a runtime 
check which ensures 'i' is in bounds and a speculative load will not
fault.

Here's a nice example we could handle with this approach.

// Assume a and b are both statically sized.
for (int i = 0; i < N; i++) {
   t = b[i];
   if (t > M) throw();
   sum += a[t];
}

(This is a classic a[b[i]] pattern, but with range checks added.)

This is broadly applicable enough to be useful, and in practice covers 
my use cases, but I'm hoping others have ideas for extensions here.

Before resorting to that though, we have the potential to rely more on 
speculation safety reasoning.  I have a patch out for review currently 
(D66688 [LoopVectorize] Leverage speculation safety to avoid 
masked.loads <https://reviews.llvm.org/D66688>) which fell out of some 
prototyping in this area; it benefits existing predication logic, so I 
separated it.

The other major challenge here is profitability.  Consider a simple loop 
such as:

// assume a[0:N] is known dereferenceable
for (int i = 0; i < N; i++)
   if (a[i] == 0) return false;
return true;

If N is large, and the array is non-zero, then this is profitable to 
vectorize.  If a[0] == 0, then it isn't, regardless of the value of N.

Figuring out when to vectorize vs not for cases like this will require 
some thought.  I don't really have a good answer for this yet, other 
than when the profile on the early exit tells us it's rarely taken.


Third, for both of the former cases, we need to be able to compute exit 
values along the early exit edges.  We can get a lot of mileage out of 
simply skipping loops with exit values (i.e. lcssa phis), as our loop 
exit value rewriting will tend to eliminate them.  However, we will 
eventually want to handle this case, which will require generating some 
interesting complicated code down the exit path to figure out which 
iteration actually exit.

I see two general options here:

1) Use the vector-body/scalar-body idiom of today, and have the vector 
body exit with IV = I when any iteration in [I, I+VF-1] would exit.  
(i.e. roll back)

2) Insert dedicated exit blocks which recompute exit conditions to 
determine exact exit value, and then let the vector body run all 
iterations in VF which contain the exiting iteration.  (Assumes 
predicated stores, and that the exit blocks skip the scalar loop)

I currently don't have a reason to strongly prefer one over the other.  
(2) is conceptually cleaner and the one which keeps coming up in 
conversation, but (1) may be simpler to actually implement.


*Current Plans*

At the current moment, I'm reasonable sure that I'm going to get the 
resources to at least tackle some of the cases where we bail out 
unnecessarily.  This will be a huge practical improvement in vectorizing 
robustness, at (I hope) relatively little implementation cost.

I'm also going to continue playing around with enhancements to our 
current dereferenceability logic.  I see that as being a key building 
block to make any predication based approach practical.

I'm not sure I'm going to get to the predication support.  I'd like
to,
but am not sure my resource constraints allow it.  I'll also mention 
that I'm not at all sure how all of this might fit in with the VPLAN 
work.  I'd really welcome feedback on that; is what I'm proposing at all
consistent with others plans?


Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190909/1da080f1/attachment.html>

Renato Golin via llvm-dev

2019-Sep-17 10:17 UTC

head link

[llvm-dev] Vectorizing multiple exit loops

Hi Philip,

Apologies for leaving this thread linger so long. It was in my
back-burner but Alex's weekly remind me to reply (thanks again,
Alex!). Starting from the end...

On Mon, 9 Sep 2019 at 18:53, Philip Reames via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Current Plans
>
> At the current moment, I'm reasonable sure that I'm going to get
the resources to at least tackle some of the cases where we bail out
unnecessarily.  This will be a huge practical improvement in vectorizing
robustness, at (I hope) relatively little implementation cost.
This makes a lot of sense to me. It doesn't sound it will need a huge
refactoring to the current code and it can be done in a piece-wise way
so that progress is consistent and non-regressable.

It would be good to know if any of the test-suite benchmarks will be
affected, or if we could insert a new one there, just to keep track of
things.
> I'm also going to continue playing around with enhancements to our
current dereferenceability logic.  I see that as being a key building block to
make any predication based approach practical.
This is *always* good work. There has been some work in the past to
common up LV with VPLan analysis stages (separating classes, moving
code), and I think if we keep working on the common infrastructure,
we'll benefit more than just LV or VPlan.

> I'm not sure I'm going to get to the predication support.  I'd
like to, but am not sure my resource constraints allow it.  I'll also
mention that I'm not at all sure how all of this might fit in with the VPLAN
work.  I'd really welcome feedback on that; is what I'm proposing at all
consistent with others plans?
As I said, the analysis work can play well with VPlan, and if it
doesn't, we can make it work by following the same cleanups that has
been done in the past. These are complementary parts of the work and
it's great you can spend some time on some of it.


> The use cases I'm looking at basically fall into two buckets:
>
> for (int i = 0; i < N; i++) {
>   if (expr(a[i])) break;
>   ... other vectorizable stuff ...
> }
>
> for (int i = 0; i < N; i++) {
>   if (expr(i)) break;
>   ... other vectorizable stuff ...
> }
The safety analysis is different, but the vectorizer would have to
behave similarly on both cases. Specilative loads, boundary
conditions, run time checks (unless the bounds are compile time
constants?).

Doing the latter first would make the second almost *only* about
reference analysis.

> First, there are all the cases where we could have handled the multiple
exit loop, but chose not to for implementation reasons.  A great example is:
>
> for (int i = 0; i < N; i++) {
>   if (i > M) break;
>   a[i] = i;
> }
>
> In this case, SCEV can happily compute the actual exit bound of the loop,
and we could use the existing vector-body/scalar slow-path structure.  The only
change needed would be to exit the vector body earlier.  (There are some details
here, but it's almost as easy as I'm describing if my POC patch
isn't missing something major.)
If M < N, then this could potentially be converted to iterate over M
and make the loop unconditional, no? I know this is a silly case, but
if a more complex loop simplifies to this one by other optimisations,
this would be a very easy change to make.

We could also look at it from a loop-splitting point of view, for
example the common pattern:

for (i: M) {
  if (a == 0)
    startup(i);
  else
    remainder(i);
}

Or more complicated examples, like matrix operations that are "easy"
in the bulk of it, but complicated (non-vectorizeable) at the edges.
This would probably be something for the VPlan side of things, but
it's good to keep in mind that a simple example can get very
interesting. :)

> // Assume a and b are both statically sized.
> for (int i = 0; i < N; i++) {
>   t = b[i];
>   if (t > M) throw();
>   sum += a[t];
> }
>
> (This is a classic a[b[i]] pattern, but with range checks added.)
This looks like a clear pattern for predicated semantics. :)

In the absence of that, if the condition is simple, we can try
splitting or tail loops, but only predication would give us an extra
mile on preventing speculative loads to crash.

> // assume a[0:N] is known dereferenceable
> for (int i = 0; i < N; i++)
>   if (a[i] == 0) return false;
> return true;
>
> If N is large, and the array is non-zero, then this is profitable to
vectorize.  If a[0] == 0, then it isn't, regardless of the value of N.
>
> Figuring out when to vectorize vs not for cases like this will require some
thought.  I don't really have a good answer for this yet, other than when
the profile on the early exit tells us it's rarely taken.
We can already bail on small constant N, and generally, code that
assumes it's small, usually use it as a constant (like RGB or xyz
handling). In those cases, it may also be profitable to do strided
vectorisation (which we do).

> Third, for both of the former cases, we need to be able to compute exit
values along the early exit edges.  We can get a lot of mileage out of simply
skipping loops with exit values (i.e. lcssa phis), as our loop exit value
rewriting will tend to eliminate them.  However, we will eventually want to
handle this case, which will require generating some interesting complicated
code down the exit path to figure out which iteration actually exit.
I think getting a real case in the test-suite and make that work would
make a lot of people happy. :)

> 1) Use the vector-body/scalar-body idiom of today, and have the vector body
exit with IV = I when any iteration in [I, I+VF-1] would exit.  (i.e. roll back)
>
> 2) Insert dedicated exit blocks which recompute exit conditions to
determine exact exit value, and then let the vector body run all iterations in
VF which contain the exiting iteration.  (Assumes predicated stores, and that
the exit blocks skip the scalar loop)
The second option sounds easier to reason with, but it could also
generate more clutter for other passes to get lost in. I don't have a
strong opinion either, but I would like to limit the damage of the
current vectoriser (to other passes, or itself), if we are to move to
VPlan any time soon.

Thanks for working on this, I think a lot of good stuff will come out of it! :)

cheers,
--renato

Philip Reames via llvm-dev

2019-Sep-18 23:19 UTC

head link

[llvm-dev] Vectorizing multiple exit loops

On 9/17/2019 3:17 AM, Renato Golin wrote:> Hi Philip,
>
> Apologies for leaving this thread linger so long. It was in my
> back-burner but Alex's weekly remind me to reply (thanks again,
> Alex!). Starting from the end...
>
> On Mon, 9 Sep 2019 at 18:53, Philip Reames via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Current Plans
>>
>> At the current moment, I'm reasonable sure that I'm going to
get the resources to at least tackle some of the cases where we bail out
unnecessarily.  This will be a huge practical improvement in vectorizing
robustness, at (I hope) relatively little implementation cost.
> This makes a lot of sense to me. It doesn't sound it will need a huge
> refactoring to the current code and it can be done in a piece-wise way
> so that progress is consistent and non-regressable.
>
> It would be good to know if any of the test-suite benchmarks will be
> affected, or if we could insert a new one there, just to keep track of
> things.
>
>> I'm also going to continue playing around with enhancements to our
current dereferenceability logic.  I see that as being a key building block to
make any predication based approach practical.
> This is *always* good work. There has been some work in the past to
> common up LV with VPLan analysis stages (separating classes, moving
> code), and I think if we keep working on the common infrastructure,
> we'll benefit more than just LV or VPlan.
>
>
>> I'm not sure I'm going to get to the predication support. 
I'd like to, but am not sure my resource constraints allow it.  I'll
also mention that I'm not at all sure how all of this might fit in with the
VPLAN work.  I'd really welcome feedback on that; is what I'm proposing
at all consistent with others plans?
> As I said, the analysis work can play well with VPlan, and if it
> doesn't, we can make it work by following the same cleanups that has
> been done in the past. These are complementary parts of the work and
> it's great you can spend some time on some of it.
>
>
>
>> The use cases I'm looking at basically fall into two buckets:
>>
>> for (int i = 0; i < N; i++) {
>>    if (expr(a[i])) break;
>>    ... other vectorizable stuff ...
>> }
>>
>> for (int i = 0; i < N; i++) {
>>    if (expr(i)) break;
>>    ... other vectorizable stuff ...
>> }
> The safety analysis is different, but the vectorizer would have to
> behave similarly on both cases. Specilative loads, boundary
> conditions, run time checks (unless the bounds are compile time
> constants?).
>
> Doing the latter first would make the second almost *only* about
> reference analysis.
>
>
>> First, there are all the cases where we could have handled the multiple
exit loop, but chose not to for implementation reasons.  A great example is:
>>
>> for (int i = 0; i < N; i++) {
>>    if (i > M) break;
>>    a[i] = i;
>> }
>>
>> In this case, SCEV can happily compute the actual exit bound of the
loop, and we could use the existing vector-body/scalar slow-path structure.  The
only change needed would be to exit the vector body earlier.  (There are some
details here, but it's almost as easy as I'm describing if my POC patch
isn't missing something major.)
> If M < N, then this could potentially be converted to iterate over M
> and make the loop unconditional, no? I know this is a silly case, but
> if a more complex loop simplifies to this one by other optimisations,
> this would be a very easy change to make.If M < N (provably), SCEV would return an exit count for the loop which 
reflects this.  If not, we'd get umin(M,N).  We can still generate the 
vector body at the cost of inserting the umin computation above the loop 
body.  Both cases can be handled by running the vector loop up to the 
minimum trip count (well, one less to handle mid-loop exits and side 
effects).  The M < N case is falls out of the more general
one.>
> We could also look at it from a loop-splitting point of view, for
> example the common pattern:
>
> for (i: M) {
>    if (a == 0)
>      startup(i);
>    else
>      remainder(i);
> }Iteration spaces splitting is a separate optimization.  IRCE implements 
a form of this.>
> Or more complicated examples, like matrix operations that are
"easy"
> in the bulk of it, but complicated (non-vectorizeable) at the edges.
> This would probably be something for the VPlan side of things, but
> it's good to keep in mind that a simple example can get very
> interesting. :)I don't know of any plans to incorporate iteration space splitting 
w/VPLan.  I don't have any plans to go that far; if I had such test 
cases - I don't - I'd want to start with separately factored transforms 
if we could.  Doing everything within one transform is undesirable. 
:)>
>
>> // Assume a and b are both statically sized.
>> for (int i = 0; i < N; i++) {
>>    t = b[i];
>>    if (t > M) throw();
>>    sum += a[t];
>> }
>>
>> (This is a classic a[b[i]] pattern, but with range checks added.)
> This looks like a clear pattern for predicated semantics. :)
>
> In the absence of that, if the condition is simple, we can try
> splitting or tail loops, but only predication would give us an extra
> mile on preventing speculative loads to crash.
>
>
>> // assume a[0:N] is known dereferenceable
>> for (int i = 0; i < N; i++)
>>    if (a[i] == 0) return false;
>> return true;
>>
>> If N is large, and the array is non-zero, then this is profitable to
vectorize.  If a[0] == 0, then it isn't, regardless of the value of N.
>>
>> Figuring out when to vectorize vs not for cases like this will require
some thought.  I don't really have a good answer for this yet, other than
when the profile on the early exit tells us it's rarely taken.
> We can already bail on small constant N, and generally, code that
> assumes it's small, usually use it as a constant (like RGB or xyz
> handling). In those cases, it may also be profitable to do strided
> vectorisation (which we do).
>
>
>> Third, for both of the former cases, we need to be able to compute exit
values along the early exit edges.  We can get a lot of mileage out of simply
skipping loops with exit values (i.e. lcssa phis), as our loop exit value
rewriting will tend to eliminate them.  However, we will eventually want to
handle this case, which will require generating some interesting complicated
code down the exit path to figure out which iteration actually exit.
> I think getting a real case in the test-suite and make that work would
> make a lot of people happy. :)I wish.  Unfortunately, my "real test case" is a java benchmark.   I 
doubt I'll be able to get it into the test suite. 
:(>
>
>> 1) Use the vector-body/scalar-body idiom of today, and have the vector
body exit with IV = I when any iteration in [I, I+VF-1] would exit.  (i.e. roll
back)
>>
>> 2) Insert dedicated exit blocks which recompute exit conditions to
determine exact exit value, and then let the vector body run all iterations in
VF which contain the exiting iteration.  (Assumes predicated stores, and that
the exit blocks skip the scalar loop)
> The second option sounds easier to reason with, but it could also
> generate more clutter for other passes to get lost in. I don't have a
> strong opinion either, but I would like to limit the damage of the
> current vectoriser (to other passes, or itself), if we are to move to
> VPlan any time soon.
>
> Thanks for working on this, I think a lot of good stuff will come out of
it! :)
>
> cheers,
> --renato

Philip Reames via llvm-dev

2021-Jan-11 20:30 UTC

head link

[llvm-dev] Vectorizing multiple exit loops

Responding to this old thread to let interested parties know there's 
been some progress on this (finally).

The first sub-item described below - multiple exit loops with computable 
trip counts - is in progress, and will likely be wrapped up in the not 
too distant future.  The first major change (4b33b2387) landed two weeks 
ago, two smaller changes are on review (D93725, and D93865), and there's 
likely only one major patch needed after that.

To my knowledge, there's been no progress on the second item and I'm not
anticipating any in the near future.

Philip

On 9/9/19 10:53 AM, Philip Reames wrote:>
> I've recently mentioned in a few places that I'm interested in 
> enhancing the loop vectorizer to handle multiple exit loops, and have 
> been asked to share plans.  This email is intended to a) share my 
> current thinking and b) help spark discussion among interested 
> parties.  I do need to warn that my near term plans for this have been 
> delayed; I got pulled into an internal project instead.
>
> *Background*
>
> At the moment, our loop vectorizer does not handle any loop with more 
> than one exit, or where that sole exit is not from the latch block.  
> Interestingly, it does handle internal predication within a single 
> loop iteration.  This results in a slightly odd behavior where a loop 
> which can be written with either a continue or a break can exhibit 
> wildly different performance depending on that choice.  It does hint 
> at a possible direction for implementation though, and implies that 
> most of the cost modeling pieces are already in place.
>
> The use cases I'm looking at basically fall into two buckets:
>
> for (int i = 0; i < N; i++) {
>   if (expr(a[i])) break;
>   ... other vectorizable stuff ...
> }
>
> for (int i = 0; i < N; i++) {
>   if (expr(i)) break;
>   ... other vectorizable stuff ...
> }
>
> The former are the actually "interesting" examples.  The later
are
> cases where we missed eliminating some check we could have, but 
> not-vectorizing creates an unfortunate performance cliff.
>
> *Framing*
>
> We have three important sub-cases to handle.
>
> First, there are all the cases where we could have handled the 
> multiple exit loop, but chose not to for implementation reasons.  A 
> great example is:
>
> for (int i = 0; i < N; i++) {
>   if (i > M) break;
>   a[i] = i;
> }
>
> In this case, SCEV can happily compute the actual exit bound of the 
> loop, and we could use the existing vector-body/scalar slow-path 
> structure.  The only change needed would be to exit the vector body 
> earlier.  (There are some details here, but it's almost as easy as
I'm
> describing if my POC patch isn't missing something major.)
>
> There's a couple other cases like this.  I suspect we can get decent 
> mileage out of just generalizing the existing code.
>
>
> Second, there are the cases where we actually have to handle 
> iterations within the vector-body with predication (or speculation).  
> The good news is that there's already support in code for predication, 
> we just need to add another source of potential predication.  The bad 
> news is that's a fairly major change.
>
> Our challenge will be finding a runtime check for dereferenceability.  
> Consider the following example:
> for (int i = 0; i < N; i++)
>   if (a[i] == 0) return false;
> return true;
>
> If 'a' is an alloca, or statically sized global, we can form a
runtime
> check which ensures 'i' is in bounds and a speculative load will
not
> fault.
>
> Here's a nice example we could handle with this approach.
>
> // Assume a and b are both statically sized.
> for (int i = 0; i < N; i++) {
>   t = b[i];
>   if (t > M) throw();
>   sum += a[t];
> }
>
> (This is a classic a[b[i]] pattern, but with range checks added.)
>
> This is broadly applicable enough to be useful, and in practice covers 
> my use cases, but I'm hoping others have ideas for extensions here.
>
> Before resorting to that though, we have the potential to rely more on 
> speculation safety reasoning.  I have a patch out for review currently 
> (D66688 [LoopVectorize] Leverage speculation safety to avoid 
> masked.loads <https://reviews.llvm.org/D66688>) which fell out of
some
> prototyping in this area; it benefits existing predication logic, so I 
> separated it.
>
> The other major challenge here is profitability.  Consider a simple 
> loop such as:
>
> // assume a[0:N] is known dereferenceable
> for (int i = 0; i < N; i++)
>   if (a[i] == 0) return false;
> return true;
>
> If N is large, and the array is non-zero, then this is profitable to 
> vectorize.  If a[0] == 0, then it isn't, regardless of the value of N.
>
> Figuring out when to vectorize vs not for cases like this will require 
> some thought.  I don't really have a good answer for this yet, other 
> than when the profile on the early exit tells us it's rarely taken.
>
>
> Third, for both of the former cases, we need to be able to compute 
> exit values along the early exit edges.  We can get a lot of mileage 
> out of simply skipping loops with exit values (i.e. lcssa phis), as 
> our loop exit value rewriting will tend to eliminate them.  However, 
> we will eventually want to handle this case, which will require 
> generating some interesting complicated code down the exit path to 
> figure out which iteration actually exit.
>
> I see two general options here:
>
> 1) Use the vector-body/scalar-body idiom of today, and have the vector 
> body exit with IV = I when any iteration in [I, I+VF-1] would exit.  
> (i.e. roll back)
>
> 2) Insert dedicated exit blocks which recompute exit conditions to 
> determine exact exit value, and then let the vector body run all 
> iterations in VF which contain the exiting iteration.  (Assumes 
> predicated stores, and that the exit blocks skip the scalar loop)
>
> I currently don't have a reason to strongly prefer one over the 
> other.  (2) is conceptually cleaner and the one which keeps coming up 
> in conversation, but (1) may be simpler to actually implement.
>
>
> *Current Plans*
>
> At the current moment, I'm reasonable sure that I'm going to get
the
> resources to at least tackle some of the cases where we bail out 
> unnecessarily.  This will be a huge practical improvement in 
> vectorizing robustness, at (I hope) relatively little implementation 
> cost.
>
> I'm also going to continue playing around with enhancements to our 
> current dereferenceability logic.  I see that as being a key building 
> block to make any predication based approach practical.
>
> I'm not sure I'm going to get to the predication support.  I'd
like
> to, but am not sure my resource constraints allow it.  I'll also 
> mention that I'm not at all sure how all of this might fit in with the 
> VPLAN work.  I'd really welcome feedback on that; is what I'm 
> proposing at all consistent with others plans?
>
>
> Philip
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210111/a8b496c0/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Sep 2019 - Vectorizing multiple exit loops

[llvm-dev] Vectorizing multiple exit loops

[llvm-dev] Vectorizing multiple exit loops

[llvm-dev] Vectorizing multiple exit loops

[llvm-dev] Vectorizing multiple exit loops

Possibly Parallel Threads