thr3ads.net - llvm dev - [llvm-dev] Vectorizing multiple exit loops [Jul 2021]

If this information is useful, please help other people find it:
Share via:

Philip Reames via llvm-dev

2021-Jan-11 20:30 UTC

[llvm-dev] Vectorizing multiple exit loops

Responding to this old thread to let interested parties know there's 
been some progress on this (finally).

The first sub-item described below - multiple exit loops with computable 
trip counts - is in progress, and will likely be wrapped up in the not 
too distant future.  The first major change (4b33b2387) landed two weeks 
ago, two smaller changes are on review (D93725, and D93865), and there's 
likely only one major patch needed after that.

To my knowledge, there's been no progress on the second item and I'm not
anticipating any in the near future.

Philip

On 9/9/19 10:53 AM, Philip Reames wrote:>
> I've recently mentioned in a few places that I'm interested in 
> enhancing the loop vectorizer to handle multiple exit loops, and have 
> been asked to share plans.  This email is intended to a) share my 
> current thinking and b) help spark discussion among interested 
> parties.  I do need to warn that my near term plans for this have been 
> delayed; I got pulled into an internal project instead.
>
> *Background*
>
> At the moment, our loop vectorizer does not handle any loop with more 
> than one exit, or where that sole exit is not from the latch block.  
> Interestingly, it does handle internal predication within a single 
> loop iteration.  This results in a slightly odd behavior where a loop 
> which can be written with either a continue or a break can exhibit 
> wildly different performance depending on that choice.  It does hint 
> at a possible direction for implementation though, and implies that 
> most of the cost modeling pieces are already in place.
>
> The use cases I'm looking at basically fall into two buckets:
>
> for (int i = 0; i < N; i++) {
>   if (expr(a[i])) break;
>   ... other vectorizable stuff ...
> }
>
> for (int i = 0; i < N; i++) {
>   if (expr(i)) break;
>   ... other vectorizable stuff ...
> }
>
> The former are the actually "interesting" examples.  The later
are
> cases where we missed eliminating some check we could have, but 
> not-vectorizing creates an unfortunate performance cliff.
>
> *Framing*
>
> We have three important sub-cases to handle.
>
> First, there are all the cases where we could have handled the 
> multiple exit loop, but chose not to for implementation reasons.  A 
> great example is:
>
> for (int i = 0; i < N; i++) {
>   if (i > M) break;
>   a[i] = i;
> }
>
> In this case, SCEV can happily compute the actual exit bound of the 
> loop, and we could use the existing vector-body/scalar slow-path 
> structure.  The only change needed would be to exit the vector body 
> earlier.  (There are some details here, but it's almost as easy as
I'm
> describing if my POC patch isn't missing something major.)
>
> There's a couple other cases like this.  I suspect we can get decent 
> mileage out of just generalizing the existing code.
>
>
> Second, there are the cases where we actually have to handle 
> iterations within the vector-body with predication (or speculation).  
> The good news is that there's already support in code for predication, 
> we just need to add another source of potential predication.  The bad 
> news is that's a fairly major change.
>
> Our challenge will be finding a runtime check for dereferenceability.  
> Consider the following example:
> for (int i = 0; i < N; i++)
>   if (a[i] == 0) return false;
> return true;
>
> If 'a' is an alloca, or statically sized global, we can form a
runtime
> check which ensures 'i' is in bounds and a speculative load will
not
> fault.
>
> Here's a nice example we could handle with this approach.
>
> // Assume a and b are both statically sized.
> for (int i = 0; i < N; i++) {
>   t = b[i];
>   if (t > M) throw();
>   sum += a[t];
> }
>
> (This is a classic a[b[i]] pattern, but with range checks added.)
>
> This is broadly applicable enough to be useful, and in practice covers 
> my use cases, but I'm hoping others have ideas for extensions here.
>
> Before resorting to that though, we have the potential to rely more on 
> speculation safety reasoning.  I have a patch out for review currently 
> (D66688 [LoopVectorize] Leverage speculation safety to avoid 
> masked.loads <https://reviews.llvm.org/D66688>) which fell out of
some
> prototyping in this area; it benefits existing predication logic, so I 
> separated it.
>
> The other major challenge here is profitability.  Consider a simple 
> loop such as:
>
> // assume a[0:N] is known dereferenceable
> for (int i = 0; i < N; i++)
>   if (a[i] == 0) return false;
> return true;
>
> If N is large, and the array is non-zero, then this is profitable to 
> vectorize.  If a[0] == 0, then it isn't, regardless of the value of N.
>
> Figuring out when to vectorize vs not for cases like this will require 
> some thought.  I don't really have a good answer for this yet, other 
> than when the profile on the early exit tells us it's rarely taken.
>
>
> Third, for both of the former cases, we need to be able to compute 
> exit values along the early exit edges.  We can get a lot of mileage 
> out of simply skipping loops with exit values (i.e. lcssa phis), as 
> our loop exit value rewriting will tend to eliminate them.  However, 
> we will eventually want to handle this case, which will require 
> generating some interesting complicated code down the exit path to 
> figure out which iteration actually exit.
>
> I see two general options here:
>
> 1) Use the vector-body/scalar-body idiom of today, and have the vector 
> body exit with IV = I when any iteration in [I, I+VF-1] would exit.  
> (i.e. roll back)
>
> 2) Insert dedicated exit blocks which recompute exit conditions to 
> determine exact exit value, and then let the vector body run all 
> iterations in VF which contain the exiting iteration.  (Assumes 
> predicated stores, and that the exit blocks skip the scalar loop)
>
> I currently don't have a reason to strongly prefer one over the 
> other.  (2) is conceptually cleaner and the one which keeps coming up 
> in conversation, but (1) may be simpler to actually implement.
>
>
> *Current Plans*
>
> At the current moment, I'm reasonable sure that I'm going to get
the
> resources to at least tackle some of the cases where we bail out 
> unnecessarily.  This will be a huge practical improvement in 
> vectorizing robustness, at (I hope) relatively little implementation 
> cost.
>
> I'm also going to continue playing around with enhancements to our 
> current dereferenceability logic.  I see that as being a key building 
> block to make any predication based approach practical.
>
> I'm not sure I'm going to get to the predication support.  I'd
like
> to, but am not sure my resource constraints allow it.  I'll also 
> mention that I'm not at all sure how all of this might fit in with the 
> VPLAN work.  I'd really welcome feedback on that; is what I'm 
> proposing at all consistent with others plans?
>
>
> Philip
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210111/a8b496c0/attachment.html>

Philip Reames via llvm-dev

2021-Jul-12 15:06 UTC

head link

[llvm-dev] Vectorizing multiple exit loops

Small progress update.

Work on this largely stalled not long after I sent my last email due to 
a difficult to debug issue seen on PPC builders.  That was only recently 
resolved in e49d65f3.  As of now, the last patch for the analyzeable 
exit subset is now on review (https://reviews.llvm.org/D105817).

Once that goes in, I don't plan to take this any further at this time.  
This was a hobby project for me, and has taken much longer than I 
anticipated.  Unless I find someone to sponsor this work, I'll probably 
turn my personal hobby efforts towards easier and more immediately 
rewarding efforts.

Philip

On 1/11/21 12:30 PM, Philip Reames via llvm-dev wrote:>
> Responding to this old thread to let interested parties know there's 
> been some progress on this (finally).
>
> The first sub-item described below - multiple exit loops with 
> computable trip counts - is in progress, and will likely be wrapped up 
> in the not too distant future.  The first major change (4b33b2387) 
> landed two weeks ago, two smaller changes are on review (D93725, and 
> D93865), and there's likely only one major patch needed after that.
>
> To my knowledge, there's been no progress on the second item and
I'm
> not anticipating any in the near future.
>
> Philip
>
> On 9/9/19 10:53 AM, Philip Reames wrote:
>>
>> I've recently mentioned in a few places that I'm interested in 
>> enhancing the loop vectorizer to handle multiple exit loops, and have 
>> been asked to share plans.  This email is intended to a) share my 
>> current thinking and b) help spark discussion among interested 
>> parties.  I do need to warn that my near term plans for this have 
>> been delayed; I got pulled into an internal project instead.
>>
>> *Background*
>>
>> At the moment, our loop vectorizer does not handle any loop with more 
>> than one exit, or where that sole exit is not from the latch block.  
>> Interestingly, it does handle internal predication within a single 
>> loop iteration.  This results in a slightly odd behavior where a loop 
>> which can be written with either a continue or a break can exhibit 
>> wildly different performance depending on that choice.  It does hint 
>> at a possible direction for implementation though, and implies that 
>> most of the cost modeling pieces are already in place.
>>
>> The use cases I'm looking at basically fall into two buckets:
>>
>> for (int i = 0; i < N; i++) {
>>   if (expr(a[i])) break;
>>   ... other vectorizable stuff ...
>> }
>>
>> for (int i = 0; i < N; i++) {
>>   if (expr(i)) break;
>>   ... other vectorizable stuff ...
>> }
>>
>> The former are the actually "interesting" examples.  The
later are
>> cases where we missed eliminating some check we could have, but 
>> not-vectorizing creates an unfortunate performance cliff.
>>
>> *Framing*
>>
>> We have three important sub-cases to handle.
>>
>> First, there are all the cases where we could have handled the 
>> multiple exit loop, but chose not to for implementation reasons.  A 
>> great example is:
>>
>> for (int i = 0; i < N; i++) {
>>   if (i > M) break;
>>   a[i] = i;
>> }
>>
>> In this case, SCEV can happily compute the actual exit bound of the 
>> loop, and we could use the existing vector-body/scalar slow-path 
>> structure.  The only change needed would be to exit the vector body 
>> earlier.  (There are some details here, but it's almost as easy as 
>> I'm describing if my POC patch isn't missing something major.)
>>
>> There's a couple other cases like this.  I suspect we can get
decent
>> mileage out of just generalizing the existing code.
>>
>>
>> Second, there are the cases where we actually have to handle 
>> iterations within the vector-body with predication (or speculation).  
>> The good news is that there's already support in code for 
>> predication, we just need to add another source of potential 
>> predication.  The bad news is that's a fairly major change.
>>
>> Our challenge will be finding a runtime check for 
>> dereferenceability.  Consider the following example:
>> for (int i = 0; i < N; i++)
>>   if (a[i] == 0) return false;
>> return true;
>>
>> If 'a' is an alloca, or statically sized global, we can form a 
>> runtime check which ensures 'i' is in bounds and a speculative
load
>> will not fault.
>>
>> Here's a nice example we could handle with this approach.
>>
>> // Assume a and b are both statically sized.
>> for (int i = 0; i < N; i++) {
>>   t = b[i];
>>   if (t > M) throw();
>>   sum += a[t];
>> }
>>
>> (This is a classic a[b[i]] pattern, but with range checks added.)
>>
>> This is broadly applicable enough to be useful, and in practice 
>> covers my use cases, but I'm hoping others have ideas for
extensions
>> here.
>>
>> Before resorting to that though, we have the potential to rely more 
>> on speculation safety reasoning.  I have a patch out for review 
>> currently (D66688 [LoopVectorize] Leverage speculation safety to 
>> avoid masked.loads <https://reviews.llvm.org/D66688>) which fell
out
>> of some prototyping in this area; it benefits existing predication 
>> logic, so I separated it.
>>
>> The other major challenge here is profitability.  Consider a simple 
>> loop such as:
>>
>> // assume a[0:N] is known dereferenceable
>> for (int i = 0; i < N; i++)
>>   if (a[i] == 0) return false;
>> return true;
>>
>> If N is large, and the array is non-zero, then this is profitable to 
>> vectorize.  If a[0] == 0, then it isn't, regardless of the value of
N.
>>
>> Figuring out when to vectorize vs not for cases like this will 
>> require some thought.  I don't really have a good answer for this 
>> yet, other than when the profile on the early exit tells us it's 
>> rarely taken.
>>
>>
>> Third, for both of the former cases, we need to be able to compute 
>> exit values along the early exit edges.  We can get a lot of mileage 
>> out of simply skipping loops with exit values (i.e. lcssa phis), as 
>> our loop exit value rewriting will tend to eliminate them. However, 
>> we will eventually want to handle this case, which will require 
>> generating some interesting complicated code down the exit path to 
>> figure out which iteration actually exit.
>>
>> I see two general options here:
>>
>> 1) Use the vector-body/scalar-body idiom of today, and have the 
>> vector body exit with IV = I when any iteration in [I, I+VF-1] would 
>> exit.  (i.e. roll back)
>>
>> 2) Insert dedicated exit blocks which recompute exit conditions to 
>> determine exact exit value, and then let the vector body run all 
>> iterations in VF which contain the exiting iteration.  (Assumes 
>> predicated stores, and that the exit blocks skip the scalar loop)
>>
>> I currently don't have a reason to strongly prefer one over the 
>> other.  (2) is conceptually cleaner and the one which keeps coming up 
>> in conversation, but (1) may be simpler to actually implement.
>>
>>
>> *Current Plans*
>>
>> At the current moment, I'm reasonable sure that I'm going to
get the
>> resources to at least tackle some of the cases where we bail out 
>> unnecessarily.  This will be a huge practical improvement in 
>> vectorizing robustness, at (I hope) relatively little implementation 
>> cost.
>>
>> I'm also going to continue playing around with enhancements to our 
>> current dereferenceability logic.  I see that as being a key building 
>> block to make any predication based approach practical.
>>
>> I'm not sure I'm going to get to the predication support.
I'd like
>> to, but am not sure my resource constraints allow it.  I'll also 
>> mention that I'm not at all sure how all of this might fit in with 
>> the VPLAN work.  I'd really welcome feedback on that; is what
I'm
>> proposing at all consistent with others plans?
>>
>>
>> Philip
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210712/745bc2ae/attachment.html>

llvm dev - Jul 2021 - Vectorizing multiple exit loops

[llvm-dev] Vectorizing multiple exit loops

[llvm-dev] Vectorizing multiple exit loops