thr3ads.net - llvm dev - [llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan [Jan 2018]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2017-Dec-14 21:49 UTC

[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

>Another might be to introduce changes under feature flags to ease the
revert/reintroduce/revert cycle.
This is essentially the first guard. We plan to have flags/settings to control
which types of outer loops are handled.
The new code path is initially exclusive to outer loop vectorization. If we
disable all types of outer loops
(and that's the initial default), LV continues to be good old innermost loop
vectorizer.
>I'd like to hear what plan is in place to ensure we don't
destabilize the vectorizer while working on this.
W.r.t. this project, this matters when we touch the code that is shared with
innermost loop vectorization
and outer loop vectorization. It's certainly good to ensure that such places
to have good test coverage
at the time of commit.

This also matters (and matters greatly) when we start guiding innermost loops
towards the new code path (when
we are ready). For this, another flag control would be there for everyone to try
before the flipping of the default could
land in the trunk (somewhat analogous to Chandler asking people to test new Pass
Manager prior to the switch).
We might be able to control which kind of innermost loops would take new code
path, but that's TBD.

Fuzz testing or not, I fully agree that good testing coverage of vectorizer is
desired.

Thanks,
Hideki

-----Original Message-----
From: Philip Reames [mailto:listmail at philipreames.com] 
Sent: Thursday, December 14, 2017 1:10 PM
To: Saito, Hideki <hideki.saito at intel.com>; llvm-dev at lists.llvm.org;
renato.golin at linaro.org; aemerson at apple.com; mkuper at google.com;
mssimpso at codeaurora.org; Simon Moll <moll at cs.uni-saarland.de>;
hfinkel at anl.gov
Subject: Re: [llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization
Implementation Plan

One area that needs a bit of attention before work on this proceeds much further
is testing.  The introduction of VPlan appears to have introduced a couple of
bugs and exposed a couple of others.  Most of those are now fixed, but the
process did point out a lack of test coverage around the changes which is
concerning.

I'd like to hear what plan is in place to ensure we don't destabilize
the vectorizer while working on this.

One thing we could consider is leveraging the new IR fuzzer to help find
assertion failures either before submission or shortly there after. Another
might be to introduce changes under feature flags to ease the
revert/reintroduce/revert cycle.

Philip


On 12/05/2017 05:09 PM, Saito, Hideki via llvm-dev
wrote:> Proposal for Outer Loop Vectorization Implementation Plan 
> ============================================>
> ====> Goal:
> ====> Extending Loop Vectorizer (LV) such that it can handle outer
loops, via VPlan infrastructure enhancements.
> Understand the trade-offs in trying to make concurrent progress with 
> moving remaining inner loop vectorization functionality to VPlan 
> infrastructure
>   
> ==========> Background:
> ==========> This is related to VPlan infrastructure project we started a
while back, a project to extend the (inner loop vectorization focused) Loop
Vectorizer to support outer loop vectorization.
> VPlan is the vectorization planner that records the decisions and candidate
directions to pursue in order to drive cost modeling and vector code generation.
When it is fully integrated into LV (i.e., at the end of this big project),
VPlan will use a Hierarchical-CFG (HCFG) and transform it starting from the
abstraction of the input IR to reflect current vectorization decisions being
made. The HCFG eventually becomes the abstraction of the output IR, and the
vector code generation is driven by this abstract representation.
>   
> This is a follow up of the previous RFCs and LLVM Developer Conference
presentations:
>            
> http://lists.llvm.org/pipermail/llvm-dev/2016-September/105057.html 
> (RFC: Extending LV to vectorize outerloops)
>            
> http://lists.llvm.org/pipermail/llvm-dev/2017-February/110159.html 
> (RFC: Introducing VPlan to model the vectorized code and drive its 
> transformation)
>            https://www.youtube.com/watch?v=XXAvdUwO7kQ (Extending 
> LoopVectorizer: OpenMP4.5 SIMD and Outer Loop Auto-Vectorization)
>            https://www.youtube.com/watch?v=IqzJRs6tb7Y (Introducing 
> VPlan to the LoopVectorizer)
>            https://www.youtube.com/watch?v=BjBSJFzYDVk (Vectorizing 
> Loops with VPlan - Current State and Next Steps) Please also refer to the
separately sent out status update
http://lists.llvm.org/pipermail/llvm-dev/2017-December/119522.html.
> .
>   
> So far, two big patches related to VPlan implementation have been
committed.
> https://reviews.llvm.org/D28975 by Gil Rapaport. (Introducing VPlan to 
> model the vectorized code and drive its transformation)
>       Has been broken down to a series of smaller patches and went in. 
> The last (re)commit of the series is
>       https://reviews.llvm.org/rL311849
> https://reviews.llvm.org/D38676 by Gil Rapaport. (Modeling masking in 
> VPlan, introducing VPInstructions)
>       This is also broken down to a series of smaller patches to facilitate
the review.
>       Committed as https://reviews.llvm.org/rL318645
>   
> With the first patch, we introduced the concept of VPlan to LV and started
explicitly recording decisions like interleave memory access optimization and
serialization.
> With the second patch, we also introduced the concept of VPInstruction (to
model newly introduced use-def) and started explicitly modeling masking and mask
manipulations.
>   
> So far, work in this area has been mostly restricted to refactoring of Loop
Vectorizer's existing functionality with several remaining areas of LV that
still need to be refactored into the new VPlan based representation. The
following is a non-exhaustive list of areas that need to be worked on and gives
an idea of the amount of remaining work.
> * Predication
> * Cost model
> * Remainder Loop
> * Runtime Guards
> * External Users
> * Reduction Epilog
> * Interleave Grouping
> * Sink Scalar Operands
>   
> While it is important to keep refactoring the above mentioned items such
that entire LV becomes fully VPlan based (which completes the transition into
VPlan infrastructure), we are aware that there is also a need to make progress
in outer loop vectorization. Function vectorization via VecClone pass
(https://reviews.llvm.org/D22792) also depends on the progress of outer loop
vectorization.
>   
> =============> Proposed Plan:
> =============> Instead of waiting for most of the above mentioned
refactoring to finish before working on outer loop vectorization, we believe it
serves the best interest of the community if we start making concurrent progress
on continued refactoring effort and outer loop vectorization effort.
> To implement outer loop vectorization in incremental steps, we propose
adding new code paths that would be exercised initially for only outer loop
vectorization cases. This would ensure that the changes are NFC for innermost
loop vectorization cases until we (slowly) transition them to also go through
these new code paths.
>   
> Given that LV never had the ability to vectorize outer level loops, letting
outer loop vectorization take slightly different code paths, compared to
innermost loop vectorization, won't cause any issues in innermost loop
vectorization. The new paths are what inner loop vectorization will eventually
go through (i.e., post-refactoring), after the kinks are worked out so that
doing so will not regress any of the existing functionality and performance.
Having the new code paths exposed early would encourage more community
participating in maturing the new code paths, makes it clearer what still needs
to be moved over to VPlan infrastructure, and hopefully also encourage more
community participation in the remaining refactoring effort.
>   
> =====================> Proposed Patch Series:
> =====================> There will be a lot of new development before
declaring "feature complete". As such, we propose to break it into
five or more series of patches ---- starting from vectorizing trivial outer
loops. Along the way, we expect to include NFC patches to make LV more modular,
in order to keep LoopVectorize.cpp file size from growing too much.
>   
>    Patch Series #1
>    ---------------
>      Trivial outer loop vectorization
>      * Inner loops are trivially lock-step among all vector elements
>      * No other branches inside the outer loop (to avoid touching 
> divergence analysis in this series)
>      This should require
>      1) inner loop uniformity check (SCEV based invariance will do) --- the
first small patch https://reviews.llvm.org/D40874 is a subset of this.
>      2) VPlan construction
>      3) Vector code generation (w/ unmodified uniform inner control 
> flow) from VPlan
>   
>      for (i=0;i<N;i++){ // vectorize here
>        // no branches here
>        for (j=0;j<loop_invariant_value; j++){
>          // no branches here
>        }
>        // no branches here
>      }
>   
>    Patch Series #2
>    ---------------
>      Simple outer loop vectorization
>      * Inner loops are still trivially lock-step among all vector 
> elements
>      * Non-loop branches are blindly assumed as divergent
>      This should require
>      1) Patch Series #1
>      2) Predication analysis and masking on VPlan
>      3) Vector code generation support for masking
>   
>      for (i=0;i<N;i++){ // vectorize here
>        // branches may be here
>        for (j=0;j<loop_invariant_value; j++){ // unconditional w.r.t. 
> i-loop
>          // branches may be here
>        }
>        // branches may be here
>      }
>   
>    Patch Series #3
>    ---------------
>      Better simple outer loop vectorization
>      * Inner loops are still trivially lock-step among all vector 
> elements
>      * Non-loop branches may be uniform/non-uniform
>      This should require
>      1) Patch Series #1, #2
>      2) We first use SCEV invariance for uniformity analysis.
>      3) Vector code generation support for uniform forward branches
>      4) We are working with Simon Moll (U-Saarland) to improve uniformity
analysis (https://www.youtube.com/watch?v=svMEphbFukw&feature=youtu.be).
>         Not a hard dependency, but improved analysis is good for
performance.
>   
>      Same kind of loop as Patch Series #2, but uniform branches are
optimized (i.e., less masking).
>   
>    Patch Series #4
>    ---------------
>      Less simple outer loop vectorization
>      * Inner loops may not be lock-step among all vector elements, but 
> single-entry/single-exit
>      This should require
>      1) Patch Series #1, #2, #3
>      2) VPlan to VPlan transformation to make inner loop lock-step. After
the transform, the VPlan should look like the ones #3 can handle.
>         We'll look into porting code from U-Saarland RV project, which
performs this transformation in IR to IR manner.
>   
>      for (i=0;i<N;i++){ // vectorize here
>        // branches may be here
>        if (cond(i)) {
>          // branches may be here
>          while (cond1(i)){
>            // branches may be here
>          }
>          // branches may be here
>        }
>        // branch may be here
>      }
>   
>    Patch Series #5
>    ---------------
>      Outer loop vectorization "Feature Complete"
>      * Inner loops must be redusible, allows multiple exits.
>      This should require
>      1) Patch Series #1, #2, #3, #4
>      2) VPlan to VPlan transformation to make inner loop lock-step.
>      3) Vector code generation support for inner loop with multiple 
> exits
>   
>      for (i=0;i<N;i++){ // vectorize here
>        // branches may be here
>        if (cond(1)) {
>          // branches may be here
>          while (cond1(i)){
>            .
>            if (cond2) break;
>            .
>            if (cond3) break;
>            .
>          }
>          // branches may be here
>        }
>        // branch may be here
>      }
>   
> ===========================> Future Work beyond this RFC:
> ===========================>    * Outer loop auto-vectorization legality
analysis
>    * Outer loop auto-vectorization cost model
>    * Cost model to compute VF/UF for explicit outer loop vectorization 
> (when not specified)
>    * (lower in priority) Irrducible inner loop handling (VPlan to 
> VPlan transform to make inner loops reducible. After the transform, 
> the VPlan should look like the ones #5 can handle.)
>   
> ==========> Trade-Offs:
> ==========> The new code path also has the ability to perform innermost
loop vectorization (as a trivial subset of outer loop vectorization if we want
to exercise that path for such purposes). It's good to be able to compare,
side-by-side, what existing code path does and what new path does so that we can
incrementally build up confidence/experience levels on the new code path. For
the time being, however, that can be seen as extra maintenance needs. Instead of
trying to do this on the trunk, we can do the same thing on the side, using
github, until "outer loop vectorization feature" becomes mature
enough. Amount of code review required, to get into the trunk, most likely would
be comparable.
>   
> =======> Summary:
> =======> Outer loop vectorization feature implementation plan is
presented. We
> think concurrent progress has a good enough chance of expediting overall
transition of LV into the new infrastructure while benefiting from delivering
new functionality to those in need sooner, with wider community participation in
the overall development.
>   
> Thanks,
> Hideki Saito
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Philip Reames via llvm-dev

2018-Jan-15 21:52 UTC

head link

[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

To revive the discussion around vectorizer testing, here's a quick 
sample of a few of the issues hit recently in the loop vectorizer.  I 
want to be careful to say that I am not stating these are the result of 
any recent work, just that they're issues that have been triaged down to 
the loop vectorizer doing something incorrect or questionable from a 
performance perspective.

https://bugs.llvm.org/show_bug.cgi?id=35282
https://bugs.llvm.org/show_bug.cgi?id=35687
https://bugs.llvm.org/show_bug.cgi?id=35734
https://bugs.llvm.org/show_bug.cgi?id=35773
https://reviews.llvm.org/D41939

I also see another 10 or so correctness related ones in a a simple 
search: 
https://bugs.llvm.org/buglist.cgi?quicksearch=LoopVectorize&list_id=131629

The loop vectorizer is currently a serious pain point in our regression 
testing.  As the rest of the optimizer changes, we are finding that 
existing bugs in the vectorizer are being exposed with disturbing 
frequency and that some new regressions are landing as well.

Philip

On 12/14/2017 01:49 PM, Saito, Hideki wrote:>> Another might be to introduce changes under feature flags to ease the
revert/reintroduce/revert cycle.
> This is essentially the first guard. We plan to have flags/settings to
control which types of outer loops are handled.
> The new code path is initially exclusive to outer loop vectorization. If we
disable all types of outer loops
> (and that's the initial default), LV continues to be good old innermost
loop vectorizer.
>
>> I'd like to hear what plan is in place to ensure we don't
destabilize the vectorizer while working on this.
> W.r.t. this project, this matters when we touch the code that is shared
with innermost loop vectorization
> and outer loop vectorization. It's certainly good to ensure that such
places to have good test coverage
> at the time of commit.
>
> This also matters (and matters greatly) when we start guiding innermost
loops towards the new code path (when
> we are ready). For this, another flag control would be there for everyone
to try before the flipping of the default could
> land in the trunk (somewhat analogous to Chandler asking people to test new
Pass Manager prior to the switch).
> We might be able to control which kind of innermost loops would take new
code path, but that's TBD.
>
> Fuzz testing or not, I fully agree that good testing coverage of vectorizer
is desired.
>
> Thanks,
> Hideki
>
> -----Original Message-----
> From: Philip Reames [mailto:listmail at philipreames.com]
> Sent: Thursday, December 14, 2017 1:10 PM
> To: Saito, Hideki <hideki.saito at intel.com>; llvm-dev at
lists.llvm.org; renato.golin at linaro.org; aemerson at apple.com; mkuper at
google.com; mssimpso at codeaurora.org; Simon Moll <moll at
cs.uni-saarland.de>; hfinkel at anl.gov
> Subject: Re: [llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop
Vectorization Implementation Plan
>
> One area that needs a bit of attention before work on this proceeds much
further is testing.  The introduction of VPlan appears to have introduced a
couple of bugs and exposed a couple of others.  Most of those are now fixed, but
the process did point out a lack of test coverage around the changes which is
concerning.
>
> I'd like to hear what plan is in place to ensure we don't
destabilize the vectorizer while working on this.
>
> One thing we could consider is leveraging the new IR fuzzer to help find
assertion failures either before submission or shortly there after. Another
might be to introduce changes under feature flags to ease the
revert/reintroduce/revert cycle.
>
> Philip
>
>
> On 12/05/2017 05:09 PM, Saito, Hideki via llvm-dev wrote:
>> Proposal for Outer Loop Vectorization Implementation Plan
>> ============================================>>
>> ====>> Goal:
>> ====>> Extending Loop Vectorizer (LV) such that it can handle
outer loops, via VPlan infrastructure enhancements.
>> Understand the trade-offs in trying to make concurrent progress with
>> moving remaining inner loop vectorization functionality to VPlan
>> infrastructure
>>    
>> ==========>> Background:
>> ==========>> This is related to VPlan infrastructure project we
started a while back, a project to extend the (inner loop vectorization focused)
Loop Vectorizer to support outer loop vectorization.
>> VPlan is the vectorization planner that records the decisions and
candidate directions to pursue in order to drive cost modeling and vector code
generation. When it is fully integrated into LV (i.e., at the end of this big
project), VPlan will use a Hierarchical-CFG (HCFG) and transform it starting
from the abstraction of the input IR to reflect current vectorization decisions
being made. The HCFG eventually becomes the abstraction of the output IR, and
the vector code generation is driven by this abstract representation.
>>    
>> This is a follow up of the previous RFCs and LLVM Developer Conference
presentations:
>>             
>> http://lists.llvm.org/pipermail/llvm-dev/2016-September/105057.html
>> (RFC: Extending LV to vectorize outerloops)
>>             
>> http://lists.llvm.org/pipermail/llvm-dev/2017-February/110159.html
>> (RFC: Introducing VPlan to model the vectorized code and drive its
>> transformation)
>>             https://www.youtube.com/watch?v=XXAvdUwO7kQ (Extending
>> LoopVectorizer: OpenMP4.5 SIMD and Outer Loop Auto-Vectorization)
>>             https://www.youtube.com/watch?v=IqzJRs6tb7Y (Introducing
>> VPlan to the LoopVectorizer)
>>             https://www.youtube.com/watch?v=BjBSJFzYDVk (Vectorizing
>> Loops with VPlan - Current State and Next Steps) Please also refer to
the separately sent out status update
http://lists.llvm.org/pipermail/llvm-dev/2017-December/119522.html.
>> .
>>    
>> So far, two big patches related to VPlan implementation have been
committed.
>> https://reviews.llvm.org/D28975 by Gil Rapaport. (Introducing VPlan to
>> model the vectorized code and drive its transformation)
>>        Has been broken down to a series of smaller patches and went in.
>> The last (re)commit of the series is
>>        https://reviews.llvm.org/rL311849
>> https://reviews.llvm.org/D38676 by Gil Rapaport. (Modeling masking in
>> VPlan, introducing VPInstructions)
>>        This is also broken down to a series of smaller patches to
facilitate the review.
>>        Committed as https://reviews.llvm.org/rL318645
>>    
>> With the first patch, we introduced the concept of VPlan to LV and
started explicitly recording decisions like interleave memory access
optimization and serialization.
>> With the second patch, we also introduced the concept of VPInstruction
(to model newly introduced use-def) and started explicitly modeling masking and
mask manipulations.
>>    
>> So far, work in this area has been mostly restricted to refactoring of
Loop Vectorizer's existing functionality with several remaining areas of LV
that still need to be refactored into the new VPlan based representation. The
following is a non-exhaustive list of areas that need to be worked on and gives
an idea of the amount of remaining work.
>> * Predication
>> * Cost model
>> * Remainder Loop
>> * Runtime Guards
>> * External Users
>> * Reduction Epilog
>> * Interleave Grouping
>> * Sink Scalar Operands
>>    
>> While it is important to keep refactoring the above mentioned items
such that entire LV becomes fully VPlan based (which completes the transition
into VPlan infrastructure), we are aware that there is also a need to make
progress in outer loop vectorization. Function vectorization via VecClone pass
(https://reviews.llvm.org/D22792) also depends on the progress of outer loop
vectorization.
>>    
>> =============>> Proposed Plan:
>> =============>> Instead of waiting for most of the above
mentioned refactoring to finish before working on outer loop vectorization, we
believe it serves the best interest of the community if we start making
concurrent progress on continued refactoring effort and outer loop vectorization
effort.
>> To implement outer loop vectorization in incremental steps, we propose
adding new code paths that would be exercised initially for only outer loop
vectorization cases. This would ensure that the changes are NFC for innermost
loop vectorization cases until we (slowly) transition them to also go through
these new code paths.
>>    
>> Given that LV never had the ability to vectorize outer level loops,
letting outer loop vectorization take slightly different code paths, compared to
innermost loop vectorization, won't cause any issues in innermost loop
vectorization. The new paths are what inner loop vectorization will eventually
go through (i.e., post-refactoring), after the kinks are worked out so that
doing so will not regress any of the existing functionality and performance.
Having the new code paths exposed early would encourage more community
participating in maturing the new code paths, makes it clearer what still needs
to be moved over to VPlan infrastructure, and hopefully also encourage more
community participation in the remaining refactoring effort.
>>    
>> =====================>> Proposed Patch Series:
>> =====================>> There will be a lot of new development
before declaring "feature complete". As such, we propose to break it
into five or more series of patches ---- starting from vectorizing trivial outer
loops. Along the way, we expect to include NFC patches to make LV more modular,
in order to keep LoopVectorize.cpp file size from growing too much.
>>    
>>     Patch Series #1
>>     ---------------
>>       Trivial outer loop vectorization
>>       * Inner loops are trivially lock-step among all vector elements
>>       * No other branches inside the outer loop (to avoid touching
>> divergence analysis in this series)
>>       This should require
>>       1) inner loop uniformity check (SCEV based invariance will do)
--- the first small patch https://reviews.llvm.org/D40874 is a subset of this.
>>       2) VPlan construction
>>       3) Vector code generation (w/ unmodified uniform inner control
>> flow) from VPlan
>>    
>>       for (i=0;i<N;i++){ // vectorize here
>>         // no branches here
>>         for (j=0;j<loop_invariant_value; j++){
>>           // no branches here
>>         }
>>         // no branches here
>>       }
>>    
>>     Patch Series #2
>>     ---------------
>>       Simple outer loop vectorization
>>       * Inner loops are still trivially lock-step among all vector
>> elements
>>       * Non-loop branches are blindly assumed as divergent
>>       This should require
>>       1) Patch Series #1
>>       2) Predication analysis and masking on VPlan
>>       3) Vector code generation support for masking
>>    
>>       for (i=0;i<N;i++){ // vectorize here
>>         // branches may be here
>>         for (j=0;j<loop_invariant_value; j++){ // unconditional
w.r.t.
>> i-loop
>>           // branches may be here
>>         }
>>         // branches may be here
>>       }
>>    
>>     Patch Series #3
>>     ---------------
>>       Better simple outer loop vectorization
>>       * Inner loops are still trivially lock-step among all vector
>> elements
>>       * Non-loop branches may be uniform/non-uniform
>>       This should require
>>       1) Patch Series #1, #2
>>       2) We first use SCEV invariance for uniformity analysis.
>>       3) Vector code generation support for uniform forward branches
>>       4) We are working with Simon Moll (U-Saarland) to improve
uniformity analysis
(https://www.youtube.com/watch?v=svMEphbFukw&feature=youtu.be).
>>          Not a hard dependency, but improved analysis is good for
performance.
>>    
>>       Same kind of loop as Patch Series #2, but uniform branches are
optimized (i.e., less masking).
>>    
>>     Patch Series #4
>>     ---------------
>>       Less simple outer loop vectorization
>>       * Inner loops may not be lock-step among all vector elements, but
>> single-entry/single-exit
>>       This should require
>>       1) Patch Series #1, #2, #3
>>       2) VPlan to VPlan transformation to make inner loop lock-step.
After the transform, the VPlan should look like the ones #3 can handle.
>>          We'll look into porting code from U-Saarland RV project,
which performs this transformation in IR to IR manner.
>>    
>>       for (i=0;i<N;i++){ // vectorize here
>>         // branches may be here
>>         if (cond(i)) {
>>           // branches may be here
>>           while (cond1(i)){
>>             // branches may be here
>>           }
>>           // branches may be here
>>         }
>>         // branch may be here
>>       }
>>    
>>     Patch Series #5
>>     ---------------
>>       Outer loop vectorization "Feature Complete"
>>       * Inner loops must be redusible, allows multiple exits.
>>       This should require
>>       1) Patch Series #1, #2, #3, #4
>>       2) VPlan to VPlan transformation to make inner loop lock-step.
>>       3) Vector code generation support for inner loop with multiple
>> exits
>>    
>>       for (i=0;i<N;i++){ // vectorize here
>>         // branches may be here
>>         if (cond(1)) {
>>           // branches may be here
>>           while (cond1(i)){
>>             .
>>             if (cond2) break;
>>             .
>>             if (cond3) break;
>>             .
>>           }
>>           // branches may be here
>>         }
>>         // branch may be here
>>       }
>>    
>> ===========================>> Future Work beyond this RFC:
>> ===========================>>     * Outer loop auto-vectorization
legality analysis
>>     * Outer loop auto-vectorization cost model
>>     * Cost model to compute VF/UF for explicit outer loop vectorization
>> (when not specified)
>>     * (lower in priority) Irrducible inner loop handling (VPlan to
>> VPlan transform to make inner loops reducible. After the transform,
>> the VPlan should look like the ones #5 can handle.)
>>    
>> ==========>> Trade-Offs:
>> ==========>> The new code path also has the ability to perform
innermost loop vectorization (as a trivial subset of outer loop vectorization if
we want to exercise that path for such purposes). It's good to be able to
compare, side-by-side, what existing code path does and what new path does so
that we can incrementally build up confidence/experience levels on the new code
path. For the time being, however, that can be seen as extra maintenance needs.
Instead of trying to do this on the trunk, we can do the same thing on the side,
using github, until "outer loop vectorization feature" becomes mature
enough. Amount of code review required, to get into the trunk, most likely would
be comparable.
>>    
>> =======>> Summary:
>> =======>> Outer loop vectorization feature implementation plan is
presented. We
>> think concurrent progress has a good enough chance of expediting
overall transition of LV into the new infrastructure while benefiting from
delivering new functionality to those in need sooner, with wider community
participation in the overall development.
>>    
>> Thanks,
>> Hideki Saito
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Hal Finkel via llvm-dev

2018-Jan-16 01:08 UTC

head link

[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

On 01/15/2018 03:52 PM, Philip Reames wrote:> To revive the discussion around vectorizer testing, here's a quick
> sample of a few of the issues hit recently in the loop vectorizer.  I
> want to be careful to say that I am not stating these are the result
> of any recent work, just that they're issues that have been triaged
> down to the loop vectorizer doing something incorrect or questionable
> from a performance perspective.
>
> https://bugs.llvm.org/show_bug.cgi?id=35282
> https://bugs.llvm.org/show_bug.cgi?id=35687
> https://bugs.llvm.org/show_bug.cgi?id=35734
> https://bugs.llvm.org/show_bug.cgi?id=35773
> https://reviews.llvm.org/D41939
>
> I also see another 10 or so correctness related ones in a a simple
> search:
>
https://bugs.llvm.org/buglist.cgi?quicksearch=LoopVectorize&list_id=131629
>
> The loop vectorizer is currently a serious pain point in our
> regression testing.  As the rest of the optimizer changes, we are
> finding that existing bugs in the vectorizer are being exposed with
> disturbing frequency and that some new regressions are landing as well.
I certainly understand what you're saying, but, as you point out, many
of these are existing bugs that are being exposed by other changes (and
they're seemingly all over the map). My general feeling is that the more
limited the applicability of a particular transform the buggier it will
tend to be. The work here to allow vectorization of an every-wider set
of inputs will really help to expose, and thus help us eliminate, bugs.
As such, one of the largest benefits of adding the
function-vectorization work (https://reviews.llvm.org/D22792), and
outer-loop vectorization capabilities, will be making it easier to throw
essentially-arbitrary inputs at the vectorizer (and have it do
something), and thus, hit it more effectively with automated testing.

Maybe we can do a better job, even with the current capabilities, of
automatically generating data-parallel loops with reductions of various
kinds? I'm thinking about automated testing because, AFAIK, the
vectorizer is already run through almost all of the relevant benchmarks
and test suites, and even if we add a few more, we probably need to
increase the test coverage by a lot more than that.

Do you have ideas about how we can have better testing in this area
otherwise?

 -Hal
>
> Philip
>
> On 12/14/2017 01:49 PM, Saito, Hideki wrote:
>>> Another might be to introduce changes under feature flags to ease
>>> the revert/reintroduce/revert cycle.
>> This is essentially the first guard. We plan to have flags/settings
>> to control which types of outer loops are handled.
>> The new code path is initially exclusive to outer loop vectorization.
>> If we disable all types of outer loops
>> (and that's the initial default), LV continues to be good old
>> innermost loop vectorizer.
>>
>>> I'd like to hear what plan is in place to ensure we don't
>>> destabilize the vectorizer while working on this.
>> W.r.t. this project, this matters when we touch the code that is
>> shared with innermost loop vectorization
>> and outer loop vectorization. It's certainly good to ensure that
such
>> places to have good test coverage
>> at the time of commit.
>>
>> This also matters (and matters greatly) when we start guiding
>> innermost loops towards the new code path (when
>> we are ready). For this, another flag control would be there for
>> everyone to try before the flipping of the default could
>> land in the trunk (somewhat analogous to Chandler asking people to
>> test new Pass Manager prior to the switch).
>> We might be able to control which kind of innermost loops would take
>> new code path, but that's TBD.
>>
>> Fuzz testing or not, I fully agree that good testing coverage of
>> vectorizer is desired.
>>
>> Thanks,
>> Hideki
>>
>> -----Original Message-----
>> From: Philip Reames [mailto:listmail at philipreames.com]
>> Sent: Thursday, December 14, 2017 1:10 PM
>> To: Saito, Hideki <hideki.saito at intel.com>; llvm-dev at
lists.llvm.org;
>> renato.golin at linaro.org; aemerson at apple.com; mkuper at
google.com;
>> mssimpso at codeaurora.org; Simon Moll <moll at
cs.uni-saarland.de>;
>> hfinkel at anl.gov
>> Subject: Re: [llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop
>> Vectorization Implementation Plan
>>
>> One area that needs a bit of attention before work on this proceeds
>> much further is testing.  The introduction of VPlan appears to have
>> introduced a couple of bugs and exposed a couple of others.  Most of
>> those are now fixed, but the process did point out a lack of test
>> coverage around the changes which is concerning.
>>
>> I'd like to hear what plan is in place to ensure we don't
destabilize
>> the vectorizer while working on this.
>>
>> One thing we could consider is leveraging the new IR fuzzer to help
>> find assertion failures either before submission or shortly there
>> after. Another might be to introduce changes under feature flags to
>> ease the revert/reintroduce/revert cycle.
>>
>> Philip
>>
>>
>> On 12/05/2017 05:09 PM, Saito, Hideki via llvm-dev wrote:
>>> Proposal for Outer Loop Vectorization Implementation Plan
>>> ============================================>>>
>>> ====>>> Goal:
>>> ====>>> Extending Loop Vectorizer (LV) such that it can
handle outer loops,
>>> via VPlan infrastructure enhancements.
>>> Understand the trade-offs in trying to make concurrent progress
with
>>> moving remaining inner loop vectorization functionality to VPlan
>>> infrastructure
>>>    ==========>>> Background:
>>> ==========>>> This is related to VPlan infrastructure
project we started a while
>>> back, a project to extend the (inner loop vectorization focused)
>>> Loop Vectorizer to support outer loop vectorization.
>>> VPlan is the vectorization planner that records the decisions and
>>> candidate directions to pursue in order to drive cost modeling and
>>> vector code generation. When it is fully integrated into LV (i.e.,
>>> at the end of this big project), VPlan will use a Hierarchical-CFG
>>> (HCFG) and transform it starting from the abstraction of the input
>>> IR to reflect current vectorization decisions being made. The HCFG
>>> eventually becomes the abstraction of the output IR, and the vector
>>> code generation is driven by this abstract representation.
>>>    This is a follow up of the previous RFCs and LLVM Developer
>>> Conference presentations:
>>>            
>>> http://lists.llvm.org/pipermail/llvm-dev/2016-September/105057.html
>>> (RFC: Extending LV to vectorize outerloops)
>>>            
>>> http://lists.llvm.org/pipermail/llvm-dev/2017-February/110159.html
>>> (RFC: Introducing VPlan to model the vectorized code and drive its
>>> transformation)
>>>             https://www.youtube.com/watch?v=XXAvdUwO7kQ (Extending
>>> LoopVectorizer: OpenMP4.5 SIMD and Outer Loop Auto-Vectorization)
>>>             https://www.youtube.com/watch?v=IqzJRs6tb7Y
(Introducing
>>> VPlan to the LoopVectorizer)
>>>             https://www.youtube.com/watch?v=BjBSJFzYDVk
(Vectorizing
>>> Loops with VPlan - Current State and Next Steps) Please also refer
>>> to the separately sent out status update
>>> http://lists.llvm.org/pipermail/llvm-dev/2017-December/119522.html.
>>> .
>>>    So far, two big patches related to VPlan implementation have
been
>>> committed.
>>> https://reviews.llvm.org/D28975 by Gil Rapaport. (Introducing VPlan
to
>>> model the vectorized code and drive its transformation)
>>>        Has been broken down to a series of smaller patches and went
in.
>>> The last (re)commit of the series is
>>>        https://reviews.llvm.org/rL311849
>>> https://reviews.llvm.org/D38676 by Gil Rapaport. (Modeling masking
in
>>> VPlan, introducing VPInstructions)
>>>        This is also broken down to a series of smaller patches to
>>> facilitate the review.
>>>        Committed as https://reviews.llvm.org/rL318645
>>>    With the first patch, we introduced the concept of VPlan to LV
>>> and started explicitly recording decisions like interleave memory
>>> access optimization and serialization.
>>> With the second patch, we also introduced the concept of
>>> VPInstruction (to model newly introduced use-def) and started
>>> explicitly modeling masking and mask manipulations.
>>>    So far, work in this area has been mostly restricted to
>>> refactoring of Loop Vectorizer's existing functionality with
several
>>> remaining areas of LV that still need to be refactored into the new
>>> VPlan based representation. The following is a non-exhaustive list
>>> of areas that need to be worked on and gives an idea of the amount
>>> of remaining work.
>>> * Predication
>>> * Cost model
>>> * Remainder Loop
>>> * Runtime Guards
>>> * External Users
>>> * Reduction Epilog
>>> * Interleave Grouping
>>> * Sink Scalar Operands
>>>    While it is important to keep refactoring the above mentioned
>>> items such that entire LV becomes fully VPlan based (which
completes
>>> the transition into VPlan infrastructure), we are aware that there
>>> is also a need to make progress in outer loop vectorization.
>>> Function vectorization via VecClone pass
>>> (https://reviews.llvm.org/D22792) also depends on the progress of
>>> outer loop vectorization.
>>>    =============>>> Proposed Plan:
>>> =============>>> Instead of waiting for most of the above
mentioned refactoring to
>>> finish before working on outer loop vectorization, we believe it
>>> serves the best interest of the community if we start making
>>> concurrent progress on continued refactoring effort and outer loop
>>> vectorization effort.
>>> To implement outer loop vectorization in incremental steps, we
>>> propose adding new code paths that would be exercised initially for
>>> only outer loop vectorization cases. This would ensure that the
>>> changes are NFC for innermost loop vectorization cases until we
>>> (slowly) transition them to also go through these new code paths.
>>>    Given that LV never had the ability to vectorize outer level
>>> loops, letting outer loop vectorization take slightly different
code
>>> paths, compared to innermost loop vectorization, won't cause
any
>>> issues in innermost loop vectorization. The new paths are what
inner
>>> loop vectorization will eventually go through (i.e.,
>>> post-refactoring), after the kinks are worked out so that doing so
>>> will not regress any of the existing functionality and performance.
>>> Having the new code paths exposed early would encourage more
>>> community participating in maturing the new code paths, makes it
>>> clearer what still needs to be moved over to VPlan infrastructure,
>>> and hopefully also encourage more community participation in the
>>> remaining refactoring effort.
>>>    =====================>>> Proposed Patch Series:
>>> =====================>>> There will be a lot of new
development before declaring "feature
>>> complete". As such, we propose to break it into five or more
series
>>> of patches ---- starting from vectorizing trivial outer loops.
Along
>>> the way, we expect to include NFC patches to make LV more modular,
>>> in order to keep LoopVectorize.cpp file size from growing too much.
>>>        Patch Series #1
>>>     ---------------
>>>       Trivial outer loop vectorization
>>>       * Inner loops are trivially lock-step among all vector
elements
>>>       * No other branches inside the outer loop (to avoid touching
>>> divergence analysis in this series)
>>>       This should require
>>>       1) inner loop uniformity check (SCEV based invariance will
do)
>>> --- the first small patch https://reviews.llvm.org/D40874 is a
>>> subset of this.
>>>       2) VPlan construction
>>>       3) Vector code generation (w/ unmodified uniform inner
control
>>> flow) from VPlan
>>>          for (i=0;i<N;i++){ // vectorize here
>>>         // no branches here
>>>         for (j=0;j<loop_invariant_value; j++){
>>>           // no branches here
>>>         }
>>>         // no branches here
>>>       }
>>>        Patch Series #2
>>>     ---------------
>>>       Simple outer loop vectorization
>>>       * Inner loops are still trivially lock-step among all vector
>>> elements
>>>       * Non-loop branches are blindly assumed as divergent
>>>       This should require
>>>       1) Patch Series #1
>>>       2) Predication analysis and masking on VPlan
>>>       3) Vector code generation support for masking
>>>          for (i=0;i<N;i++){ // vectorize here
>>>         // branches may be here
>>>         for (j=0;j<loop_invariant_value; j++){ // unconditional
w.r.t.
>>> i-loop
>>>           // branches may be here
>>>         }
>>>         // branches may be here
>>>       }
>>>        Patch Series #3
>>>     ---------------
>>>       Better simple outer loop vectorization
>>>       * Inner loops are still trivially lock-step among all vector
>>> elements
>>>       * Non-loop branches may be uniform/non-uniform
>>>       This should require
>>>       1) Patch Series #1, #2
>>>       2) We first use SCEV invariance for uniformity analysis.
>>>       3) Vector code generation support for uniform forward
branches
>>>       4) We are working with Simon Moll (U-Saarland) to improve
>>> uniformity analysis
>>> (https://www.youtube.com/watch?v=svMEphbFukw&feature=youtu.be).
>>>          Not a hard dependency, but improved analysis is good for
>>> performance.
>>>          Same kind of loop as Patch Series #2, but uniform branches
>>> are optimized (i.e., less masking).
>>>        Patch Series #4
>>>     ---------------
>>>       Less simple outer loop vectorization
>>>       * Inner loops may not be lock-step among all vector elements,
but
>>> single-entry/single-exit
>>>       This should require
>>>       1) Patch Series #1, #2, #3
>>>       2) VPlan to VPlan transformation to make inner loop
lock-step.
>>> After the transform, the VPlan should look like the ones #3 can
handle.
>>>          We'll look into porting code from U-Saarland RV
project,
>>> which performs this transformation in IR to IR manner.
>>>          for (i=0;i<N;i++){ // vectorize here
>>>         // branches may be here
>>>         if (cond(i)) {
>>>           // branches may be here
>>>           while (cond1(i)){
>>>             // branches may be here
>>>           }
>>>           // branches may be here
>>>         }
>>>         // branch may be here
>>>       }
>>>        Patch Series #5
>>>     ---------------
>>>       Outer loop vectorization "Feature Complete"
>>>       * Inner loops must be redusible, allows multiple exits.
>>>       This should require
>>>       1) Patch Series #1, #2, #3, #4
>>>       2) VPlan to VPlan transformation to make inner loop
lock-step.
>>>       3) Vector code generation support for inner loop with
multiple
>>> exits
>>>          for (i=0;i<N;i++){ // vectorize here
>>>         // branches may be here
>>>         if (cond(1)) {
>>>           // branches may be here
>>>           while (cond1(i)){
>>>             .
>>>             if (cond2) break;
>>>             .
>>>             if (cond3) break;
>>>             .
>>>           }
>>>           // branches may be here
>>>         }
>>>         // branch may be here
>>>       }
>>>    ===========================>>> Future Work beyond this
RFC:
>>> ===========================>>>     * Outer loop
auto-vectorization legality analysis
>>>     * Outer loop auto-vectorization cost model
>>>     * Cost model to compute VF/UF for explicit outer loop
vectorization
>>> (when not specified)
>>>     * (lower in priority) Irrducible inner loop handling (VPlan to
>>> VPlan transform to make inner loops reducible. After the transform,
>>> the VPlan should look like the ones #5 can handle.)
>>>    ==========>>> Trade-Offs:
>>> ==========>>> The new code path also has the ability to
perform innermost loop
>>> vectorization (as a trivial subset of outer loop vectorization if
we
>>> want to exercise that path for such purposes). It's good to be
able
>>> to compare, side-by-side, what existing code path does and what new
>>> path does so that we can incrementally build up
>>> confidence/experience levels on the new code path. For the time
>>> being, however, that can be seen as extra maintenance needs.
Instead
>>> of trying to do this on the trunk, we can do the same thing on the
>>> side, using github, until "outer loop vectorization
feature" becomes
>>> mature enough. Amount of code review required, to get into the
>>> trunk, most likely would be comparable.
>>>    =======>>> Summary:
>>> =======>>> Outer loop vectorization feature implementation
plan is presented. We
>>> think concurrent progress has a good enough chance of expediting
>>> overall transition of LV into the new infrastructure while
>>> benefiting from delivering new functionality to those in need
>>> sooner, with wider community participation in the overall
development.
>>>    Thanks,
>>> Hideki Saito
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Amara Emerson via llvm-dev

2018-Jan-16 02:51 UTC

head link

[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

> On 15 Jan 2018, at 21:52, Philip Reames <listmail at
philipreames.com> wrote:
> 
> To revive the discussion around vectorizer testing, here's a quick
sample of a few of the issues hit recently in the loop vectorizer.  I want to be
careful to say that I am not stating these are the result of any recent work,
just that they're issues that have been triaged down to the loop vectorizer
doing something incorrect or questionable from a performance perspective.
> 
> https://bugs.llvm.org/show_bug.cgi?id=35282
> https://bugs.llvm.org/show_bug.cgi?id=35687
> https://bugs.llvm.org/show_bug.cgi?id=35734
> https://bugs.llvm.org/show_bug.cgi?id=35773
> https://reviews.llvm.org/D41939
> 
> I also see another 10 or so correctness related ones in a a simple search:
https://bugs.llvm.org/buglist.cgi?quicksearch=LoopVectorize&list_id=131629
> 
> The loop vectorizer is currently a serious pain point in our regression
testing.  As the rest of the optimizer changes, we are finding that existing
bugs in the vectorizer are being exposed with disturbing frequency and that some
new regressions are landing as well.
> 
> Philip
> 
> On 12/14/2017 01:49 PM, Saito, Hideki wrote:
>>> Another might be to introduce changes under feature flags to ease
the revert/reintroduce/revert cycle.
>> This is essentially the first guard. We plan to have flags/settings to
control which types of outer loops are handled.
>> The new code path is initially exclusive to outer loop vectorization.
If we disable all types of outer loops
>> (and that's the initial default), LV continues to be good old
innermost loop vectorizer.
>> 
>>> I'd like to hear what plan is in place to ensure we don't
destabilize the vectorizer while working on this.
>> W.r.t. this project, this matters when we touch the code that is shared
with innermost loop vectorization
>> and outer loop vectorization. It's certainly good to ensure that
such places to have good test coverage
>> at the time of commit.
>> 
>> This also matters (and matters greatly) when we start guiding innermost
loops towards the new code path (when
>> we are ready). For this, another flag control would be there for
everyone to try before the flipping of the default could
>> land in the trunk (somewhat analogous to Chandler asking people to test
new Pass Manager prior to the switch).
>> We might be able to control which kind of innermost loops would take
new code path, but that's TBD.
>> 
>> Fuzz testing or not, I fully agree that good testing coverage of
vectorizer is desired.
>> 
>> Thanks,
>> Hideki
>> 
>> -----Original Message-----
>> From: Philip Reames [mailto:listmail at philipreames.com]
>> Sent: Thursday, December 14, 2017 1:10 PM
>> To: Saito, Hideki <hideki.saito at intel.com>; llvm-dev at
lists.llvm.org; renato.golin at linaro.org; aemerson at apple.com; mkuper at
google.com; mssimpso at codeaurora.org; Simon Moll <moll at
cs.uni-saarland.de>; hfinkel at anl.gov
>> Subject: Re: [llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop
Vectorization Implementation Plan
>> 
>> One area that needs a bit of attention before work on this proceeds
much further is testing.  The introduction of VPlan appears to have introduced a
couple of bugs and exposed a couple of others.  Most of those are now fixed, but
the process did point out a lack of test coverage around the changes which is
concerning.
>> 
>> I'd like to hear what plan is in place to ensure we don't
destabilize the vectorizer while working on this.
>> 
>> One thing we could consider is leveraging the new IR fuzzer to help
find assertion failures either before submission or shortly there after. Another
might be to introduce changes under feature flags to ease the
revert/reintroduce/revert cycle.
>> 
>> Philip
>> 
>> 
>> On 12/05/2017 05:09 PM, Saito, Hideki via llvm-dev wrote:
>>> Proposal for Outer Loop Vectorization Implementation Plan
>>> ============================================>>> 
>>> ====>>> Goal:
>>> ====>>> Extending Loop Vectorizer (LV) such that it can
handle outer loops, via VPlan infrastructure enhancements.
>>> Understand the trade-offs in trying to make concurrent progress
with
>>> moving remaining inner loop vectorization functionality to VPlan
>>> infrastructure
>>>   ==========>>> Background:
>>> ==========>>> This is related to VPlan infrastructure
project we started a while back, a project to extend the (inner loop
vectorization focused) Loop Vectorizer to support outer loop vectorization.
>>> VPlan is the vectorization planner that records the decisions and
candidate directions to pursue in order to drive cost modeling and vector code
generation. When it is fully integrated into LV (i.e., at the end of this big
project), VPlan will use a Hierarchical-CFG (HCFG) and transform it starting
from the abstraction of the input IR to reflect current vectorization decisions
being made. The HCFG eventually becomes the abstraction of the output IR, and
the vector code generation is driven by this abstract representation.
>>>   This is a follow up of the previous RFCs and LLVM Developer
Conference presentations:
>>>           
http://lists.llvm.org/pipermail/llvm-dev/2016-September/105057.html
>>> (RFC: Extending LV to vectorize outerloops)
>>>           
http://lists.llvm.org/pipermail/llvm-dev/2017-February/110159.html
>>> (RFC: Introducing VPlan to model the vectorized code and drive its
>>> transformation)
>>>            https://www.youtube.com/watch?v=XXAvdUwO7kQ (Extending
>>> LoopVectorizer: OpenMP4.5 SIMD and Outer Loop Auto-Vectorization)
>>>            https://www.youtube.com/watch?v=IqzJRs6tb7Y (Introducing
>>> VPlan to the LoopVectorizer)
>>>            https://www.youtube.com/watch?v=BjBSJFzYDVk (Vectorizing
>>> Loops with VPlan - Current State and Next Steps) Please also refer
to the separately sent out status update
http://lists.llvm.org/pipermail/llvm-dev/2017-December/119522.html.
>>> .
>>>   So far, two big patches related to VPlan implementation have been
committed.
>>> https://reviews.llvm.org/D28975 by Gil Rapaport. (Introducing VPlan
to
>>> model the vectorized code and drive its transformation)
>>>       Has been broken down to a series of smaller patches and went
in.
>>> The last (re)commit of the series is
>>>       https://reviews.llvm.org/rL311849
>>> https://reviews.llvm.org/D38676 by Gil Rapaport. (Modeling masking
in
>>> VPlan, introducing VPInstructions)
>>>       This is also broken down to a series of smaller patches to
facilitate the review.
>>>       Committed as https://reviews.llvm.org/rL318645
>>>   With the first patch, we introduced the concept of VPlan to LV
and started explicitly recording decisions like interleave memory access
optimization and serialization.
>>> With the second patch, we also introduced the concept of
VPInstruction (to model newly introduced use-def) and started explicitly
modeling masking and mask manipulations.
>>>   So far, work in this area has been mostly restricted to
refactoring of Loop Vectorizer's existing functionality with several
remaining areas of LV that still need to be refactored into the new VPlan based
representation. The following is a non-exhaustive list of areas that need to be
worked on and gives an idea of the amount of remaining work.
>>> * Predication
>>> * Cost model
>>> * Remainder Loop
>>> * Runtime Guards
>>> * External Users
>>> * Reduction Epilog
>>> * Interleave Grouping
>>> * Sink Scalar Operands
>>>   While it is important to keep refactoring the above mentioned
items such that entire LV becomes fully VPlan based (which completes the
transition into VPlan infrastructure), we are aware that there is also a need to
make progress in outer loop vectorization. Function vectorization via VecClone
pass (https://reviews.llvm.org/D22792) also depends on the progress of outer
loop vectorization.
>>>   =============>>> Proposed Plan:
>>> =============>>> Instead of waiting for most of the above
mentioned refactoring to finish before working on outer loop vectorization, we
believe it serves the best interest of the community if we start making
concurrent progress on continued refactoring effort and outer loop vectorization
effort.
>>> To implement outer loop vectorization in incremental steps, we
propose adding new code paths that would be exercised initially for only outer
loop vectorization cases. This would ensure that the changes are NFC for
innermost loop vectorization cases until we (slowly) transition them to also go
through these new code paths.
>>>   Given that LV never had the ability to vectorize outer level
loops, letting outer loop vectorization take slightly different code paths,
compared to innermost loop vectorization, won't cause any issues in
innermost loop vectorization. The new paths are what inner loop vectorization
will eventually go through (i.e., post-refactoring), after the kinks are worked
out so that doing so will not regress any of the existing functionality and
performance. Having the new code paths exposed early would encourage more
community participating in maturing the new code paths, makes it clearer what
still needs to be moved over to VPlan infrastructure, and hopefully also
encourage more community participation in the remaining refactoring effort.
>>>   =====================>>> Proposed Patch Series:
>>> =====================>>> There will be a lot of new
development before declaring "feature complete". As such, we propose
to break it into five or more series of patches ---- starting from vectorizing
trivial outer loops. Along the way, we expect to include NFC patches to make LV
more modular, in order to keep LoopVectorize.cpp file size from growing too
much.
>>>       Patch Series #1
>>>    ---------------
>>>      Trivial outer loop vectorization
>>>      * Inner loops are trivially lock-step among all vector
elements
>>>      * No other branches inside the outer loop (to avoid touching
>>> divergence analysis in this series)
>>>      This should require
>>>      1) inner loop uniformity check (SCEV based invariance will do)
--- the first small patch https://reviews.llvm.org/D40874 is a subset of this.
>>>      2) VPlan construction
>>>      3) Vector code generation (w/ unmodified uniform inner control
>>> flow) from VPlan
>>>         for (i=0;i<N;i++){ // vectorize here
>>>        // no branches here
>>>        for (j=0;j<loop_invariant_value; j++){
>>>          // no branches here
>>>        }
>>>        // no branches here
>>>      }
>>>       Patch Series #2
>>>    ---------------
>>>      Simple outer loop vectorization
>>>      * Inner loops are still trivially lock-step among all vector
>>> elements
>>>      * Non-loop branches are blindly assumed as divergent
>>>      This should require
>>>      1) Patch Series #1
>>>      2) Predication analysis and masking on VPlan
>>>      3) Vector code generation support for masking
>>>         for (i=0;i<N;i++){ // vectorize here
>>>        // branches may be here
>>>        for (j=0;j<loop_invariant_value; j++){ // unconditional
w.r.t.
>>> i-loop
>>>          // branches may be here
>>>        }
>>>        // branches may be here
>>>      }
>>>       Patch Series #3
>>>    ---------------
>>>      Better simple outer loop vectorization
>>>      * Inner loops are still trivially lock-step among all vector
>>> elements
>>>      * Non-loop branches may be uniform/non-uniform
>>>      This should require
>>>      1) Patch Series #1, #2
>>>      2) We first use SCEV invariance for uniformity analysis.
>>>      3) Vector code generation support for uniform forward branches
>>>      4) We are working with Simon Moll (U-Saarland) to improve
uniformity analysis
(https://www.youtube.com/watch?v=svMEphbFukw&feature=youtu.be).
>>>         Not a hard dependency, but improved analysis is good for
performance.
>>>         Same kind of loop as Patch Series #2, but uniform branches
are optimized (i.e., less masking).
>>>       Patch Series #4
>>>    ---------------
>>>      Less simple outer loop vectorization
>>>      * Inner loops may not be lock-step among all vector elements,
but
>>> single-entry/single-exit
>>>      This should require
>>>      1) Patch Series #1, #2, #3
>>>      2) VPlan to VPlan transformation to make inner loop lock-step.
After the transform, the VPlan should look like the ones #3 can handle.
>>>         We'll look into porting code from U-Saarland RV
project, which performs this transformation in IR to IR manner.
>>>         for (i=0;i<N;i++){ // vectorize here
>>>        // branches may be here
>>>        if (cond(i)) {
>>>          // branches may be here
>>>          while (cond1(i)){
>>>            // branches may be here
>>>          }
>>>          // branches may be here
>>>        }
>>>        // branch may be here
>>>      }
>>>       Patch Series #5
>>>    ---------------
>>>      Outer loop vectorization "Feature Complete"
>>>      * Inner loops must be redusible, allows multiple exits.
>>>      This should require
>>>      1) Patch Series #1, #2, #3, #4
>>>      2) VPlan to VPlan transformation to make inner loop lock-step.
>>>      3) Vector code generation support for inner loop with multiple
>>> exits
>>>         for (i=0;i<N;i++){ // vectorize here
>>>        // branches may be here
>>>        if (cond(1)) {
>>>          // branches may be here
>>>          while (cond1(i)){
>>>            .
>>>            if (cond2) break;
>>>            .
>>>            if (cond3) break;
>>>            .
>>>          }
>>>          // branches may be here
>>>        }
>>>        // branch may be here
>>>      }
>>>   ===========================>>> Future Work beyond this
RFC:
>>> ===========================>>>    * Outer loop
auto-vectorization legality analysis
>>>    * Outer loop auto-vectorization cost model
>>>    * Cost model to compute VF/UF for explicit outer loop
vectorization
>>> (when not specified)
>>>    * (lower in priority) Irrducible inner loop handling (VPlan to
>>> VPlan transform to make inner loops reducible. After the transform,
>>> the VPlan should look like the ones #5 can handle.)
>>>   ==========>>> Trade-Offs:
>>> ==========>>> The new code path also has the ability to
perform innermost loop vectorization (as a trivial subset of outer loop
vectorization if we want to exercise that path for such purposes). It's good
to be able to compare, side-by-side, what existing code path does and what new
path does so that we can incrementally build up confidence/experience levels on
the new code path. For the time being, however, that can be seen as extra
maintenance needs. Instead of trying to do this on the trunk, we can do the same
thing on the side, using github, until "outer loop vectorization
feature" becomes mature enough. Amount of code review required, to get into
the trunk, most likely would be comparable.
>>>   =======>>> Summary:
>>> =======>>> Outer loop vectorization feature implementation
plan is presented. We
>>> think concurrent progress has a good enough chance of expediting
overall transition of LV into the new infrastructure while benefiting from
delivering new functionality to those in need sooner, with wider community
participation in the overall development.
>>>   Thanks,
>>> Hideki Saito
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
If you’re finding bugs that have been extant for a while, is there anything
special you’re doing that’s uncovering them? If so, is there a way we could
replicate that kind of IR input to the vectorizer to better test the corner
cases?

Amara

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Jan 2018 - [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

Possibly Parallel Threads