thr3ads.net - llvm dev - [llvm-dev] [Proposal][RFC] Epilog loop vectorization [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2017-Mar-14 19:44 UTC

[llvm-dev] [Proposal][RFC] Epilog loop vectorization

On 03/14/2017 02:37 PM, Michael Kuperstein wrote:>
>
> On Tue, Mar 14, 2017 at 11:40 AM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>
>     On 03/14/2017 11:58 AM, Michael Kuperstein wrote:
>>     I'm still not sure about this, for a few reasons:
>>
>>     1) I'd like to try to treat epilogue loops the same way
>>     regardless of whether the main loop was vectorized by hand or
>>     automatically. So if someone hand-wrote an avx-512 16-wide loop,
>>     with alias checks, and we decide it's profitable to vectorize
the
>>     epilogue loop by 4 and re-use the checks, it ought to be done the
>>     same way. I realize this may be a pipe-dream, though.
>
>     I agree that sounds ideal. Identifying the effective vectorization
>     factor of the hand-vectorized loop seems fragile. However, if
>     someone is hand vectorizing then it seems like a small price to
>     add a pragma to the scalar loop restricting the vectorization
>     factor (and/or specifying that it is safe to execute). As a
>     result, I'm not sure how much effort we should make here.
>
>
> I semi-agree with you, which is why I don't think it's a blocker.
:-)
>
>>
>>     2) I'm still somewhat worried about "tiny loops". As
I wrote
>>     before, we explicitly refuse to vectorize loops we know have a
>>     trip-count less than 16, because our profitability heuristic for
>>     such loops is probably bad. IIUC the only reason we don't bail
>>     due to the threshold is because we use the same loop for
"failed
>>     min iters check" and "failed alias check". So,
because it's
>>     reachable through the alias-check path, the max trip count
isn't
>>     actually known, even though the typical trip count is probably
>>     small.
>>     It's true that you currently don't try to vectorize the
epilogue
>>     if the original VF is below 16, but this is a somewhat different
>>     condition.
>
>     I think that the reason we have that limit, however, is that we
>     don't model the costs of the checks. Not that the cost model is
>     otherwise too inaccurate for low-trip-count loops. If we modeled
>     the costs of the checks, then I don't think this would be a
problem.
>
>
> I don't think it's just the alias checks. There's also the 
> min-iteration check,
Yes, we'd need to model the cost of all of the checks (trip count, 
aliasing, overflow, etc.).
> broadcasts that get hoisted out of the loop, etc.
Can you elaborate? Do you mean that we're hoisting broadcasts outside of 
the trip-count check (i.e. speculatively executing them) and these are 
actually expensive operations? Is this is a cost-model problem being 
exposed by SimplifyCFG (which is hoisting these things outside of the 
trip-count check)?

  -Hal
>>
>>     3) Technically speaking, constructing a new InnerLoopVectorizer
>>     to vectorize this one loop sounds weird. We already have a
>>     worklist in the vectorizer that's currently running.
>
>     I agree, although we do want to reuse the cost and legality
>     analysis (which I think is a worthwhile engineering decision
>     because that analysis involves SCEV, AA, and TTI, all of which can
>     get expensive). If we can do that and also just add the new loop
>     to the work queue, that certainly might be cleaner.
>
>      -Hal
>
>
>>
>>     I don't think (1) is a blocker, and (3) should be easy to fix,
>>     but I'm not sure whether the way this is going to handle (2) is
>>     sufficient.  If I'm the only one that this bothers, I won't
stand
>>     in the way, but I'd like to at least make sure we've fully
>>     considered this.
>>
>>     On Mar 14, 2017 06:00, "Nema, Ashutosh" <Ashutosh.Nema
at amd.com
>>     <mailto:Ashutosh.Nema at amd.com>> wrote:
>>
>>         Summarizing the discussion on the implementation approaches.
>>
>>         Discussed about two approaches, first running
>>         ‘InnerLoopVectorizer’ again on the epilog loop immediately
>>         after vectorizing the original loop within the same
>>         vectorization pass, the second approach where re-running
>>         vectorization pass and limiting vectorization factor of
>>         epilog loop by metadata.
>>
>>         <Approach-2>
>>
>>         Challenges with re-running the vectorizer pass:
>>
>>         1)Reusing alias check result:
>>
>>         When vectorizer pass runs again it finds the epilog loop as a
>>         new loop and it may generates alias check, this new alias
>>         check may overkill the gains of epilog vectorization.
>>
>>         We should use the already computed alias check result instead
>>         of re computing again.
>>
>>         2)Rerun the vectorizer and hoist the new alias check:
>>
>>         It’s not possible to hoist alias checks as its not fully
>>         redundant (not dominated by other checks), it’s not getting
>>         execute in all paths.
>>
>>         NOTE: We cannot prepone alias check as its expensive compared
>>         to other checks.
>>
>>         <Approach-1>
>>
>>         1)Current patch depends on the existing functionality of
>>         LoopVectorizer, it uses ‘InnerLoopVectorizer’ again to
>>         vectorize the epilog loop, as it happens in the same
>>         vectorization pass we have flexibility to reuse already
>>         computed alias result check & limit vectorization factor
for
>>         the epilog loop.
>>
>>         2)It does not generate the blocks for new block layout
>>         explicitly, rather it depends on
>>         ‘InnerLoopVectorizer::createEmptyLoop’ to generate new block
>>         layout. The new block layout get automatically generated by
>>         calling the ‘InnerLoopVectorizer:: vectorize’ again.
>>
>>         3)Block layout description with epilog loop vectorization is
>>         available at
>>
>>        
https://reviews.llvm.org/file/data/fxg5vx3capyj257rrn5j/PHID-FILE-x6thnbf6ub55ep5yhalu/LayoutDescription.png
>>        
<https://reviews.llvm.org/file/data/fxg5vx3capyj257rrn5j/PHID-FILE-x6thnbf6ub55ep5yhalu/LayoutDescription.png>
>>
>>         Approach-1 looks feasible, please comment if any objections.
>>
>>         Regards,
>>
>>         Ashutosh
>>
>>         *From:*Nema, Ashutosh
>>         *Sent:* Wednesday, March 1, 2017 10:42 AM
>>         *To:* 'Daniel Berlin' <dberlin at dberlin.org
>>         <mailto:dberlin at dberlin.org>>
>>         *Cc:* anemet at apple.com <mailto:anemet at apple.com>;
Hal Finkel
>>         <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>;
Zaks, Ayal
>>         <ayal.zaks at intel.com <mailto:ayal.zaks at
intel.com>>; Renato
>>         Golin <renato.golin at linaro.org
>>         <mailto:renato.golin at linaro.org>>; mkuper at
google.com
>>         <mailto:mkuper at google.com>; Mehdi Amini
>>         <mehdi.amini at apple.com <mailto:mehdi.amini at
apple.com>>;
>>         llvm-dev <llvm-dev at lists.llvm.org
>>         <mailto:llvm-dev at lists.llvm.org>>
>>         *Subject:* RE: [llvm-dev] [Proposal][RFC] Epilog loop
>>         vectorization
>>
>>         Sorry I misunderstood, gvn/newgvn/gvnhoist cannot help here
>>         as these checks are not dominated by all paths.
>>
>>         Regards,
>>
>>         Ashutosh
>>
>>         *From:*Daniel Berlin [mailto:dberlin at dberlin.org]
>>         *Sent:* Tuesday, February 28, 2017 6:58 PM
>>         *To:* Nema, Ashutosh <Ashutosh.Nema at amd.com
>>         <mailto:Ashutosh.Nema at amd.com>>
>>         *Cc:* anemet at apple.com <mailto:anemet at apple.com>;
Hal Finkel
>>         <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>;
Zaks, Ayal
>>         <ayal.zaks at intel.com <mailto:ayal.zaks at
intel.com>>; Renato
>>         Golin <renato.golin at linaro.org
>>         <mailto:renato.golin at linaro.org>>; mkuper at
google.com
>>         <mailto:mkuper at google.com>; Mehdi Amini
>>         <mehdi.amini at apple.com <mailto:mehdi.amini at
apple.com>>;
>>         llvm-dev <llvm-dev at lists.llvm.org
>>         <mailto:llvm-dev at lists.llvm.org>>
>>         *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop
>>         vectorization
>>
>>         Hoisting or removing?
>>         Neither pass does hoisting, you'd need gvnhoist for that.
>>
>>         Even then:
>>         Staring at your example, none of the checks are fully
>>         redundant  (IE they are not dominated by other checks) or
>>         execute on all paths.
>>
>>         Thus, hoisting them  would be purely speculative code motion,
>>         which none of our passes do.
>>
>>         If you would like these sets of checks to be removed, you
>>         would need to place them in a place that they execute
>>         unconditionally.
>>
>>         Otherwise, this is not a standard code hoisting/removal
>>         transform.
>>
>>         The only redundancy i can see here at all is the repeated
>>         getelementptr computation.
>>
>>         If you move it to the preheader, it will be eliminated.
>>
>>         Otherwise, none of the checks are redundant.
>>
>>
>>         What would you hope to happen in this case?
>>
>>         On Tue, Feb 28, 2017 at 5:09 AM, Nema, Ashutosh
>>         <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at
amd.com>> wrote:
>>
>>             I have tried running both gvn and newgvn but it did not
>>             helped in hoisting the alias checks:
>>
>>             Please check, maybe I have missed something.
>>
>>             <TestCase>
>>
>>             void foo (char *A, char *B, char *C, int len) {
>>
>>             int i = 0;
>>
>>             for (i=0 ; i< len; i++)
>>
>>             A[i] = B[i] + C[i];
>>
>>             }
>>
>>             <Command>
>>
>>             $ opt –O3 –gvn test.ll –o test.opt.ll
>>
>>             $ opt –O3 –newgvn test.ll –o test.opt.ll
>>
>>             “test.ll” is attached, it got already vectorized by the
>>             approach running vectorizer twice by annotate the
>>             remainder loop with metadata to limit the vectorization
>>             factor for epilog vector loop.
>>
>>             Regards,
>>
>>             Ashutosh
>>
>>             *From:*anemet at apple.com <mailto:anemet at
apple.com>
>>             [mailto:anemet at apple.com <mailto:anemet at
apple.com>]
>>             *Sent:* Tuesday, February 28, 2017 1:33 AM
>>             *To:* Hal Finkel <hfinkel at anl.gov <mailto:hfinkel
at anl.gov>>
>>             *Cc:* Daniel Berlin <dberlin at dberlin.org
>>             <mailto:dberlin at dberlin.org>>; Nema, Ashutosh
>>             <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at
amd.com>>;
>>             Zaks, Ayal <ayal.zaks at intel.com
>>             <mailto:ayal.zaks at intel.com>>; Renato Golin
>>             <renato.golin at linaro.org
>>             <mailto:renato.golin at linaro.org>>; mkuper at
google.com
>>             <mailto:mkuper at google.com>; Mehdi Amini
>>             <mehdi.amini at apple.com <mailto:mehdi.amini at
apple.com>>;
>>             llvm-dev <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>>
>>             *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop
>>             vectorization
>>
>>                 On Feb 27, 2017, at 12:01 PM, Hal Finkel
>>                 <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>> wrote:
>>
>>                 On 02/27/2017 01:47 PM, Daniel Berlin wrote:
>>
>>                     On Mon, Feb 27, 2017 at 11:29 AM, Adam Nemet
>>                     <anemet at apple.com <mailto:anemet at
apple.com>> wrote:
>>
>>                             On Feb 27, 2017, at 10:11 AM, Hal Finkel
>>                             <hfinkel at anl.gov
>>                             <mailto:hfinkel at anl.gov>>
wrote:
>>
>>                             On 02/27/2017 11:47 AM, Adam Nemet wrote:
>>
>>                                     On Feb 27, 2017, at 9:39 AM,
>>                                     Daniel Berlin
>>                                     <dberlin at dberlin.org
>>                                     <mailto:dberlin at
dberlin.org>> wrote:
>>
>>                                     On Mon, Feb 27, 2017 at 9:29 AM,
>>                                     Adam Nemet <anemet at apple.com
>>                                     <mailto:anemet at
apple.com>> wrote:
>>
>>                                             On Feb 27, 2017, at 7:27
>>                                             AM, Hal Finkel
>>                                             <hfinkel at anl.gov
>>                                             <mailto:hfinkel at
anl.gov>>
>>                                             wrote:
>>
>>
>>                                             On 02/27/2017 06:29 AM,
>>                                             Nema, Ashutosh wrote:
>>
>>                                                 Thanks for looking
>>                                                 into this.
>>
>>                                                 1) Issues with re
>>                                                 running vectorizer:
>>
>>                                                 Vectorizer might
>>                                                 generate redundant
>>                                                 alias checks while
>>                                                 vectorizing epilog
loop.
>>
>>                                                 Redundant alias
>>                                                 checks are expensive,
>>                                                 we like to reuse the
>>                                                 results of already
>>                                                 computed alias checks.
>>
>>                                                 With metadata we can
>>                                                 limit the width of
>>                                                 epilog loop, but not
>>                                                 sure about reusing
>>                                                 alias check result.
>>
>>                                                 Any thoughts on
>>                                                 rerunning vectorizer
>>                                                 with reusing the
>>                                                 alias check result ?
>>
>>
>>                                             One way of looking at
>>                                             this is: Reusing the
>>                                             alias-check result is
>>                                             really just a conditional
>>                                             propagation problem; if
>>                                             we don't already have
an
>>                                             optimization that can
>>                                             combine these after the
>>                                             fact, then we should.
>>
>>                                         +Danny
>>
>>                                         Isn’t Extended SSA supposed
>>                                         to help with this?
>>
>>                                     Yes, it will solve this with no
>>                                     issue already.  GVN probably does
>>                                     already too.
>>
>>                                     even if if you have
>>
>>                                     if (a == b)
>>
>>                                     if (a == c)
>>
>>                                      if (a == d)
>>
>>                                      if (a == e)
>>
>>                                      if (a == g)
>>
>>                                     and  we can prove a ... g
>>                                     equivalent, newgvn will eliminate
>>                                     them all and set all the branches
>>                                     true.
>>
>>                                     If you need a simpler clean up
>>                                     pass, we could run it on
sub-graphs.
>>
>>                                 Yes we probably don’t want to run a
>>                                 full GVN after the “loop-scheduling”
>>                                 passes.
>>
>>
>>                             FWIW, we could, just without the
>>                             memory-dependence analysis enabled (i.e.
>>                             set the NoLoads constructor parameter to
>>                             true). GVN is pretty fast in that mode.
>>
>>                         OK.  Another data point is that I’ve seen
>>                         cases in the past where the alias checks
>>                         required for the loop passes could enable GVN
>>                         to remove redundant loads/stores. Currently
>>                         we can only pick these up with LTO when GVN
>>                         is rerun.
>>
>>                     This is just GVN brokenness, newgvn should not
>>                     have this problem.
>>
>>                     If it does, i'd love to see it.
>>
>>
>>                 I thought that the problem is that we just don't
run
>>                 GVN after that point in the pipeline.
>>
>>             Yeah, that is the problem but I think Danny misunderstood
>>             what I was trying to say.
>>
>>             This was a datapoint to possibly rerun GVN with
>>             memory-awareness.
>>
>>
>>                  -Hal
>>
>>                     (I'm working on the last few parts of turning
it
>>                     on by default, but it requires a new
>>                     getModRefInfo interface to be able to get the
>>                     last few testcases)
>>
>>                 -- 
>>
>>                 Hal Finkel
>>
>>                 Lead, Compiler Technology and Programming Languages
>>
>>                 Leadership Computing Facility
>>
>>                 Argonne National Laboratory
>>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/daf47c89/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 18444 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/daf47c89/attachment-0001.jpe>

Michael Kuperstein via llvm-dev

2017-Mar-14 19:56 UTC

head link

[llvm-dev] [Proposal][RFC] Epilog loop vectorization

>
>
> broadcasts that get hoisted out of the loop, etc.
>
>
> Can you elaborate? Do you mean that we're hoisting broadcasts outside
of
> the trip-count check (i.e. speculatively executing them) and these are
> actually expensive operations? Is this is a cost-model problem being
> exposed by SimplifyCFG (which is hoisting these things outside of the
> trip-count check)?
>
I'm pretty sure I've seen this happen. We're definitely hoisting
them out
of the loop (and relying on CGP and/or Loop Sinking to sink them back when
profitable), but I also remember seeing the broadcasts executed in cases
where the the iteration count was 0.

I'll see if I can come up with an example.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/0154a762/attachment.html>

llvm dev - Mar 2017 - [Proposal][RFC] Epilog loop vectorization

[llvm-dev] [Proposal][RFC] Epilog loop vectorization

[llvm-dev] [Proposal][RFC] Epilog loop vectorization