Hal Finkel via llvm-dev
2017-Mar-14 19:44 UTC
[llvm-dev] [Proposal][RFC] Epilog loop vectorization
On 03/14/2017 02:37 PM, Michael Kuperstein wrote:> > > On Tue, Mar 14, 2017 at 11:40 AM, Hal Finkel <hfinkel at anl.gov > <mailto:hfinkel at anl.gov>> wrote: > > > > On 03/14/2017 11:58 AM, Michael Kuperstein wrote: >> I'm still not sure about this, for a few reasons: >> >> 1) I'd like to try to treat epilogue loops the same way >> regardless of whether the main loop was vectorized by hand or >> automatically. So if someone hand-wrote an avx-512 16-wide loop, >> with alias checks, and we decide it's profitable to vectorize the >> epilogue loop by 4 and re-use the checks, it ought to be done the >> same way. I realize this may be a pipe-dream, though. > > I agree that sounds ideal. Identifying the effective vectorization > factor of the hand-vectorized loop seems fragile. However, if > someone is hand vectorizing then it seems like a small price to > add a pragma to the scalar loop restricting the vectorization > factor (and/or specifying that it is safe to execute). As a > result, I'm not sure how much effort we should make here. > > > I semi-agree with you, which is why I don't think it's a blocker. :-) > >> >> 2) I'm still somewhat worried about "tiny loops". As I wrote >> before, we explicitly refuse to vectorize loops we know have a >> trip-count less than 16, because our profitability heuristic for >> such loops is probably bad. IIUC the only reason we don't bail >> due to the threshold is because we use the same loop for "failed >> min iters check" and "failed alias check". So, because it's >> reachable through the alias-check path, the max trip count isn't >> actually known, even though the typical trip count is probably >> small. >> It's true that you currently don't try to vectorize the epilogue >> if the original VF is below 16, but this is a somewhat different >> condition. > > I think that the reason we have that limit, however, is that we > don't model the costs of the checks. Not that the cost model is > otherwise too inaccurate for low-trip-count loops. If we modeled > the costs of the checks, then I don't think this would be a problem. > > > I don't think it's just the alias checks. There's also the > min-iteration check,Yes, we'd need to model the cost of all of the checks (trip count, aliasing, overflow, etc.).> broadcasts that get hoisted out of the loop, etc.Can you elaborate? Do you mean that we're hoisting broadcasts outside of the trip-count check (i.e. speculatively executing them) and these are actually expensive operations? Is this is a cost-model problem being exposed by SimplifyCFG (which is hoisting these things outside of the trip-count check)? -Hal>> >> 3) Technically speaking, constructing a new InnerLoopVectorizer >> to vectorize this one loop sounds weird. We already have a >> worklist in the vectorizer that's currently running. > > I agree, although we do want to reuse the cost and legality > analysis (which I think is a worthwhile engineering decision > because that analysis involves SCEV, AA, and TTI, all of which can > get expensive). If we can do that and also just add the new loop > to the work queue, that certainly might be cleaner. > > -Hal > > >> >> I don't think (1) is a blocker, and (3) should be easy to fix, >> but I'm not sure whether the way this is going to handle (2) is >> sufficient. If I'm the only one that this bothers, I won't stand >> in the way, but I'd like to at least make sure we've fully >> considered this. >> >> On Mar 14, 2017 06:00, "Nema, Ashutosh" <Ashutosh.Nema at amd.com >> <mailto:Ashutosh.Nema at amd.com>> wrote: >> >> Summarizing the discussion on the implementation approaches. >> >> Discussed about two approaches, first running >> ‘InnerLoopVectorizer’ again on the epilog loop immediately >> after vectorizing the original loop within the same >> vectorization pass, the second approach where re-running >> vectorization pass and limiting vectorization factor of >> epilog loop by metadata. >> >> <Approach-2> >> >> Challenges with re-running the vectorizer pass: >> >> 1)Reusing alias check result: >> >> When vectorizer pass runs again it finds the epilog loop as a >> new loop and it may generates alias check, this new alias >> check may overkill the gains of epilog vectorization. >> >> We should use the already computed alias check result instead >> of re computing again. >> >> 2)Rerun the vectorizer and hoist the new alias check: >> >> It’s not possible to hoist alias checks as its not fully >> redundant (not dominated by other checks), it’s not getting >> execute in all paths. >> >> NOTE: We cannot prepone alias check as its expensive compared >> to other checks. >> >> <Approach-1> >> >> 1)Current patch depends on the existing functionality of >> LoopVectorizer, it uses ‘InnerLoopVectorizer’ again to >> vectorize the epilog loop, as it happens in the same >> vectorization pass we have flexibility to reuse already >> computed alias result check & limit vectorization factor for >> the epilog loop. >> >> 2)It does not generate the blocks for new block layout >> explicitly, rather it depends on >> ‘InnerLoopVectorizer::createEmptyLoop’ to generate new block >> layout. The new block layout get automatically generated by >> calling the ‘InnerLoopVectorizer:: vectorize’ again. >> >> 3)Block layout description with epilog loop vectorization is >> available at >> >> https://reviews.llvm.org/file/data/fxg5vx3capyj257rrn5j/PHID-FILE-x6thnbf6ub55ep5yhalu/LayoutDescription.png >> <https://reviews.llvm.org/file/data/fxg5vx3capyj257rrn5j/PHID-FILE-x6thnbf6ub55ep5yhalu/LayoutDescription.png> >> >> Approach-1 looks feasible, please comment if any objections. >> >> Regards, >> >> Ashutosh >> >> *From:*Nema, Ashutosh >> *Sent:* Wednesday, March 1, 2017 10:42 AM >> *To:* 'Daniel Berlin' <dberlin at dberlin.org >> <mailto:dberlin at dberlin.org>> >> *Cc:* anemet at apple.com <mailto:anemet at apple.com>; Hal Finkel >> <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>; Zaks, Ayal >> <ayal.zaks at intel.com <mailto:ayal.zaks at intel.com>>; Renato >> Golin <renato.golin at linaro.org >> <mailto:renato.golin at linaro.org>>; mkuper at google.com >> <mailto:mkuper at google.com>; Mehdi Amini >> <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>>; >> llvm-dev <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> >> *Subject:* RE: [llvm-dev] [Proposal][RFC] Epilog loop >> vectorization >> >> Sorry I misunderstood, gvn/newgvn/gvnhoist cannot help here >> as these checks are not dominated by all paths. >> >> Regards, >> >> Ashutosh >> >> *From:*Daniel Berlin [mailto:dberlin at dberlin.org] >> *Sent:* Tuesday, February 28, 2017 6:58 PM >> *To:* Nema, Ashutosh <Ashutosh.Nema at amd.com >> <mailto:Ashutosh.Nema at amd.com>> >> *Cc:* anemet at apple.com <mailto:anemet at apple.com>; Hal Finkel >> <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>; Zaks, Ayal >> <ayal.zaks at intel.com <mailto:ayal.zaks at intel.com>>; Renato >> Golin <renato.golin at linaro.org >> <mailto:renato.golin at linaro.org>>; mkuper at google.com >> <mailto:mkuper at google.com>; Mehdi Amini >> <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>>; >> llvm-dev <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> >> *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop >> vectorization >> >> Hoisting or removing? >> Neither pass does hoisting, you'd need gvnhoist for that. >> >> Even then: >> Staring at your example, none of the checks are fully >> redundant (IE they are not dominated by other checks) or >> execute on all paths. >> >> Thus, hoisting them would be purely speculative code motion, >> which none of our passes do. >> >> If you would like these sets of checks to be removed, you >> would need to place them in a place that they execute >> unconditionally. >> >> Otherwise, this is not a standard code hoisting/removal >> transform. >> >> The only redundancy i can see here at all is the repeated >> getelementptr computation. >> >> If you move it to the preheader, it will be eliminated. >> >> Otherwise, none of the checks are redundant. >> >> >> What would you hope to happen in this case? >> >> On Tue, Feb 28, 2017 at 5:09 AM, Nema, Ashutosh >> <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>> wrote: >> >> I have tried running both gvn and newgvn but it did not >> helped in hoisting the alias checks: >> >> Please check, maybe I have missed something. >> >> <TestCase> >> >> void foo (char *A, char *B, char *C, int len) { >> >> int i = 0; >> >> for (i=0 ; i< len; i++) >> >> A[i] = B[i] + C[i]; >> >> } >> >> <Command> >> >> $ opt –O3 –gvn test.ll –o test.opt.ll >> >> $ opt –O3 –newgvn test.ll –o test.opt.ll >> >> “test.ll” is attached, it got already vectorized by the >> approach running vectorizer twice by annotate the >> remainder loop with metadata to limit the vectorization >> factor for epilog vector loop. >> >> Regards, >> >> Ashutosh >> >> *From:*anemet at apple.com <mailto:anemet at apple.com> >> [mailto:anemet at apple.com <mailto:anemet at apple.com>] >> *Sent:* Tuesday, February 28, 2017 1:33 AM >> *To:* Hal Finkel <hfinkel at anl.gov <mailto:hfinkel at anl.gov>> >> *Cc:* Daniel Berlin <dberlin at dberlin.org >> <mailto:dberlin at dberlin.org>>; Nema, Ashutosh >> <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>>; >> Zaks, Ayal <ayal.zaks at intel.com >> <mailto:ayal.zaks at intel.com>>; Renato Golin >> <renato.golin at linaro.org >> <mailto:renato.golin at linaro.org>>; mkuper at google.com >> <mailto:mkuper at google.com>; Mehdi Amini >> <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>>; >> llvm-dev <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> >> *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop >> vectorization >> >> On Feb 27, 2017, at 12:01 PM, Hal Finkel >> <hfinkel at anl.gov <mailto:hfinkel at anl.gov>> wrote: >> >> On 02/27/2017 01:47 PM, Daniel Berlin wrote: >> >> On Mon, Feb 27, 2017 at 11:29 AM, Adam Nemet >> <anemet at apple.com <mailto:anemet at apple.com>> wrote: >> >> On Feb 27, 2017, at 10:11 AM, Hal Finkel >> <hfinkel at anl.gov >> <mailto:hfinkel at anl.gov>> wrote: >> >> On 02/27/2017 11:47 AM, Adam Nemet wrote: >> >> On Feb 27, 2017, at 9:39 AM, >> Daniel Berlin >> <dberlin at dberlin.org >> <mailto:dberlin at dberlin.org>> wrote: >> >> On Mon, Feb 27, 2017 at 9:29 AM, >> Adam Nemet <anemet at apple.com >> <mailto:anemet at apple.com>> wrote: >> >> On Feb 27, 2017, at 7:27 >> AM, Hal Finkel >> <hfinkel at anl.gov >> <mailto:hfinkel at anl.gov>> >> wrote: >> >> >> On 02/27/2017 06:29 AM, >> Nema, Ashutosh wrote: >> >> Thanks for looking >> into this. >> >> 1) Issues with re >> running vectorizer: >> >> Vectorizer might >> generate redundant >> alias checks while >> vectorizing epilog loop. >> >> Redundant alias >> checks are expensive, >> we like to reuse the >> results of already >> computed alias checks. >> >> With metadata we can >> limit the width of >> epilog loop, but not >> sure about reusing >> alias check result. >> >> Any thoughts on >> rerunning vectorizer >> with reusing the >> alias check result ? >> >> >> One way of looking at >> this is: Reusing the >> alias-check result is >> really just a conditional >> propagation problem; if >> we don't already have an >> optimization that can >> combine these after the >> fact, then we should. >> >> +Danny >> >> Isn’t Extended SSA supposed >> to help with this? >> >> Yes, it will solve this with no >> issue already. GVN probably does >> already too. >> >> even if if you have >> >> if (a == b) >> >> if (a == c) >> >> if (a == d) >> >> if (a == e) >> >> if (a == g) >> >> and we can prove a ... g >> equivalent, newgvn will eliminate >> them all and set all the branches >> true. >> >> If you need a simpler clean up >> pass, we could run it on sub-graphs. >> >> Yes we probably don’t want to run a >> full GVN after the “loop-scheduling” >> passes. >> >> >> FWIW, we could, just without the >> memory-dependence analysis enabled (i.e. >> set the NoLoads constructor parameter to >> true). GVN is pretty fast in that mode. >> >> OK. Another data point is that I’ve seen >> cases in the past where the alias checks >> required for the loop passes could enable GVN >> to remove redundant loads/stores. Currently >> we can only pick these up with LTO when GVN >> is rerun. >> >> This is just GVN brokenness, newgvn should not >> have this problem. >> >> If it does, i'd love to see it. >> >> >> I thought that the problem is that we just don't run >> GVN after that point in the pipeline. >> >> Yeah, that is the problem but I think Danny misunderstood >> what I was trying to say. >> >> This was a datapoint to possibly rerun GVN with >> memory-awareness. >> >> >> -Hal >> >> (I'm working on the last few parts of turning it >> on by default, but it requires a new >> getModRefInfo interface to be able to get the >> last few testcases) >> >> -- >> >> Hal Finkel >> >> Lead, Compiler Technology and Programming Languages >> >> Leadership Computing Facility >> >> Argonne National Laboratory >> > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/daf47c89/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 18444 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/daf47c89/attachment-0001.jpe>
Michael Kuperstein via llvm-dev
2017-Mar-14 19:56 UTC
[llvm-dev] [Proposal][RFC] Epilog loop vectorization
> > > broadcasts that get hoisted out of the loop, etc. > > > Can you elaborate? Do you mean that we're hoisting broadcasts outside of > the trip-count check (i.e. speculatively executing them) and these are > actually expensive operations? Is this is a cost-model problem being > exposed by SimplifyCFG (which is hoisting these things outside of the > trip-count check)? >I'm pretty sure I've seen this happen. We're definitely hoisting them out of the loop (and relying on CGP and/or Loop Sinking to sink them back when profitable), but I also remember seeing the broadcasts executed in cases where the the iteration count was 0. I'll see if I can come up with an example. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/0154a762/attachment.html>