Sanjay Patel via llvm-dev
2017-Mar-22 18:34 UTC
[llvm-dev] Saving Compile Time in InstCombine
> To (hopefully) make it easier to answer this question, I've posted my(work-in-progress) patch which adds a known-bits cache to InstCombine.> I rebased it yesterday, so it should be fairly easy to apply:https://reviews.llvm.org/D31239 - Seeing what this does to the performance of the> benchmarks mentioned in this thread (among others) would certainly beinteresting. Thanks! I only have the one rough data point based on PR32037, but caching does good things for compile-time on that example. Trunk r298514 compiled release on macOS running on Haswell 4GHz: $ time ./opt -O2 row_common.bc -S -o /dev/null user 0m0.302s user 0m0.300s user 0m0.296s user 0m0.299s user 0m0.296s With your patch applied: user 0m0.264s user 0m0.269s user 0m0.269s user 0m0.268s user 0m0.268s So the time for all of -O2 has dropped to 89.6% of the baseline (improvement of 11.5%). A profile shows time spent in InstCombine->computeKnownBits dropped from 58 ms to 15 ms (lines up with the overall time drop), so we're about 4x faster in ValueTracking with the caching. On Wed, Mar 22, 2017 at 7:36 AM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > On 03/20/2017 11:51 PM, Gerolf Hoflehner wrote: > > > On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Honestly, I'm not a huge fan of this change as-is. The set of transforms > that were added behind ExpensiveChecks seems awfully strange and many would > not lead the reader to believe that they are expensive at all > (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls > being the obvious expensive routines). > > The purpose of many of InstCombine's xforms is to canonicalize the IR to > make life easier for downstream passes and analyses. > > > As we get further along with compile-time improvements one question we > need to ask ourselves more frequently is about the effectiveness of > optimizations/passes. For example - in this case - how can we make an > educated assessment that running the combiner N times is a good > cost/benefit investment of compute resources? The questions below are meant > to figure out what technologies/instrumentations/etc could help towards a > more data-driven decision process when it comes to the effectiveness of > optimizations. Instcombiner might just be an inspirational use case to see > what is possible in that direction. > > The combiner is invoked in full multiple times. But is it really necessary > to run all of it for that purpose? After instcombine is run once is there a > mapping from transformation -> combines? I suspect most transformations > could invoke a subset of combines to re-canonicalize. Or, if there was a > (cheap) verifier for canonical IR, it could invoke a specific > canonicalization routine. Instrumenting the instcombiner and checking which > patterns actually kick in (for different invocations) might give insight > into how the combiner could be structured and so that only a subset of > pattern need to be checked. > > > InstCombine internally relies on knowing what transforms it may or may not > perform. This is important: canonicalizations may duel endlessly if we get > this wrong; the order of the combines is also important for exactly the > same reason (SelectionDAG deals with this problem in a different way with > its pattern complexity field). > > > Can you elaborate on this “duel endlessly” with specific examples? This is > out of curiosity. There must be verifiers that check that this cannot > happen. Or an implementation strategy that guarantees that. Global isel > will run into the same/similar question when it gets far enough to replace > SD. > > > Another concern with moving seemingly arbitrary combines under > ExpensiveCombines is that it will make it that much harder to understand > what is and is not canonical at a given point during the execution of the > optimizer. > > > > I'd be much more interested in a patch which caches the result of > frequently called ValueTracking functionality like ComputeKnownBits, > ComputeSignBit, etc. which often doesn't change but is not intelligently > reused. I imagine that the performance win might be quite comparable. > > > Can you back this up with measurements? Caching schemes are tricky. Is > there a way to evaluate when the results of ComputeKnownBits etc is > actually effective meaining the result is used and gives faster > instructions? E.g. it might well be that only the first instance of > inst_combine benefits from computing the bits. > > > To (hopefully) make it easier to answer this question, I've posted my > (work-in-progress) patch which adds a known-bits cache to InstCombine. I > rebased it yesterday, so it should be fairly easy to apply: > https://reviews.llvm.org/D31239 - Seeing what this does to the > performance of the benchmarks mentioned in this thread (among others) would > certainly be interesting. > > -Hal > > > > > Such a patch would have the benefit of keeping the set of available > transforms constant throughout the pipeline while bringing execution time > down; I wouldn't be at all surprised if caching the ValueTracking functions > resulted in a bigger time savings. > > On Fri, Mar 17, 2017 at 5:49 PM, Hal Finkel via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> On 03/17/2017 04:30 PM, Mehdi Amini via llvm-dev wrote: >> >> >> On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> Hi, >> >> One of the most time-consuming passes in LLVM middle-end is InstCombine >> (see e.g. [1]). It is a very powerful pass capable of doing all the crazy >> stuff, and new patterns are being constantly introduced there. The problem >> is that we often use it just as a clean-up pass: it's scheduled 6 times in >> the current pass pipeline, and each time it's invoked it checks all known >> patterns. It sounds ok for O3, where we try to squeeze as much performance >> as possible, but it is too excessive for other opt-levels. InstCombine has >> an ExpensiveCombines parameter to address that - but I think it's underused >> at the moment. >> >> >> Yes, the “ExpensiveCombines” has been added recently (4.0? 3.9?) but I >> believe has always been intended to be extended the way you’re doing it. So >> I support this effort :) >> >> >> +1 >> >> Also, did your profiling reveal why the other combines are expensive? >> Among other things, I'm curious if the expensive ones tend to spend a lot >> of time in ValueTracking (getting known bits and similar)? >> >> -Hal >> >> >> >> CC: David for the general direction on InstCombine though. >> >> >> — >> Mehdi >> >> >> >> >> Trying to find out, which patterns are important, and which are rare, I >> profiled clang using CTMark and got the following coverage report: >> <InstCombine_covreport.html> >> (beware, the file is ~6MB). >> >> Guided by this profile I moved some patterns under the "if >> (ExpensiveCombines)" check, which expectedly happened to be neutral for >> runtime performance, but improved compile-time. The testing results are >> below (measured for Os). >> >> Performance Improvements - Compile Time Δ Previous Current σ >> CTMark/sqlite3/sqlite3 >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2> -1.55% >> 6.8155 6.7102 0.0081 >> CTMark/mafft/pairlocalalign >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.1=2> -1.05% 8.0407 >> 7.9559 0.0193 >> CTMark/ClamAV/clamscan >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2> -1.02% >> 11.3893 11.2734 0.0081 >> CTMark/lencod/lencod >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.10=2> -1.01% >> 12.8763 12.7461 0.0244 >> CTMark/SPASS/SPASS >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.5=2> -1.01% >> 12.5048 12.3791 0.0340 >> >> Performance Improvements - Compile Time Δ Previous Current σ >> External/SPEC/CINT2006/403.gcc/403.gcc >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.14=2> -1.64% >> 54.0801 53.1930 - >> External/SPEC/CINT2006/400.perlbench/400.perlbench >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2> -1.25% >> 19.1481 18.9091 - >> External/SPEC/CINT2006/445.gobmk/445.gobmk >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2> -1.01% >> 15.2819 15.1274 - >> >> >> Do such changes make sense? The patch doesn't change O3, but it does >> change Os and potentially can change performance there (though I didn't see >> any changes in my tests). >> >> The patch is attached for the reference, if we decide to go for it, I'll >> upload it to phab: >> >> <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch> >> >> >> Thanks, >> Michael >> >> [1]: http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> >> _______________________________________________ LLVM Developers mailing >> list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/ >> mailman/listinfo/llvm-dev > > _______________________________________________ LLVM Developers mailing > list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/ > mailman/listinfo/llvm-dev > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170322/53e7a842/attachment-0001.html>
Hal Finkel via llvm-dev
2017-Mar-22 21:23 UTC
[llvm-dev] Saving Compile Time in InstCombine
On 03/22/2017 01:34 PM, Sanjay Patel wrote:> > To (hopefully) make it easier to answer this question, I've posted > my (work-in-progress) patch which adds a known-bits cache to InstCombine. > > I rebased it yesterday, so it should be fairly easy to apply: > https://reviews.llvm.org/D31239 <https://reviews.llvm.org/D31239> - > Seeing what this does to the performance of the > > benchmarks mentioned in this thread (among others) would certainly > be interesting. > > Thanks! I only have the one rough data point based on PR32037, but > caching does good things for compile-time on that example. > > Trunk r298514 compiled release on macOS running on Haswell 4GHz: > $ time ./opt -O2 row_common.bc -S -o /dev/null > > user 0m0.302s > user 0m0.300s > user 0m0.296s > user 0m0.299s > user 0m0.296s > > With your patch applied: > > user 0m0.264s > user 0m0.269s > user 0m0.269s > user 0m0.268s > user 0m0.268s > > So the time for all of -O2 has dropped to 89.6% of the baseline > (improvement of 11.5%). > A profile shows time spent in InstCombine->computeKnownBits dropped > from 58 ms to 15 ms (lines up with the overall time drop), so we're > about 4x faster in ValueTracking with the caching.Yay :-) -- Unfortunately, I won't have time to work on this in the near future, but if someone would like to pick this up and fix the nsw/nuw invalidation issue (which is exposed in a few of the regression-tests which fail with the patch applied), that would be great. -Hal> > On Wed, Mar 22, 2017 at 7:36 AM, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > On 03/20/2017 11:51 PM, Gerolf Hoflehner wrote: >> >>> On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev >>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>> Honestly, I'm not a huge fan of this change as-is. The set of >>> transforms that were added behind ExpensiveChecks seems awfully >>> strange and many would not lead the reader to believe that they >>> are expensive at all (the SimplifyDemandedInstructionBits and >>> foldICmpUsingKnownBits calls being the obvious expensive routines). >>> >>> The purpose of many of InstCombine's xforms is to canonicalize >>> the IR to make life easier for downstream passes and analyses. >> >> As we get further along with compile-time improvements one >> question we need to ask ourselves more frequently is about the >> effectiveness of optimizations/passes. For example - in this >> case - how can we make an educated assessment that running the >> combiner N times is a good cost/benefit investment of compute >> resources? The questions below are meant to figure out what >> technologies/instrumentations/etc could help towards a more >> data-driven decision process when it comes to the effectiveness >> of optimizations. Instcombiner might just be an inspirational use >> case to see what is possible in that direction. >> >> The combiner is invoked in full multiple times. But is it really >> necessary to run all of it for that purpose? After instcombine is >> run once is there a mapping from transformation -> combines? I >> suspect most transformations could invoke a subset of combines to >> re-canonicalize. Or, if there was a (cheap) verifier for >> canonical IR, it could invoke a specific canonicalization >> routine. Instrumenting the instcombiner and checking which >> patterns actually kick in (for different invocations) might give >> insight into how the combiner could be structured and so that >> only a subset of pattern need to be checked. >>> >>> InstCombine internally relies on knowing what transforms it may >>> or may not perform. This is important: canonicalizations may >>> duel endlessly if we get this wrong; the order of the combines >>> is also important for exactly the same reason (SelectionDAG >>> deals with this problem in a different way with its pattern >>> complexity field). >> >> Can you elaborate on this “duel endlessly” with specific >> examples? This is out of curiosity. There must be verifiers that >> check that this cannot happen. Or an implementation strategy that >> guarantees that. Global isel will run into the same/similar >> question when it gets far enough to replace SD. >>> >>> Another concern with moving seemingly arbitrary combines under >>> ExpensiveCombines is that it will make it that much harder to >>> understand what is and is not canonical at a given point during >>> the execution of the optimizer. >> >>> >>> I'd be much more interested in a patch which caches the result >>> of frequently called ValueTracking functionality like >>> ComputeKnownBits, ComputeSignBit, etc. which often doesn't >>> change but is not intelligently reused. I imagine that the >>> performance win might be quite comparable. >> >> Can you back this up with measurements? Caching schemes are >> tricky. Is there a way to evaluate when the results of >> ComputeKnownBits etc is actually effective meaining the result is >> used and gives faster instructions? E.g. it might well be that >> only the first instance of inst_combine benefits from computing >> the bits. > > To (hopefully) make it easier to answer this question, I've posted > my (work-in-progress) patch which adds a known-bits cache to > InstCombine. I rebased it yesterday, so it should be fairly easy > to apply: https://reviews.llvm.org/D31239 > <https://reviews.llvm.org/D31239> - Seeing what this does to the > performance of the benchmarks mentioned in this thread (among > others) would certainly be interesting. > > -Hal > > >> >> >>> Such a patch would have the benefit of keeping the set of >>> available transforms constant throughout the pipeline while >>> bringing execution time down; I wouldn't be at all surprised if >>> caching the ValueTracking functions resulted in a bigger time >>> savings. >>> >>> On Fri, Mar 17, 2017 at 5:49 PM, Hal Finkel via llvm-dev >>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>> >>> On 03/17/2017 04:30 PM, Mehdi Amini via llvm-dev wrote: >>>> >>>>> On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via >>>>> llvm-dev <llvm-dev at lists.llvm.org >>>>> <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> One of the most time-consuming passes in LLVM middle-end >>>>> is InstCombine (see e.g. [1]). It is a very powerful pass >>>>> capable of doing all the crazy stuff, and new patterns are >>>>> being constantly introduced there. The problem is that we >>>>> often use it just as a clean-up pass: it's scheduled 6 >>>>> times in the current pass pipeline, and each time it's >>>>> invoked it checks all known patterns. It sounds ok for O3, >>>>> where we try to squeeze as much performance as possible, >>>>> but it is too excessive for other opt-levels. InstCombine >>>>> has an ExpensiveCombines parameter to address that - but I >>>>> think it's underused at the moment. >>>> >>>> Yes, the “ExpensiveCombines” has been added recently (4.0? >>>> 3.9?) but I believe has always been intended to be extended >>>> the way you’re doing it. So I support this effort :) >>> >>> +1 >>> >>> Also, did your profiling reveal why the other combines are >>> expensive? Among other things, I'm curious if the expensive >>> ones tend to spend a lot of time in ValueTracking (getting >>> known bits and similar)? >>> >>> -Hal >>> >>> >>>> >>>> CC: David for the general direction on InstCombine though. >>>> >>>> >>>> — >>>> Mehdi >>>> >>>> >>>> >>>>> >>>>> Trying to find out, which patterns are important, and >>>>> which are rare, I profiled clang using CTMark and got the >>>>> following coverage report: >>>>> <InstCombine_covreport.html> >>>>> (beware, the file is ~6MB). >>>>> >>>>> Guided by this profile I moved some patterns under the "if >>>>> (ExpensiveCombines)" check, which expectedly happened to >>>>> be neutral for runtime performance, but improved >>>>> compile-time. The testing results are below (measured for Os). >>>>> >>>>> Performance Improvements - Compile Time Δ Previous >>>>> Current σ >>>>> CTMark/sqlite3/sqlite3 >>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2> >>>>> -1.55% 6.8155 6.7102 0.0081 >>>>> CTMark/mafft/pairlocalalign >>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.1=2> >>>>> -1.05% 8.0407 7.9559 0.0193 >>>>> CTMark/ClamAV/clamscan >>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2> >>>>> -1.02% 11.3893 11.2734 0.0081 >>>>> CTMark/lencod/lencod >>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.10=2> >>>>> -1.01% 12.8763 12.7461 0.0244 >>>>> CTMark/SPASS/SPASS >>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.5=2> >>>>> -1.01% 12.5048 12.3791 0.0340 >>>>> >>>>> >>>>> Performance Improvements - Compile Time Δ Previous >>>>> Current σ >>>>> External/SPEC/CINT2006/403.gcc/403.gcc >>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.14=2> >>>>> -1.64% 54.0801 53.1930 - >>>>> External/SPEC/CINT2006/400.perlbench/400.perlbench >>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2> >>>>> -1.25% 19.1481 18.9091 - >>>>> External/SPEC/CINT2006/445.gobmk/445.gobmk >>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2> >>>>> -1.01% 15.2819 15.1274 - >>>>> >>>>> >>>>> >>>>> Do such changes make sense? The patch doesn't change O3, >>>>> but it does change Os and potentially can change >>>>> performance there (though I didn't see any changes in my >>>>> tests). >>>>> >>>>> The patch is attached for the reference, if we decide to >>>>> go for it, I'll upload it to phab: >>>>> >>>>> <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch> >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Michael >>>>> >>>>> [1]: >>>>> http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html >>>>> <http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>> >>> -- >>> Hal Finkel >>> Lead, Compiler Technology and Programming Languages >>> Leadership Computing Facility >>> Argonne National Laboratory >>> >>> _______________________________________________ LLVM >>> Developers mailing list llvm-dev at lists.llvm.org >>> <mailto:llvm-dev at lists.llvm.org> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>> >>> _______________________________________________ LLVM Developers >>> mailing list llvm-dev at lists.llvm.org >>> <mailto:llvm-dev at lists.llvm.org> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > > _______________________________________________ LLVM Developers > mailing list llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170322/3d9f37ec/attachment-0001.html>
Mikhail Zolotukhin via llvm-dev
2017-Mar-23 01:29 UTC
[llvm-dev] Saving Compile Time in InstCombine
In my testing results are not that impressive, but that's because I'm now focusing on Os. For me even complete disabling of all KnownBits-related patterns in InstCombine places the results very close to the noise level. In my original patch I also had some extra patterns moved under ExpensiveCombines - and that seems to make a difference too (without this part, or without the KnownBits part I get results below 1%, which are not reported as regressions/improvements). Personally I think caching results of KnownBits is a good idea, and should probably help O3 compile time (and obviously the cases from bug reports, like PR32037). But I also think that the way we're currently doing combining/canonicalization is unnecessary complicated. Do we really need to try canonicalizing all of these patterns? What happens if we don't? Has anyone tried replace some of the InstCombine invocations with InstSimplify? Do we have a justification for the invocations we currently have? I realize that InstCombine doesn't usually do any harm, if we don't care about compile time, but that's only the case for O3 (to some extent), not for other optimization levels. I think it's equally important to 1) make our implementations faster, and 2) not perform unnecessary work when possible. Adding caching for known bits makes InstCombine faster, but we can get even more if we don't invoke it when it's not needed. Of course, we don't want to blindly disable the patterns that just happen to be not used in some (admittedly small) test runs that I made. But what I'd like to do is to make a deliberate decision on what's critical and what's optional here, what can be disabled to save some compile time. I'd be happy to carry out any kinds of experiments do we need to run to figure this out, and take part in implementing any missing parts or necessary refactoring. My current plan is to experiment with replacing some InstCombine invocations with InstSimplify and see what regresses after it, but if you know that's already been done by someone, or you have better ideas - I'd be happy to hear them! Thanks, Michael> On Mar 22, 2017, at 2:23 PM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > On 03/22/2017 01:34 PM, Sanjay Patel wrote: >> > To (hopefully) make it easier to answer this question, I've posted my (work-in-progress) patch which adds a known-bits cache to InstCombine. >> > I rebased it yesterday, so it should be fairly easy to apply: https://reviews.llvm.org/D31239 <https://reviews.llvm.org/D31239> - Seeing what this does to the performance of the >> > benchmarks mentioned in this thread (among others) would certainly be interesting. >> >> Thanks! I only have the one rough data point based on PR32037, but caching does good things for compile-time on that example. >> >> Trunk r298514 compiled release on macOS running on Haswell 4GHz: >> $ time ./opt -O2 row_common.bc -S -o /dev/null >> >> user 0m0.302s >> user 0m0.300s >> user 0m0.296s >> user 0m0.299s >> user 0m0.296s >> >> With your patch applied: >> >> user 0m0.264s >> user 0m0.269s >> user 0m0.269s >> user 0m0.268s >> user 0m0.268s >> >> So the time for all of -O2 has dropped to 89.6% of the baseline (improvement of 11.5%). >> A profile shows time spent in InstCombine->computeKnownBits dropped from 58 ms to 15 ms (lines up with the overall time drop), so we're about 4x faster in ValueTracking with the caching. > > Yay :-) -- Unfortunately, I won't have time to work on this in the near future, but if someone would like to pick this up and fix the nsw/nuw invalidation issue (which is exposed in a few of the regression-tests which fail with the patch applied), that would be great. > > -Hal > >> >> On Wed, Mar 22, 2017 at 7:36 AM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> On 03/20/2017 11:51 PM, Gerolf Hoflehner wrote: >>> >>>> On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>> >>>> Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines). >>>> >>>> The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses. >>> >>> As we get further along with compile-time improvements one question we need to ask ourselves more frequently is about the effectiveness of optimizations/passes. For example - in this case - how can we make an educated assessment that running the combiner N times is a good cost/benefit investment of compute resources? The questions below are meant to figure out what technologies/instrumentations/etc could help towards a more data-driven decision process when it comes to the effectiveness of optimizations. Instcombiner might just be an inspirational use case to see what is possible in that direction. >>> >>> The combiner is invoked in full multiple times. But is it really necessary to run all of it for that purpose? After instcombine is run once is there a mapping from transformation -> combines? I suspect most transformations could invoke a subset of combines to re-canonicalize. Or, if there was a (cheap) verifier for canonical IR, it could invoke a specific canonicalization routine. Instrumenting the instcombiner and checking which patterns actually kick in (for different invocations) might give insight into how the combiner could be structured and so that only a subset of pattern need to be checked. >>>> >>>> InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field). >>> >>> Can you elaborate on this “duel endlessly” with specific examples? This is out of curiosity. There must be verifiers that check that this cannot happen. Or an implementation strategy that guarantees that. Global isel will run into the same/similar question when it gets far enough to replace SD. >>>> >>>> Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer. >>> >>>> >>>> I'd be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn't change but is not intelligently reused. I imagine that the performance win might be quite comparable. >>> >>> Can you back this up with measurements? Caching schemes are tricky. Is there a way to evaluate when the results of ComputeKnownBits etc is actually effective meaining the result is used and gives faster instructions? E.g. it might well be that only the first instance of inst_combine benefits from computing the bits. >> >> To (hopefully) make it easier to answer this question, I've posted my (work-in-progress) patch which adds a known-bits cache to InstCombine. I rebased it yesterday, so it should be fairly easy to apply: https://reviews.llvm.org/D31239 <https://reviews.llvm.org/D31239> - Seeing what this does to the performance of the benchmarks mentioned in this thread (among others) would certainly be interesting. >> >> -Hal >> >> >>> >>> >>>> Such a patch would have the benefit of keeping the set of available transforms constant throughout the pipeline while bringing execution time down; I wouldn't be at all surprised if caching the ValueTracking functions resulted in a bigger time savings. >>>> >>>> On Fri, Mar 17, 2017 at 5:49 PM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>> >>>> On 03/17/2017 04:30 PM, Mehdi Amini via llvm-dev wrote: >>>>> >>>>>> On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment. >>>>> >>>>> Yes, the “ExpensiveCombines” has been added recently (4.0? 3.9?) but I believe has always been intended to be extended the way you’re doing it. So I support this effort :) >>>> >>>> +1 >>>> >>>> Also, did your profiling reveal why the other combines are expensive? Among other things, I'm curious if the expensive ones tend to spend a lot of time in ValueTracking (getting known bits and similar)? >>>> >>>> -Hal >>>> >>>> >>>>> >>>>> CC: David for the general direction on InstCombine though. >>>>> >>>>> >>>>> — >>>>> Mehdi >>>>> >>>>> >>>>> >>>>>> >>>>>> Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report: >>>>>> <InstCombine_covreport.html> >>>>>> (beware, the file is ~6MB). >>>>>> >>>>>> Guided by this profile I moved some patterns under the "if (ExpensiveCombines)" check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os). >>>>>> >>>>>> Performance Improvements - Compile Time Δ Previous Current σ >>>>>> CTMark/sqlite3/sqlite3 <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2> -1.55% 6.8155 6.7102 0.0081 >>>>>> CTMark/mafft/pairlocalalign <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.1=2> -1.05% 8.0407 7.9559 0.0193 >>>>>> CTMark/ClamAV/clamscan <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2> -1.02% 11.3893 11.2734 0.0081 >>>>>> CTMark/lencod/lencod <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.10=2> -1.01% 12.8763 12.7461 0.0244 >>>>>> CTMark/SPASS/SPASS <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.5=2> -1.01% 12.5048 12.3791 0.0340 >>>>>> >>>>>> Performance Improvements - Compile Time Δ Previous Current σ >>>>>> External/SPEC/CINT2006/403.gcc/403.gcc <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.14=2> -1.64% 54.0801 53.1930 - >>>>>> External/SPEC/CINT2006/400.perlbench/400.perlbench <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2> -1.25% 19.1481 18.9091 - >>>>>> External/SPEC/CINT2006/445.gobmk/445.gobmk <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2> -1.01% 15.2819 15.1274 - >>>>>> >>>>>> >>>>>> Do such changes make sense? The patch doesn't change O3, but it does change Os and potentially can change performance there (though I didn't see any changes in my tests). >>>>>> >>>>>> The patch is attached for the reference, if we decide to go for it, I'll upload it to phab: >>>>>> >>>>>> <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Michael >>>>>> >>>>>> [1]: http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html <http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html> >>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>>> >>>> -- >>>> Hal Finkel >>>> Lead, Compiler Technology and Programming Languages >>>> Leadership Computing Facility >>>> Argonne National Laboratory >>>> _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>_______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170322/6e751d78/attachment.html>
Sean Silva via llvm-dev
2017-Mar-24 21:58 UTC
[llvm-dev] Saving Compile Time in InstCombine
An ALIVE-like matching automaton could algorithmically mitigate the costs of large numbers of matchers. I don't know if we're at the point where we need to take that leap, but we should probably answer dannyb's comments before doing that. Once we know what we want we can then look for solutions (possibly using new/different algorithms, such as what we are doing for NewGVN) Honestly I think that finding out what we want is going to be as hard or harder to solving it. As Gerolf mentions, there's still quite a bit of analysis to be done to better understand InstCombine's role in the our current pipeline. -- Sean Silva On Mar 22, 2017 7:29 PM, "Mikhail Zolotukhin via llvm-dev" < llvm-dev at lists.llvm.org> wrote: In my testing results are not that impressive, but that's because I'm now focusing on Os. For me even complete disabling of all KnownBits-related patterns in InstCombine places the results very close to the noise level. In my original patch I also had some extra patterns moved under ExpensiveCombines - and that seems to make a difference too (without this part, or without the KnownBits part I get results below 1%, which are not reported as regressions/improvements). Personally I think caching results of KnownBits is a good idea, and should probably help O3 compile time (and obviously the cases from bug reports, like PR32037). But I also think that the way we're currently doing combining/canonicalization is unnecessary complicated. Do we really need to try canonicalizing all of these patterns? What happens if we don't? Has anyone tried replace some of the InstCombine invocations with InstSimplify? Do we have a justification for the invocations we currently have? I realize that InstCombine doesn't usually do any harm, if we don't care about compile time, but that's only the case for O3 (to some extent), not for other optimization levels. I think it's equally important to 1) make our implementations faster, and 2) not perform unnecessary work when possible. Adding caching for known bits makes InstCombine faster, but we can get even more if we don't invoke it when it's not needed. Of course, we don't want to blindly disable the patterns that just happen to be not used in some (admittedly small) test runs that I made. But what I'd like to do is to make a deliberate decision on what's critical and what's optional here, what can be disabled to save some compile time. I'd be happy to carry out any kinds of experiments do we need to run to figure this out, and take part in implementing any missing parts or necessary refactoring. My current plan is to experiment with replacing some InstCombine invocations with InstSimplify and see what regresses after it, but if you know that's already been done by someone, or you have better ideas - I'd be happy to hear them! Thanks, Michael On Mar 22, 2017, at 2:23 PM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote: On 03/22/2017 01:34 PM, Sanjay Patel wrote:> To (hopefully) make it easier to answer this question, I've posted my(work-in-progress) patch which adds a known-bits cache to InstCombine.> I rebased it yesterday, so it should be fairly easy to apply:https://reviews.llvm.org/D31239 - Seeing what this does to the performance of the> benchmarks mentioned in this thread (among others) would certainly beinteresting. Thanks! I only have the one rough data point based on PR32037, but caching does good things for compile-time on that example. Trunk r298514 compiled release on macOS running on Haswell 4GHz: $ time ./opt -O2 row_common.bc -S -o /dev/null user 0m0.302s user 0m0.300s user 0m0.296s user 0m0.299s user 0m0.296s With your patch applied: user 0m0.264s user 0m0.269s user 0m0.269s user 0m0.268s user 0m0.268s So the time for all of -O2 has dropped to 89.6% of the baseline (improvement of 11.5%). A profile shows time spent in InstCombine->computeKnownBits dropped from 58 ms to 15 ms (lines up with the overall time drop), so we're about 4x faster in ValueTracking with the caching. Yay :-) -- Unfortunately, I won't have time to work on this in the near future, but if someone would like to pick this up and fix the nsw/nuw invalidation issue (which is exposed in a few of the regression-tests which fail with the patch applied), that would be great. -Hal On Wed, Mar 22, 2017 at 7:36 AM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > On 03/20/2017 11:51 PM, Gerolf Hoflehner wrote: > > > On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Honestly, I'm not a huge fan of this change as-is. The set of transforms > that were added behind ExpensiveChecks seems awfully strange and many would > not lead the reader to believe that they are expensive at all > (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls > being the obvious expensive routines). > > The purpose of many of InstCombine's xforms is to canonicalize the IR to > make life easier for downstream passes and analyses. > > > As we get further along with compile-time improvements one question we > need to ask ourselves more frequently is about the effectiveness of > optimizations/passes. For example - in this case - how can we make an > educated assessment that running the combiner N times is a good > cost/benefit investment of compute resources? The questions below are meant > to figure out what technologies/instrumentations/etc could help towards a > more data-driven decision process when it comes to the effectiveness of > optimizations. Instcombiner might just be an inspirational use case to see > what is possible in that direction. > > The combiner is invoked in full multiple times. But is it really necessary > to run all of it for that purpose? After instcombine is run once is there a > mapping from transformation -> combines? I suspect most transformations > could invoke a subset of combines to re-canonicalize. Or, if there was a > (cheap) verifier for canonical IR, it could invoke a specific > canonicalization routine. Instrumenting the instcombiner and checking which > patterns actually kick in (for different invocations) might give insight > into how the combiner could be structured and so that only a subset of > pattern need to be checked. > > > InstCombine internally relies on knowing what transforms it may or may not > perform. This is important: canonicalizations may duel endlessly if we get > this wrong; the order of the combines is also important for exactly the > same reason (SelectionDAG deals with this problem in a different way with > its pattern complexity field). > > > Can you elaborate on this “duel endlessly” with specific examples? This is > out of curiosity. There must be verifiers that check that this cannot > happen. Or an implementation strategy that guarantees that. Global isel > will run into the same/similar question when it gets far enough to replace > SD. > > > Another concern with moving seemingly arbitrary combines under > ExpensiveCombines is that it will make it that much harder to understand > what is and is not canonical at a given point during the execution of the > optimizer. > > > > I'd be much more interested in a patch which caches the result of > frequently called ValueTracking functionality like ComputeKnownBits, > ComputeSignBit, etc. which often doesn't change but is not intelligently > reused. I imagine that the performance win might be quite comparable. > > > Can you back this up with measurements? Caching schemes are tricky. Is > there a way to evaluate when the results of ComputeKnownBits etc is > actually effective meaining the result is used and gives faster > instructions? E.g. it might well be that only the first instance of > inst_combine benefits from computing the bits. > > > To (hopefully) make it easier to answer this question, I've posted my > (work-in-progress) patch which adds a known-bits cache to InstCombine. I > rebased it yesterday, so it should be fairly easy to apply: > https://reviews.llvm.org/D31239 - Seeing what this does to the > performance of the benchmarks mentioned in this thread (among others) would > certainly be interesting. > > -Hal > > > > > Such a patch would have the benefit of keeping the set of available > transforms constant throughout the pipeline while bringing execution time > down; I wouldn't be at all surprised if caching the ValueTracking functions > resulted in a bigger time savings. > > On Fri, Mar 17, 2017 at 5:49 PM, Hal Finkel via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> On 03/17/2017 04:30 PM, Mehdi Amini via llvm-dev wrote: >> >> >> On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> Hi, >> >> One of the most time-consuming passes in LLVM middle-end is InstCombine >> (see e.g. [1]). It is a very powerful pass capable of doing all the crazy >> stuff, and new patterns are being constantly introduced there. The problem >> is that we often use it just as a clean-up pass: it's scheduled 6 times in >> the current pass pipeline, and each time it's invoked it checks all known >> patterns. It sounds ok for O3, where we try to squeeze as much performance >> as possible, but it is too excessive for other opt-levels. InstCombine has >> an ExpensiveCombines parameter to address that - but I think it's underused >> at the moment. >> >> >> Yes, the “ExpensiveCombines” has been added recently (4.0? 3.9?) but I >> believe has always been intended to be extended the way you’re doing it. So >> I support this effort :) >> >> >> +1 >> >> Also, did your profiling reveal why the other combines are expensive? >> Among other things, I'm curious if the expensive ones tend to spend a lot >> of time in ValueTracking (getting known bits and similar)? >> >> -Hal >> >> >> >> CC: David for the general direction on InstCombine though. >> >> >> — >> Mehdi >> >> >> >> >> Trying to find out, which patterns are important, and which are rare, I >> profiled clang using CTMark and got the following coverage report: >> <InstCombine_covreport.html> >> (beware, the file is ~6MB). >> >> Guided by this profile I moved some patterns under the "if >> (ExpensiveCombines)" check, which expectedly happened to be neutral for >> runtime performance, but improved compile-time. The testing results are >> below (measured for Os). >> >> Performance Improvements - Compile Time Δ Previous Current σ >> CTMark/sqlite3/sqlite3 >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2> -1.55% >> 6.8155 6.7102 0.0081 >> CTMark/mafft/pairlocalalign >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.1=2> -1.05% 8.0407 >> 7.9559 0.0193 >> CTMark/ClamAV/clamscan >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2> -1.02% >> 11.3893 11.2734 0.0081 >> CTMark/lencod/lencod >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.10=2> -1.01% >> 12.8763 12.7461 0.0244 >> CTMark/SPASS/SPASS >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.5=2> -1.01% >> 12.5048 12.3791 0.0340 >> >> Performance Improvements - Compile Time Δ Previous Current σ >> External/SPEC/CINT2006/403.gcc/403.gcc >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.14=2> -1.64% >> 54.0801 53.1930 - >> External/SPEC/CINT2006/400.perlbench/400.perlbench >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2> -1.25% >> 19.1481 18.9091 - >> External/SPEC/CINT2006/445.gobmk/445.gobmk >> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2> -1.01% >> 15.2819 15.1274 - >> >> >> Do such changes make sense? The patch doesn't change O3, but it does >> change Os and potentially can change performance there (though I didn't see >> any changes in my tests). >> >> The patch is attached for the reference, if we decide to go for it, I'll >> upload it to phab: >> >> <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch> >> >> >> Thanks, >> Michael >> >> [1]: http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> >> _______________________________________________ LLVM Developers mailing >> list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/ >> mailman/listinfo/llvm-dev > > _______________________________________________ LLVM Developers mailing > list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/ > mailman/listinfo/llvm-dev > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > > _______________________________________________ LLVM Developers mailing > list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/ > mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170324/3c8a7b5f/attachment-0001.html>