Mikulin, Dmitry via llvm-dev
2017-Apr-14 17:39 UTC
[llvm-dev] Saving Compile Time in InstCombine
> On Apr 13, 2017, at 7:43 PM, Davide Italiano <davide at freebsd.org> wrote: > > On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry > <dmitry.mikulin at sony.com> wrote: >> I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far. >> >> Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time. >> >> InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager. >> > > It's because we run instcombine as we inline (see > addFunctionSimplificationPasses()) IIRC. We don't quite do this at LTO > time (FullLTO) because it's too expensive compile-time wise. ThinLTO > runs it. > >> ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought. >> >> Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors. >> >> ICmpInst 6.1% >> LoadInst 5.5% >> CallInst 2.1% >> GetElementPtrInst 2.1% >> StoreInst 1.6% >> >> Out of 35% InstCombine time, about half is spent in the top 5 visitor routines. >> > > So walking the matchers seems to be expensive from your preliminary > analysis, at least, this is what you're saying?Looks like it. Other than computeKnownBits, most other functions at the top of the profile for InstCombine are instruction visitors.> Is this a run with debug info? i.e. are you passing -g to the per-TU > pipeline? I'm inclined to think this is mostly an additive effect > adding matchers here and there that don't really hurt small testcases > but we pay the debt over time (in particular for LTO). Side note, I > noticed (and others did as well) that instcombine is way slower with > `-g` on (one of the reasons could be we walking much longer use lists, > due to the dbg use). Do you have numbers of instcombine ran on IR with > and without debug info?I do have the numbers for the same app with and without debug info. The results above are for the no-debug version. Total execution time of -O3 is 34% slower with debug info. The size of the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with the exception of bit code writer and verifier taking a larger share in the -g case. Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst takes significantly larger share with -g: 5s vs 13s, which translates to about half of the InstCombine slowdown. Need to understand why. ComputeKnownBits takes about the same time and other visitors have elevated times I would guess due to the need to propagate debug info.> >> I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine” on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem? If so, what are the alternatives for debugging large scale problem? If not, it’s possibly another item to add to the to-do list. >> > > You may consider adding statistics (those should be much more > scalable) although more coarse. > > Thanks! > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare >
Mikulin, Dmitry via llvm-dev
2017-Apr-14 21:19 UTC
[llvm-dev] Saving Compile Time in InstCombine
>> Is this a run with debug info? i.e. are you passing -g to the per-TU >> pipeline? I'm inclined to think this is mostly an additive effect >> adding matchers here and there that don't really hurt small testcases >> but we pay the debt over time (in particular for LTO). Side note, I >> noticed (and others did as well) that instcombine is way slower with >> `-g` on (one of the reasons could be we walking much longer use lists, >> due to the dbg use). Do you have numbers of instcombine ran on IR with >> and without debug info? > > I do have the numbers for the same app with and without debug info. The results above are for the no-debug version. > > Total execution time of -O3 is 34% slower with debug info. The size of the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with the exception of bit code writer and verifier taking a larger share in the -g case. > > Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst takes significantly larger share with -g: 5s vs 13s, which translates to about half of the InstCombine slowdown. Need to understand why.Ah, it’s all those calls to @llvm.dbg.* functions. I’ll explore if they can be safely ignored by InstCombine.> ComputeKnownBits takes about the same time and other visitors have elevated times I would guess due to the need to propagate debug info. > > >> >>> I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine” on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem? If so, what are the alternatives for debugging large scale problem? If not, it’s possibly another item to add to the to-do list. >>> >> >> You may consider adding statistics (those should be much more >> scalable) although more coarse. >> >> Thanks! >> >> -- >> Davide >> >> "There are no solved problems; there are only problems that are more >> or less solved" -- Henri Poincare >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Davide Italiano via llvm-dev
2017-Apr-14 21:23 UTC
[llvm-dev] Saving Compile Time in InstCombine
On Fri, Apr 14, 2017 at 2:19 PM, Mikulin, Dmitry <dmitry.mikulin at sony.com> wrote:> >>> Is this a run with debug info? i.e. are you passing -g to the per-TU >>> pipeline? I'm inclined to think this is mostly an additive effect >>> adding matchers here and there that don't really hurt small testcases >>> but we pay the debt over time (in particular for LTO). Side note, I >>> noticed (and others did as well) that instcombine is way slower with >>> `-g` on (one of the reasons could be we walking much longer use lists, >>> due to the dbg use). Do you have numbers of instcombine ran on IR with >>> and without debug info? >> >> I do have the numbers for the same app with and without debug info. The results above are for the no-debug version. >> >> Total execution time of -O3 is 34% slower with debug info. The size of the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with the exception of bit code writer and verifier taking a larger share in the -g case. >> >> Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst takes significantly larger share with -g: 5s vs 13s, which translates to about half of the InstCombine slowdown. Need to understand why. > > Ah, it’s all those calls to @llvm.dbg.* functions. I’ll explore if they can be safely ignored by InstCombine. > >I took a look and saw no immediate problems, also discussed with David Majnemer on IRC, who thinks we should just bail out early. -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare
Reid Kleckner via llvm-dev
2017-Apr-15 15:38 UTC
[llvm-dev] Saving Compile Time in InstCombine
I had an idea that llvm.dbg.value should be variadic. I was staring at some program output, and I noticed that debug values tend to group together around inline call sites. It might be interesting to shorten the instruction stream by extending the dbg.value operand list to describe multiple variables and expressions. On Fri, Apr 14, 2017 at 2:19 PM, Mikulin, Dmitry via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > >> Is this a run with debug info? i.e. are you passing -g to the per-TU > >> pipeline? I'm inclined to think this is mostly an additive effect > >> adding matchers here and there that don't really hurt small testcases > >> but we pay the debt over time (in particular for LTO). Side note, I > >> noticed (and others did as well) that instcombine is way slower with > >> `-g` on (one of the reasons could be we walking much longer use lists, > >> due to the dbg use). Do you have numbers of instcombine ran on IR with > >> and without debug info? > > > > I do have the numbers for the same app with and without debug info. The > results above are for the no-debug version. > > > > Total execution time of -O3 is 34% slower with debug info. The size of > the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar > with the exception of bit code writer and verifier taking a larger share in > the -g case. > > > > Looking at InstCombine, it’s 23% slower. One notable thing is that > CallInst takes significantly larger share with -g: 5s vs 13s, which > translates to about half of the InstCombine slowdown. Need to understand > why. > > Ah, it’s all those calls to @llvm.dbg.* functions. I’ll explore if they > can be safely ignored by InstCombine. > > > > ComputeKnownBits takes about the same time and other visitors have > elevated times I would guess due to the need to propagate debug info. > > > > > >> > >>> I wanted to see what transformations InstCombine actually performs. > Using -debug option turned out not to be very scalable. Never mind the > large output size of the trace, running "opt -debug -instcombine” on > anything other than a small IR is excruciatingly slow. Out of curiosity I > profiled it too: 96% of the time is spent decoding and printing > instructions. Is this a known problem? If so, what are the alternatives for > debugging large scale problem? If not, it’s possibly another item to add to > the to-do list. > >>> > >> > >> You may consider adding statistics (those should be much more > >> scalable) although more coarse. > >> > >> Thanks! > >> > >> -- > >> Davide > >> > >> "There are no solved problems; there are only problems that are more > >> or less solved" -- Henri Poincare > >> > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170415/f327a7d8/attachment.html>