thr3ads.net - llvm dev - [llvm-dev] Saving Compile Time in InstCombine [Apr 2017]

If this information is useful, please help other people find it:
Share via:

Mikulin, Dmitry via llvm-dev

2017-Apr-14 17:39 UTC

[llvm-dev] Saving Compile Time in InstCombine

> On Apr 13, 2017, at 7:43 PM, Davide Italiano <davide at freebsd.org>
wrote:
> 
> On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry
> <dmitry.mikulin at sony.com> wrote:
>> I’m taking a first look at InstCombine performance. I picked up the
caching patch and ran a few experiments on one of our larger C++ apps. The size
of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so
far.
>> 
>> Interestingly, caching produced a slight but measurable performance
degradation of -O3 compile time.
>> 
>> InstCombine takes about 35% of total execution time, of which ~20%
originates from CGPassManager.
>> 
> 
> It's because we run instcombine as we inline (see
> addFunctionSimplificationPasses()) IIRC. We don't quite do this at LTO
> time (FullLTO) because it's too expensive compile-time wise. ThinLTO
> runs it.
> 
>> ComputeKnownBits contributes 7.8%, but calls from InstCombine
contribute only 2.6% to the total execution time. Caching only covers
InstCombine use of KnownBits. This may explain limited gain or even slight
degradation if KnownBits are not re-computed as often as we thought.
>> 
>> Most of the time is spent in instruction visitor routines. CmpInst,
LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.
>> 
>> ICmpInst          6.1%
>> LoadInst          5.5%
>> CallInst          2.1%
>> GetElementPtrInst 2.1%
>> StoreInst         1.6%
>> 
>> Out of 35% InstCombine time, about half is spent in the top 5 visitor
routines.
>> 
> 
> So walking the matchers seems to be expensive from your preliminary
> analysis, at least, this is what you're saying?
Looks like it. Other than computeKnownBits, most other functions at the top of
the profile for InstCombine are instruction visitors.
> Is this a run with debug info? i.e. are you passing -g to the per-TU
> pipeline? I'm inclined to think this is mostly an additive effect
> adding matchers here and there that don't really hurt small testcases
> but we pay the debt over time (in particular for LTO). Side note, I
> noticed (and others did as well) that instcombine is way slower with
> `-g` on (one of the reasons could be we walking much longer use lists,
> due to the dbg use). Do you have numbers of instcombine ran on IR with
> and without debug info?
I do have the numbers for the same app with and without debug info. The results
above are for the no-debug version.

Total execution time of -O3 is 34% slower with debug info. The size of the debug
IR is 162M vs 39M no-debug. Both profiles look relatively similar with the
exception of bit code writer and verifier taking a larger share in the -g case.

Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst
takes significantly larger share with -g: 5s vs 13s, which translates to about
half of the InstCombine slowdown. Need to understand why. ComputeKnownBits takes
about the same time and other visitors have elevated times I would guess due to
the need to propagate debug info.

> 
>> I wanted to see what transformations InstCombine actually performs.
Using -debug option turned out not to be very scalable. Never mind the large
output size of the trace, running "opt -debug -instcombine” on anything
other than a small IR is excruciatingly slow. Out of curiosity I profiled it
too: 96% of the time is spent decoding and printing instructions. Is this a
known problem? If so, what are the alternatives for debugging large scale
problem? If not, it’s possibly another item to add to the to-do list.
>> 
> 
> You may consider adding statistics (those should be much more
> scalable) although more coarse.
> 
> Thanks!
> 
> -- 
> Davide
> 
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
>

Mikulin, Dmitry via llvm-dev

2017-Apr-14 21:19 UTC

head link

[llvm-dev] Saving Compile Time in InstCombine

>> Is this a run with debug info? i.e. are you passing -g to the per-TU
>> pipeline? I'm inclined to think this is mostly an additive effect
>> adding matchers here and there that don't really hurt small
testcases
>> but we pay the debt over time (in particular for LTO). Side note, I
>> noticed (and others did as well) that instcombine is way slower with
>> `-g` on (one of the reasons could be we walking much longer use lists,
>> due to the dbg use). Do you have numbers of instcombine ran on IR with
>> and without debug info?
> 
> I do have the numbers for the same app with and without debug info. The
results above are for the no-debug version.
> 
> Total execution time of -O3 is 34% slower with debug info. The size of the
debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with the
exception of bit code writer and verifier taking a larger share in the -g case.
> 
> Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst
takes significantly larger share with -g: 5s vs 13s, which translates to about
half of the InstCombine slowdown. Need to understand why.
Ah, it’s all those calls to @llvm.dbg.* functions. I’ll explore if they can be
safely ignored by InstCombine.

> ComputeKnownBits takes about the same time and other visitors have elevated
times I would guess due to the need to propagate debug info.
> 
> 
>> 
>>> I wanted to see what transformations InstCombine actually performs.
Using -debug option turned out not to be very scalable. Never mind the large
output size of the trace, running "opt -debug -instcombine” on anything
other than a small IR is excruciatingly slow. Out of curiosity I profiled it
too: 96% of the time is spent decoding and printing instructions. Is this a
known problem? If so, what are the alternatives for debugging large scale
problem? If not, it’s possibly another item to add to the to-do list.
>>> 
>> 
>> You may consider adding statistics (those should be much more
>> scalable) although more coarse.
>> 
>> Thanks!
>> 
>> -- 
>> Davide
>> 
>> "There are no solved problems; there are only problems that are
more
>> or less solved" -- Henri Poincare
>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Davide Italiano via llvm-dev

2017-Apr-14 21:23 UTC

head link

[llvm-dev] Saving Compile Time in InstCombine

On Fri, Apr 14, 2017 at 2:19 PM, Mikulin, Dmitry
<dmitry.mikulin at sony.com> wrote:>
>>> Is this a run with debug info? i.e. are you passing -g to the
per-TU
>>> pipeline? I'm inclined to think this is mostly an additive
effect
>>> adding matchers here and there that don't really hurt small
testcases
>>> but we pay the debt over time (in particular for LTO). Side note, I
>>> noticed (and others did as well) that instcombine is way slower
with
>>> `-g` on (one of the reasons could be we walking much longer use
lists,
>>> due to the dbg use). Do you have numbers of instcombine ran on IR
with
>>> and without debug info?
>>
>> I do have the numbers for the same app with and without debug info. The
results above are for the no-debug version.
>>
>> Total execution time of -O3 is 34% slower with debug info. The size of
the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with
the exception of bit code writer and verifier taking a larger share in the -g
case.
>>
>> Looking at InstCombine, it’s 23% slower. One notable thing is that
CallInst takes significantly larger share with -g: 5s vs 13s, which translates
to about half of the InstCombine slowdown. Need to understand why.
>
> Ah, it’s all those calls to @llvm.dbg.* functions. I’ll explore if they can
be safely ignored by InstCombine.
>
>
I took a look and saw no immediate problems, also discussed with David
Majnemer on IRC, who thinks we should just bail out early.

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Reid Kleckner via llvm-dev

2017-Apr-15 15:38 UTC

head link

[llvm-dev] Saving Compile Time in InstCombine

I had an idea that llvm.dbg.value should be variadic. I was staring at some
program output, and I noticed that debug values tend to group together
around inline call sites. It might be interesting to shorten the
instruction stream by extending the dbg.value operand list to describe
multiple variables and expressions.

On Fri, Apr 14, 2017 at 2:19 PM, Mikulin, Dmitry via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> >> Is this a run with debug info? i.e. are you passing -g to the
per-TU
> >> pipeline? I'm inclined to think this is mostly an additive
effect
> >> adding matchers here and there that don't really hurt small
testcases
> >> but we pay the debt over time (in particular for LTO). Side note,
I
> >> noticed (and others did as well) that instcombine is way slower
with
> >> `-g` on (one of the reasons could be we walking much longer use
lists,
> >> due to the dbg use). Do you have numbers of instcombine ran on IR
with
> >> and without debug info?
> >
> > I do have the numbers for the same app with and without debug info.
The
> results above are for the no-debug version.
> >
> > Total execution time of -O3 is 34% slower with debug info. The size of
> the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar
> with the exception of bit code writer and verifier taking a larger share in
> the -g case.
> >
> > Looking at InstCombine, it’s 23% slower. One notable thing is that
> CallInst takes significantly larger share with -g: 5s vs 13s, which
> translates to about half of the InstCombine slowdown. Need to understand
> why.
>
> Ah, it’s all those calls to @llvm.dbg.* functions. I’ll explore if they
> can be safely ignored by InstCombine.
>
>
> > ComputeKnownBits takes about the same time and other visitors have
> elevated times I would guess due to the need to propagate debug info.
> >
> >
> >>
> >>> I wanted to see what transformations InstCombine actually
performs.
> Using -debug option turned out not to be very scalable. Never mind the
> large output size of the trace, running "opt -debug -instcombine” on
> anything other than a small IR is excruciatingly slow. Out of curiosity I
> profiled it too: 96% of the time is spent decoding and printing
> instructions. Is this a known problem? If so, what are the alternatives for
> debugging large scale problem? If not, it’s possibly another item to add to
> the to-do list.
> >>>
> >>
> >> You may consider adding statistics (those should be much more
> >> scalable) although more coarse.
> >>
> >> Thanks!
> >>
> >> --
> >> Davide
> >>
> >> "There are no solved problems; there are only problems that
are more
> >> or less solved" -- Henri Poincare
> >>
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170415/f327a7d8/attachment.html>

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Apr 2017 - Saving Compile Time in InstCombine

[llvm-dev] Saving Compile Time in InstCombine

[llvm-dev] Saving Compile Time in InstCombine

[llvm-dev] Saving Compile Time in InstCombine

[llvm-dev] Saving Compile Time in InstCombine

Reasonably Related Threads