thr3ads.net - llvm dev - [llvm-dev] [LLVMdev] Path forward on profile guided inlining? [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2015-Dec-11 01:03 UTC

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
<listmail at philipreames.com> wrote:>
>
> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
>>
>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
>> <listmail at philipreames.com> wrote:
>>>
>>> Given I didn't get any response to my original query, I chose
not to
>>> invest
>>> time in this at the time.  I am unlikely to get time for this in
the near
>>> future.
>>>
>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
>>>
>>> (Resending after  removing llvmdev at cs.uiuc.edu and using
>>> llvm-dev at lists.llvm.org)
>>>
>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman <eraman at
google.com> wrote:
>>>>
>>>> Hi Philip,
>>>>
>>>>   Is there any update on this? I've been sending patches to
get rid of
>>>> the
>>>> callee hotness based inline hints from the frontend and move
the logic
>>>> to
>>>> the inliner. The next step is to use the callsite hotness
instead. I
>>>> also
>>>> want to focus on the infrastructure to enable this and what
I've been
>>>> experimenting with is similar to your two alternative
approaches:
>>>>
>>>>> Alternate Approaches:
>>>>> 1) We could just recompute BFI explicitly in the inliner
right before
>>>>> passing the result to ICA for the purposes of prototyping.
If this was
>>>>> off
>>>>> by default, this might be a reasonable scheme for
investigation.  This
>>>>> could
>>>>> likely never be enabled for real uses.
>>>>> 2) We could pre-compute BFI once per function within a
given SCC and
>>>>> then
>>>>> try to keep it up to date during inlining.  If we cached
the call
>>>>> frequencies for the initial call sites, we could adjust the
visit order
>>>>> to
>>>>> minimize the number of times we need to recompute a given
functions
>>>>> block
>>>>> frequencies.  (e.g. we can look at all the original call
sites within a
>>>>> function before looking at newly inlined ones)
>>>>>
>>>> My proposal is very similar (perhaps identical) to your option
2 above.
>>>> I
>>>> don't understand the part where you talk about adjusting
the visit order
>>>> to
>>>> minimize BFI computation.
>>>>
>>>> BFI computation:  BFI for a function is computed on demand and
cached.
>>>
>>> The problem here is that the old pass manager does not have a
place/way
>>> to
>>> cache the function level analysis.  You could do this within the
inliner
>>> itself, but the function passes run by the SCC pass manager will
>>> invalidate
>>> everything.  That's the problem I was trying to solve.
>>
>> Easwaran's proposal does not rely on the analysis pass framework of
>> the legacy pass manager -- that is the key idea. BFI info is designed
>> in a way such that it can be computed on-demand.
>>
>> BFI for the caller will be incrementally updated during inlining, but
>> after inlining is done for a node, the BFI info will be explicitly
>> invalidated for that node due to the function passes run later by SCC
>> pass manager.  This is expected.
>
> Just to make sure, you're aware this will cost at least one
re-computation
> per iteration in CGPassManager::runOnModule right? If that's expected
and
> acceptable, then yes, the approach you seem to be describing should be
> workable.
>
For each leaf node, the number of BFI computation is <= 1. For other
nodes, it is <=2 (one as caller, one as callee) -- in other words, it
is linear to the number of functions.

> Particularly as an intermediate step to parallelize work with the new pass
> manager, this sounds very much worthwhile.
>>
Yes, that is exactly the intention -- it can buy pass manager rework
more time while unlock lots of pending performance tuning work we plan
to do.

thanks,

David

>>
>>>> Update: When 'bar' gets inlined into 'foo', the
BFI for 'foo' is
>>>> updated.
>>>> Let OldBB in 'bar' gets cloned as NewBB in
'foo'.  NewBB's block
>>>> frequency
>>>> can be incrementally computed from OldBB's block frequency,
entry block
>>>> frequency of 'bar' and the frequency of the block
containing the 'foo'
>>>> ->
>>>> 'bar' callsite. Even when the new CGSCC level BFI
analysis is in place,
>>>> this
>>>> incremental update is useful to minimize computation.
>>>>
>>>> Invalidation:  Once inlining is completed in an SCC (at the end
of
>>>> runOnSCC), the entries for functions in that SCC are
invalidated since
>>>> other
>>>> passes run by the CGSCC pass manager (including those run by
the
>>>> function
>>>> pass manager run under CGSCC pass manager) might affect the
computed BFI
>>>> for
>>>> the functions in the SCC.
>>>
>>> This sounds like you're essentially planning to extend the old
pass
>>> manager
>>> with function level analysis support in the SCC pass manager.  I
advise
>>> against this given the fact that Chandler is actively working on
the new
>>> one
>>> and getting patches reviewed for the old one is
"challenging" at best.
>>> If
>>> you take this approach, make sure Chandler - who is pretty much the
only
>>> active reviewer in this area - is on-board with your approach
before you
>>> start.
>>
>> There is no plan to extend old pass manager for this capability.  The
>> proposed solution is intended to work with the new pass manager too,
>> so there is no conflict here. The BFI book-keeping by inliner can be
>> removed with pass manager change is in place.
>
> Thanks for the clarification.
>
>>
>> David
>>
>>
>>>> When the new PM infrastructure and a CGSCC based BFI analysis
is in
>>>> place,
>>>> the transition should be easy assuming it will provide
getBFI(Function
>>>> *)
>>>> and invalidateBFI(Function *) interfaces. BFI for a function is
computed
>>>> at
>>>> most twice in this approach. Thoughts?
>>>>
>>>>
>>>> Thanks,
>>>> Easwaran
>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>
>>>>
>>>
>

Xinliang David Li via llvm-dev

2015-Dec-11 01:06 UTC

head link

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

On Thu, Dec 10, 2015 at 5:03 PM, Xinliang David Li <davidxl at google.com>
wrote:> On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
> <listmail at philipreames.com> wrote:
>>
>>
>> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
>>>
>>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
>>> <listmail at philipreames.com> wrote:
>>>>
>>>> Given I didn't get any response to my original query, I
chose not to
>>>> invest
>>>> time in this at the time.  I am unlikely to get time for this
in the near
>>>> future.
>>>>
>>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
>>>>
>>>> (Resending after  removing llvmdev at cs.uiuc.edu and using
>>>> llvm-dev at lists.llvm.org)
>>>>
>>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman <eraman at
google.com> wrote:
>>>>>
>>>>> Hi Philip,
>>>>>
>>>>>   Is there any update on this? I've been sending
patches to get rid of
>>>>> the
>>>>> callee hotness based inline hints from the frontend and
move the logic
>>>>> to
>>>>> the inliner. The next step is to use the callsite hotness
instead. I
>>>>> also
>>>>> want to focus on the infrastructure to enable this and what
I've been
>>>>> experimenting with is similar to your two alternative
approaches:
>>>>>
>>>>>> Alternate Approaches:
>>>>>> 1) We could just recompute BFI explicitly in the
inliner right before
>>>>>> passing the result to ICA for the purposes of
prototyping. If this was
>>>>>> off
>>>>>> by default, this might be a reasonable scheme for
investigation.  This
>>>>>> could
>>>>>> likely never be enabled for real uses.
>>>>>> 2) We could pre-compute BFI once per function within a
given SCC and
>>>>>> then
>>>>>> try to keep it up to date during inlining.  If we
cached the call
>>>>>> frequencies for the initial call sites, we could adjust
the visit order
>>>>>> to
>>>>>> minimize the number of times we need to recompute a
given functions
>>>>>> block
>>>>>> frequencies.  (e.g. we can look at all the original
call sites within a
>>>>>> function before looking at newly inlined ones)
>>>>>>
>>>>> My proposal is very similar (perhaps identical) to your
option 2 above.
>>>>> I
>>>>> don't understand the part where you talk about
adjusting the visit order
>>>>> to
>>>>> minimize BFI computation.
>>>>>
>>>>> BFI computation:  BFI for a function is computed on demand
and cached.
>>>>
>>>> The problem here is that the old pass manager does not have a
place/way
>>>> to
>>>> cache the function level analysis.  You could do this within
the inliner
>>>> itself, but the function passes run by the SCC pass manager
will
>>>> invalidate
>>>> everything.  That's the problem I was trying to solve.
>>>
>>> Easwaran's proposal does not rely on the analysis pass
framework of
>>> the legacy pass manager -- that is the key idea. BFI info is
designed
>>> in a way such that it can be computed on-demand.
>>>
>>> BFI for the caller will be incrementally updated during inlining,
but
>>> after inlining is done for a node, the BFI info will be explicitly
>>> invalidated for that node due to the function passes run later by
SCC
>>> pass manager.  This is expected.
>>
>> Just to make sure, you're aware this will cost at least one
re-computation
>> per iteration in CGPassManager::runOnModule right? If that's
expected and
>> acceptable, then yes, the approach you seem to be describing should be
>> workable.
>>
>
> For each leaf node, the number of BFI computation is <= 1. For other
> nodes, it is <=2 (one as caller, one as callee) -- in other words, it
> is linear to the number of functions.
With PGO, the cost can be greatly reduced. For instance, if the
function entry count is zero (in practice, large percentage of
functions are like this), do not bother to do frequency analysis --
they will be optimized for size.

David
>
>
>> Particularly as an intermediate step to parallelize work with the new
pass
>> manager, this sounds very much worthwhile.
>>>
>
> Yes, that is exactly the intention -- it can buy pass manager rework
> more time while unlock lots of pending performance tuning work we plan
> to do.
>
> thanks,
>
> David
>
>
>>>
>>>>> Update: When 'bar' gets inlined into 'foo',
the BFI for 'foo' is
>>>>> updated.
>>>>> Let OldBB in 'bar' gets cloned as NewBB in
'foo'.  NewBB's block
>>>>> frequency
>>>>> can be incrementally computed from OldBB's block
frequency, entry block
>>>>> frequency of 'bar' and the frequency of the block
containing the 'foo'
>>>>> ->
>>>>> 'bar' callsite. Even when the new CGSCC level BFI
analysis is in place,
>>>>> this
>>>>> incremental update is useful to minimize computation.
>>>>>
>>>>> Invalidation:  Once inlining is completed in an SCC (at the
end of
>>>>> runOnSCC), the entries for functions in that SCC are
invalidated since
>>>>> other
>>>>> passes run by the CGSCC pass manager (including those run
by the
>>>>> function
>>>>> pass manager run under CGSCC pass manager) might affect the
computed BFI
>>>>> for
>>>>> the functions in the SCC.
>>>>
>>>> This sounds like you're essentially planning to extend the
old pass
>>>> manager
>>>> with function level analysis support in the SCC pass manager. 
I advise
>>>> against this given the fact that Chandler is actively working
on the new
>>>> one
>>>> and getting patches reviewed for the old one is
"challenging" at best.
>>>> If
>>>> you take this approach, make sure Chandler - who is pretty much
the only
>>>> active reviewer in this area - is on-board with your approach
before you
>>>> start.
>>>
>>> There is no plan to extend old pass manager for this capability. 
The
>>> proposed solution is intended to work with the new pass manager
too,
>>> so there is no conflict here. The BFI book-keeping by inliner can
be
>>> removed with pass manager change is in place.
>>
>> Thanks for the clarification.
>>
>>>
>>> David
>>>
>>>
>>>>> When the new PM infrastructure and a CGSCC based BFI
analysis is in
>>>>> place,
>>>>> the transition should be easy assuming it will provide
getBFI(Function
>>>>> *)
>>>>> and invalidateBFI(Function *) interfaces. BFI for a
function is computed
>>>>> at
>>>>> most twice in this approach. Thoughts?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Easwaran
>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>
>>>>>
>>>>
>>

Philip Reames via llvm-dev

2015-Dec-11 01:12 UTC

head link

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

On 12/10/2015 05:03 PM, Xinliang David Li wrote:> On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
> <listmail at philipreames.com> wrote:
>>
>> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
>>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
>>> <listmail at philipreames.com> wrote:
>>>> Given I didn't get any response to my original query, I
chose not to
>>>> invest
>>>> time in this at the time.  I am unlikely to get time for this
in the near
>>>> future.
>>>>
>>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
>>>>
>>>> (Resending after  removing llvmdev at cs.uiuc.edu and using
>>>> llvm-dev at lists.llvm.org)
>>>>
>>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman <eraman at
google.com> wrote:
>>>>> Hi Philip,
>>>>>
>>>>>    Is there any update on this? I've been sending
patches to get rid of
>>>>> the
>>>>> callee hotness based inline hints from the frontend and
move the logic
>>>>> to
>>>>> the inliner. The next step is to use the callsite hotness
instead. I
>>>>> also
>>>>> want to focus on the infrastructure to enable this and what
I've been
>>>>> experimenting with is similar to your two alternative
approaches:
>>>>>
>>>>>> Alternate Approaches:
>>>>>> 1) We could just recompute BFI explicitly in the
inliner right before
>>>>>> passing the result to ICA for the purposes of
prototyping. If this was
>>>>>> off
>>>>>> by default, this might be a reasonable scheme for
investigation.  This
>>>>>> could
>>>>>> likely never be enabled for real uses.
>>>>>> 2) We could pre-compute BFI once per function within a
given SCC and
>>>>>> then
>>>>>> try to keep it up to date during inlining.  If we
cached the call
>>>>>> frequencies for the initial call sites, we could adjust
the visit order
>>>>>> to
>>>>>> minimize the number of times we need to recompute a
given functions
>>>>>> block
>>>>>> frequencies.  (e.g. we can look at all the original
call sites within a
>>>>>> function before looking at newly inlined ones)
>>>>>>
>>>>> My proposal is very similar (perhaps identical) to your
option 2 above.
>>>>> I
>>>>> don't understand the part where you talk about
adjusting the visit order
>>>>> to
>>>>> minimize BFI computation.
>>>>>
>>>>> BFI computation:  BFI for a function is computed on demand
and cached.
>>>> The problem here is that the old pass manager does not have a
place/way
>>>> to
>>>> cache the function level analysis.  You could do this within
the inliner
>>>> itself, but the function passes run by the SCC pass manager
will
>>>> invalidate
>>>> everything.  That's the problem I was trying to solve.
>>> Easwaran's proposal does not rely on the analysis pass
framework of
>>> the legacy pass manager -- that is the key idea. BFI info is
designed
>>> in a way such that it can be computed on-demand.
>>>
>>> BFI for the caller will be incrementally updated during inlining,
but
>>> after inlining is done for a node, the BFI info will be explicitly
>>> invalidated for that node due to the function passes run later by
SCC
>>> pass manager.  This is expected.
>> Just to make sure, you're aware this will cost at least one
re-computation
>> per iteration in CGPassManager::runOnModule right? If that's
expected and
>> acceptable, then yes, the approach you seem to be describing should be
>> workable.
>>
> For each leaf node, the number of BFI computation is <= 1. For other
> nodes, it is <=2 (one as caller, one as callee) -- in other words, it
> is linear to the number of functions.This is not true.

In CGPassManager::runOnModule, we will iterate on the *same* SCC 
multiple times if we're making progress (i.e. devirtualizing call sites) 
on each iteration.  The BFI will be invalidated on each iteration.  This 
is caped at 4 iterations, so it's still semi linear, but only because 
there's a cap.

Examples where this might kick in:
void bar() {};
func_ptr foo() { return &bar; }
void caller() {
   func_ptr f = foo();
   f();
}

We will visit caller twice so that we can inline both foo and bar.

I think this iteration is okay for what your proposing - i.e. off by 
default, migration step, etc - but it's important to understand the 
implications of the planned changes.

>
>
>> Particularly as an intermediate step to parallelize work with the new
pass
>> manager, this sounds very much worthwhile.
> Yes, that is exactly the intention -- it can buy pass manager rework
> more time while unlock lots of pending performance tuning work we plan
> to do.
>
> thanks,
>
> David
>
>
>>>>> Update: When 'bar' gets inlined into 'foo',
the BFI for 'foo' is
>>>>> updated.
>>>>> Let OldBB in 'bar' gets cloned as NewBB in
'foo'.  NewBB's block
>>>>> frequency
>>>>> can be incrementally computed from OldBB's block
frequency, entry block
>>>>> frequency of 'bar' and the frequency of the block
containing the 'foo'
>>>>> ->
>>>>> 'bar' callsite. Even when the new CGSCC level BFI
analysis is in place,
>>>>> this
>>>>> incremental update is useful to minimize computation.
>>>>>
>>>>> Invalidation:  Once inlining is completed in an SCC (at the
end of
>>>>> runOnSCC), the entries for functions in that SCC are
invalidated since
>>>>> other
>>>>> passes run by the CGSCC pass manager (including those run
by the
>>>>> function
>>>>> pass manager run under CGSCC pass manager) might affect the
computed BFI
>>>>> for
>>>>> the functions in the SCC.
>>>> This sounds like you're essentially planning to extend the
old pass
>>>> manager
>>>> with function level analysis support in the SCC pass manager. 
I advise
>>>> against this given the fact that Chandler is actively working
on the new
>>>> one
>>>> and getting patches reviewed for the old one is
"challenging" at best.
>>>> If
>>>> you take this approach, make sure Chandler - who is pretty much
the only
>>>> active reviewer in this area - is on-board with your approach
before you
>>>> start.
>>> There is no plan to extend old pass manager for this capability. 
The
>>> proposed solution is intended to work with the new pass manager
too,
>>> so there is no conflict here. The BFI book-keeping by inliner can
be
>>> removed with pass manager change is in place.
>> Thanks for the clarification.
>>
>>> David
>>>
>>>
>>>>> When the new PM infrastructure and a CGSCC based BFI
analysis is in
>>>>> place,
>>>>> the transition should be easy assuming it will provide
getBFI(Function
>>>>> *)
>>>>> and invalidateBFI(Function *) interfaces. BFI for a
function is computed
>>>>> at
>>>>> most twice in this approach. Thoughts?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Easwaran
>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>

Hal Finkel via llvm-dev

2015-Dec-11 01:22 UTC

head link

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

----- Original Message -----> From: "Philip Reames via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: "Xinliang David Li" <davidxl at google.com>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Thursday, December 10, 2015 7:12:50 PM
> Subject: Re: [llvm-dev] [LLVMdev] Path forward on profile guided inlining?
> 
> 
> 
> On 12/10/2015 05:03 PM, Xinliang David Li wrote:
> > On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
> > <listmail at philipreames.com> wrote:
> >>
> >> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
> >>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
> >>> <listmail at philipreames.com> wrote:
> >>>> Given I didn't get any response to my original query,
I chose
> >>>> not to
> >>>> invest
> >>>> time in this at the time.  I am unlikely to get time for
this in
> >>>> the near
> >>>> future.
> >>>>
> >>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
> >>>>
> >>>> (Resending after  removing llvmdev at cs.uiuc.edu and
using
> >>>> llvm-dev at lists.llvm.org)
> >>>>
> >>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman
> >>>> <eraman at google.com> wrote:
> >>>>> Hi Philip,
> >>>>>
> >>>>>    Is there any update on this? I've been sending
patches to
> >>>>>    get rid of
> >>>>> the
> >>>>> callee hotness based inline hints from the frontend
and move
> >>>>> the logic
> >>>>> to
> >>>>> the inliner. The next step is to use the callsite
hotness
> >>>>> instead. I
> >>>>> also
> >>>>> want to focus on the infrastructure to enable this and
what
> >>>>> I've been
> >>>>> experimenting with is similar to your two alternative
> >>>>> approaches:
> >>>>>
> >>>>>> Alternate Approaches:
> >>>>>> 1) We could just recompute BFI explicitly in the
inliner right
> >>>>>> before
> >>>>>> passing the result to ICA for the purposes of
prototyping. If
> >>>>>> this was
> >>>>>> off
> >>>>>> by default, this might be a reasonable scheme for
> >>>>>> investigation.  This
> >>>>>> could
> >>>>>> likely never be enabled for real uses.
> >>>>>> 2) We could pre-compute BFI once per function
within a given
> >>>>>> SCC and
> >>>>>> then
> >>>>>> try to keep it up to date during inlining.  If we
cached the
> >>>>>> call
> >>>>>> frequencies for the initial call sites, we could
adjust the
> >>>>>> visit order
> >>>>>> to
> >>>>>> minimize the number of times we need to recompute
a given
> >>>>>> functions
> >>>>>> block
> >>>>>> frequencies.  (e.g. we can look at all the
original call sites
> >>>>>> within a
> >>>>>> function before looking at newly inlined ones)
> >>>>>>
> >>>>> My proposal is very similar (perhaps identical) to
your option
> >>>>> 2 above.
> >>>>> I
> >>>>> don't understand the part where you talk about
adjusting the
> >>>>> visit order
> >>>>> to
> >>>>> minimize BFI computation.
> >>>>>
> >>>>> BFI computation:  BFI for a function is computed on
demand and
> >>>>> cached.
> >>>> The problem here is that the old pass manager does not
have a
> >>>> place/way
> >>>> to
> >>>> cache the function level analysis.  You could do this
within the
> >>>> inliner
> >>>> itself, but the function passes run by the SCC pass
manager will
> >>>> invalidate
> >>>> everything.  That's the problem I was trying to solve.
> >>> Easwaran's proposal does not rely on the analysis pass
framework
> >>> of
> >>> the legacy pass manager -- that is the key idea. BFI info is
> >>> designed
> >>> in a way such that it can be computed on-demand.
> >>>
> >>> BFI for the caller will be incrementally updated during
inlining,
> >>> but
> >>> after inlining is done for a node, the BFI info will be
> >>> explicitly
> >>> invalidated for that node due to the function passes run later
by
> >>> SCC
> >>> pass manager.  This is expected.
> >> Just to make sure, you're aware this will cost at least one
> >> re-computation
> >> per iteration in CGPassManager::runOnModule right? If that's
> >> expected and
> >> acceptable, then yes, the approach you seem to be describing
> >> should be
> >> workable.
> >>
> > For each leaf node, the number of BFI computation is <= 1. For
> > other
> > nodes, it is <=2 (one as caller, one as callee) -- in other words,
> > it
> > is linear to the number of functions.
> This is not true.
> 
> In CGPassManager::runOnModule, we will iterate on the *same* SCC
> multiple times if we're making progress (i.e. devirtualizing call
> sites)
> on each iteration.  The BFI will be invalidated on each iteration.
>  This
> is caped at 4 iterations, so it's still semi linear, but only because
> there's a cap.
> 
> Examples where this might kick in:
> void bar() {};
> func_ptr foo() { return &bar; }
> void caller() {
>    func_ptr f = foo();
>    f();
> }
> 
> We will visit caller twice so that we can inline both foo and bar.
> 
> I think this iteration is okay for what your proposing - i.e. off by
> default, migration step, etc - but it's important to understand the
> implications of the planned changes.
> 
I think it will be interesting to see if this causes a net increase or decrease
in compile time. Turning off inlining in regions known (suspected) to be cold
should decrease compile time overall, and given that the CodeGen parts involved
for large functions are really pretty expensive, I think it is still an open
question how this will turn out.

 -Hal
> 
> >
> >
> >> Particularly as an intermediate step to parallelize work with the
> >> new pass
> >> manager, this sounds very much worthwhile.
> > Yes, that is exactly the intention -- it can buy pass manager
> > rework
> > more time while unlock lots of pending performance tuning work we
> > plan
> > to do.
> >
> > thanks,
> >
> > David
> >
> >
> >>>>> Update: When 'bar' gets inlined into
'foo', the BFI for 'foo'
> >>>>> is
> >>>>> updated.
> >>>>> Let OldBB in 'bar' gets cloned as NewBB in
'foo'.  NewBB's
> >>>>> block
> >>>>> frequency
> >>>>> can be incrementally computed from OldBB's block
frequency,
> >>>>> entry block
> >>>>> frequency of 'bar' and the frequency of the
block containing
> >>>>> the 'foo'
> >>>>> ->
> >>>>> 'bar' callsite. Even when the new CGSCC level
BFI analysis is
> >>>>> in place,
> >>>>> this
> >>>>> incremental update is useful to minimize computation.
> >>>>>
> >>>>> Invalidation:  Once inlining is completed in an SCC
(at the end
> >>>>> of
> >>>>> runOnSCC), the entries for functions in that SCC are
> >>>>> invalidated since
> >>>>> other
> >>>>> passes run by the CGSCC pass manager (including those
run by
> >>>>> the
> >>>>> function
> >>>>> pass manager run under CGSCC pass manager) might
affect the
> >>>>> computed BFI
> >>>>> for
> >>>>> the functions in the SCC.
> >>>> This sounds like you're essentially planning to extend
the old
> >>>> pass
> >>>> manager
> >>>> with function level analysis support in the SCC pass
manager.  I
> >>>> advise
> >>>> against this given the fact that Chandler is actively
working on
> >>>> the new
> >>>> one
> >>>> and getting patches reviewed for the old one is
"challenging" at
> >>>> best.
> >>>> If
> >>>> you take this approach, make sure Chandler - who is pretty
much
> >>>> the only
> >>>> active reviewer in this area - is on-board with your
approach
> >>>> before you
> >>>> start.
> >>> There is no plan to extend old pass manager for this
capability.
> >>>  The
> >>> proposed solution is intended to work with the new pass
manager
> >>> too,
> >>> so there is no conflict here. The BFI book-keeping by inliner
can
> >>> be
> >>> removed with pass manager change is in place.
> >> Thanks for the clarification.
> >>
> >>> David
> >>>
> >>>
> >>>>> When the new PM infrastructure and a CGSCC based BFI
analysis
> >>>>> is in
> >>>>> place,
> >>>>> the transition should be easy assuming it will provide
> >>>>> getBFI(Function
> >>>>> *)
> >>>>> and invalidateBFI(Function *) interfaces. BFI for a
function is
> >>>>> computed
> >>>>> at
> >>>>> most twice in this approach. Thoughts?
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Easwaran
> >>>>>
> >>>>>> _______________________________________________
> >>>>>> LLVM Developers mailing list
> >>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
> >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>>>>
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Xinliang David Li via llvm-dev

2015-Dec-11 01:28 UTC

head link

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

On Thu, Dec 10, 2015 at 5:12 PM, Philip Reames
<listmail at philipreames.com> wrote:>
>
> On 12/10/2015 05:03 PM, Xinliang David Li wrote:
>>
>> On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
>> <listmail at philipreames.com> wrote:
>>>
>>>
>>> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
>>>>
>>>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
>>>> <listmail at philipreames.com> wrote:
>>>>>
>>>>> Given I didn't get any response to my original query, I
chose not to
>>>>> invest
>>>>> time in this at the time.  I am unlikely to get time for
this in the
>>>>> near
>>>>> future.
>>>>>
>>>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
>>>>>
>>>>> (Resending after  removing llvmdev at cs.uiuc.edu and using
>>>>> llvm-dev at lists.llvm.org)
>>>>>
>>>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman <eraman
at google.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Philip,
>>>>>>
>>>>>>    Is there any update on this? I've been sending
patches to get rid
>>>>>> of
>>>>>> the
>>>>>> callee hotness based inline hints from the frontend and
move the logic
>>>>>> to
>>>>>> the inliner. The next step is to use the callsite
hotness instead. I
>>>>>> also
>>>>>> want to focus on the infrastructure to enable this and
what I've been
>>>>>> experimenting with is similar to your two alternative
approaches:
>>>>>>
>>>>>>> Alternate Approaches:
>>>>>>> 1) We could just recompute BFI explicitly in the
inliner right before
>>>>>>> passing the result to ICA for the purposes of
prototyping. If this
>>>>>>> was
>>>>>>> off
>>>>>>> by default, this might be a reasonable scheme for
investigation.
>>>>>>> This
>>>>>>> could
>>>>>>> likely never be enabled for real uses.
>>>>>>> 2) We could pre-compute BFI once per function
within a given SCC and
>>>>>>> then
>>>>>>> try to keep it up to date during inlining.  If we
cached the call
>>>>>>> frequencies for the initial call sites, we could
adjust the visit
>>>>>>> order
>>>>>>> to
>>>>>>> minimize the number of times we need to recompute a
given functions
>>>>>>> block
>>>>>>> frequencies.  (e.g. we can look at all the original
call sites within
>>>>>>> a
>>>>>>> function before looking at newly inlined ones)
>>>>>>>
>>>>>> My proposal is very similar (perhaps identical) to your
option 2
>>>>>> above.
>>>>>> I
>>>>>> don't understand the part where you talk about
adjusting the visit
>>>>>> order
>>>>>> to
>>>>>> minimize BFI computation.
>>>>>>
>>>>>> BFI computation:  BFI for a function is computed on
demand and cached.
>>>>>
>>>>> The problem here is that the old pass manager does not have
a place/way
>>>>> to
>>>>> cache the function level analysis.  You could do this
within the
>>>>> inliner
>>>>> itself, but the function passes run by the SCC pass manager
will
>>>>> invalidate
>>>>> everything.  That's the problem I was trying to solve.
>>>>
>>>> Easwaran's proposal does not rely on the analysis pass
framework of
>>>> the legacy pass manager -- that is the key idea. BFI info is
designed
>>>> in a way such that it can be computed on-demand.
>>>>
>>>> BFI for the caller will be incrementally updated during
inlining, but
>>>> after inlining is done for a node, the BFI info will be
explicitly
>>>> invalidated for that node due to the function passes run later
by SCC
>>>> pass manager.  This is expected.
>>>
>>> Just to make sure, you're aware this will cost at least one
>>> re-computation
>>> per iteration in CGPassManager::runOnModule right? If that's
expected and
>>> acceptable, then yes, the approach you seem to be describing should
be
>>> workable.
>>>
>> For each leaf node, the number of BFI computation is <= 1. For other
>> nodes, it is <=2 (one as caller, one as callee) -- in other words,
it
>> is linear to the number of functions.
>
> This is not true.
>
> In CGPassManager::runOnModule, we will iterate on the *same* SCC multiple
> times if we're making progress (i.e. devirtualizing call sites) on each
> iteration.  The BFI will be invalidated on each iteration.  This is caped
at
> 4 iterations, so it's still semi linear, but only because there's a
cap.
>
> Examples where this might kick in:
> void bar() {};
> func_ptr foo() { return &bar; }
> void caller() {
>   func_ptr f = foo();
>   f();
> }
>
> We will visit caller twice so that we can inline both foo and bar.
>
Right. I would guess on average only a small portion of all SCCs can
trigger such iterative compilation.

David
> I think this iteration is okay for what your proposing - i.e. off by
> default, migration step, etc - but it's important to understand the
> implications of the planned changes.
>
>
>
>>
>>
>>> Particularly as an intermediate step to parallelize work with the
new
>>> pass
>>> manager, this sounds very much worthwhile.
>>
>> Yes, that is exactly the intention -- it can buy pass manager rework
>> more time while unlock lots of pending performance tuning work we plan
>> to do.
>>
>> thanks,
>>
>> David
>>
>>
>>>>>> Update: When 'bar' gets inlined into
'foo', the BFI for 'foo' is
>>>>>> updated.
>>>>>> Let OldBB in 'bar' gets cloned as NewBB in
'foo'.  NewBB's block
>>>>>> frequency
>>>>>> can be incrementally computed from OldBB's block
frequency, entry
>>>>>> block
>>>>>> frequency of 'bar' and the frequency of the
block containing the 'foo'
>>>>>> ->
>>>>>> 'bar' callsite. Even when the new CGSCC level
BFI analysis is in
>>>>>> place,
>>>>>> this
>>>>>> incremental update is useful to minimize computation.
>>>>>>
>>>>>> Invalidation:  Once inlining is completed in an SCC (at
the end of
>>>>>> runOnSCC), the entries for functions in that SCC are
invalidated since
>>>>>> other
>>>>>> passes run by the CGSCC pass manager (including those
run by the
>>>>>> function
>>>>>> pass manager run under CGSCC pass manager) might affect
the computed
>>>>>> BFI
>>>>>> for
>>>>>> the functions in the SCC.
>>>>>
>>>>> This sounds like you're essentially planning to extend
the old pass
>>>>> manager
>>>>> with function level analysis support in the SCC pass
manager.  I advise
>>>>> against this given the fact that Chandler is actively
working on the
>>>>> new
>>>>> one
>>>>> and getting patches reviewed for the old one is
"challenging" at best.
>>>>> If
>>>>> you take this approach, make sure Chandler - who is pretty
much the
>>>>> only
>>>>> active reviewer in this area - is on-board with your
approach before
>>>>> you
>>>>> start.
>>>>
>>>> There is no plan to extend old pass manager for this
capability.  The
>>>> proposed solution is intended to work with the new pass manager
too,
>>>> so there is no conflict here. The BFI book-keeping by inliner
can be
>>>> removed with pass manager change is in place.
>>>
>>> Thanks for the clarification.
>>>
>>>> David
>>>>
>>>>
>>>>>> When the new PM infrastructure and a CGSCC based BFI
analysis is in
>>>>>> place,
>>>>>> the transition should be easy assuming it will provide
getBFI(Function
>>>>>> *)
>>>>>> and invalidateBFI(Function *) interfaces. BFI for a
function is
>>>>>> computed
>>>>>> at
>>>>>> most twice in this approach. Thoughts?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Easwaran
>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>
>>>>>>
>

Mehdi Amini via llvm-dev

2015-Dec-11 01:50 UTC

head link

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

> On Dec 10, 2015, at 5:12 PM, Philip Reames via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> 
> On 12/10/2015 05:03 PM, Xinliang David Li wrote:
>> On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
>> <listmail at philipreames.com> wrote:
>>> 
>>> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
>>>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
>>>> <listmail at philipreames.com> wrote:
>>>>> Given I didn't get any response to my original query, I
chose not to
>>>>> invest
>>>>> time in this at the time.  I am unlikely to get time for
this in the near
>>>>> future.
>>>>> 
>>>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
>>>>> 
>>>>> (Resending after  removing llvmdev at cs.uiuc.edu and using
>>>>> llvm-dev at lists.llvm.org)
>>>>> 
>>>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman <eraman
at google.com> wrote:
>>>>>> Hi Philip,
>>>>>> 
>>>>>>   Is there any update on this? I've been sending
patches to get rid of
>>>>>> the
>>>>>> callee hotness based inline hints from the frontend and
move the logic
>>>>>> to
>>>>>> the inliner. The next step is to use the callsite
hotness instead. I
>>>>>> also
>>>>>> want to focus on the infrastructure to enable this and
what I've been
>>>>>> experimenting with is similar to your two alternative
approaches:
>>>>>> 
>>>>>>> Alternate Approaches:
>>>>>>> 1) We could just recompute BFI explicitly in the
inliner right before
>>>>>>> passing the result to ICA for the purposes of
prototyping. If this was
>>>>>>> off
>>>>>>> by default, this might be a reasonable scheme for
investigation.  This
>>>>>>> could
>>>>>>> likely never be enabled for real uses.
>>>>>>> 2) We could pre-compute BFI once per function
within a given SCC and
>>>>>>> then
>>>>>>> try to keep it up to date during inlining.  If we
cached the call
>>>>>>> frequencies for the initial call sites, we could
adjust the visit order
>>>>>>> to
>>>>>>> minimize the number of times we need to recompute a
given functions
>>>>>>> block
>>>>>>> frequencies.  (e.g. we can look at all the original
call sites within a
>>>>>>> function before looking at newly inlined ones)
>>>>>>> 
>>>>>> My proposal is very similar (perhaps identical) to your
option 2 above.
>>>>>> I
>>>>>> don't understand the part where you talk about
adjusting the visit order
>>>>>> to
>>>>>> minimize BFI computation.
>>>>>> 
>>>>>> BFI computation:  BFI for a function is computed on
demand and cached.
>>>>> The problem here is that the old pass manager does not have
a place/way
>>>>> to
>>>>> cache the function level analysis.  You could do this
within the inliner
>>>>> itself, but the function passes run by the SCC pass manager
will
>>>>> invalidate
>>>>> everything.  That's the problem I was trying to solve.
>>>> Easwaran's proposal does not rely on the analysis pass
framework of
>>>> the legacy pass manager -- that is the key idea. BFI info is
designed
>>>> in a way such that it can be computed on-demand.
>>>> 
>>>> BFI for the caller will be incrementally updated during
inlining, but
>>>> after inlining is done for a node, the BFI info will be
explicitly
>>>> invalidated for that node due to the function passes run later
by SCC
>>>> pass manager.  This is expected.
>>> Just to make sure, you're aware this will cost at least one
re-computation
>>> per iteration in CGPassManager::runOnModule right? If that's
expected and
>>> acceptable, then yes, the approach you seem to be describing should
be
>>> workable.
>>> 
>> For each leaf node, the number of BFI computation is <= 1. For other
>> nodes, it is <=2 (one as caller, one as callee) -- in other words,
it
>> is linear to the number of functions.
> This is not true.
> 
> In CGPassManager::runOnModule, we will iterate on the *same* SCC multiple
times if we're making progress (i.e. devirtualizing call sites) on each
iteration.  The BFI will be invalidated on each iteration.  This is caped at 4
iterations, so it's still semi linear, but only because there's a cap.
> 
> Examples where this might kick in:
> void bar() {};
> func_ptr foo() { return &bar; }
> void caller() {
>  func_ptr f = foo();
>  f();
> }
> 
> We will visit caller twice so that we can inline both foo and bar.
I think you would need a far more complex example to trigger the revisit: GVN or
other need to actually remove a call, the inliner doing its job won’t trigger a
reiteration at this level.

— 
Mehdi


> 
> I think this iteration is okay for what your proposing - i.e. off by
default, migration step, etc - but it's important to understand the
implications of the planned changes.
> 
> 
>> 
>> 
>>> Particularly as an intermediate step to parallelize work with the
new pass
>>> manager, this sounds very much worthwhile.
>> Yes, that is exactly the intention -- it can buy pass manager rework
>> more time while unlock lots of pending performance tuning work we plan
>> to do.
>> 
>> thanks,
>> 
>> David
>> 
>> 
>>>>>> Update: When 'bar' gets inlined into
'foo', the BFI for 'foo' is
>>>>>> updated.
>>>>>> Let OldBB in 'bar' gets cloned as NewBB in
'foo'.  NewBB's block
>>>>>> frequency
>>>>>> can be incrementally computed from OldBB's block
frequency, entry block
>>>>>> frequency of 'bar' and the frequency of the
block containing the 'foo'
>>>>>> ->
>>>>>> 'bar' callsite. Even when the new CGSCC level
BFI analysis is in place,
>>>>>> this
>>>>>> incremental update is useful to minimize computation.
>>>>>> 
>>>>>> Invalidation:  Once inlining is completed in an SCC (at
the end of
>>>>>> runOnSCC), the entries for functions in that SCC are
invalidated since
>>>>>> other
>>>>>> passes run by the CGSCC pass manager (including those
run by the
>>>>>> function
>>>>>> pass manager run under CGSCC pass manager) might affect
the computed BFI
>>>>>> for
>>>>>> the functions in the SCC.
>>>>> This sounds like you're essentially planning to extend
the old pass
>>>>> manager
>>>>> with function level analysis support in the SCC pass
manager.  I advise
>>>>> against this given the fact that Chandler is actively
working on the new
>>>>> one
>>>>> and getting patches reviewed for the old one is
"challenging" at best.
>>>>> If
>>>>> you take this approach, make sure Chandler - who is pretty
much the only
>>>>> active reviewer in this area - is on-board with your
approach before you
>>>>> start.
>>>> There is no plan to extend old pass manager for this
capability.  The
>>>> proposed solution is intended to work with the new pass manager
too,
>>>> so there is no conflict here. The BFI book-keeping by inliner
can be
>>>> removed with pass manager change is in place.
>>> Thanks for the clarification.
>>> 
>>>> David
>>>> 
>>>> 
>>>>>> When the new PM infrastructure and a CGSCC based BFI
analysis is in
>>>>>> place,
>>>>>> the transition should be easy assuming it will provide
getBFI(Function
>>>>>> *)
>>>>>> and invalidateBFI(Function *) interfaces. BFI for a
function is computed
>>>>>> at
>>>>>> most twice in this approach. Thoughts?
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Easwaran
>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Dec 2015 - [LLVMdev] Path forward on profile guided inlining?

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?