thr3ads.net - llvm dev - [llvm-dev] RFC: Synthetic function entry counts [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Easwaran Raman via llvm-dev

2017-Dec-13 01:02 UTC

[llvm-dev] RFC: Synthetic function entry counts

Functions in LLVM IR have a function_entry_count metadata that is attached
in PGO compilation. By using the entry count together with the block
frequency info, the compiler computes the profile count of call
instructions based on which the hotness/coldness of callsites can be
determined. Experiments have shown that using a higher threshold for hot
callsites results in improved runtime performance of the generated code
without significant code size increases. We propose to generate synthetic
function counts for non-PGO compilation and use the counts for boosting hot
callsites during inlining.

Synthetic function entry counts of functions are initialized based on
properties of the function such as whether it is visible outside the
module, whether it has an inline keyword and so on. Then, the callgraph SCC
is traversed in reverse post-order. Counts of callsites are determined
based on the entry count and the block frequency of the callsite. The
callsite count gets added to the entry count of the callee. For targets of
indirect calls, we will use the !callees metadata to find the possible
targets and distribute the count equally among them. For functions in a
non-trivial SCC, the algorithm has to ensure that the counts are stable and
deterministic.

In ThinLTO mode, the function summary contains the list of call edges from
the function. We propose to add the relative block frequency on these
edges. During the thinlink phase, we propagate the function counts on the
entire call graph and update the function summary with the synthetic
counts. Additionally, we plan to use the computed counts to drive the
importing decisions.

Alternative approach

-----------------------------


An alternative to generating synthetic counts is to make block frequency
info an inter-procedural analysis. Such an analysis would allow comparing
BFI of callsites in two different functions. This has several downsides:

   -

   The inter-procedural BFI computation is likely to be more expensive in
   terms of compile-time.
   -

   Many function passes invalidate the BFI. This will require selective
   invalidation of function BFIs.
   -

   Inliner correctly updates function counts of a callee after a callsite
   is inlined. We can piggyback on this mechanism by using synthetic counts.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171212/2a0fb507/attachment.html>

Philip Reames via llvm-dev

2017-Dec-14 21:03 UTC

head link

[llvm-dev] RFC: Synthetic function entry counts

This really sounds like it should be an analysis pass. Summarizing the 
results into the IR using the existing entry_county attributes sounds 
like a workaround for an invalidation problem more than anything else.

You acknowledge this in your alternatives discussion, but I don't follow 
why BFI invalidation at the function level has to invalidate the inter 
procedural result.  You'd certainly need to use BFI when *running* your 
analysis, but once computed the results should be stable across updates 
within the function.  You might need update logic for inlining, 
outlining, and mergefunc, but that should be it in terms of current LLVM 
transforms?

Philip


On 12/12/2017 05:02 PM, Easwaran Raman via llvm-dev
wrote:>
> Functions in LLVM IR have a function_entry_count metadata that is 
> attached in PGO compilation. By using the entry count together with 
> the block frequency info, the compiler computes the profile count of 
> call instructions based on which the hotness/coldness of callsites can 
> be determined. Experiments have shown that using a higher threshold 
> for hot callsites results in improved runtime performance of the 
> generated code without significant code size increases. We propose to 
> generate synthetic function counts for non-PGO compilation and use the 
> counts for boosting hot callsites during inlining.
>
>
> Synthetic function entry counts of functions are initialized based on 
> properties of the function such as whether it is visible outside the 
> module, whether it has an inline keyword and so on. Then, the 
> callgraph SCC is traversed in reverse post-order. Counts of callsites 
> are determined based on the entry count and the block frequency of the 
> callsite. The callsite count gets added to the entry count of the 
> callee. For targets of indirect calls, we will use the !callees 
> metadata to find the possible targets and distribute the count equally 
> among them. For functions in a non-trivial SCC, the algorithm has to 
> ensure that the counts are stable and deterministic.
>
>
> In ThinLTO mode, the function summary contains the list of call edges 
> from the function. We propose to add the relative block frequency on 
> these edges. During the thinlink phase, we propagate the function 
> counts on the entire call graph and update the function summary with 
> the synthetic counts. Additionally, we plan to use the computed counts 
> to drive the importing decisions.
>
>
> Alternative approach
>
> -----------------------------
>
> An alternative to generating synthetic counts is to make block 
> frequency info an inter-procedural analysis. Such an analysis would 
> allow comparing BFI of callsites in two different functions. This has 
> several downsides:
>
>  *
>
>     The inter-procedural BFI computation is likely to be more
>     expensive in terms of compile-time.
>
>  *
>
>     Many function passes invalidate the BFI. This will require
>     selective invalidation of function BFIs.
>
>  *
>
>     Inliner correctly updates function counts of a callee after a
>     callsite is inlined. We can piggyback on this mechanism by using
>     synthetic counts.
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171214/71f75415/attachment.html>

Xinliang David Li via llvm-dev

2017-Dec-14 21:44 UTC

head link

[llvm-dev] RFC: Synthetic function entry counts

On Thu, Dec 14, 2017 at 1:03 PM, Philip Reames <listmail at
philipreames.com>
wrote:
> This really sounds like it should be an analysis pass.  Summarizing the
> results into the IR using the existing entry_county attributes sounds like
> a workaround for an invalidation problem more than anything else.
>
I think it is not a workaround. The strength of the proposal is that it
fits really naturally with the PGO infrastructure -- all the existing
profile update mechanism (currently mainly inliner) can be reused.  Most of
the PGO related query interfaces can also be reused so that there is very
little change from the client side.

The inter-procedural frequency propagation also needs to happen in the
thinLink time for thinLTO which should not touch IR except summary data.

David

You acknowledge this in your alternatives discussion, but I don't
follow> why BFI invalidation at the function level has to invalidate the inter
> procedural result.  You'd certainly need to use BFI when *running* your
> analysis, but once computed the results should be stable across updates
> within the function.  You might need update logic for inlining, outlining,
> and mergefunc, but that should be it in terms of current LLVM transforms?
>
> Philip
>
> On 12/12/2017 05:02 PM, Easwaran Raman via llvm-dev wrote:
>
> Functions in LLVM IR have a function_entry_count metadata that is attached
> in PGO compilation. By using the entry count together with the block
> frequency info, the compiler computes the profile count of call
> instructions based on which the hotness/coldness of callsites can be
> determined. Experiments have shown that using a higher threshold for hot
> callsites results in improved runtime performance of the generated code
> without significant code size increases. We propose to generate synthetic
> function counts for non-PGO compilation and use the counts for boosting hot
> callsites during inlining.
>
> Synthetic function entry counts of functions are initialized based on
> properties of the function such as whether it is visible outside the
> module, whether it has an inline keyword and so on. Then, the callgraph SCC
> is traversed in reverse post-order. Counts of callsites are determined
> based on the entry count and the block frequency of the callsite. The
> callsite count gets added to the entry count of the callee. For targets of
> indirect calls, we will use the !callees metadata to find the possible
> targets and distribute the count equally among them. For functions in a
> non-trivial SCC, the algorithm has to ensure that the counts are stable and
> deterministic.
>
> In ThinLTO mode, the function summary contains the list of call edges from
> the function. We propose to add the relative block frequency on these
> edges. During the thinlink phase, we propagate the function counts on the
> entire call graph and update the function summary with the synthetic
> counts. Additionally, we plan to use the computed counts to drive the
> importing decisions.
>
> Alternative approach
>
> -----------------------------
>
> An alternative to generating synthetic counts is to make block frequency
> info an inter-procedural analysis. Such an analysis would allow comparing
> BFI of callsites in two different functions. This has several downsides:
>
>    -
>
>    The inter-procedural BFI computation is likely to be more expensive in
>    terms of compile-time.
>    -
>
>    Many function passes invalidate the BFI. This will require selective
>    invalidation of function BFIs.
>    -
>
>    Inliner correctly updates function counts of a callee after a callsite
>    is inlined. We can piggyback on this mechanism by using synthetic
counts.
>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171214/146ca160/attachment.html>

Sean Silva via llvm-dev

2017-Dec-15 08:22 UTC

head link

[llvm-dev] RFC: Synthetic function entry counts

IIUC, this proposal is just saying that we should infer a static profile
for entry counts just like we do for branch probabilities. In the case of
entry counts, we do not hide that information behind an analysis like BPI,
so currently just annotating synthetic PGO entry counts is a simple
solution that piggybacks on the PGO mechanism and Just Works.

If that is correct then this makes perfect sense to me.

It could be argued that we ought to refactor things so that the raw PGO
metadata is only ever accessed via a wrapper CGSCC analysis that falls back
to interprocedural analysis (i.e. static profile heuristics) when the entry
count metadata is missing, just like BPI does with static intraprocedural
analysis and branch weight metadata. However, we probably don't want to do
that while folks are still depending on the old PM in production since
CGSCC analyses don't exist there which would force us to maintain an old
and new way of doing it.

(Also, it sounds like you want to compute this with a top-down CGSCC
traversal, so it might not actually be computable incrementally as a bottom
up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
module analysis for the top-down part might work around this though)

Also, the need to run this logic (or similar logic) as a "ThinLTO
analysis"
suggests not wedding it too much with the intricacies of the IR-level pass
management (although admittedly we already do that with the inliner and
then ThinLTO has to approximate those inlining decisions, so it might not
be the end of the world to have some divergence).

-- Sean Silva

On Dec 12, 2017 5:02 PM, "Easwaran Raman via llvm-dev" <
llvm-dev at lists.llvm.org> wrote:
> Functions in LLVM IR have a function_entry_count metadata that is attached
> in PGO compilation. By using the entry count together with the block
> frequency info, the compiler computes the profile count of call
> instructions based on which the hotness/coldness of callsites can be
> determined. Experiments have shown that using a higher threshold for hot
> callsites results in improved runtime performance of the generated code
> without significant code size increases. We propose to generate synthetic
> function counts for non-PGO compilation and use the counts for boosting hot
> callsites during inlining.
>
> Synthetic function entry counts of functions are initialized based on
> properties of the function such as whether it is visible outside the
> module, whether it has an inline keyword and so on. Then, the callgraph SCC
> is traversed in reverse post-order. Counts of callsites are determined
> based on the entry count and the block frequency of the callsite. The
> callsite count gets added to the entry count of the callee. For targets of
> indirect calls, we will use the !callees metadata to find the possible
> targets and distribute the count equally among them. For functions in a
> non-trivial SCC, the algorithm has to ensure that the counts are stable and
> deterministic.
>
> In ThinLTO mode, the function summary contains the list of call edges from
> the function. We propose to add the relative block frequency on these
> edges. During the thinlink phase, we propagate the function counts on the
> entire call graph and update the function summary with the synthetic
> counts. Additionally, we plan to use the computed counts to drive the
> importing decisions.
>
> Alternative approach
>
> -----------------------------
>
>
> An alternative to generating synthetic counts is to make block frequency
> info an inter-procedural analysis. Such an analysis would allow comparing
> BFI of callsites in two different functions. This has several downsides:
>
>    -
>
>    The inter-procedural BFI computation is likely to be more expensive in
>    terms of compile-time.
>    -
>
>    Many function passes invalidate the BFI. This will require selective
>    invalidation of function BFIs.
>    -
>
>    Inliner correctly updates function counts of a callee after a callsite
>    is inlined. We can piggyback on this mechanism by using synthetic
counts.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171215/70485d04/attachment.html>

Sean Silva via llvm-dev

2017-Dec-15 08:34 UTC

head link

[llvm-dev] RFC: Synthetic function entry counts

On Dec 12, 2017 5:02 PM, "Easwaran Raman via llvm-dev" <
llvm-dev at lists.llvm.org> wrote:

Functions in LLVM IR have a function_entry_count metadata that is attached
in PGO compilation. By using the entry count together with the block
frequency info, the compiler computes the profile count of call
instructions based on which the hotness/coldness of callsites can be
determined. Experiments have shown that using a higher threshold for hot
callsites results in improved runtime performance of the generated code
without significant code size increases. We propose to generate synthetic
function counts for non-PGO compilation and use the counts for boosting hot
callsites during inlining.

Synthetic function entry counts of functions are initialized based on
properties of the function such as whether it is visible outside the
module, whether it has an inline keyword and so on. Then, the callgraph SCC
is traversed in reverse post-order. Counts of callsites are determined
based on the entry count and the block frequency of the callsite. The
callsite count gets added to the entry count of the callee. For targets of
indirect calls, we will use the !callees metadata to find the possible
targets and distribute the count equally among them. For functions in a
non-trivial SCC, the algorithm has to ensure that the counts are stable and
deterministic.


Does this have any interaction with mayBeDerefined? E.g. different results
of static profile in different compilation units causing problems. I think
it should be okay since the transformations that look at the PGO data I
think are the ones responsible for being safe when !mayBeDerefined, but I
want to get a second opinion.

-- Sean Silva


In ThinLTO mode, the function summary contains the list of call edges from
the function. We propose to add the relative block frequency on these
edges. During the thinlink phase, we propagate the function counts on the
entire call graph and update the function summary with the synthetic
counts. Additionally, we plan to use the computed counts to drive the
importing decisions.

Alternative approach

-----------------------------


An alternative to generating synthetic counts is to make block frequency
info an inter-procedural analysis. Such an analysis would allow comparing
BFI of callsites in two different functions. This has several downsides:

   -

   The inter-procedural BFI computation is likely to be more expensive in
   terms of compile-time.
   -

   Many function passes invalidate the BFI. This will require selective
   invalidation of function BFIs.
   -

   Inliner correctly updates function counts of a callee after a callsite
   is inlined. We can piggyback on this mechanism by using synthetic counts.


_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171215/1dd0a5ac/attachment-0001.html>

Easwaran Raman via llvm-dev

2017-Dec-15 18:22 UTC

head link

[llvm-dev] RFC: Synthetic function entry counts

On Fri, Dec 15, 2017 at 12:22 AM, Sean Silva <chisophugis at gmail.com>
wrote:
> IIUC, this proposal is just saying that we should infer a static profile
> for entry counts just like we do for branch probabilities. In the case of
> entry counts, we do not hide that information behind an analysis like BPI,
> so currently just annotating synthetic PGO entry counts is a simple
> solution that piggybacks on the PGO mechanism and Just Works.
>
> If that is correct then this makes perfect sense to me.
>Yes, that is the proposal.
>
> It could be argued that we ought to refactor things so that the raw PGO
> metadata is only ever accessed via a wrapper CGSCC analysis that falls back
> to interprocedural analysis (i.e. static profile heuristics) when the entry
> count metadata is missing, just like BPI does with static intraprocedural
> analysis and branch weight metadata. However, we probably don't want to
do
> that while folks are still depending on the old PM in production since
> CGSCC analyses don't exist there which would force us to maintain an
old
> and new way of doing it.
>
> (Also, it sounds like you want to compute this with a top-down CGSCC
> traversal, so it might not actually be computable incrementally as a bottom
> up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
> module analysis for the top-down part might work around this though)
>
>Wrapping function entry counts behind an analysis might be doable but
requires more careful thought. Right now getEntryCount is called from
various passes: inliner, loop passes (unroll, loop sink), codegenprepare
(for function layout), machine block placement, module summary analysis ...
This might be worth attempting when we fully migrate to the new pass
manager but not now.

Also, the need to run this logic (or similar logic) as a "ThinLTO
analysis"> suggests not wedding it too much with the intricacies of the IR-level pass
> management (although admittedly we already do that with the inliner and
> then ThinLTO has to approximate those inlining decisions, so it might not
> be the end of the world to have some divergence).
>
> -- Sean Silva
>
> On Dec 12, 2017 5:02 PM, "Easwaran Raman via llvm-dev" <
> llvm-dev at lists.llvm.org> wrote:
>
>> Functions in LLVM IR have a function_entry_count metadata that is
>> attached in PGO compilation. By using the entry count together with the
>> block frequency info, the compiler computes the profile count of call
>> instructions based on which the hotness/coldness of callsites can be
>> determined. Experiments have shown that using a higher threshold for
hot
>> callsites results in improved runtime performance of the generated code
>> without significant code size increases. We propose to generate
synthetic
>> function counts for non-PGO compilation and use the counts for boosting
hot
>> callsites during inlining.
>>
>> Synthetic function entry counts of functions are initialized based on
>> properties of the function such as whether it is visible outside the
>> module, whether it has an inline keyword and so on. Then, the callgraph
SCC
>> is traversed in reverse post-order. Counts of callsites are determined
>> based on the entry count and the block frequency of the callsite. The
>> callsite count gets added to the entry count of the callee. For targets
of
>> indirect calls, we will use the !callees metadata to find the possible
>> targets and distribute the count equally among them. For functions in a
>> non-trivial SCC, the algorithm has to ensure that the counts are stable
and
>> deterministic.
>>
>> In ThinLTO mode, the function summary contains the list of call edges
>> from the function. We propose to add the relative block frequency on
these
>> edges. During the thinlink phase, we propagate the function counts on
the
>> entire call graph and update the function summary with the synthetic
>> counts. Additionally, we plan to use the computed counts to drive the
>> importing decisions.
>>
>> Alternative approach
>>
>> -----------------------------
>>
>>
>> An alternative to generating synthetic counts is to make block
frequency
>> info an inter-procedural analysis. Such an analysis would allow
comparing
>> BFI of callsites in two different functions. This has several
downsides:
>>
>>    -
>>
>>    The inter-procedural BFI computation is likely to be more expensive
>>    in terms of compile-time.
>>    -
>>
>>    Many function passes invalidate the BFI. This will require selective
>>    invalidation of function BFIs.
>>    -
>>
>>    Inliner correctly updates function counts of a callee after a
>>    callsite is inlined. We can piggyback on this mechanism by using
synthetic
>>    counts.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171215/e8442db5/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Dec 2017 - RFC: Synthetic function entry counts

[llvm-dev] RFC: Synthetic function entry counts

[llvm-dev] RFC: Synthetic function entry counts

[llvm-dev] RFC: Synthetic function entry counts

[llvm-dev] RFC: Synthetic function entry counts

[llvm-dev] RFC: Synthetic function entry counts

[llvm-dev] RFC: Synthetic function entry counts

Reasonably Related Threads