thr3ads.net - llvm dev - [llvm-dev] RFC: PGO Late instrumentation for LLVM [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Ivan Baev via llvm-dev

2015-Sep-02 19:10 UTC

[llvm-dev] RFC: PGO Late instrumentation for LLVM

> On Tue, Sep 1, 2015 at 7:21 PM, Ivan Baev via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>> > Date: Tue, 1 Sep 2015 14:21:16 -0700
>> > From: Rong Xu via llvm-dev <llvm-dev at lists.llvm.org>
>> > Cc: llvm-dev <llvm-dev at lists.llvm.org>, David Li
<davidxl at google.com>
>> Subject: Re: [llvm-dev] RFC: PGO Late instrumentation for LLVM
>> >>>> *(2) Performance impact of context sensitivity*
>> >>>> LLVM does not use the profile information fully in the
back-end
>> optimizations, for instance, inlining does not fully use the profile
counts>> >>>> -- it only marks hot/cold function attribute based on
function
>> entry
>> counts. To evaluate the impact of profile context sensitivity, GCC is
used>> >>>> in the experiment. Note that GCC PGO improves clang
performance a
>> lot
>> more
>> >>>> than clang PGO.
>> >>>> First we summarize the methodology used in the
experiment: 0)build clang with GCC O2 without early inlining and
measure>> clang's
>> >>>> performance. GCC early inlining (einline) is similar
to pre-inline
>> used by
>> >>>> late instrumentation.
>> >>>> 1) build clang with GCC O2 with early inlining and
measure
performance.>> >>>> The performance difference of 1) and 0) is denoted as
E which
>> measures
>> >>>> the contribution of early inlining.
>> >>>> 2) build clang with GCC O2 + PGO without early
inlining.
>> >>>> 3) build clang with GCC O2 + PGO with early inlining.
>> >>>> The performance difference of 3) and 2) is denoted as
EC. It
constitutes>> >>>> roughly two parts a) early inlining contribution b)
context
>> sensitive
>> profiling enabled with early inlining.
>> >>>> The contribution of context sensitive profiling can be
estimated
by>> EC
>> >>>> -
>> >>>> E above.
>>
-------------------------------------------------------------------------------Config                        wall_time_for_use 
speedup_vs_(0)>> >>>>  speedup_vs_(1)
>> >>>> (0) base w/o einline             84.946           
1.000
>> 0.934
>> >>>> (1) base O2                      79.310           
1.071
>> 1.000
>> >>>> (2) profile-arcs w/o einline     63.518           
1.337
>> 1.249
>> >>>> (3) profile-arcs                 48.364           
1.756
>> 1.640
>> >>>> We see the following:
>> >>>> 1) GCC PGO with early inlining improves clang
performance by 64.0%
>> (v.s.
>> >>>> base O2 w/ early inline).
>> >>>> 2) GCC PGO w/o early inlining improves clang
performance by 33.7%
>> (v.s.
>> >>>> base O2 w/o early inline).
>> >>>> 3) Early inlining performance contribution is about
7.1%.
>> >>>> 4) Profile context sensitivity contribution is
estimated to be
>> 22.2%
>> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
>> Rong,
>> Sorry for the late response. Just wanted to clarify my understanding of
data in (2) Performance impact of context sensitivity.>> On clang as an application:
>> 3) Early inlining contribution is about 7.1%,
> This is the effect of pre-inlining without profile guidance.
>> 2) PGO w/o early inlining contribution is about 33.7%,
>> 4) so the additional combined effect of 2 and 3 is about 22.2%,
correct?> Not combined effect -- but remaining effect (by excluding 2 and 3)
>> In other words, just avoiding inlining small/simple callees and
updating>> their profile counts in the call graph by the main inliner - all
through>> the use of early inlining - improves clang performance by 22.2%.
> Not sure what you mean here. 22% is the estimate of the effect of CSprofile due to clones of profile counters during instrumentation
(through> pre-inlining). Profile update with inlining always exist including in2).

If we compare times for:
(2) profile-arcs w/o einline - 63.518 secs, v.s.
(3) profile-arcs - 48.364 secs,
we get about 31.3% improvement due to early inline with PGO.

If we compare times for:
(0) base w/o einline - 84.946, v.s.
(1) base O2 - 79.310.
we get about 7.1% improvement due to early inline without PGO.

What can we attribute the difference of 24.2% (31.3 - 7.1) to?
31.3% is the total contribution of early inline with PGO.
Is 24.2% the context-sensitivity part of it, meaning that the profile
counts in the call graph are more precise duing the inlining process,
inlining decisions are better, etc.?

Ivan

Xinliang David Li via llvm-dev

2015-Sep-02 19:26 UTC

head link

[llvm-dev] RFC: PGO Late instrumentation for LLVM

On Wed, Sep 2, 2015 at 12:10 PM, Ivan Baev <ibaev at codeaurora.org>
wrote:
> > On Tue, Sep 1, 2015 at 7:21 PM, Ivan Baev via llvm-dev <
> > llvm-dev at lists.llvm.org> wrote:
> >> > Date: Tue, 1 Sep 2015 14:21:16 -0700
> >> > From: Rong Xu via llvm-dev <llvm-dev at lists.llvm.org>
> >> > Cc: llvm-dev <llvm-dev at lists.llvm.org>, David Li
<davidxl at google.com>
> >> Subject: Re: [llvm-dev] RFC: PGO Late instrumentation for LLVM
> >> >>>> *(2) Performance impact of context sensitivity*
> >> >>>> LLVM does not use the profile information fully
in the back-end
> >> optimizations, for instance, inlining does not fully use the
profile
> counts
> >> >>>> -- it only marks hot/cold function attribute
based on function
> >> entry
> >> counts. To evaluate the impact of profile context sensitivity, GCC
is
> used
> >> >>>> in the experiment. Note that GCC PGO improves
clang performance a
> >> lot
> >> more
> >> >>>> than clang PGO.
> >> >>>> First we summarize the methodology used in the
experiment: 0)
> build clang with GCC O2 without early inlining and measure
> >> clang's
> >> >>>> performance. GCC early inlining (einline) is
similar to pre-inline
> >> used by
> >> >>>> late instrumentation.
> >> >>>> 1) build clang with GCC O2 with early inlining
and measure
> performance.
> >> >>>> The performance difference of 1) and 0) is
denoted as E which
> >> measures
> >> >>>> the contribution of early inlining.
> >> >>>> 2) build clang with GCC O2 + PGO without early
inlining.
> >> >>>> 3) build clang with GCC O2 + PGO with early
inlining.
> >> >>>> The performance difference of 3) and 2) is
denoted as EC. It
> constitutes
> >> >>>> roughly two parts a) early inlining contribution
b) context
> >> sensitive
> >> profiling enabled with early inlining.
> >> >>>> The contribution of context sensitive profiling
can be estimated
> by
> >> EC
> >> >>>> -
> >> >>>> E above.
> >>
>
-------------------------------------------------------------------------------
> Config                        wall_time_for_use  speedup_vs_(0)
> >> >>>>  speedup_vs_(1)
> >> >>>> (0) base w/o einline             84.946          
1.000
> >> 0.934
> >> >>>> (1) base O2                      79.310          
1.071
> >> 1.000
> >> >>>> (2) profile-arcs w/o einline     63.518          
1.337
> >> 1.249
> >> >>>> (3) profile-arcs                 48.364          
1.756
> >> 1.640
> >> >>>> We see the following:
> >> >>>> 1) GCC PGO with early inlining improves clang
performance by 64.0%
> >> (v.s.
> >> >>>> base O2 w/ early inline).
> >> >>>> 2) GCC PGO w/o early inlining improves clang
performance by 33.7%
> >> (v.s.
> >> >>>> base O2 w/o early inline).
> >> >>>> 3) Early inlining performance contribution is
about 7.1%.
> >> >>>> 4) Profile context sensitivity contribution is
estimated to be
> >> 22.2%
> >> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
> >> Rong,
> >> Sorry for the late response. Just wanted to clarify my
understanding of
> data in (2) Performance impact of context sensitivity.
> >> On clang as an application:
> >> 3) Early inlining contribution is about 7.1%,
> > This is the effect of pre-inlining without profile guidance.
> >> 2) PGO w/o early inlining contribution is about 33.7%,
> >> 4) so the additional combined effect of 2 and 3 is about 22.2%,
> correct?
> > Not combined effect -- but remaining effect (by excluding 2 and 3)
> >> In other words, just avoiding inlining small/simple callees and
> updating
> >> their profile counts in the call graph by the main inliner - all
> through
> >> the use of early inlining - improves clang performance by 22.2%.
> > Not sure what you mean here. 22% is the estimate of the effect of CS
> profile due to clones of profile counters during instrumentation
> (through
> > pre-inlining). Profile update with inlining always exist including in
> 2).
>
> If we compare times for:
> (2) profile-arcs w/o einline - 63.518 secs, v.s.
> (3) profile-arcs - 48.364 secs,
> we get about 31.3% improvement due to early inline with PGO.
>
> If we compare times for:
> (0) base w/o einline - 84.946, v.s.
> (1) base O2 - 79.310.
> we get about 7.1% improvement due to early inline without PGO.
>
> What can we attribute the difference of 24.2% (31.3 - 7.1) to?
> 31.3% is the total contribution of early inline with PGO.
> Is 24.2% the context-sensitivity part of it, meaning that the profile
> counts in the call graph are more precise duing the inlining process,
> inlining decisions are better, etc.?
>
yes -- that is it.

David

>
> Ivan
>
>
>
>
>
>
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150902/1b3587ad/attachment.html>

llvm dev - Sep 2015 - RFC: PGO Late instrumentation for LLVM

[llvm-dev] RFC: PGO Late instrumentation for LLVM

[llvm-dev] RFC: PGO Late instrumentation for LLVM