thr3ads.net - llvm dev - [llvm-dev] RFC: PGO Late instrumentation for LLVM [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Diego Novillo via llvm-dev

2015-Aug-11 18:07 UTC

[llvm-dev] RFC: PGO Late instrumentation for LLVM

One aspect of this that I have not seen discussed is that middle-end
instrumentation enables PGO optimizations to front-ends other than Clang.

While I agree that FE instrumentation could be improved, it still requires
every FE to implement essentially the same common functionality.  Having
PGO instrumentation generated in the middle-end, allows us every FE to
automatically take advantage of PGO.

Additionally, some of the overhead imposed by FE instrumentation is not
really all that easy to get rid of.  You end up duplicating functionality
that is more naturally implemented in the middle end.

I see the two approaches as supplementary, rather than complementary.  One
does not negate the other.  Some of the optimizations we'd do in the FE,
may hurt coverage.  Instead, by instrumenting in the middle end, you can
focus exclusively on performance (coverage be damned).


Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150811/aa16b582/attachment.html>

Sean Silva via llvm-dev

2015-Aug-12 05:11 UTC

head link

[llvm-dev] RFC: PGO Late instrumentation for LLVM

On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> One aspect of this that I have not seen discussed is that middle-end
> instrumentation enables PGO optimizations to front-ends other than Clang.
>
> While I agree that FE instrumentation could be improved, it still requires
> every FE to implement essentially the same common functionality.  Having
> PGO instrumentation generated in the middle-end, allows us every FE to
> automatically take advantage of PGO.
>
This is a really good point, and I agree with it. We may have gotten off on
the wrong foot since Rong's email focused so heavily on comparing with the
frontend instrumentation. As far as I see it, Rong's proposal has a couple
different parts:

1. Infrastructure for IR-level instrumentation-based PGO
2. Changes to the pass pipeline so that a hypothetical IR-level
instrumentation-based PGO is more effective
3. MST algorithm with profile feedback for optimal placement of counter
updates.

I think 1. is a no-brainer, if only so that all LLVM clients can benefit
from PGO, and also (as you pointed out below) so that it can have an
exclusive focus on performance. If it is sufficiently flexible, it may even
make sense to restrict clang's frontend instrumentation-based profiling to
non-performance stuff, and have clang directly interoperate with the
IR-level PGO for performance-related PGO use cases, just like any other
frontend would.

Philip and Sanjoy, out of curiosity do you guys use your own
instrumentation placement for PGO? Is an IR-level PGO infrastructure
upstream something you guys would be interested in?

I think that 2. is something that once we have 1. we will be able to
evaluate better, but for now my opinion is that we should be able to make
good progress without digging into that.

I think that 3. is a no-brainer if it provides a really significant win,
but without 1. we can't really measure its effect in isolation. It also has
a usability problem since it requires feeding in an existing profile for
the *instrumented* build, but if the benefit is very significant this may
be worth it for some users. We will probably be able to easily refactor 1.
as needed into an MST approach that degrades gracefully to using static
heuristics in the absence of real profile information, so is not a
maintenance burden (maybe even helps by providing a good framework in which
to develop effective static heuristics).

For the time being, I think we can avoid discussion of 2. and 3. until we
have more of 1. working. So I think it would be most productive if we focus
this discussion on 1.

> Additionally, some of the overhead imposed by FE instrumentation is not
> really all that easy to get rid of.  You end up duplicating functionality
> that is more naturally implemented in the middle end.
>
Yeah, I was looking into a couple of other simple approaches and quickly
found out that I was basically replicating much of the sort of logic that
the inliner already has.

-- Sean Silva

>
> I see the two approaches as supplementary, rather than complementary.  One
> does not negate the other.  Some of the optimizations we'd do in the
FE,
> may hurt coverage.  Instead, by instrumenting in the middle end, you can
> focus exclusively on performance (coverage be damned).
>
>
> Diego.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150811/8af84a9c/attachment.html>

Philip Reames via llvm-dev

2015-Aug-12 18:56 UTC

head link

[llvm-dev] RFC: PGO Late instrumentation for LLVM

On 08/11/2015 10:11 PM, Sean Silva wrote:> Philip and Sanjoy, out of curiosity do you guys use your own 
> instrumentation placement for PGO? Is an IR-level PGO infrastructure 
> upstream something you guys would be interested in?We have entirely separate infrastructure for this.  An IR-level PGO 
instrumentation infrastructure would be of no immediate benefit to me.  
Even in the long term, we'd have some restrictions on the profile 
collection points which are easy to reason about in the frontend, but 
would be hard in the middle end.

Philip

Rong Xu via llvm-dev

2015-Aug-19 22:39 UTC

head link

[llvm-dev] RFC: PGO Late instrumentation for LLVM

We collected more data to address some of the questions from the reviewers.
Note this time we use clang itself as the benchmark. We choose clang
because we think it's a typical C++ program and the reviewers here have
good knowledge of the code base.

What we measure is running time for clang to compile a large preprocessed
source file (4.98M lines of .ii file), using different compilation modes.
All the numbers reported here are the average running time of 5 runs in
seconds.

*(1) Performance b/w late instrumentation v.s. not instrumenting single BB
functions*

We first compare various instrumentation performance.
----------------------------------------------------------------------------
  Config                   wall_time_for_instr   ratio_vs_base
profile_size
(1) base O2                     80.386             100.0%           --
(2) FE-based Instr             201.658             250.8%         65238880
(3) late Instr                 103.662             129.0%         14860144
(4) (3) + w/o pre-inline       199.924             248.7%         70762720
(5) (4) + Silva                119.904             149.2%         24499528

Config(5) used the simple heuristic that Sean Silva proposed: not
instrumenting single BB functions that contain less than 10 instructions
(excluding debug and phi stmts).

We can see:
1) Simple heuristic of not instrumenting small single BB functions improves
instrumentation performance as expected.
2) Using simple heuristic is still slower than late instrumentation with
pre-inlining: the later is 15% faster.
3) Late instrumentation produces the smallest profile size: it's 39%
smaller than using the simple heuristic.

The result is expected as pre-inlining can handle more cases than the
simple heuristic. There is significant performance gap between the simple
heuristic (5) and late instrumentation (2).

We also used a few larger internal benchmarks to further validate the above
result. The following table shows the slowdown compared to the base O2. The
labels (2) to (5) refer to the same config as in the previous table.
------------------------------------------------------
Program                (2)      (3)      (4)      (5)
C++benchmark16      -45.24%  -12.93%  -43.84%  -24.74%
C++benchmark17      -90.86%  -58.19%  -87.77%  -80.62%
C++benchmark18      -95.32%  -54.75%  -91.21%  -82.56%

We can see the same trend as the clang benchmark: the simple heuristic (5)
recovers a lot of performance loss compared with FE base instrumentation,
but is still significantly worse than late instrumentation (3).

*(2) Performance impact of context sensitivity*

LLVM does not use the profile information fully in the back-end
optimizations, for instance, inlining does not fully use the profile counts
-- it only marks hot/cold function attribute based on function entry
counts. To evaluate the impact of profile context sensitivity, GCC is used
in the experiment. Note that GCC PGO improves clang performance a lot more
than clang PGO.

First we summarize the methodology used in the experiment:
0)  build clang with GCC O2 without early inlining and measure clang's
performance. GCC early inlining (einline) is similar to pre-inline used by
late instrumentation.
1) build clang with GCC O2 with early inlining and measure performance.

The performance difference of 1) and 0) is denoted as E which measures the
contribution of early inlining.

2) build clang with GCC O2 + PGO without early inlining.
3) build clang with GCC O2 + PGO with early inlining.

The performance difference of 3) and 2) is denoted as EC. It constitutes
roughly two parts a) early inlining contribution b) context sensitive
profiling enabled with early inlining.

The contribution of context sensitive profiling can be estimated by EC - E
above.
-------------------------------------------------------------------------------
Config                        wall_time_for_use  speedup_vs_(0)
 speedup_vs_(1)
(0) base w/o einline             84.946            1.000          0.934
(1) base O2                      79.310            1.071          1.000
(2) profile-arcs w/o einline     63.518            1.337          1.249
(3) profile-arcs                 48.364            1.756          1.640

We see the following:
1) GCC PGO with early inlining improves clang performance by 64.0% (v.s.
base O2 w/ early inline).
2) GCC PGO w/o early inlining improves clang performance by 33.7% (v.s.
base O2 w/o early inline).
3) Early inlining performance contribution is about 7.1%.
4) Profile context sensitivity contribution is estimated to be 22.2% (i.e.
64.0% -33.7% - 7.1%), which is pretty significant.

*(3) Pre-inline pass impact on the value profiling*

Again, we use GCC as the platform to estimate:

--------------------------------------------------------
  Config                            wall_time for_instr
(2) profile-arcs                      115.720
(3) profile-arcs w/o einline          310.560
(4) profile-generate                  139.952
(5) profile-generate w/o einline      680.910

In GCC, -fprofile-generate does -fprofile-arcs as well as the value
profiling. The above table shows that with value profile, the impact of
pre-inlining is even larger for instrumented binary performance. Without
value  profiling, disabling pre-inlining increases runtime by 1.7x, while
with value profiling, its impact is 3.9x increase in runtime.

On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> One aspect of this that I have not seen discussed is that middle-end
>> instrumentation enables PGO optimizations to front-ends other than
Clang.
>>
>> While I agree that FE instrumentation could be improved, it still
>> requires every FE to implement essentially the same common
functionality.
>> Having PGO instrumentation generated in the middle-end, allows us every
FE
>> to automatically take advantage of PGO.
>>
>
> This is a really good point, and I agree with it. We may have gotten off
> on the wrong foot since Rong's email focused so heavily on comparing
with
> the frontend instrumentation. As far as I see it, Rong's proposal has a
> couple different parts:
>
> 1. Infrastructure for IR-level instrumentation-based PGO
> 2. Changes to the pass pipeline so that a hypothetical IR-level
> instrumentation-based PGO is more effective
> 3. MST algorithm with profile feedback for optimal placement of counter
> updates.
>
> I think 1. is a no-brainer, if only so that all LLVM clients can benefit
> from PGO, and also (as you pointed out below) so that it can have an
> exclusive focus on performance. If it is sufficiently flexible, it may even
> make sense to restrict clang's frontend instrumentation-based profiling
to
> non-performance stuff, and have clang directly interoperate with the
> IR-level PGO for performance-related PGO use cases, just like any other
> frontend would.
>
> Philip and Sanjoy, out of curiosity do you guys use your own
> instrumentation placement for PGO? Is an IR-level PGO infrastructure
> upstream something you guys would be interested in?
>
> I think that 2. is something that once we have 1. we will be able to
> evaluate better, but for now my opinion is that we should be able to make
> good progress without digging into that.
>
> I think that 3. is a no-brainer if it provides a really significant win,
> but without 1. we can't really measure its effect in isolation. It also
has
> a usability problem since it requires feeding in an existing profile for
> the *instrumented* build, but if the benefit is very significant this may
> be worth it for some users. We will probably be able to easily refactor 1.
> as needed into an MST approach that degrades gracefully to using static
> heuristics in the absence of real profile information, so is not a
> maintenance burden (maybe even helps by providing a good framework in which
> to develop effective static heuristics).
>
> For the time being, I think we can avoid discussion of 2. and 3. until we
> have more of 1. working. So I think it would be most productive if we focus
> this discussion on 1.
>
>
>> Additionally, some of the overhead imposed by FE instrumentation is not
>> really all that easy to get rid of.  You end up duplicating
functionality
>> that is more naturally implemented in the middle end.
>>
>
> Yeah, I was looking into a couple of other simple approaches and quickly
> found out that I was basically replicating much of the sort of logic that
> the inliner already has.
>
> -- Sean Silva
>
>
>>
>> I see the two approaches as supplementary, rather than complementary.
>> One does not negate the other.  Some of the optimizations we'd do
in the
>> FE, may hurt coverage.  Instead, by instrumenting in the middle end,
you
>> can focus exclusively on performance (coverage be damned).
>>
>>
>> Diego.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150819/ceec9b94/attachment-0001.html>

Xinliang David Li via llvm-dev

2015-Sep-01 21:10 UTC

head link

[llvm-dev] RFC: PGO Late instrumentation for LLVM

This is a late reply -- the email somehow skipped my inbox.
> Philip and Sanjoy, out of curiosity do you guys use your own
instrumentation
> placement for PGO? Is an IR-level PGO infrastructure upstream something you
> guys would be interested in?
>
> I think that 2. is something that once we have 1. we will be able to
> evaluate better, but for now my opinion is that we should be able to make
> good progress without digging into that.
>
> I think that 3. is a no-brainer if it provides a really significant win,
but
> without 1. we can't really measure its effect in isolation. It also has
a
> usability problem since it requires feeding in an existing profile for the
> *instrumented* build, but if the benefit is very significant this may be
> worth it for some users. We will probably be able to easily refactor 1. as
> needed into an MST approach that degrades gracefully to using static
> heuristics in the absence of real profile information, so is not a
> maintenance burden (maybe even helps by providing a good framework in which
> to develop effective static heuristics).
Regarding 3, I am not sure what usability issue are you referring to.
Can you elaborate?

thanks,

David



>
> For the time being, I think we can avoid discussion of 2. and 3. until we
> have more of 1. working. So I think it would be most productive if we focus
> this discussion on 1.
>
>>
>> Additionally, some of the overhead imposed by FE instrumentation is not
>> really all that easy to get rid of.  You end up duplicating
functionality
>> that is more naturally implemented in the middle end.
>
>
> Yeah, I was looking into a couple of other simple approaches and quickly
> found out that I was basically replicating much of the sort of logic that
> the inliner already has.
>
> -- Sean Silva
>
>>
>>
>> I see the two approaches as supplementary, rather than complementary. 
One
>> does not negate the other.  Some of the optimizations we'd do in
the FE, may
>> hurt coverage.  Instead, by instrumenting in the middle end, you can
focus
>> exclusively on performance (coverage be damned).
>>
>>
>> Diego.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Sep 2015 - RFC: PGO Late instrumentation for LLVM

[llvm-dev] RFC: PGO Late instrumentation for LLVM

[llvm-dev] RFC: PGO Late instrumentation for LLVM

[llvm-dev] RFC: PGO Late instrumentation for LLVM

[llvm-dev] RFC: PGO Late instrumentation for LLVM

[llvm-dev] RFC: PGO Late instrumentation for LLVM

Apparently Analagous Threads