thr3ads.net - llvm dev - [llvm-dev] PGO is ineffective for Rust

If this information is useful, please help other people find it:
Share via:

Teresa Johnson via llvm-dev

2019-Sep-12 16:31 UTC

[llvm-dev] PGO is ineffective for Rust - but why?

On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <tejohnson at google.com>
wrote:
> I just have a couple suggestions off the top of my head:
> - have you tried using the new pass manager
> (-fexperimental-new-pass-manager)? That has access to additional analysis
> info during inlining and is able to make more precise PGO based inline
> decisions.
>
(although note the above shouldn't make the difference between no
performance and a typical PGO performance boost)

Another thing I just thought of - are you using -ffunction-sections and
-fdata-sections? These will allow for PGO based function layout in the
linker (assuming you are using lld or gold).

- have you tried collecting profile data with and without PGO to see if
you> can compare where cycles are being spent? That's my usual way of
debugging
> performance differences related to inlining or profile changes.
> - just a comment that it is odd you are getting better performance without
> the pre-inlining - which typically helps because you get better
> context-sensitive profile info. Maybe sanity check that the pre inlining is
> kicking in for both the profile gen and use passes?
>
> Teresa
>
> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi everyone,
>>
>> As part of my work for Mozilla's Low Level Tools team I've
>> implemented PGO in the Rust compiler. The feature is
>> available since Rust 1.37 [1]. However, so far we have not
>> seen any actual performance gains from enabling PGO for
>> Rust code. Performance even seems to drop 1-3% with PGO
>> enabled. I wonder why that is and I'm hoping that someone
>> here might have experience debugging PGO effectiveness.
>>
>>
>> PGO in the Rust compiler
>> ------------------------
>>
>> The Rust compiler uses IR-level instrumentation (the
>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>> This has worked pretty well and even enables doing PGO for
>> mixed Rust/C++ codebases when also using Clang.
>>
>> The Rust compiler has regression tests that make sure that:
>>
>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>   and that
>>
>> - profiling data is actually used during the `use` phase, i.e.
>>   that cold functions get marked with `cold` and hot functions
>>   get marked with `inline`.
>>
>> I also verified manually that `branch_weights` are being set
>> in IR. So, from my perspective, the PGO implementation does
>> what it is supposed to do.
>>
>> However, as already mentioned, in all benchmarks I've seen so
>> far performance seems to stay the same at best and often even
>> suffers slightly. Which is suprising because for C++ code
>> using Clang's version of IR-level instrumentation & PGO brings
>> signifcant gains (up to 5-10% from what I've seen in
>> benchmarks for Firefox).
>>
>> One thing we noticed early on is that disabling the
>> pre-inlining pass (`-disable-preinline`) seems to consistently
>> improve the situation for Rust code. Doing that we sometimes
>> see performance wins of almost 1% over not using PGO. This
>> again is very different to C++ where disabling this pass
>> causes dramatic performance loses for the Firefox benchmarks.
>> And 1% performance improvement is still well below
>> expectations, I think.
>>
>> So my questions to you are:
>>
>> - Has anybody here observed something similar while
>>   wokring on or with PGO?
>>
>> - Are there certain known characteristics of LLVM IR code
>>   that inhibit PGO's effectiveness and that IR produced by
>>   `rustc` might exhibit?
>>
>> - Does anybody know of a good source that describes how to
>>   effectively debug a problem like this?
>>
>> - Does anybody know of a small example program in C/C++
>>   that is known to profit from PGO and that could be
>>   re-implemented in Rust for comparison?
>>
>> Thanks a lot for reading! Any help is appreciated.
>>
>> -Michael
>>
>> [1]
>>
https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>

-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190912/311d59b4/attachment.html>

Michael Woerister via llvm-dev

2019-Sep-13 11:04 UTC

head link

[llvm-dev] PGO is ineffective for Rust - but why?

Thank you all a lot, Teresa, David, and Philip!

This is giving me quite a todo list of things to check and try out. I'll
report back here when I have some findings.

On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <tejohnson at google.com>
wrote:
>
>
> On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <tejohnson at
google.com>
> wrote:
>
>> I just have a couple suggestions off the top of my head:
>> - have you tried using the new pass manager
>> (-fexperimental-new-pass-manager)? That has access to additional
analysis
>> info during inlining and is able to make more precise PGO based inline
>> decisions.
>>
>
> (although note the above shouldn't make the difference between no
> performance and a typical PGO performance boost)
>
> Another thing I just thought of - are you using -ffunction-sections and
> -fdata-sections? These will allow for PGO based function layout in the
> linker (assuming you are using lld or gold).
>
> - have you tried collecting profile data with and without PGO to see if
>> you can compare where cycles are being spent? That's my usual way
of
>> debugging performance differences related to inlining or profile
changes.
>> - just a comment that it is odd you are getting better performance
>> without the pre-inlining - which typically helps because you get better
>> context-sensitive profile info. Maybe sanity check that the pre
inlining is
>> kicking in for both the profile gen and use passes?
>>
>> Teresa
>>
>> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi everyone,
>>>
>>> As part of my work for Mozilla's Low Level Tools team I've
>>> implemented PGO in the Rust compiler. The feature is
>>> available since Rust 1.37 [1]. However, so far we have not
>>> seen any actual performance gains from enabling PGO for
>>> Rust code. Performance even seems to drop 1-3% with PGO
>>> enabled. I wonder why that is and I'm hoping that someone
>>> here might have experience debugging PGO effectiveness.
>>>
>>>
>>> PGO in the Rust compiler
>>> ------------------------
>>>
>>> The Rust compiler uses IR-level instrumentation (the
>>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>> This has worked pretty well and even enables doing PGO for
>>> mixed Rust/C++ codebases when also using Clang.
>>>
>>> The Rust compiler has regression tests that make sure that:
>>>
>>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>   and that
>>>
>>> - profiling data is actually used during the `use` phase, i.e.
>>>   that cold functions get marked with `cold` and hot functions
>>>   get marked with `inline`.
>>>
>>> I also verified manually that `branch_weights` are being set
>>> in IR. So, from my perspective, the PGO implementation does
>>> what it is supposed to do.
>>>
>>> However, as already mentioned, in all benchmarks I've seen so
>>> far performance seems to stay the same at best and often even
>>> suffers slightly. Which is suprising because for C++ code
>>> using Clang's version of IR-level instrumentation & PGO
brings
>>> signifcant gains (up to 5-10% from what I've seen in
>>> benchmarks for Firefox).
>>>
>>> One thing we noticed early on is that disabling the
>>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>> improve the situation for Rust code. Doing that we sometimes
>>> see performance wins of almost 1% over not using PGO. This
>>> again is very different to C++ where disabling this pass
>>> causes dramatic performance loses for the Firefox benchmarks.
>>> And 1% performance improvement is still well below
>>> expectations, I think.
>>>
>>> So my questions to you are:
>>>
>>> - Has anybody here observed something similar while
>>>   wokring on or with PGO?
>>>
>>> - Are there certain known characteristics of LLVM IR code
>>>   that inhibit PGO's effectiveness and that IR produced by
>>>   `rustc` might exhibit?
>>>
>>> - Does anybody know of a good source that describes how to
>>>   effectively debug a problem like this?
>>>
>>> - Does anybody know of a small example program in C/C++
>>>   that is known to profit from PGO and that could be
>>>   re-implemented in Rust for comparison?
>>>
>>> Thanks a lot for reading! Any help is appreciated.
>>>
>>> -Michael
>>>
>>> [1]
>>>
https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>>
>> --
>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>>
>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190913/acbc6a63/attachment.html>

Michael Woerister via llvm-dev

2019-Sep-16 15:41 UTC

head link

[llvm-dev] PGO is ineffective for Rust - but why?

So one interesting observation has already come out of this: I
confirmed that `rustc` indeed uses `-ffunction-sections` and
`-fdata-sections` on all platforms except for macOS. When trying out
different linkers for a small test case [1], however, I found that
there were rather large differences in execution time:

ld (no PGO) = 172 ms
ld (PGO) = 196 ms

gold (no PGO) = 182 ms
gold (PGO) = 141 ms

lld (no PGO) = 193 ms
lld (PGO) = 171 ms

So `gold` and `lld` both profit from PGO quite a bit, while `ld`
linked programs are slower with PGO. I then noticed that branch
weights for `ld` were missing from most branches, while the counts for
the other linkers are correct. All of this suggests to me that
something goes wrong when `ld` tries to link in the profiling runtime.

I'll be investigating further.

[1]
https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights


On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <tejohnson at google.com>
wrote:>
>
>
> On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <tejohnson at
google.com> wrote:
>>
>> I just have a couple suggestions off the top of my head:
>> - have you tried using the new pass manager
(-fexperimental-new-pass-manager)? That has access to additional analysis info
during inlining and is able to make more precise PGO based inline decisions.
>
>
> (although note the above shouldn't make the difference between no
performance and a typical PGO performance boost)
>
> Another thing I just thought of - are you using -ffunction-sections and
-fdata-sections? These will allow for PGO based function layout in the linker
(assuming you are using lld or gold).
>
>> - have you tried collecting profile data with and without PGO to see if
you can compare where cycles are being spent? That's my usual way of
debugging performance differences related to inlining or profile changes.
>> - just a comment that it is odd you are getting better performance
without the pre-inlining - which typically helps because you get better
context-sensitive profile info. Maybe sanity check that the pre inlining is
kicking in for both the profile gen and use passes?
>>
>> Teresa
>>
>> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>
>>> Hi everyone,
>>>
>>> As part of my work for Mozilla's Low Level Tools team I've
>>> implemented PGO in the Rust compiler. The feature is
>>> available since Rust 1.37 [1]. However, so far we have not
>>> seen any actual performance gains from enabling PGO for
>>> Rust code. Performance even seems to drop 1-3% with PGO
>>> enabled. I wonder why that is and I'm hoping that someone
>>> here might have experience debugging PGO effectiveness.
>>>
>>>
>>> PGO in the Rust compiler
>>> ------------------------
>>>
>>> The Rust compiler uses IR-level instrumentation (the
>>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>> This has worked pretty well and even enables doing PGO for
>>> mixed Rust/C++ codebases when also using Clang.
>>>
>>> The Rust compiler has regression tests that make sure that:
>>>
>>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>   and that
>>>
>>> - profiling data is actually used during the `use` phase, i.e.
>>>   that cold functions get marked with `cold` and hot functions
>>>   get marked with `inline`.
>>>
>>> I also verified manually that `branch_weights` are being set
>>> in IR. So, from my perspective, the PGO implementation does
>>> what it is supposed to do.
>>>
>>> However, as already mentioned, in all benchmarks I've seen so
>>> far performance seems to stay the same at best and often even
>>> suffers slightly. Which is suprising because for C++ code
>>> using Clang's version of IR-level instrumentation & PGO
brings
>>> signifcant gains (up to 5-10% from what I've seen in
>>> benchmarks for Firefox).
>>>
>>> One thing we noticed early on is that disabling the
>>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>> improve the situation for Rust code. Doing that we sometimes
>>> see performance wins of almost 1% over not using PGO. This
>>> again is very different to C++ where disabling this pass
>>> causes dramatic performance loses for the Firefox benchmarks.
>>> And 1% performance improvement is still well below
>>> expectations, I think.
>>>
>>> So my questions to you are:
>>>
>>> - Has anybody here observed something similar while
>>>   wokring on or with PGO?
>>>
>>> - Are there certain known characteristics of LLVM IR code
>>>   that inhibit PGO's effectiveness and that IR produced by
>>>   `rustc` might exhibit?
>>>
>>> - Does anybody know of a good source that describes how to
>>>   effectively debug a problem like this?
>>>
>>> - Does anybody know of a small example program in C/C++
>>>   that is known to profit from PGO and that could be
>>>   re-implemented in Rust for comparison?
>>>
>>> Thanks a lot for reading! Any help is appreciated.
>>>
>>> -Michael
>>>
>>> [1]
https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson at google.com |
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson at google.com |

Teresa Johnson via llvm-dev

2019-Sep-16 17:07 UTC

head link

[llvm-dev] PGO is ineffective for Rust - but why?

Interesting. By ld do you mean GNU ld? I know GNU ld does "work" with
LLVM's gold plugin, but it's an untested combination and not
recommended. I
wouldn't be surprised if there were some issues around it not passing
necessary info to the gold plugin.

Teresa

On Mon, Sep 16, 2019 at 8:41 AM Michael Woerister <mwoerister at
mozilla.com>
wrote:
> So one interesting observation has already come out of this: I
> confirmed that `rustc` indeed uses `-ffunction-sections` and
> `-fdata-sections` on all platforms except for macOS. When trying out
> different linkers for a small test case [1], however, I found that
> there were rather large differences in execution time:
>
> ld (no PGO) = 172 ms
> ld (PGO) = 196 ms
>
> gold (no PGO) = 182 ms
> gold (PGO) = 141 ms
>
> lld (no PGO) = 193 ms
> lld (PGO) = 171 ms
>
> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
> linked programs are slower with PGO. I then noticed that branch
> weights for `ld` were missing from most branches, while the counts for
> the other linkers are correct. All of this suggests to me that
> something goes wrong when `ld` tries to link in the profiling runtime.
>
> I'll be investigating further.
>
> [1]
>
https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
>
>
> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <tejohnson at
google.com>
> wrote:
> >
> >
> >
> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <tejohnson at
google.com>
> wrote:
> >>
> >> I just have a couple suggestions off the top of my head:
> >> - have you tried using the new pass manager
> (-fexperimental-new-pass-manager)? That has access to additional analysis
> info during inlining and is able to make more precise PGO based inline
> decisions.
> >
> >
> > (although note the above shouldn't make the difference between no
> performance and a typical PGO performance boost)
> >
> > Another thing I just thought of - are you using -ffunction-sections
and
> -fdata-sections? These will allow for PGO based function layout in the
> linker (assuming you are using lld or gold).
> >
> >> - have you tried collecting profile data with and without PGO to
see if
> you can compare where cycles are being spent? That's my usual way of
> debugging performance differences related to inlining or profile changes.
> >> - just a comment that it is odd you are getting better performance
> without the pre-inlining - which typically helps because you get better
> context-sensitive profile info. Maybe sanity check that the pre inlining is
> kicking in for both the profile gen and use passes?
> >>
> >> Teresa
> >>
> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev
<
> llvm-dev at lists.llvm.org> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> As part of my work for Mozilla's Low Level Tools team
I've
> >>> implemented PGO in the Rust compiler. The feature is
> >>> available since Rust 1.37 [1]. However, so far we have not
> >>> seen any actual performance gains from enabling PGO for
> >>> Rust code. Performance even seems to drop 1-3% with PGO
> >>> enabled. I wonder why that is and I'm hoping that someone
> >>> here might have experience debugging PGO effectiveness.
> >>>
> >>>
> >>> PGO in the Rust compiler
> >>> ------------------------
> >>>
> >>> The Rust compiler uses IR-level instrumentation (the
> >>> equivalent of Clang's
`-fprofile-generate`/`-fprofile-use`).
> >>> This has worked pretty well and even enables doing PGO for
> >>> mixed Rust/C++ codebases when also using Clang.
> >>>
> >>> The Rust compiler has regression tests that make sure that:
> >>>
> >>> - instrumentation shows up in LLVM IR for the `generate`
phase,
> >>>   and that
> >>>
> >>> - profiling data is actually used during the `use` phase, i.e.
> >>>   that cold functions get marked with `cold` and hot functions
> >>>   get marked with `inline`.
> >>>
> >>> I also verified manually that `branch_weights` are being set
> >>> in IR. So, from my perspective, the PGO implementation does
> >>> what it is supposed to do.
> >>>
> >>> However, as already mentioned, in all benchmarks I've seen
so
> >>> far performance seems to stay the same at best and often even
> >>> suffers slightly. Which is suprising because for C++ code
> >>> using Clang's version of IR-level instrumentation &
PGO brings
> >>> signifcant gains (up to 5-10% from what I've seen in
> >>> benchmarks for Firefox).
> >>>
> >>> One thing we noticed early on is that disabling the
> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
> >>> improve the situation for Rust code. Doing that we sometimes
> >>> see performance wins of almost 1% over not using PGO. This
> >>> again is very different to C++ where disabling this pass
> >>> causes dramatic performance loses for the Firefox benchmarks.
> >>> And 1% performance improvement is still well below
> >>> expectations, I think.
> >>>
> >>> So my questions to you are:
> >>>
> >>> - Has anybody here observed something similar while
> >>>   wokring on or with PGO?
> >>>
> >>> - Are there certain known characteristics of LLVM IR code
> >>>   that inhibit PGO's effectiveness and that IR produced by
> >>>   `rustc` might exhibit?
> >>>
> >>> - Does anybody know of a good source that describes how to
> >>>   effectively debug a problem like this?
> >>>
> >>> - Does anybody know of a small example program in C/C++
> >>>   that is known to profit from PGO and that could be
> >>>   re-implemented in Rust for comparison?
> >>>
> >>> Thanks a lot for reading! Any help is appreciated.
> >>>
> >>> -Michael
> >>>
> >>> [1]
>
https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson at google.com |
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer | tejohnson at google.com |
>

-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190916/a97994a2/attachment.html>

Xinliang David Li via llvm-dev

2019-Sep-16 17:40 UTC

head link

[llvm-dev] PGO is ineffective for Rust - but why?

Can you clarify if performance difference is caused by using different
linkers at instrumentation build?  If that is the case, try dump the
sections of the resulting binary and compare __llvm_prf_** sections. Also
check the arguments passed to the linker. It should
have -u__llvm_profile_runtime   to force the profile runtime to be linked
in.

David

On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> So one interesting observation has already come out of this: I
> confirmed that `rustc` indeed uses `-ffunction-sections` and
> `-fdata-sections` on all platforms except for macOS. When trying out
> different linkers for a small test case [1], however, I found that
> there were rather large differences in execution time:
>
> ld (no PGO) = 172 ms
> ld (PGO) = 196 ms
>
> gold (no PGO) = 182 ms
> gold (PGO) = 141 ms
>
> lld (no PGO) = 193 ms
> lld (PGO) = 171 ms
>
> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
> linked programs are slower with PGO. I then noticed that branch
> weights for `ld` were missing from most branches, while the counts for
> the other linkers are correct. All of this suggests to me that
> something goes wrong when `ld` tries to link in the profiling runtime.
>
> I'll be investigating further.
>
> [1]
>
https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
>
>
> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <tejohnson at
google.com>
> wrote:
> >
> >
> >
> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <tejohnson at
google.com>
> wrote:
> >>
> >> I just have a couple suggestions off the top of my head:
> >> - have you tried using the new pass manager
> (-fexperimental-new-pass-manager)? That has access to additional analysis
> info during inlining and is able to make more precise PGO based inline
> decisions.
> >
> >
> > (although note the above shouldn't make the difference between no
> performance and a typical PGO performance boost)
> >
> > Another thing I just thought of - are you using -ffunction-sections
and
> -fdata-sections? These will allow for PGO based function layout in the
> linker (assuming you are using lld or gold).
> >
> >> - have you tried collecting profile data with and without PGO to
see if
> you can compare where cycles are being spent? That's my usual way of
> debugging performance differences related to inlining or profile changes.
> >> - just a comment that it is odd you are getting better performance
> without the pre-inlining - which typically helps because you get better
> context-sensitive profile info. Maybe sanity check that the pre inlining is
> kicking in for both the profile gen and use passes?
> >>
> >> Teresa
> >>
> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev
<
> llvm-dev at lists.llvm.org> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> As part of my work for Mozilla's Low Level Tools team
I've
> >>> implemented PGO in the Rust compiler. The feature is
> >>> available since Rust 1.37 [1]. However, so far we have not
> >>> seen any actual performance gains from enabling PGO for
> >>> Rust code. Performance even seems to drop 1-3% with PGO
> >>> enabled. I wonder why that is and I'm hoping that someone
> >>> here might have experience debugging PGO effectiveness.
> >>>
> >>>
> >>> PGO in the Rust compiler
> >>> ------------------------
> >>>
> >>> The Rust compiler uses IR-level instrumentation (the
> >>> equivalent of Clang's
`-fprofile-generate`/`-fprofile-use`).
> >>> This has worked pretty well and even enables doing PGO for
> >>> mixed Rust/C++ codebases when also using Clang.
> >>>
> >>> The Rust compiler has regression tests that make sure that:
> >>>
> >>> - instrumentation shows up in LLVM IR for the `generate`
phase,
> >>>   and that
> >>>
> >>> - profiling data is actually used during the `use` phase, i.e.
> >>>   that cold functions get marked with `cold` and hot functions
> >>>   get marked with `inline`.
> >>>
> >>> I also verified manually that `branch_weights` are being set
> >>> in IR. So, from my perspective, the PGO implementation does
> >>> what it is supposed to do.
> >>>
> >>> However, as already mentioned, in all benchmarks I've seen
so
> >>> far performance seems to stay the same at best and often even
> >>> suffers slightly. Which is suprising because for C++ code
> >>> using Clang's version of IR-level instrumentation &
PGO brings
> >>> signifcant gains (up to 5-10% from what I've seen in
> >>> benchmarks for Firefox).
> >>>
> >>> One thing we noticed early on is that disabling the
> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
> >>> improve the situation for Rust code. Doing that we sometimes
> >>> see performance wins of almost 1% over not using PGO. This
> >>> again is very different to C++ where disabling this pass
> >>> causes dramatic performance loses for the Firefox benchmarks.
> >>> And 1% performance improvement is still well below
> >>> expectations, I think.
> >>>
> >>> So my questions to you are:
> >>>
> >>> - Has anybody here observed something similar while
> >>>   wokring on or with PGO?
> >>>
> >>> - Are there certain known characteristics of LLVM IR code
> >>>   that inhibit PGO's effectiveness and that IR produced by
> >>>   `rustc` might exhibit?
> >>>
> >>> - Does anybody know of a good source that describes how to
> >>>   effectively debug a problem like this?
> >>>
> >>> - Does anybody know of a small example program in C/C++
> >>>   that is known to profit from PGO and that could be
> >>>   re-implemented in Rust for comparison?
> >>>
> >>> Thanks a lot for reading! Any help is appreciated.
> >>>
> >>> -Michael
> >>>
> >>> [1]
>
https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson at google.com |
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer | tejohnson at google.com |
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190916/8604e063/attachment.html>

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Sep 2019 - PGO is ineffective for Rust - but why?

[llvm-dev] PGO is ineffective for Rust - but why?

[llvm-dev] PGO is ineffective for Rust - but why?

[llvm-dev] PGO is ineffective for Rust - but why?

[llvm-dev] PGO is ineffective for Rust - but why?

[llvm-dev] PGO is ineffective for Rust - but why?

Possibly Parallel Threads