Michael Woerister via llvm-dev
2019-Sep-12 09:18 UTC
[llvm-dev] PGO is ineffective for Rust - but why?
Hi everyone, As part of my work for Mozilla's Low Level Tools team I've implemented PGO in the Rust compiler. The feature is available since Rust 1.37 [1]. However, so far we have not seen any actual performance gains from enabling PGO for Rust code. Performance even seems to drop 1-3% with PGO enabled. I wonder why that is and I'm hoping that someone here might have experience debugging PGO effectiveness. PGO in the Rust compiler ------------------------ The Rust compiler uses IR-level instrumentation (the equivalent of Clang's `-fprofile-generate`/`-fprofile-use`). This has worked pretty well and even enables doing PGO for mixed Rust/C++ codebases when also using Clang. The Rust compiler has regression tests that make sure that: - instrumentation shows up in LLVM IR for the `generate` phase, and that - profiling data is actually used during the `use` phase, i.e. that cold functions get marked with `cold` and hot functions get marked with `inline`. I also verified manually that `branch_weights` are being set in IR. So, from my perspective, the PGO implementation does what it is supposed to do. However, as already mentioned, in all benchmarks I've seen so far performance seems to stay the same at best and often even suffers slightly. Which is suprising because for C++ code using Clang's version of IR-level instrumentation & PGO brings signifcant gains (up to 5-10% from what I've seen in benchmarks for Firefox). One thing we noticed early on is that disabling the pre-inlining pass (`-disable-preinline`) seems to consistently improve the situation for Rust code. Doing that we sometimes see performance wins of almost 1% over not using PGO. This again is very different to C++ where disabling this pass causes dramatic performance loses for the Firefox benchmarks. And 1% performance improvement is still well below expectations, I think. So my questions to you are: - Has anybody here observed something similar while wokring on or with PGO? - Are there certain known characteristics of LLVM IR code that inhibit PGO's effectiveness and that IR produced by `rustc` might exhibit? - Does anybody know of a good source that describes how to effectively debug a problem like this? - Does anybody know of a small example program in C/C++ that is known to profit from PGO and that could be re-implemented in Rust for comparison? Thanks a lot for reading! Any help is appreciated. -Michael [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
Teresa Johnson via llvm-dev
2019-Sep-12 15:18 UTC
[llvm-dev] PGO is ineffective for Rust - but why?
I just have a couple suggestions off the top of my head: - have you tried using the new pass manager (-fexperimental-new-pass-manager)? That has access to additional analysis info during inlining and is able to make more precise PGO based inline decisions. - have you tried collecting profile data with and without PGO to see if you can compare where cycles are being spent? That's my usual way of debugging performance differences related to inlining or profile changes. - just a comment that it is odd you are getting better performance without the pre-inlining - which typically helps because you get better context-sensitive profile info. Maybe sanity check that the pre inlining is kicking in for both the profile gen and use passes? Teresa On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi everyone, > > As part of my work for Mozilla's Low Level Tools team I've > implemented PGO in the Rust compiler. The feature is > available since Rust 1.37 [1]. However, so far we have not > seen any actual performance gains from enabling PGO for > Rust code. Performance even seems to drop 1-3% with PGO > enabled. I wonder why that is and I'm hoping that someone > here might have experience debugging PGO effectiveness. > > > PGO in the Rust compiler > ------------------------ > > The Rust compiler uses IR-level instrumentation (the > equivalent of Clang's `-fprofile-generate`/`-fprofile-use`). > This has worked pretty well and even enables doing PGO for > mixed Rust/C++ codebases when also using Clang. > > The Rust compiler has regression tests that make sure that: > > - instrumentation shows up in LLVM IR for the `generate` phase, > and that > > - profiling data is actually used during the `use` phase, i.e. > that cold functions get marked with `cold` and hot functions > get marked with `inline`. > > I also verified manually that `branch_weights` are being set > in IR. So, from my perspective, the PGO implementation does > what it is supposed to do. > > However, as already mentioned, in all benchmarks I've seen so > far performance seems to stay the same at best and often even > suffers slightly. Which is suprising because for C++ code > using Clang's version of IR-level instrumentation & PGO brings > signifcant gains (up to 5-10% from what I've seen in > benchmarks for Firefox). > > One thing we noticed early on is that disabling the > pre-inlining pass (`-disable-preinline`) seems to consistently > improve the situation for Rust code. Doing that we sometimes > see performance wins of almost 1% over not using PGO. This > again is very different to C++ where disabling this pass > causes dramatic performance loses for the Firefox benchmarks. > And 1% performance improvement is still well below > expectations, I think. > > So my questions to you are: > > - Has anybody here observed something similar while > wokring on or with PGO? > > - Are there certain known characteristics of LLVM IR code > that inhibit PGO's effectiveness and that IR produced by > `rustc` might exhibit? > > - Does anybody know of a good source that describes how to > effectively debug a problem like this? > > - Does anybody know of a small example program in C/C++ > that is known to profit from PGO and that could be > re-implemented in Rust for comparison? > > Thanks a lot for reading! Any help is appreciated. > > -Michael > > [1] > https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Teresa Johnson | Software Engineer | tejohnson at google.com | -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190912/a5e6c95d/attachment.html>
Teresa Johnson via llvm-dev
2019-Sep-12 16:31 UTC
[llvm-dev] PGO is ineffective for Rust - but why?
On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <tejohnson at google.com> wrote:> I just have a couple suggestions off the top of my head: > - have you tried using the new pass manager > (-fexperimental-new-pass-manager)? That has access to additional analysis > info during inlining and is able to make more precise PGO based inline > decisions. >(although note the above shouldn't make the difference between no performance and a typical PGO performance boost) Another thing I just thought of - are you using -ffunction-sections and -fdata-sections? These will allow for PGO based function layout in the linker (assuming you are using lld or gold). - have you tried collecting profile data with and without PGO to see if you> can compare where cycles are being spent? That's my usual way of debugging > performance differences related to inlining or profile changes. > - just a comment that it is odd you are getting better performance without > the pre-inlining - which typically helps because you get better > context-sensitive profile info. Maybe sanity check that the pre inlining is > kicking in for both the profile gen and use passes? > > Teresa > > On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi everyone, >> >> As part of my work for Mozilla's Low Level Tools team I've >> implemented PGO in the Rust compiler. The feature is >> available since Rust 1.37 [1]. However, so far we have not >> seen any actual performance gains from enabling PGO for >> Rust code. Performance even seems to drop 1-3% with PGO >> enabled. I wonder why that is and I'm hoping that someone >> here might have experience debugging PGO effectiveness. >> >> >> PGO in the Rust compiler >> ------------------------ >> >> The Rust compiler uses IR-level instrumentation (the >> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`). >> This has worked pretty well and even enables doing PGO for >> mixed Rust/C++ codebases when also using Clang. >> >> The Rust compiler has regression tests that make sure that: >> >> - instrumentation shows up in LLVM IR for the `generate` phase, >> and that >> >> - profiling data is actually used during the `use` phase, i.e. >> that cold functions get marked with `cold` and hot functions >> get marked with `inline`. >> >> I also verified manually that `branch_weights` are being set >> in IR. So, from my perspective, the PGO implementation does >> what it is supposed to do. >> >> However, as already mentioned, in all benchmarks I've seen so >> far performance seems to stay the same at best and often even >> suffers slightly. Which is suprising because for C++ code >> using Clang's version of IR-level instrumentation & PGO brings >> signifcant gains (up to 5-10% from what I've seen in >> benchmarks for Firefox). >> >> One thing we noticed early on is that disabling the >> pre-inlining pass (`-disable-preinline`) seems to consistently >> improve the situation for Rust code. Doing that we sometimes >> see performance wins of almost 1% over not using PGO. This >> again is very different to C++ where disabling this pass >> causes dramatic performance loses for the Firefox benchmarks. >> And 1% performance improvement is still well below >> expectations, I think. >> >> So my questions to you are: >> >> - Has anybody here observed something similar while >> wokring on or with PGO? >> >> - Are there certain known characteristics of LLVM IR code >> that inhibit PGO's effectiveness and that IR produced by >> `rustc` might exhibit? >> >> - Does anybody know of a good source that describes how to >> effectively debug a problem like this? >> >> - Does anybody know of a small example program in C/C++ >> that is known to profit from PGO and that could be >> re-implemented in Rust for comparison? >> >> Thanks a lot for reading! Any help is appreciated. >> >> -Michael >> >> [1] >> https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > > > -- > Teresa Johnson | Software Engineer | tejohnson at google.com | >-- Teresa Johnson | Software Engineer | tejohnson at google.com | -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190912/311d59b4/attachment.html>
Xinliang David Li via llvm-dev
2019-Sep-12 17:14 UTC
[llvm-dev] PGO is ineffective for Rust - but why?
A couple of things to look at: 1) Do you see any profile mismatch warnings? 2) Use the following options to dump text output of branch probabilities with source information and sanity check if they are good: -Rpass=pgo-instrumentation -mllvm -pgo-emit-branch-prob 3) Does Rust code have lots of indirect calls? use option -Rpass=pgo-icall-prom to see if there are any indirect call promotions happening. 4) Collect perf stats data about taken branches. With PGO, the result should be much smaller. Otherwise, the block layout is not using any profile data. 5) Using llvm-profdata to dump the profile. What do they look like? a) llvm-profdata show --detailed-summary ... b) llvm-profdata show --topn=100 ... c) llvm-profdata show --all-functions --ic-targets ... David On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi everyone, > > As part of my work for Mozilla's Low Level Tools team I've > implemented PGO in the Rust compiler. The feature is > available since Rust 1.37 [1]. However, so far we have not > seen any actual performance gains from enabling PGO for > Rust code. Performance even seems to drop 1-3% with PGO > enabled. I wonder why that is and I'm hoping that someone > here might have experience debugging PGO effectiveness. > > > PGO in the Rust compiler > ------------------------ > > The Rust compiler uses IR-level instrumentation (the > equivalent of Clang's `-fprofile-generate`/`-fprofile-use`). > This has worked pretty well and even enables doing PGO for > mixed Rust/C++ codebases when also using Clang. > > The Rust compiler has regression tests that make sure that: > > - instrumentation shows up in LLVM IR for the `generate` phase, > and that > > - profiling data is actually used during the `use` phase, i.e. > that cold functions get marked with `cold` and hot functions > get marked with `inline`. > > I also verified manually that `branch_weights` are being set > in IR. So, from my perspective, the PGO implementation does > what it is supposed to do. > > However, as already mentioned, in all benchmarks I've seen so > far performance seems to stay the same at best and often even > suffers slightly. Which is suprising because for C++ code > using Clang's version of IR-level instrumentation & PGO brings > signifcant gains (up to 5-10% from what I've seen in > benchmarks for Firefox). > > One thing we noticed early on is that disabling the > pre-inlining pass (`-disable-preinline`) seems to consistently > improve the situation for Rust code. Doing that we sometimes > see performance wins of almost 1% over not using PGO. This > again is very different to C++ where disabling this pass > causes dramatic performance loses for the Firefox benchmarks. > And 1% performance improvement is still well below > expectations, I think. > > So my questions to you are: > > - Has anybody here observed something similar while > wokring on or with PGO? > > - Are there certain known characteristics of LLVM IR code > that inhibit PGO's effectiveness and that IR produced by > `rustc` might exhibit? > > - Does anybody know of a good source that describes how to > effectively debug a problem like this? > > - Does anybody know of a small example program in C/C++ > that is known to profit from PGO and that could be > re-implemented in Rust for comparison? > > Thanks a lot for reading! Any help is appreciated. > > -Michael > > [1] > https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190912/057522e1/attachment.html>
Philip Reames via llvm-dev
2019-Sep-12 21:57 UTC
[llvm-dev] PGO is ineffective for Rust - but why?
On 9/12/19 2:18 AM, Michael Woerister via llvm-dev wrote:> Hi everyone, > > As part of my work for Mozilla's Low Level Tools team I've > implemented PGO in the Rust compiler. The feature is > available since Rust 1.37 [1]. However, so far we have not > seen any actual performance gains from enabling PGO for > Rust code. Performance even seems to drop 1-3% with PGO > enabled. I wonder why that is and I'm hoping that someone > here might have experience debugging PGO effectiveness. > > > PGO in the Rust compiler > ------------------------ > > The Rust compiler uses IR-level instrumentation (the > equivalent of Clang's `-fprofile-generate`/`-fprofile-use`). > This has worked pretty well and even enables doing PGO for > mixed Rust/C++ codebases when also using Clang. > > The Rust compiler has regression tests that make sure that: > > - instrumentation shows up in LLVM IR for the `generate` phase, > and that > > - profiling data is actually used during the `use` phase, i.e. > that cold functions get marked with `cold` and hot functions > get marked with `inline`. > > I also verified manually that `branch_weights` are being set > in IR. So, from my perspective, the PGO implementation does > what it is supposed to do.One thing missing here is profile guided devirtualization. That's super significant for Java; it might be highly relevant for Rust as well. However, I'd still expect to see *some* positive delta with what you've got, so don't start here. Your immediate problem is likely something else.> > However, as already mentioned, in all benchmarks I've seen so > far performance seems to stay the same at best and often even > suffers slightly. Which is suprising because for C++ code > using Clang's version of IR-level instrumentation & PGO brings > signifcant gains (up to 5-10% from what I've seen in > benchmarks for Firefox). > > One thing we noticed early on is that disabling the > pre-inlining pass (`-disable-preinline`) seems to consistently > improve the situation for Rust code. Doing that we sometimes > see performance wins of almost 1% over not using PGO. This > again is very different to C++ where disabling this pass > causes dramatic performance loses for the Firefox benchmarks. > And 1% performance improvement is still well below > expectations, I think. > > So my questions to you are: > > - Has anybody here observed something similar while > wokring on or with PGO? > > - Are there certain known characteristics of LLVM IR code > that inhibit PGO's effectiveness and that IR produced by > `rustc` might exhibit?Have you checked to make sure *all* of your branches have weights? Including the ones which don't directly correspond to Rust conditionals? If you left off branch weights from range checks or something (i.e something with a ton of occurrences) that might be confusing the heuristics enough to explain your results.> > - Does anybody know of a good source that describes how to > effectively debug a problem like this? > > - Does anybody know of a small example program in C/C++ > that is known to profit from PGO and that could be > re-implemented in Rust for comparison? > > Thanks a lot for reading! Any help is appreciated. > > -Michael > > [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Hiroshi Yamauchi via llvm-dev
2019-Sep-17 18:23 UTC
[llvm-dev] PGO is ineffective for Rust - but why?
On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi everyone, > > As part of my work for Mozilla's Low Level Tools team I've > implemented PGO in the Rust compiler. The feature is > available since Rust 1.37 [1]. However, so far we have not > seen any actual performance gains from enabling PGO for > Rust code. Performance even seems to drop 1-3% with PGO > enabled. I wonder why that is and I'm hoping that someone > here might have experience debugging PGO effectiveness. > > > PGO in the Rust compiler > ------------------------ > > The Rust compiler uses IR-level instrumentation (the > equivalent of Clang's `-fprofile-generate`/`-fprofile-use`). > This has worked pretty well and even enables doing PGO for > mixed Rust/C++ codebases when also using Clang. > > The Rust compiler has regression tests that make sure that: > > - instrumentation shows up in LLVM IR for the `generate` phase, > and that > > - profiling data is actually used during the `use` phase, i.e. > that cold functions get marked with `cold` and hot functions > get marked with `inline`. > > I also verified manually that `branch_weights` are being set > in IR. So, from my perspective, the PGO implementation does > what it is supposed to do. >Are the 'function_entry_count' and the 'ProfileSummary' metadata included in the IR? I think some PGO passes may expect them to trigger.> However, as already mentioned, in all benchmarks I've seen so > far performance seems to stay the same at best and often even > suffers slightly. Which is suprising because for C++ code > using Clang's version of IR-level instrumentation & PGO brings > signifcant gains (up to 5-10% from what I've seen in > benchmarks for Firefox). > > One thing we noticed early on is that disabling the > pre-inlining pass (`-disable-preinline`) seems to consistently > improve the situation for Rust code. Doing that we sometimes > see performance wins of almost 1% over not using PGO. This > again is very different to C++ where disabling this pass > causes dramatic performance loses for the Firefox benchmarks. > And 1% performance improvement is still well below > expectations, I think. > > So my questions to you are: > > - Has anybody here observed something similar while > wokring on or with PGO? > > - Are there certain known characteristics of LLVM IR code > that inhibit PGO's effectiveness and that IR produced by > `rustc` might exhibit? > > - Does anybody know of a good source that describes how to > effectively debug a problem like this? > > - Does anybody know of a small example program in C/C++ > that is known to profit from PGO and that could be > re-implemented in Rust for comparison? > > Thanks a lot for reading! Any help is appreciated. > > -Michael > > [1] > https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190917/8bfc923d/attachment.html>
Michael Woerister via llvm-dev
2019-Sep-18 08:10 UTC
[llvm-dev] PGO is ineffective for Rust - but why?
> Are the 'function_entry_count' and the 'ProfileSummary' metadata included in the IR? I think some PGO passes may expect them to trigger.Yes, they are present in the IR generate during the `profile-use` phase. For my test case you can take a look at the IR generated here: generate-phase: https://github.com/michaelwoerister/rust-pgo-test-programs/blob/master/branch_weights/outputs/opt_lib_gen.lld.ll use-phase: https://github.com/michaelwoerister/rust-pgo-test-programs/blob/master/branch_weights/outputs/opt_lib_use.lld.ll source code: https://github.com/michaelwoerister/rust-pgo-test-programs/blob/master/branch_weights/opt_lib.rs However, when using the right linker and thus not running into the GNU ld bug mentioned earlier, I'm seeing proper speedups with PGO for this test case. For other (larger) test cases I still don't see speedups so I'll need to take a closer look at those. On Tue, Sep 17, 2019 at 8:23 PM Hiroshi Yamauchi <yamauchi at google.com> wrote:> > > > On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Hi everyone, >> >> As part of my work for Mozilla's Low Level Tools team I've >> implemented PGO in the Rust compiler. The feature is >> available since Rust 1.37 [1]. However, so far we have not >> seen any actual performance gains from enabling PGO for >> Rust code. Performance even seems to drop 1-3% with PGO >> enabled. I wonder why that is and I'm hoping that someone >> here might have experience debugging PGO effectiveness. >> >> >> PGO in the Rust compiler >> ------------------------ >> >> The Rust compiler uses IR-level instrumentation (the >> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`). >> This has worked pretty well and even enables doing PGO for >> mixed Rust/C++ codebases when also using Clang. >> >> The Rust compiler has regression tests that make sure that: >> >> - instrumentation shows up in LLVM IR for the `generate` phase, >> and that >> >> - profiling data is actually used during the `use` phase, i.e. >> that cold functions get marked with `cold` and hot functions >> get marked with `inline`. >> >> I also verified manually that `branch_weights` are being set >> in IR. So, from my perspective, the PGO implementation does >> what it is supposed to do. > > > Are the 'function_entry_count' and the 'ProfileSummary' metadata included in the IR? I think some PGO passes may expect them to trigger. > >> >> However, as already mentioned, in all benchmarks I've seen so >> far performance seems to stay the same at best and often even >> suffers slightly. Which is suprising because for C++ code >> using Clang's version of IR-level instrumentation & PGO brings >> signifcant gains (up to 5-10% from what I've seen in >> benchmarks for Firefox). >> >> One thing we noticed early on is that disabling the >> pre-inlining pass (`-disable-preinline`) seems to consistently >> improve the situation for Rust code. Doing that we sometimes >> see performance wins of almost 1% over not using PGO. This >> again is very different to C++ where disabling this pass >> causes dramatic performance loses for the Firefox benchmarks. >> And 1% performance improvement is still well below >> expectations, I think. >> >> So my questions to you are: >> >> - Has anybody here observed something similar while >> wokring on or with PGO? >> >> - Are there certain known characteristics of LLVM IR code >> that inhibit PGO's effectiveness and that IR produced by >> `rustc` might exhibit? >> >> - Does anybody know of a good source that describes how to >> effectively debug a problem like this? >> >> - Does anybody know of a small example program in C/C++ >> that is known to profit from PGO and that could be >> re-implemented in Rust for comparison? >> >> Thanks a lot for reading! Any help is appreciated. >> >> -Michael >> >> [1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev