Min-Yih Hsu via llvm-dev
2020-Sep-09 00:20 UTC
[llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info
We would like to propose a new feature to disable optimizations on IR Functions that are considered “cold” by PGO profiles. The primary goal for this work is to improve code optimization speed (which also improves compilation and LTO speed) without making too much impact on target code performance. The mechanism is pretty simple: In the second phase (i.e. optimization phase) of PGO, we would add `optnone` attributes on functions that are considered “cold”. That is, functions with low profiling counts. Similar approach can be applied on loops. The rationale behind this idea is pretty simple as well: If a given IR Function will not be frequently executed, we shouldn’t waste time optimizing it. Similar approaches can be found in modern JIT compilers for dynamic languages (e.g. Javascript and Python) that adopt a multi-tier compilation model: Only “hot” functions or execution traces will be brought to higher-tier compilers for aggressive optimizations. In addition to de-optimizing on functions whose profiling counts are exactly zero (`-fprofile-deopt-cold`), we also provide a knob (`-fprofile-deopt-cold-percent=<X percent>`) to adjust the “cold threshold”. That is, after sorting profiling counts of all functions, this knob provides an option to de-optimize functions whose count values are sitting in the lower X percent. We evaluated this feature on LLVM Test Suite (the Bitcode, SingleSource, and MultiSource sub-folders were selected). Both compilation speed and target program performance are measured by the number of instructions reported by Linux perf. The table below shows the percentage of compilation speed improvement and target performance overhead relative to the baseline that only uses (instrumentation-based) PGO. Experiment Name Compile Speedup Target Overhead DeOpt Cold Zero Count 5.13% 0.02% DeOpt Cold 25% 8.06% 0.12% DeOpt Cold 50% 13.32% 2.38% DeOpt Cold 75% 17.53% 7.07% (The “DeOpt Cold Zero Count” experiment will only disable optimizations on functions whose profiling counts are exactly zero. Rest of the experiments are disabling optimizations on functions whose profiling counts are in the lower X%.) We also did evaluations on FullLTO, here are the numbers: Experiment Name Link Time Speedup Target Overhead DeOpt Cold Zero Count 10.87% 1.29% DeOpt Cold 25% 18.76% 1.50% DeOpt Cold 50% 30.16% 3.94% DeOpt Cold 75% 38.71% 8.97% (The link time presented here included the LTO and code generation time. We omitted the compile time numbers here since it’s not really interesting in LTO setup)>From the above experiments we observed that compilation / link timeimprovement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty. We believed that the above numbers had justified this patch to be useful on improving build time with little overhead. Here are the patches for review: * Modifications on LLVM instrumentation-based PGO: https://reviews.llvm.org/D87337 * Modifications on Clang driver: https://reviews.llvm.org/D87338 Credit: This project was originally started by Paul Robinson < paul.robinson at sony.com> and Edward Dawson <Edd.Dawson at sony.com> from Sony PlayStation compiler team. I picked it up when I was interning there this summer. Thank you for your reading. -Min -- Min-Yih Hsu Ph.D Student in ICS Department, University of California, Irvine (UCI). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200908/eab3480f/attachment.html>
Renato Golin via llvm-dev
2020-Sep-09 08:03 UTC
[llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info
On Wed, 9 Sep 2020 at 01:21, Min-Yih Hsu via llvm-dev < llvm-dev at lists.llvm.org> wrote:> From the above experiments we observed that compilation / link time > improvement scaled linearly with the percentage of cold functions we > skipped. Even if we only skipped functions that never got executed (i.e. > had counter values equal to zero, which is effectively “0%”), we already > had 5~10% of “free ride” on compilation / linking speed improvement and > barely had any target performance penalty. >Hi Min (Paul, Edd), This is great work! Small, clear patch, substantial impact, virtually no downsides. Just looking at your test-suite numbers, not optimising functions "never used" during the profile run sounds like an obvious "default PGO behaviour" to me. The flag defining the percentage range is a good option for development builds. I imagine you guys have run this on internal programs and found beneficial, too, not just the LLVM test-suite (which is very small and non-representative). It would be nice if other groups that already use PGO could try that locally and spot any issues. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200909/f867cb41/attachment.html>
Tobias Hieta via llvm-dev
2020-Sep-09 10:25 UTC
[llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info
Hello, We use PGO to optimize clang itself. I can see if I have time to give this patch some testing. Anything special to look out for except compile benchmark and time to build clang, do you expect any changes in code size? On Wed, Sep 9, 2020, 10:03 Renato Golin via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Wed, 9 Sep 2020 at 01:21, Min-Yih Hsu via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> From the above experiments we observed that compilation / link time >> improvement scaled linearly with the percentage of cold functions we >> skipped. Even if we only skipped functions that never got executed (i.e. >> had counter values equal to zero, which is effectively “0%”), we already >> had 5~10% of “free ride” on compilation / linking speed improvement and >> barely had any target performance penalty. >> > > Hi Min (Paul, Edd), > > This is great work! Small, clear patch, substantial impact, virtually no > downsides. > > Just looking at your test-suite numbers, not optimising functions "never > used" during the profile run sounds like an obvious "default PGO behaviour" > to me. The flag defining the percentage range is a good option for > development builds. > > I imagine you guys have run this on internal programs and found > beneficial, too, not just the LLVM test-suite (which is very small and > non-representative). It would be nice if other groups that already use PGO > could try that locally and spot any issues. > > cheers, > --renato > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200909/5535c2bc/attachment.html>
Min-Yih Hsu via llvm-dev
2020-Sep-09 16:42 UTC
[llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info
Hi Renato, On Wed, Sep 9, 2020 at 1:03 AM Renato Golin <rengolin at gmail.com> wrote:> On Wed, 9 Sep 2020 at 01:21, Min-Yih Hsu via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> From the above experiments we observed that compilation / link time >> improvement scaled linearly with the percentage of cold functions we >> skipped. Even if we only skipped functions that never got executed (i.e. >> had counter values equal to zero, which is effectively “0%”), we already >> had 5~10% of “free ride” on compilation / linking speed improvement and >> barely had any target performance penalty. >> > > Hi Min (Paul, Edd), > > This is great work! Small, clear patch, substantial impact, virtually no > downsides. >Thank you :-)> > Just looking at your test-suite numbers, not optimising functions "never > used" during the profile run sounds like an obvious "default PGO behaviour" > to me. The flag defining the percentage range is a good option for > development builds. >> I imagine you guys have run this on internal programs and found > beneficial, too, not just the LLVM test-suite (which is very small and > non-representative). It would be nice if other groups that already use PGO > could try that locally and spot any issues. >Good point! We are aware that LLVM Test Suite is too "SPEC-alike" and lean toward scientific computation rather than real-world use cases. So we actually did experiments on the V8 javascript engine, which is absolutely a huge code base and a good real-world example. And it showed a 10~13% speed improvement on optimization + codegen time with up to 4% of target performance overhead (Note that due to some hacky reasons, for many of the V8 source files, over 80% or even 95% of compilation time was spent on frontend, so measuring by total compilation time will be heavily skewed and unable to reflect the impact of this feature) Best -Min> > cheers, > --renato >-- Min-Yih Hsu Ph.D Student in ICS Department, University of California, Irvine (UCI). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200909/5fd6f8a9/attachment.html>
Modi Mo via llvm-dev
2020-Sep-10 01:18 UTC
[llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info
The 1.29% is pretty considerable on functions that should never be hit according to profile information. This can indicate that there might be something amiss with the profile quality and that certain hot functions are not getting caught. Alternatively, given the ~5% code size increase you mention in the other thread the cold code may not be being moved out to a cold page so i-cache pollution ends up being a factor. I think it would be worthwhile to dig deeper into why there’s any performance degradation on functions that should never be called. Also if you’re curious on how to build clang itself with PGO the documentation is here: https://llvm.org/docs/HowToBuildWithPGO.html On 9/8/20, 5:21 PM, "llvm-dev on behalf of Min-Yih Hsu via llvm-dev" <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org> on behalf of llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: We also did evaluations on FullLTO, here are the numbers: Experiment Name Link Time Speedup Target Overhead DeOpt Cold Zero Count 10.87% 1.29% DeOpt Cold 25% 18.76% 1.50% DeOpt Cold 50% 30.16% 3.94% DeOpt Cold 75% 38.71% 8.97% -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200910/85313df3/attachment-0001.html>
Wenlei He via llvm-dev
2020-Sep-10 04:50 UTC
[llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info
1%+ overhead is indeed interesting. If you use lld as linker (together with new pass manager), you should be able to have a good profile guided function level layout so dead functions are moved out of the hot pages. This may also be related to subtle pass ordering issue. Pre-inline counts may not be super accurate, but we can’t use post-inline counts either given CGSCC inline is half way through the opt pipeline. Looking at the patch, it seems the decision is made at PGO annotation time which is between pre-instrumentation inline and CGSCC inline. From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Modi Mo via llvm-dev <llvm-dev at lists.llvm.org> Reply-To: Modi Mo <modimo at fb.com> Date: Wednesday, September 9, 2020 at 6:18 PM To: Min-Yih Hsu <minyihh at uci.edu>, llvm-dev <llvm-dev at lists.llvm.org>, "cfe-dev (cfe-dev at lists.llvm.org)" <cfe-dev at lists.llvm.org>, Hongtao Yu <hoy at fb.com> Subject: Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info The 1.29% is pretty considerable on functions that should never be hit according to profile information. This can indicate that there might be something amiss with the profile quality and that certain hot functions are not getting caught. Alternatively, given the ~5% code size increase you mention in the other thread the cold code may not be being moved out to a cold page so i-cache pollution ends up being a factor. I think it would be worthwhile to dig deeper into why there’s any performance degradation on functions that should never be called. Also if you’re curious on how to build clang itself with PGO the documentation is here: https://llvm.org/docs/HowToBuildWithPGO.html On 9/8/20, 5:21 PM, "llvm-dev on behalf of Min-Yih Hsu via llvm-dev" <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org> on behalf of llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: We also did evaluations on FullLTO, here are the numbers: Experiment Name Link Time Speedup Target Overhead DeOpt Cold Zero Count 10.87% 1.29% DeOpt Cold 25% 18.76% 1.50% DeOpt Cold 50% 30.16% 3.94% DeOpt Cold 75% 38.71% 8.97% -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200910/2146944f/attachment.html>