Rong Xu via llvm-dev
2015-Sep-01 18:03 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
Justin, Sean and other people interested in this proposal, I'm wondering if you have chances to read the new experiment results in my last email sent 2 weeks ago. Can you share you thoughts, or you have other tests that you want to to run? I'm in the final stage of preparing the patch. If you are OK, I can sent out the patch soon. Thanks, -Rong On Wed, Aug 19, 2015 at 5:18 PM, Philip Reames <listmail at philipreames.com> wrote:> Thank you for sharing the data. I haven't been following the discussion, > but this data made for very interesting reading on it's own. > > Philip > > > On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote: > > We collected more data to address some of the questions from the > reviewers. Note this time we use clang itself as the benchmark. We choose > clang because we think it's a typical C++ program and the reviewers here > have good knowledge of the code base. > > What we measure is running time for clang to compile a large preprocessed > source file (4.98M lines of .ii file), using different compilation modes. > All the numbers reported here are the average running time of 5 runs in > seconds. > > *(1) Performance b/w late instrumentation v.s. not instrumenting single BB > functions* > > We first compare various instrumentation performance. > > ---------------------------------------------------------------------------- > Config wall_time_for_instr ratio_vs_base > profile_size > (1) base O2 80.386 100.0% -- > (2) FE-based Instr 201.658 250.8% 65238880 > (3) late Instr 103.662 129.0% 14860144 > (4) (3) + w/o pre-inline 199.924 248.7% 70762720 > (5) (4) + Silva 119.904 149.2% 24499528 > > Config(5) used the simple heuristic that Sean Silva proposed: not > instrumenting single BB functions that contain less than 10 instructions > (excluding debug and phi stmts). > > We can see: > 1) Simple heuristic of not instrumenting small single BB functions > improves instrumentation performance as expected. > 2) Using simple heuristic is still slower than late instrumentation with > pre-inlining: the later is 15% faster. > 3) Late instrumentation produces the smallest profile size: it's 39% > smaller than using the simple heuristic. > > The result is expected as pre-inlining can handle more cases than the > simple heuristic. There is significant performance gap between the simple > heuristic (5) and late instrumentation (2). > > We also used a few larger internal benchmarks to further validate the > above result. The following table shows the slowdown compared to the base > O2. The labels (2) to (5) refer to the same config as in the previous table. > ------------------------------------------------------ > Program (2) (3) (4) (5) > C++benchmark16 -45.24% -12.93% -43.84% -24.74% > C++benchmark17 -90.86% -58.19% -87.77% -80.62% > C++benchmark18 -95.32% -54.75% -91.21% -82.56% > > > We can see the same trend as the clang benchmark: the simple heuristic (5) > recovers a lot of performance loss compared with FE base instrumentation, > but is still significantly worse than late instrumentation (3). > > *(2) Performance impact of context sensitivity* > > LLVM does not use the profile information fully in the back-end > optimizations, for instance, inlining does not fully use the profile counts > -- it only marks hot/cold function attribute based on function entry > counts. To evaluate the impact of profile context sensitivity, GCC is used > in the experiment. Note that GCC PGO improves clang performance a lot more > than clang PGO. > > First we summarize the methodology used in the experiment: > 0) build clang with GCC O2 without early inlining and measure clang's > performance. GCC early inlining (einline) is similar to pre-inline used by > late instrumentation. > 1) build clang with GCC O2 with early inlining and measure performance. > > The performance difference of 1) and 0) is denoted as E which measures the > contribution of early inlining. > > 2) build clang with GCC O2 + PGO without early inlining. > 3) build clang with GCC O2 + PGO with early inlining. > > The performance difference of 3) and 2) is denoted as EC. It constitutes > roughly two parts a) early inlining contribution b) context sensitive > profiling enabled with early inlining. > > The contribution of context sensitive profiling can be estimated by EC - E > above. > > ------------------------------------------------------------------------------- > Config wall_time_for_use speedup_vs_(0) > speedup_vs_(1) > (0) base w/o einline 84.946 1.000 0.934 > (1) base O2 79.310 1.071 1.000 > (2) profile-arcs w/o einline 63.518 1.337 1.249 > (3) profile-arcs 48.364 1.756 1.640 > > We see the following: > 1) GCC PGO with early inlining improves clang performance by 64.0% (v.s. > base O2 w/ early inline). > 2) GCC PGO w/o early inlining improves clang performance by 33.7% (v.s. > base O2 w/o early inline). > 3) Early inlining performance contribution is about 7.1%. > 4) Profile context sensitivity contribution is estimated to be 22.2% (i.e. > 64.0% -33.7% - 7.1%), which is pretty significant. > > *(3) Pre-inline pass impact on the value profiling* > > Again, we use GCC as the platform to estimate: > > -------------------------------------------------------- > Config wall_time for_instr > (2) profile-arcs 115.720 > (3) profile-arcs w/o einline 310.560 > (4) profile-generate 139.952 > (5) profile-generate w/o einline 680.910 > > In GCC, -fprofile-generate does -fprofile-arcs as well as the value > profiling. The above table shows that with value profile, the impact of > pre-inlining is even larger for instrumented binary performance. Without > value profiling, disabling pre-inlining increases runtime by 1.7x, while > with value profiling, its impact is 3.9x increase in runtime. > > > On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> >> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> One aspect of this that I have not seen discussed is that middle-end >>> instrumentation enables PGO optimizations to front-ends other than Clang. >>> >>> While I agree that FE instrumentation could be improved, it still >>> requires every FE to implement essentially the same common functionality. >>> Having PGO instrumentation generated in the middle-end, allows us every FE >>> to automatically take advantage of PGO. >>> >> >> This is a really good point, and I agree with it. We may have gotten off >> on the wrong foot since Rong's email focused so heavily on comparing with >> the frontend instrumentation. As far as I see it, Rong's proposal has a >> couple different parts: >> >> 1. Infrastructure for IR-level instrumentation-based PGO >> 2. Changes to the pass pipeline so that a hypothetical IR-level >> instrumentation-based PGO is more effective >> 3. MST algorithm with profile feedback for optimal placement of counter >> updates. >> >> I think 1. is a no-brainer, if only so that all LLVM clients can benefit >> from PGO, and also (as you pointed out below) so that it can have an >> exclusive focus on performance. If it is sufficiently flexible, it may even >> make sense to restrict clang's frontend instrumentation-based profiling to >> non-performance stuff, and have clang directly interoperate with the >> IR-level PGO for performance-related PGO use cases, just like any other >> frontend would. >> >> Philip and Sanjoy, out of curiosity do you guys use your own >> instrumentation placement for PGO? Is an IR-level PGO infrastructure >> upstream something you guys would be interested in? >> >> I think that 2. is something that once we have 1. we will be able to >> evaluate better, but for now my opinion is that we should be able to make >> good progress without digging into that. >> >> I think that 3. is a no-brainer if it provides a really significant win, >> but without 1. we can't really measure its effect in isolation. It also has >> a usability problem since it requires feeding in an existing profile for >> the *instrumented* build, but if the benefit is very significant this may >> be worth it for some users. We will probably be able to easily refactor 1. >> as needed into an MST approach that degrades gracefully to using static >> heuristics in the absence of real profile information, so is not a >> maintenance burden (maybe even helps by providing a good framework in which >> to develop effective static heuristics). >> >> For the time being, I think we can avoid discussion of 2. and 3. until we >> have more of 1. working. So I think it would be most productive if we focus >> this discussion on 1. >> >> >>> Additionally, some of the overhead imposed by FE instrumentation is not >>> really all that easy to get rid of. You end up duplicating functionality >>> that is more naturally implemented in the middle end. >>> >> >> Yeah, I was looking into a couple of other simple approaches and quickly >> found out that I was basically replicating much of the sort of logic that >> the inliner already has. >> >> -- Sean Silva >> >> >>> >>> I see the two approaches as supplementary, rather than complementary. >>> One does not negate the other. Some of the optimizations we'd do in the >>> FE, may hurt coverage. Instead, by instrumenting in the middle end, you >>> can focus exclusively on performance (coverage be damned). >>> >>> >>> Diego. >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> > > > _______________________________________________ > LLVM Developers mailing listllvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150901/9b9c2b9d/attachment.html>
Sean Silva via llvm-dev
2015-Sep-01 18:47 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
On Tue, Sep 1, 2015 at 11:03 AM, Rong Xu <xur at google.com> wrote:> Justin, Sean and other people interested in this proposal, > > I'm wondering if you have chances to read the new experiment results in my > last email sent 2 weeks ago. Can you share you thoughts, or you have other > tests that you want to to run? >See my email from Aug 11 (3 weeks ago). Adding an IR-level instrumentation pass makes sense (you didn't need to provide any performance data to support this; there are plenty of good reasons), but there are a couple independent parts. Have you been able to work on splitting out any of them?> > I'm in the final stage of preparing the patch. If you are OK, I can sent > out the patch soon. >I'm not sure what you mean by "the" patch. It seems pretty clear that there are multiple sub-parts to this. Could you send an RFC for part 1 that I described? We especially need to discuss the interface for frontends e.g. clang command line interface, when a user passes a profile file how do we thread that information back to the middle-end, details for the runtime interoperation (things like function hash will have different meaning between IR-level and Clang instrumentation), etc. -- Sean Silva> > Thanks, > > -Rong > > On Wed, Aug 19, 2015 at 5:18 PM, Philip Reames <listmail at philipreames.com> > wrote: > >> Thank you for sharing the data. I haven't been following the discussion, >> but this data made for very interesting reading on it's own. >> >> Philip >> >> >> On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote: >> >> We collected more data to address some of the questions from the >> reviewers. Note this time we use clang itself as the benchmark. We choose >> clang because we think it's a typical C++ program and the reviewers here >> have good knowledge of the code base. >> >> What we measure is running time for clang to compile a large preprocessed >> source file (4.98M lines of .ii file), using different compilation modes. >> All the numbers reported here are the average running time of 5 runs in >> seconds. >> >> *(1) Performance b/w late instrumentation v.s. not instrumenting single >> BB functions* >> >> We first compare various instrumentation performance. >> >> ---------------------------------------------------------------------------- >> Config wall_time_for_instr ratio_vs_base >> profile_size >> (1) base O2 80.386 100.0% -- >> (2) FE-based Instr 201.658 250.8% 65238880 >> (3) late Instr 103.662 129.0% 14860144 >> (4) (3) + w/o pre-inline 199.924 248.7% 70762720 >> (5) (4) + Silva 119.904 149.2% 24499528 >> >> Config(5) used the simple heuristic that Sean Silva proposed: not >> instrumenting single BB functions that contain less than 10 instructions >> (excluding debug and phi stmts). >> >> We can see: >> 1) Simple heuristic of not instrumenting small single BB functions >> improves instrumentation performance as expected. >> 2) Using simple heuristic is still slower than late instrumentation with >> pre-inlining: the later is 15% faster. >> 3) Late instrumentation produces the smallest profile size: it's 39% >> smaller than using the simple heuristic. >> >> The result is expected as pre-inlining can handle more cases than the >> simple heuristic. There is significant performance gap between the simple >> heuristic (5) and late instrumentation (2). >> >> We also used a few larger internal benchmarks to further validate the >> above result. The following table shows the slowdown compared to the base >> O2. The labels (2) to (5) refer to the same config as in the previous table. >> ------------------------------------------------------ >> Program (2) (3) (4) (5) >> C++benchmark16 -45.24% -12.93% -43.84% -24.74% >> C++benchmark17 -90.86% -58.19% -87.77% -80.62% >> C++benchmark18 -95.32% -54.75% -91.21% -82.56% >> >> >> We can see the same trend as the clang benchmark: the simple heuristic >> (5) recovers a lot of performance loss compared with FE base >> instrumentation, but is still significantly worse than late instrumentation >> (3). >> >> *(2) Performance impact of context sensitivity* >> >> LLVM does not use the profile information fully in the back-end >> optimizations, for instance, inlining does not fully use the profile counts >> -- it only marks hot/cold function attribute based on function entry >> counts. To evaluate the impact of profile context sensitivity, GCC is used >> in the experiment. Note that GCC PGO improves clang performance a lot more >> than clang PGO. >> >> First we summarize the methodology used in the experiment: >> 0) build clang with GCC O2 without early inlining and measure clang's >> performance. GCC early inlining (einline) is similar to pre-inline used by >> late instrumentation. >> 1) build clang with GCC O2 with early inlining and measure performance. >> >> The performance difference of 1) and 0) is denoted as E which measures >> the contribution of early inlining. >> >> 2) build clang with GCC O2 + PGO without early inlining. >> 3) build clang with GCC O2 + PGO with early inlining. >> >> The performance difference of 3) and 2) is denoted as EC. It constitutes >> roughly two parts a) early inlining contribution b) context sensitive >> profiling enabled with early inlining. >> >> The contribution of context sensitive profiling can be estimated by EC - >> E above. >> >> ------------------------------------------------------------------------------- >> Config wall_time_for_use speedup_vs_(0) >> speedup_vs_(1) >> (0) base w/o einline 84.946 1.000 0.934 >> (1) base O2 79.310 1.071 1.000 >> (2) profile-arcs w/o einline 63.518 1.337 1.249 >> (3) profile-arcs 48.364 1.756 1.640 >> >> We see the following: >> 1) GCC PGO with early inlining improves clang performance by 64.0% (v.s. >> base O2 w/ early inline). >> 2) GCC PGO w/o early inlining improves clang performance by 33.7% (v.s. >> base O2 w/o early inline). >> 3) Early inlining performance contribution is about 7.1%. >> 4) Profile context sensitivity contribution is estimated to be 22.2% >> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant. >> >> *(3) Pre-inline pass impact on the value profiling* >> >> Again, we use GCC as the platform to estimate: >> >> -------------------------------------------------------- >> Config wall_time for_instr >> (2) profile-arcs 115.720 >> (3) profile-arcs w/o einline 310.560 >> (4) profile-generate 139.952 >> (5) profile-generate w/o einline 680.910 >> >> In GCC, -fprofile-generate does -fprofile-arcs as well as the value >> profiling. The above table shows that with value profile, the impact of >> pre-inlining is even larger for instrumented binary performance. Without >> value profiling, disabling pre-inlining increases runtime by 1.7x, while >> with value profiling, its impact is 3.9x increase in runtime. >> >> >> On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> >>> >>> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> One aspect of this that I have not seen discussed is that middle-end >>>> instrumentation enables PGO optimizations to front-ends other than Clang. >>>> >>>> While I agree that FE instrumentation could be improved, it still >>>> requires every FE to implement essentially the same common functionality. >>>> Having PGO instrumentation generated in the middle-end, allows us every FE >>>> to automatically take advantage of PGO. >>>> >>> >>> This is a really good point, and I agree with it. We may have gotten off >>> on the wrong foot since Rong's email focused so heavily on comparing with >>> the frontend instrumentation. As far as I see it, Rong's proposal has a >>> couple different parts: >>> >>> 1. Infrastructure for IR-level instrumentation-based PGO >>> 2. Changes to the pass pipeline so that a hypothetical IR-level >>> instrumentation-based PGO is more effective >>> 3. MST algorithm with profile feedback for optimal placement of counter >>> updates. >>> >>> I think 1. is a no-brainer, if only so that all LLVM clients can benefit >>> from PGO, and also (as you pointed out below) so that it can have an >>> exclusive focus on performance. If it is sufficiently flexible, it may even >>> make sense to restrict clang's frontend instrumentation-based profiling to >>> non-performance stuff, and have clang directly interoperate with the >>> IR-level PGO for performance-related PGO use cases, just like any other >>> frontend would. >>> >>> Philip and Sanjoy, out of curiosity do you guys use your own >>> instrumentation placement for PGO? Is an IR-level PGO infrastructure >>> upstream something you guys would be interested in? >>> >>> I think that 2. is something that once we have 1. we will be able to >>> evaluate better, but for now my opinion is that we should be able to make >>> good progress without digging into that. >>> >>> I think that 3. is a no-brainer if it provides a really significant win, >>> but without 1. we can't really measure its effect in isolation. It also has >>> a usability problem since it requires feeding in an existing profile for >>> the *instrumented* build, but if the benefit is very significant this may >>> be worth it for some users. We will probably be able to easily refactor 1. >>> as needed into an MST approach that degrades gracefully to using static >>> heuristics in the absence of real profile information, so is not a >>> maintenance burden (maybe even helps by providing a good framework in which >>> to develop effective static heuristics). >>> >>> For the time being, I think we can avoid discussion of 2. and 3. until >>> we have more of 1. working. So I think it would be most productive if we >>> focus this discussion on 1. >>> >>> >>>> Additionally, some of the overhead imposed by FE instrumentation is not >>>> really all that easy to get rid of. You end up duplicating functionality >>>> that is more naturally implemented in the middle end. >>>> >>> >>> Yeah, I was looking into a couple of other simple approaches and quickly >>> found out that I was basically replicating much of the sort of logic that >>> the inliner already has. >>> >>> -- Sean Silva >>> >>> >>>> >>>> I see the two approaches as supplementary, rather than complementary. >>>> One does not negate the other. Some of the optimizations we'd do in the >>>> FE, may hurt coverage. Instead, by instrumenting in the middle end, you >>>> can focus exclusively on performance (coverage be damned). >>>> >>>> >>>> Diego. >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >> >> >> _______________________________________________ >> LLVM Developers mailing listllvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150901/53b882f3/attachment.html>
Xinliang David Li via llvm-dev
2015-Sep-01 18:57 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
On Tue, Sep 1, 2015 at 11:47 AM, Sean Silva via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Tue, Sep 1, 2015 at 11:03 AM, Rong Xu <xur at google.com> wrote: > >> Justin, Sean and other people interested in this proposal, >> >> I'm wondering if you have chances to read the new experiment results in >> my last email sent 2 weeks ago. Can you share you thoughts, or you have >> other tests that you want to to run? >> > > See my email from Aug 11 (3 weeks ago). Adding an IR-level instrumentation > pass makes sense (you didn't need to provide any performance data to > support this; there are plenty of good reasons), but there are a couple > independent parts. Have you been able to work on splitting out any of them? > > >> >> I'm in the final stage of preparing the patch. If you are OK, I can sent >> out the patch soon. >> > > I'm not sure what you mean by "the" patch. It seems pretty clear that > there are multiple sub-parts to this. Could you send an RFC for part 1 that > I described? We especially need to discuss the interface for frontends e.g. > clang command line interface, when a user passes a profile file how do we > thread that information back to the middle-end, details for the runtime > interoperation (things like function hash will have different meaning > between IR-level and Clang instrumentation), etc. > >Those are good suggestions! thanks, David> -- Sean Silva > > >> >> Thanks, >> >> -Rong >> >> On Wed, Aug 19, 2015 at 5:18 PM, Philip Reames <listmail at philipreames.com >> > wrote: >> >>> Thank you for sharing the data. I haven't been following the >>> discussion, but this data made for very interesting reading on it's own. >>> >>> Philip >>> >>> >>> On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote: >>> >>> We collected more data to address some of the questions from the >>> reviewers. Note this time we use clang itself as the benchmark. We choose >>> clang because we think it's a typical C++ program and the reviewers here >>> have good knowledge of the code base. >>> >>> What we measure is running time for clang to compile a large >>> preprocessed source file (4.98M lines of .ii file), using different >>> compilation modes. All the numbers reported here are the average running >>> time of 5 runs in seconds. >>> >>> *(1) Performance b/w late instrumentation v.s. not instrumenting single >>> BB functions* >>> >>> We first compare various instrumentation performance. >>> >>> ---------------------------------------------------------------------------- >>> Config wall_time_for_instr ratio_vs_base >>> profile_size >>> (1) base O2 80.386 100.0% -- >>> (2) FE-based Instr 201.658 250.8% >>> 65238880 >>> (3) late Instr 103.662 129.0% >>> 14860144 >>> (4) (3) + w/o pre-inline 199.924 248.7% >>> 70762720 >>> (5) (4) + Silva 119.904 149.2% >>> 24499528 >>> >>> Config(5) used the simple heuristic that Sean Silva proposed: not >>> instrumenting single BB functions that contain less than 10 instructions >>> (excluding debug and phi stmts). >>> >>> We can see: >>> 1) Simple heuristic of not instrumenting small single BB functions >>> improves instrumentation performance as expected. >>> 2) Using simple heuristic is still slower than late instrumentation with >>> pre-inlining: the later is 15% faster. >>> 3) Late instrumentation produces the smallest profile size: it's 39% >>> smaller than using the simple heuristic. >>> >>> The result is expected as pre-inlining can handle more cases than the >>> simple heuristic. There is significant performance gap between the simple >>> heuristic (5) and late instrumentation (2). >>> >>> We also used a few larger internal benchmarks to further validate the >>> above result. The following table shows the slowdown compared to the base >>> O2. The labels (2) to (5) refer to the same config as in the previous table. >>> ------------------------------------------------------ >>> Program (2) (3) (4) (5) >>> C++benchmark16 -45.24% -12.93% -43.84% -24.74% >>> C++benchmark17 -90.86% -58.19% -87.77% -80.62% >>> C++benchmark18 -95.32% -54.75% -91.21% -82.56% >>> >>> >>> We can see the same trend as the clang benchmark: the simple heuristic >>> (5) recovers a lot of performance loss compared with FE base >>> instrumentation, but is still significantly worse than late instrumentation >>> (3). >>> >>> *(2) Performance impact of context sensitivity* >>> >>> LLVM does not use the profile information fully in the back-end >>> optimizations, for instance, inlining does not fully use the profile counts >>> -- it only marks hot/cold function attribute based on function entry >>> counts. To evaluate the impact of profile context sensitivity, GCC is used >>> in the experiment. Note that GCC PGO improves clang performance a lot more >>> than clang PGO. >>> >>> First we summarize the methodology used in the experiment: >>> 0) build clang with GCC O2 without early inlining and measure clang's >>> performance. GCC early inlining (einline) is similar to pre-inline used by >>> late instrumentation. >>> 1) build clang with GCC O2 with early inlining and measure performance. >>> >>> The performance difference of 1) and 0) is denoted as E which measures >>> the contribution of early inlining. >>> >>> 2) build clang with GCC O2 + PGO without early inlining. >>> 3) build clang with GCC O2 + PGO with early inlining. >>> >>> The performance difference of 3) and 2) is denoted as EC. It constitutes >>> roughly two parts a) early inlining contribution b) context sensitive >>> profiling enabled with early inlining. >>> >>> The contribution of context sensitive profiling can be estimated by EC - >>> E above. >>> >>> ------------------------------------------------------------------------------- >>> Config wall_time_for_use speedup_vs_(0) >>> speedup_vs_(1) >>> (0) base w/o einline 84.946 1.000 0.934 >>> (1) base O2 79.310 1.071 1.000 >>> (2) profile-arcs w/o einline 63.518 1.337 1.249 >>> (3) profile-arcs 48.364 1.756 1.640 >>> >>> We see the following: >>> 1) GCC PGO with early inlining improves clang performance by 64.0% (v.s. >>> base O2 w/ early inline). >>> 2) GCC PGO w/o early inlining improves clang performance by 33.7% (v.s. >>> base O2 w/o early inline). >>> 3) Early inlining performance contribution is about 7.1%. >>> 4) Profile context sensitivity contribution is estimated to be 22.2% >>> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant. >>> >>> *(3) Pre-inline pass impact on the value profiling* >>> >>> Again, we use GCC as the platform to estimate: >>> >>> -------------------------------------------------------- >>> Config wall_time for_instr >>> (2) profile-arcs 115.720 >>> (3) profile-arcs w/o einline 310.560 >>> (4) profile-generate 139.952 >>> (5) profile-generate w/o einline 680.910 >>> >>> In GCC, -fprofile-generate does -fprofile-arcs as well as the value >>> profiling. The above table shows that with value profile, the impact of >>> pre-inlining is even larger for instrumented binary performance. Without >>> value profiling, disabling pre-inlining increases runtime by 1.7x, while >>> with value profiling, its impact is 3.9x increase in runtime. >>> >>> >>> On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> >>>> >>>> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> >>>>> One aspect of this that I have not seen discussed is that middle-end >>>>> instrumentation enables PGO optimizations to front-ends other than Clang. >>>>> >>>>> While I agree that FE instrumentation could be improved, it still >>>>> requires every FE to implement essentially the same common functionality. >>>>> Having PGO instrumentation generated in the middle-end, allows us every FE >>>>> to automatically take advantage of PGO. >>>>> >>>> >>>> This is a really good point, and I agree with it. We may have gotten >>>> off on the wrong foot since Rong's email focused so heavily on comparing >>>> with the frontend instrumentation. As far as I see it, Rong's proposal has >>>> a couple different parts: >>>> >>>> 1. Infrastructure for IR-level instrumentation-based PGO >>>> 2. Changes to the pass pipeline so that a hypothetical IR-level >>>> instrumentation-based PGO is more effective >>>> 3. MST algorithm with profile feedback for optimal placement of counter >>>> updates. >>>> >>>> I think 1. is a no-brainer, if only so that all LLVM clients can >>>> benefit from PGO, and also (as you pointed out below) so that it can have >>>> an exclusive focus on performance. If it is sufficiently flexible, it may >>>> even make sense to restrict clang's frontend instrumentation-based >>>> profiling to non-performance stuff, and have clang directly interoperate >>>> with the IR-level PGO for performance-related PGO use cases, just like any >>>> other frontend would. >>>> >>>> Philip and Sanjoy, out of curiosity do you guys use your own >>>> instrumentation placement for PGO? Is an IR-level PGO infrastructure >>>> upstream something you guys would be interested in? >>>> >>>> I think that 2. is something that once we have 1. we will be able to >>>> evaluate better, but for now my opinion is that we should be able to make >>>> good progress without digging into that. >>>> >>>> I think that 3. is a no-brainer if it provides a really significant >>>> win, but without 1. we can't really measure its effect in isolation. It >>>> also has a usability problem since it requires feeding in an existing >>>> profile for the *instrumented* build, but if the benefit is very >>>> significant this may be worth it for some users. We will probably be able >>>> to easily refactor 1. as needed into an MST approach that degrades >>>> gracefully to using static heuristics in the absence of real profile >>>> information, so is not a maintenance burden (maybe even helps by providing >>>> a good framework in which to develop effective static heuristics). >>>> >>>> For the time being, I think we can avoid discussion of 2. and 3. until >>>> we have more of 1. working. So I think it would be most productive if we >>>> focus this discussion on 1. >>>> >>>> >>>>> Additionally, some of the overhead imposed by FE instrumentation is >>>>> not really all that easy to get rid of. You end up duplicating >>>>> functionality that is more naturally implemented in the middle end. >>>>> >>>> >>>> Yeah, I was looking into a couple of other simple approaches and >>>> quickly found out that I was basically replicating much of the sort of >>>> logic that the inliner already has. >>>> >>>> -- Sean Silva >>>> >>>> >>>>> >>>>> I see the two approaches as supplementary, rather than complementary. >>>>> One does not negate the other. Some of the optimizations we'd do in the >>>>> FE, may hurt coverage. Instead, by instrumenting in the middle end, you >>>>> can focus exclusively on performance (coverage be damned). >>>>> >>>>> >>>>> Diego. >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing listllvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150901/7f285f65/attachment.html>
Rong Xu via llvm-dev
2015-Sep-01 21:21 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
On Tue, Sep 1, 2015 at 11:47 AM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Tue, Sep 1, 2015 at 11:03 AM, Rong Xu <xur at google.com> wrote: > >> Justin, Sean and other people interested in this proposal, >> >> I'm wondering if you have chances to read the new experiment results in >> my last email sent 2 weeks ago. Can you share you thoughts, or you have >> other tests that you want to to run? >> > > See my email from Aug 11 (3 weeks ago). Adding an IR-level instrumentation > pass makes sense (you didn't need to provide any performance data to > support this; there are plenty of good reasons), but there are a couple > independent parts. Have you been able to work on splitting out any of them? >I re-read your comments from Aug 11.>As far as I see it, Rong's proposal has a couple different parts: > >1. Infrastructure for IR-level instrumentation-based PGO >2. Changes to the pass pipeline so that a hypothetical IR-levelinstrumentation-based PGO is more effective>3. MST algorithm with profile feedback for optimal placement of counterupdates. In my implementation, MST algorithm is the main component of 1. The only IR change is to insert instrprof_increment intrinsic calls (which will be lower in createInstrProfilingPass). I'm not quite sure about 3. Do you mean MST algorithm, or using one profile to guide the MST algorithm to get the optimal placement? I do have the code for both. But the latter one was just for experimental purpose. It gonna be hard to use in the real applications (for example, the profile-use would also need the bootstrap profile to read the real profile).> >> >> I'm in the final stage of preparing the patch. If you are OK, I can sent >> out the patch soon. >> > > I'm not sure what you mean by "the" patch. It seems pretty clear that > there are multiple sub-parts to this. Could you send an RFC for part 1 that > I described? We especially need to discuss the interface for frontends e.g. > clang command line interface, when a user passes a profile file how do we > thread that information back to the middle-end, details for the runtime > interoperation (things like function hash will have different meaning > between IR-level and Clang instrumentation), etc. >I agree with your approach. When I said "the patch", I really meant 'a series of patches'. Thanks for the suggestion. -Rong> > -- Sean Silva > > >> >> Thanks, >> >> -Rong >> >> On Wed, Aug 19, 2015 at 5:18 PM, Philip Reames <listmail at philipreames.com >> > wrote: >> >>> Thank you for sharing the data. I haven't been following the >>> discussion, but this data made for very interesting reading on it's own. >>> >>> Philip >>> >>> >>> On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote: >>> >>> We collected more data to address some of the questions from the >>> reviewers. Note this time we use clang itself as the benchmark. We choose >>> clang because we think it's a typical C++ program and the reviewers here >>> have good knowledge of the code base. >>> >>> What we measure is running time for clang to compile a large >>> preprocessed source file (4.98M lines of .ii file), using different >>> compilation modes. All the numbers reported here are the average running >>> time of 5 runs in seconds. >>> >>> *(1) Performance b/w late instrumentation v.s. not instrumenting single >>> BB functions* >>> >>> We first compare various instrumentation performance. >>> >>> ---------------------------------------------------------------------------- >>> Config wall_time_for_instr ratio_vs_base >>> profile_size >>> (1) base O2 80.386 100.0% -- >>> (2) FE-based Instr 201.658 250.8% >>> 65238880 >>> (3) late Instr 103.662 129.0% >>> 14860144 >>> (4) (3) + w/o pre-inline 199.924 248.7% >>> 70762720 >>> (5) (4) + Silva 119.904 149.2% >>> 24499528 >>> >>> Config(5) used the simple heuristic that Sean Silva proposed: not >>> instrumenting single BB functions that contain less than 10 instructions >>> (excluding debug and phi stmts). >>> >>> We can see: >>> 1) Simple heuristic of not instrumenting small single BB functions >>> improves instrumentation performance as expected. >>> 2) Using simple heuristic is still slower than late instrumentation with >>> pre-inlining: the later is 15% faster. >>> 3) Late instrumentation produces the smallest profile size: it's 39% >>> smaller than using the simple heuristic. >>> >>> The result is expected as pre-inlining can handle more cases than the >>> simple heuristic. There is significant performance gap between the simple >>> heuristic (5) and late instrumentation (2). >>> >>> We also used a few larger internal benchmarks to further validate the >>> above result. The following table shows the slowdown compared to the base >>> O2. The labels (2) to (5) refer to the same config as in the previous table. >>> ------------------------------------------------------ >>> Program (2) (3) (4) (5) >>> C++benchmark16 -45.24% -12.93% -43.84% -24.74% >>> C++benchmark17 -90.86% -58.19% -87.77% -80.62% >>> C++benchmark18 -95.32% -54.75% -91.21% -82.56% >>> >>> >>> We can see the same trend as the clang benchmark: the simple heuristic >>> (5) recovers a lot of performance loss compared with FE base >>> instrumentation, but is still significantly worse than late instrumentation >>> (3). >>> >>> *(2) Performance impact of context sensitivity* >>> >>> LLVM does not use the profile information fully in the back-end >>> optimizations, for instance, inlining does not fully use the profile counts >>> -- it only marks hot/cold function attribute based on function entry >>> counts. To evaluate the impact of profile context sensitivity, GCC is used >>> in the experiment. Note that GCC PGO improves clang performance a lot more >>> than clang PGO. >>> >>> First we summarize the methodology used in the experiment: >>> 0) build clang with GCC O2 without early inlining and measure clang's >>> performance. GCC early inlining (einline) is similar to pre-inline used by >>> late instrumentation. >>> 1) build clang with GCC O2 with early inlining and measure performance. >>> >>> The performance difference of 1) and 0) is denoted as E which measures >>> the contribution of early inlining. >>> >>> 2) build clang with GCC O2 + PGO without early inlining. >>> 3) build clang with GCC O2 + PGO with early inlining. >>> >>> The performance difference of 3) and 2) is denoted as EC. It constitutes >>> roughly two parts a) early inlining contribution b) context sensitive >>> profiling enabled with early inlining. >>> >>> The contribution of context sensitive profiling can be estimated by EC - >>> E above. >>> >>> ------------------------------------------------------------------------------- >>> Config wall_time_for_use speedup_vs_(0) >>> speedup_vs_(1) >>> (0) base w/o einline 84.946 1.000 0.934 >>> (1) base O2 79.310 1.071 1.000 >>> (2) profile-arcs w/o einline 63.518 1.337 1.249 >>> (3) profile-arcs 48.364 1.756 1.640 >>> >>> We see the following: >>> 1) GCC PGO with early inlining improves clang performance by 64.0% (v.s. >>> base O2 w/ early inline). >>> 2) GCC PGO w/o early inlining improves clang performance by 33.7% (v.s. >>> base O2 w/o early inline). >>> 3) Early inlining performance contribution is about 7.1%. >>> 4) Profile context sensitivity contribution is estimated to be 22.2% >>> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant. >>> >>> *(3) Pre-inline pass impact on the value profiling* >>> >>> Again, we use GCC as the platform to estimate: >>> >>> -------------------------------------------------------- >>> Config wall_time for_instr >>> (2) profile-arcs 115.720 >>> (3) profile-arcs w/o einline 310.560 >>> (4) profile-generate 139.952 >>> (5) profile-generate w/o einline 680.910 >>> >>> In GCC, -fprofile-generate does -fprofile-arcs as well as the value >>> profiling. The above table shows that with value profile, the impact of >>> pre-inlining is even larger for instrumented binary performance. Without >>> value profiling, disabling pre-inlining increases runtime by 1.7x, while >>> with value profiling, its impact is 3.9x increase in runtime. >>> >>> >>> On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> >>>> >>>> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> >>>>> One aspect of this that I have not seen discussed is that middle-end >>>>> instrumentation enables PGO optimizations to front-ends other than Clang. >>>>> >>>>> While I agree that FE instrumentation could be improved, it still >>>>> requires every FE to implement essentially the same common functionality. >>>>> Having PGO instrumentation generated in the middle-end, allows us every FE >>>>> to automatically take advantage of PGO. >>>>> >>>> >>>> This is a really good point, and I agree with it. We may have gotten >>>> off on the wrong foot since Rong's email focused so heavily on comparing >>>> with the frontend instrumentation. As far as I see it, Rong's proposal has >>>> a couple different parts: >>>> >>>> 1. Infrastructure for IR-level instrumentation-based PGO >>>> 2. Changes to the pass pipeline so that a hypothetical IR-level >>>> instrumentation-based PGO is more effective >>>> 3. MST algorithm with profile feedback for optimal placement of counter >>>> updates. >>>> >>>> I think 1. is a no-brainer, if only so that all LLVM clients can >>>> benefit from PGO, and also (as you pointed out below) so that it can have >>>> an exclusive focus on performance. If it is sufficiently flexible, it may >>>> even make sense to restrict clang's frontend instrumentation-based >>>> profiling to non-performance stuff, and have clang directly interoperate >>>> with the IR-level PGO for performance-related PGO use cases, just like any >>>> other frontend would. >>>> >>>> Philip and Sanjoy, out of curiosity do you guys use your own >>>> instrumentation placement for PGO? Is an IR-level PGO infrastructure >>>> upstream something you guys would be interested in? >>>> >>>> I think that 2. is something that once we have 1. we will be able to >>>> evaluate better, but for now my opinion is that we should be able to make >>>> good progress without digging into that. >>>> >>>> I think that 3. is a no-brainer if it provides a really significant >>>> win, but without 1. we can't really measure its effect in isolation. It >>>> also has a usability problem since it requires feeding in an existing >>>> profile for the *instrumented* build, but if the benefit is very >>>> significant this may be worth it for some users. We will probably be able >>>> to easily refactor 1. as needed into an MST approach that degrades >>>> gracefully to using static heuristics in the absence of real profile >>>> information, so is not a maintenance burden (maybe even helps by providing >>>> a good framework in which to develop effective static heuristics). >>>> >>>> For the time being, I think we can avoid discussion of 2. and 3. until >>>> we have more of 1. working. So I think it would be most productive if we >>>> focus this discussion on 1. >>>> >>>> >>>>> Additionally, some of the overhead imposed by FE instrumentation is >>>>> not really all that easy to get rid of. You end up duplicating >>>>> functionality that is more naturally implemented in the middle end. >>>>> >>>> >>>> Yeah, I was looking into a couple of other simple approaches and >>>> quickly found out that I was basically replicating much of the sort of >>>> logic that the inliner already has. >>>> >>>> -- Sean Silva >>>> >>>> >>>>> >>>>> I see the two approaches as supplementary, rather than complementary. >>>>> One does not negate the other. Some of the optimizations we'd do in the >>>>> FE, may hurt coverage. Instead, by instrumenting in the middle end, you >>>>> can focus exclusively on performance (coverage be damned). >>>>> >>>>> >>>>> Diego. >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing listllvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150901/93c8ec73/attachment.html>