Randy Chapman
2015-May-27 20:31 UTC
[LLVMdev] FW: Capabilities of Clang's PGO (e.g. improving code density)
David, Yes, that is very helpful. Thanks! --randy From: Xinliang David Li [mailto:xinliangli at gmail.com] Sent: Wednesday, May 27, 2015 12:53 PM To: Randy Chapman Cc: Lee Hunt; llvmdev at cs.uiuc.edu Subject: Re: FW: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) On Wed, May 27, 2015 at 12:40 PM, Randy Chapman <randyc at microsoft.com<mailto:randyc at microsoft.com>> wrote: Hi David! Thanks again for your help! I was wondering if you could clarify one thing for me? I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing. Does Clang PGO do block reordering? It does reordering, but does not do splitting/partitioning. I take this to mean that PGO does block reordering within the function? I don’t see that the clang drive passes anything to the linker to drive function ordering at the linker level as well. Is there something there that I missed, or are you aware of any readily available tools to do so? If not, we’ve done some work locally on enabling that which we will continue. Ok. There are three reordering related optimizations: 1) intra-procedural Basic Block Reordering to reduce branch cost, icache miss and front-end stalls. 2) function splitting/partitioning -- splitting really code part of a function into unlikely.text sections 3) function reordering based on affinity and hotness -- reordering functions by the linker/plugin (guided by the compiler annotations). Clang currently only does 1). Hope this clarifies. thanks, David Thanks ☺ --randy From: Xinliang David Li [mailto:xinliangli at gmail.com] Sent: Wednesday, May 27, 2015 10:21 AM To: Lee Hunt Cc: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> Subject: Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) On Wed, May 27, 2015 at 10:11 AM, Lee Hunt <leehu at exchange.microsoft.com<mailto:leehu at exchange.microsoft.com>> wrote: Thanks! CIL [LeeHu] for a few comments… From: Xinliang David Li [mailto:xinliangli at gmail.com<mailto:xinliangli at gmail.com>] Sent: Wednesday, May 27, 2015 9:29 AM To: Lee Hunt Cc: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> Subject: Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <leehu at exchange.microsoft.com<mailto:leehu at exchange.microsoft.com>> wrote: Hello – I’m an Engineer in Microsoft Office after looking into possible advantages of using PGO for our Android Applications. We at Microsoft have deep experience with Visual C++’s Profile Guided Optimization<https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_e7k32f4k.aspx&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=L5s90Jkxqk45FMvD7qA0Visu71cC_bqMyLK3h0RSZtU&e=> and often see 10% or more reduction in the size of application code loaded after using PGO for key scenarios (e.g. application launch). yes. This is true for the GCC too. Clang's PGO does not shrink code size yet. [LeeHu] Note: I’m not talking about shrinking code size, but rather reordering it such that only ‘active’ branches within the profiled functions are grouped together in ‘hot’ code pages. This is a very big optimization for us in VC++ toolchain in PGO. We also have the “/LTCG” flag – which is seemingly similar to the “-flto” Clang flag -- that *does* shrink code by various means (dead code removal, common IL tree collapsing) because it can see all the object code for an entire produced target binary (e.g. .exe or .dll). Does -flto also shrink code? That depends on other options used (e.g, -Os). With LTO, compiler sees larger scope, performs cross module inlines and dead function eliminations. It does have more opportunities to shrink code. Making application launch quickly is very important to us, and reducing the number of code pages loaded helps with this goal. Before we dig into turning it on, I’m wondering if there’s any pre-existing research / case studies about possible code page reduction seen from other Clang PGO-enabled applications? It sounds like there is some possible instrumented run performance problems due to counter contention resulting in sluggish performance and perhaps skewed profile data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY<https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21topic_llvm-2Ddev_cDqYgnxNEhY&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=YaUiiOgIrmA6Io5p4aWzmppYDAKyp8ddTwozd_l-Wjg&e=>. Counter contention is one issue. Redundant counter updates is another major issue (due to the early instrumentation). We are working on the later and see great speed ups. I’d like an overview of the optimizations that PGO does, but I don’t find much from looking at the Clang PGO section: http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization<https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23profile-2Dguided-2Doptimization&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=cKiMsZqz31mbPqwGaH_hX2B8sTtFSJ65A4_vbF-fkB4&e=>. Profile data is not used in any IPA passes yet. It is used by any post inline optimizations though -- including block layout, register allocator etc. [LeeHu]: sorry for naïve question, but what is IPA? Inter-procedural analysis/optimizations. And what post-inline optimizations are currently being done? We’re currently using Clang 3.5 if that matters. For example, from reading different pages on how Clang PGO, it’s unclear if it does “block reordering” (i.e. moving unexecuted code blocks to a distant code page, leaving only ‘hot’ executed code packed together for greater code density). LLVM's block placement uses branch probability and frequency data, but there is no function splitting optimization yet. I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing. Does Clang PGO do block reordering? It does reordering, but does not do splitting/partitioning. David Thanks, --Lee _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/76093c0f/attachment.html>
Lee Hunt
2015-May-27 23:56 UTC
[LLVMdev] FW: Capabilities of Clang's PGO (e.g. improving code density)
Yes, thanks David! For the intra-procedural Basic Block Reordering, do you have any data as to how much improvement that gives speed-wise for any perf tests you’ve measured? I’m thinking this may speed things up for things like application launch by a couple %. For perf intensive code (e.g. spreadsheet recalc), I would expect it would be more. From: Randy Chapman Sent: Wednesday, May 27, 2015 1:32 PM To: Xinliang David Li Cc: Lee Hunt; llvmdev at cs.uiuc.edu Subject: RE: FW: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) David, Yes, that is very helpful. Thanks! --randy From: Xinliang David Li [mailto:xinliangli at gmail.com] Sent: Wednesday, May 27, 2015 12:53 PM To: Randy Chapman Cc: Lee Hunt; llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> Subject: Re: FW: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) On Wed, May 27, 2015 at 12:40 PM, Randy Chapman <randyc at microsoft.com<mailto:randyc at microsoft.com>> wrote: Hi David! Thanks again for your help! I was wondering if you could clarify one thing for me? I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing. Does Clang PGO do block reordering? It does reordering, but does not do splitting/partitioning. I take this to mean that PGO does block reordering within the function? I don’t see that the clang drive passes anything to the linker to drive function ordering at the linker level as well. Is there something there that I missed, or are you aware of any readily available tools to do so? If not, we’ve done some work locally on enabling that which we will continue. Ok. There are three reordering related optimizations: 1) intra-procedural Basic Block Reordering to reduce branch cost, icache miss and front-end stalls. 2) function splitting/partitioning -- splitting really code part of a function into unlikely.text sections 3) function reordering based on affinity and hotness -- reordering functions by the linker/plugin (guided by the compiler annotations). Clang currently only does 1). Hope this clarifies. thanks, David Thanks ☺ --randy From: Xinliang David Li [mailto:xinliangli at gmail.com] Sent: Wednesday, May 27, 2015 10:21 AM To: Lee Hunt Cc: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> Subject: Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) On Wed, May 27, 2015 at 10:11 AM, Lee Hunt <leehu at exchange.microsoft.com<mailto:leehu at exchange.microsoft.com>> wrote: Thanks! CIL [LeeHu] for a few comments… From: Xinliang David Li [mailto:xinliangli at gmail.com<mailto:xinliangli at gmail.com>] Sent: Wednesday, May 27, 2015 9:29 AM To: Lee Hunt Cc: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> Subject: Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <leehu at exchange.microsoft.com<mailto:leehu at exchange.microsoft.com>> wrote: Hello – I’m an Engineer in Microsoft Office after looking into possible advantages of using PGO for our Android Applications. We at Microsoft have deep experience with Visual C++’s Profile Guided Optimization<https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_e7k32f4k.aspx&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=L5s90Jkxqk45FMvD7qA0Visu71cC_bqMyLK3h0RSZtU&e=> and often see 10% or more reduction in the size of application code loaded after using PGO for key scenarios (e.g. application launch). yes. This is true for the GCC too. Clang's PGO does not shrink code size yet. [LeeHu] Note: I’m not talking about shrinking code size, but rather reordering it such that only ‘active’ branches within the profiled functions are grouped together in ‘hot’ code pages. This is a very big optimization for us in VC++ toolchain in PGO. We also have the “/LTCG” flag – which is seemingly similar to the “-flto” Clang flag -- that *does* shrink code by various means (dead code removal, common IL tree collapsing) because it can see all the object code for an entire produced target binary (e.g. .exe or .dll). Does -flto also shrink code? That depends on other options used (e.g, -Os). With LTO, compiler sees larger scope, performs cross module inlines and dead function eliminations. It does have more opportunities to shrink code. Making application launch quickly is very important to us, and reducing the number of code pages loaded helps with this goal. Before we dig into turning it on, I’m wondering if there’s any pre-existing research / case studies about possible code page reduction seen from other Clang PGO-enabled applications? It sounds like there is some possible instrumented run performance problems due to counter contention resulting in sluggish performance and perhaps skewed profile data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY<https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21topic_llvm-2Ddev_cDqYgnxNEhY&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=YaUiiOgIrmA6Io5p4aWzmppYDAKyp8ddTwozd_l-Wjg&e=>. Counter contention is one issue. Redundant counter updates is another major issue (due to the early instrumentation). We are working on the later and see great speed ups. I’d like an overview of the optimizations that PGO does, but I don’t find much from looking at the Clang PGO section: http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization<https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23profile-2Dguided-2Doptimization&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=cKiMsZqz31mbPqwGaH_hX2B8sTtFSJ65A4_vbF-fkB4&e=>. Profile data is not used in any IPA passes yet. It is used by any post inline optimizations though -- including block layout, register allocator etc. [LeeHu]: sorry for naïve question, but what is IPA? Inter-procedural analysis/optimizations. And what post-inline optimizations are currently being done? We’re currently using Clang 3.5 if that matters. For example, from reading different pages on how Clang PGO, it’s unclear if it does “block reordering” (i.e. moving unexecuted code blocks to a distant code page, leaving only ‘hot’ executed code packed together for greater code density). LLVM's block placement uses branch probability and frequency data, but there is no function splitting optimization yet. I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing. Does Clang PGO do block reordering? It does reordering, but does not do splitting/partitioning. David Thanks, --Lee _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/b360b83c/attachment.html>
Xinliang David Li
2015-May-28 06:08 UTC
[LLVMdev] FW: Capabilities of Clang's PGO (e.g. improving code density)
On Wed, May 27, 2015 at 4:56 PM, Lee Hunt <leehu at exchange.microsoft.com> wrote:> Yes, thanks David! > > > > For the intra-procedural Basic Block Reordering, do you have any data as > to how much improvement that gives speed-wise for any perf tests you’ve > measured? >Yes. Most of the benchmarks we have see improvement with better layout -- some improvement are small and some are large. Of course this also depends on the layout algorithm, which we are working on improving too.> > > I’m thinking this may speed things up for things like application launch > by a couple %. >Function reordering may be more important for this, which needs call-trace profile. The trace based layout will reduce # of page faults during program starts. David> For perf intensive code (e.g. spreadsheet recalc), I would expect it would > be more. > > > > *From:* Randy Chapman > *Sent:* Wednesday, May 27, 2015 1:32 PM > *To:* Xinliang David Li > *Cc:* Lee Hunt; llvmdev at cs.uiuc.edu > *Subject:* RE: FW: [LLVMdev] Capabilities of Clang's PGO (e.g. improving > code density) > > > > > > David, > > > > Yes, that is very helpful. Thanks! > > --randy > > > > *From:* Xinliang David Li [mailto:xinliangli at gmail.com > <xinliangli at gmail.com>] > *Sent:* Wednesday, May 27, 2015 12:53 PM > *To:* Randy Chapman > *Cc:* Lee Hunt; llvmdev at cs.uiuc.edu > *Subject:* Re: FW: [LLVMdev] Capabilities of Clang's PGO (e.g. improving > code density) > > > > > > > > On Wed, May 27, 2015 at 12:40 PM, Randy Chapman <randyc at microsoft.com> > wrote: > > > > Hi David! > > > > Thanks again for your help! I was wondering if you could clarify one > thing for me? > > I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m > unclear if this is the same thing. Does Clang PGO do block reordering? > > It does reordering, but does not do splitting/partitioning. > > I take this to mean that PGO does block reordering within the function? I > don’t see that the clang drive passes anything to the linker to drive > function ordering at the linker level as well. Is there something there > that I missed, or are you aware of any readily available tools to do so? > If not, we’ve done some work locally on enabling that which we will > continue. > > > > > > Ok. There are three reordering related optimizations: > > > > 1) intra-procedural Basic Block Reordering to reduce branch cost, icache > miss and front-end stalls. > > 2) function splitting/partitioning -- splitting really code part of a > function into unlikely.text sections > > 3) function reordering based on affinity and hotness -- reordering > functions by the linker/plugin (guided by the compiler annotations). > > > > Clang currently only does 1). > > > > Hope this clarifies. > > > > thanks, > > > > David > > > > > > > > Thanks J > > --randy > > > > *From:* Xinliang David Li [mailto:xinliangli at gmail.com > <xinliangli at gmail.com>] > *Sent:* Wednesday, May 27, 2015 10:21 AM > > > *To:* Lee Hunt > *Cc:* llvmdev at cs.uiuc.edu > *Subject:* Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code > density) > > > > > > > > On Wed, May 27, 2015 at 10:11 AM, Lee Hunt <leehu at exchange.microsoft.com> > wrote: > > Thanks! CIL [LeeHu] for a few comments… > > > > > > *From:* Xinliang David Li [mailto:xinliangli at gmail.com] > *Sent:* Wednesday, May 27, 2015 9:29 AM > *To:* Lee Hunt > *Cc:* llvmdev at cs.uiuc.edu > *Subject:* Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code > density) > > > > > > On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <leehu at exchange.microsoft.com> > wrote: > > Hello – > > > > I’m an Engineer in Microsoft Office after looking into possible advantages > of using PGO for our Android Applications. > > > > We at Microsoft have deep experience with Visual C++’s Profile Guided > Optimization > <https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_e7k32f4k.aspx&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=L5s90Jkxqk45FMvD7qA0Visu71cC_bqMyLK3h0RSZtU&e=> > and often see 10% or more reduction in the size of application code loaded > after using PGO for key scenarios (e.g. application launch). > > > > yes. This is true for the GCC too. Clang's PGO does not shrink code size > yet. > > > > [LeeHu] Note: I’m not talking about shrinking code size, but rather > reordering it such that only ‘active’ branches within the profiled > functions are grouped together in ‘hot’ code pages. This is a very big > optimization for us in VC++ toolchain in PGO. > > We also have the “/LTCG” flag – which is seemingly similar to the “-flto” > Clang flag -- that **does** shrink code by various means (dead code > removal, common IL tree collapsing) because it can see all the object code > for an entire produced target binary (e.g. .exe or .dll). > > Does -flto also shrink code? > > > > > > That depends on other options used (e.g, -Os). With LTO, compiler sees > larger scope, performs cross module inlines and dead function eliminations. > It does have more opportunities to shrink code. > > > > > > > > Making application launch quickly is very important to us, and > reducing the number of code pages loaded helps with this goal. > > > > Before we dig into turning it on, I’m wondering if there’s any > pre-existing research / case studies about possible code page reduction > seen from other Clang PGO-enabled applications? It sounds like there is > some possible instrumented run performance problems due to counter > contention resulting in sluggish performance and perhaps skewed profile > data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY > <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21topic_llvm-2Ddev_cDqYgnxNEhY&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=YaUiiOgIrmA6Io5p4aWzmppYDAKyp8ddTwozd_l-Wjg&e=>. > > > > > Counter contention is one issue. Redundant counter updates is another > major issue (due to the early instrumentation). We are working on the later > and see great speed ups. > > > > > > I’d like an overview of the optimizations that PGO does, but I don’t > find much from looking at the Clang PGO section: > http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization > <https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23profile-2Dguided-2Doptimization&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=cKiMsZqz31mbPqwGaH_hX2B8sTtFSJ65A4_vbF-fkB4&e=> > . > > > > Profile data is not used in any IPA passes yet. It is used by any post > inline optimizations though -- including block layout, register allocator > etc. > > > > [LeeHu]: sorry for naïve question, but what is IPA? > > > > > > Inter-procedural analysis/optimizations. > > > > > > And what post-inline optimizations are currently being done? We’re > currently using Clang 3.5 if that matters. > > > > > > For example, from reading different pages on how Clang PGO, it’s unclear > if it does “block reordering” (i.e. moving unexecuted code blocks to a > distant code page, leaving only ‘hot’ executed code packed together for > greater code density). > > > > LLVM's block placement uses branch probability and frequency data, but > there is no function splitting optimization yet. > > > > I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m > unclear if this is the same thing. Does Clang PGO do block reordering? > > > > It does reordering, but does not do splitting/partitioning. > > > > David > > > > > > > > Thanks, > > --Lee > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/71d65ae1/attachment.html>
Reasonably Related Threads
- [LLVMdev] FW: Capabilities of Clang's PGO (e.g. improving code density)
- [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)
- [LLVMdev] FW: Capabilities of Clang's PGO (e.g. improving code density)
- [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)
- [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)