Lee Hunt
2015-May-27 03:47 UTC
[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)
Hello - I'm an Engineer in Microsoft Office after looking into possible advantages of using PGO for our Android Applications. We at Microsoft have deep experience with Visual C++'s Profile Guided Optimization<https://msdn.microsoft.com/en-us/library/e7k32f4k.aspx> and often see 10% or more reduction in the size of application code loaded after using PGO for key scenarios (e.g. application launch). Making application launch quickly is very important to us, and reducing the number of code pages loaded helps with this goal. Before we dig into turning it on, I'm wondering if there's any pre-existing research / case studies about possible code page reduction seen from other Clang PGO-enabled applications? It sounds like there is some possible instrumented run performance problems due to counter contention resulting in sluggish performance and perhaps skewed profile data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY. I'd like an overview of the optimizations that PGO does, but I don't find much from looking at the Clang PGO section: http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization. For example, from reading different pages on how Clang PGO, it's unclear if it does "block reordering" (i.e. moving unexecuted code blocks to a distant code page, leaving only 'hot' executed code packed together for greater code density). I find mention of "hot arc" optimization (-fprofile-arcs) , but I'm unclear if this is the same thing. Does Clang PGO do block reordering? Thanks, --Lee -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/19e91910/attachment.html>
Diego Novillo
2015-May-27 14:42 UTC
[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)
On Tue, May 26, 2015 at 11:47 PM, Lee Hunt <leehu at exchange.microsoft.com> wrote:> For example, from reading different pages on how Clang PGO, it’s unclear if > it does “block reordering” (i.e. moving unexecuted code blocks to a distant > code page, leaving only ‘hot’ executed code packed together for greater code > density). I find mention of “hot arc” optimization (-fprofile-arcs) , but > I’m unclear if this is the same thing. Does Clang PGO do block reordering?A small clarification. Clang itself does not implement any optimizations. Clang limits itself to generate LLVM IR. The annotated IR is then used by some LLVM optimizers to guide decisions. At this time, there are few optimization passes that use the profile information: block reordering and register allocation (to avoid spilling on cold paths). There are no other significant transformations that use profiling information. We are working on that. Notably, we'd like to add profiling-based decisions to the inliner, loop optimizers and the vectorizer. Diego.
Xinliang David Li
2015-May-27 16:29 UTC
[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)
On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <leehu at exchange.microsoft.com> wrote:> Hello – > > > > I’m an Engineer in Microsoft Office after looking into possible advantages > of using PGO for our Android Applications. > > > > We at Microsoft have deep experience with Visual C++’s Profile Guided > Optimization > <https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_e7k32f4k.aspx&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=L5s90Jkxqk45FMvD7qA0Visu71cC_bqMyLK3h0RSZtU&e=> > and often see 10% or more reduction in the size of application code loaded > after using PGO for key scenarios (e.g. application launch). >yes. This is true for the GCC too. Clang's PGO does not shrink code size yet.> Making application launch quickly is very important to us, and reducing > the number of code pages loaded helps with this goal. > > > > Before we dig into turning it on, I’m wondering if there’s any > pre-existing research / case studies about possible code page reduction > seen from other Clang PGO-enabled applications? It sounds like there is > some possible instrumented run performance problems due to counter > contention resulting in sluggish performance and perhaps skewed profile > data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY > <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21topic_llvm-2Ddev_cDqYgnxNEhY&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=YaUiiOgIrmA6Io5p4aWzmppYDAKyp8ddTwozd_l-Wjg&e=>. > >Counter contention is one issue. Redundant counter updates is another major issue (due to the early instrumentation). We are working on the later and see great speed ups.> I’d like an overview of the optimizations that PGO does, but I don’t find > much from looking at the Clang PGO section: > http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization > <https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23profile-2Dguided-2Doptimization&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=cKiMsZqz31mbPqwGaH_hX2B8sTtFSJ65A4_vbF-fkB4&e=> > . >Profile data is not used in any IPA passes yet. It is used by any post inline optimizations though -- including block layout, register allocator etc.> > > For example, from reading different pages on how Clang PGO, it’s unclear > if it does “block reordering” (i.e. moving unexecuted code blocks to a > distant code page, leaving only ‘hot’ executed code packed together for > greater code density). >LLVM's block placement uses branch probability and frequency data, but there is no function splitting optimization yet. I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m> unclear if this is the same thing. Does Clang PGO do block reordering? > >It does reordering, but does not do splitting/partitioning. David> > > Thanks, > > --Lee > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/fcb09083/attachment.html>
Lee Hunt
2015-May-27 17:11 UTC
[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)
Thanks! CIL [LeeHu] for a few comments… From: Xinliang David Li [mailto:xinliangli at gmail.com] Sent: Wednesday, May 27, 2015 9:29 AM To: Lee Hunt Cc: llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <leehu at exchange.microsoft.com<mailto:leehu at exchange.microsoft.com>> wrote: Hello – I’m an Engineer in Microsoft Office after looking into possible advantages of using PGO for our Android Applications. We at Microsoft have deep experience with Visual C++’s Profile Guided Optimization<https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_e7k32f4k.aspx&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=L5s90Jkxqk45FMvD7qA0Visu71cC_bqMyLK3h0RSZtU&e=> and often see 10% or more reduction in the size of application code loaded after using PGO for key scenarios (e.g. application launch). yes. This is true for the GCC too. Clang's PGO does not shrink code size yet. [LeeHu] Note: I’m not talking about shrinking code size, but rather reordering it such that only ‘active’ branches within the profiled functions are grouped together in ‘hot’ code pages. This is a very big optimization for us in VC++ toolchain in PGO. We also have the “/LTCG” flag – which is seemingly similar to the “-flto” Clang flag -- that *does* shrink code by various means (dead code removal, common IL tree collapsing) because it can see all the object code for an entire produced target binary (e.g. .exe or .dll). Does -flto also shrink code? Making application launch quickly is very important to us, and reducing the number of code pages loaded helps with this goal. Before we dig into turning it on, I’m wondering if there’s any pre-existing research / case studies about possible code page reduction seen from other Clang PGO-enabled applications? It sounds like there is some possible instrumented run performance problems due to counter contention resulting in sluggish performance and perhaps skewed profile data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY<https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21topic_llvm-2Ddev_cDqYgnxNEhY&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=YaUiiOgIrmA6Io5p4aWzmppYDAKyp8ddTwozd_l-Wjg&e=>. Counter contention is one issue. Redundant counter updates is another major issue (due to the early instrumentation). We are working on the later and see great speed ups. I’d like an overview of the optimizations that PGO does, but I don’t find much from looking at the Clang PGO section: http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization<https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23profile-2Dguided-2Doptimization&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=cKiMsZqz31mbPqwGaH_hX2B8sTtFSJ65A4_vbF-fkB4&e=>. Profile data is not used in any IPA passes yet. It is used by any post inline optimizations though -- including block layout, register allocator etc. [LeeHu]: sorry for naïve question, but what is IPA? And what post-inline optimizations are currently being done? We’re currently using Clang 3.5 if that matters. For example, from reading different pages on how Clang PGO, it’s unclear if it does “block reordering” (i.e. moving unexecuted code blocks to a distant code page, leaving only ‘hot’ executed code packed together for greater code density). LLVM's block placement uses branch probability and frequency data, but there is no function splitting optimization yet. I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing. Does Clang PGO do block reordering? It does reordering, but does not do splitting/partitioning. David Thanks, --Lee _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/f8cbc729/attachment.html>
Duncan P. N. Exon Smith
2015-May-27 18:13 UTC
[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)
> On 2015 May 27, at 07:42, Diego Novillo <dnovillo at google.com> wrote: > > On Tue, May 26, 2015 at 11:47 PM, Lee Hunt <leehu at exchange.microsoft.com> wrote: > >> For example, from reading different pages on how Clang PGO, it’s unclear if >> it does “block reordering” (i.e. moving unexecuted code blocks to a distant >> code page, leaving only ‘hot’ executed code packed together for greater code >> density). I find mention of “hot arc” optimization (-fprofile-arcs) , but >> I’m unclear if this is the same thing. Does Clang PGO do block reordering? > > A small clarification. Clang itself does not implement any > optimizations. Clang limits itself to generate LLVM IR. The > annotated IR is then used by some LLVM optimizers to guide decisions. > At this time, there are few optimization passes that use the profile > information: block reordering and register allocation (to avoid > spilling on cold paths). > > There are no other significant transformations that use profiling > information. We are working on that. Notably, we'd like to add > profiling-based decisions to the inlinerJust a quick note about the inliner. Although the inliner itself doesn't know how to use the profile, clang's IRGen has been modified to add an 'inlinehint' attribute to hot functions and the 'cold' attribute to cold functions. Indirectly, PGO does affect the inliner. (We'll remove this once the inliner does the right thing on its own.)> , loop optimizers and the > vectorizer. > > > Diego. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reasonably Related Threads
- [LLVMdev] FW: Capabilities of Clang's PGO (e.g. improving code density)
- [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)
- [LLVMdev] FW: Capabilities of Clang's PGO (e.g. improving code density)
- [LLVMdev] FW: Capabilities of Clang's PGO (e.g. improving code density)
- [LLVMdev] Why can't comparisons with negative zero be simplified?