thr3ads.net - llvm dev - [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density) [May 2015]

If this information is useful, please help other people find it:
Share via:

Lee Hunt

2015-May-27 03:47 UTC

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

Hello -

I'm an Engineer in Microsoft Office after looking into possible advantages
of using PGO for our Android Applications.

We at Microsoft have deep experience with Visual C++'s Profile Guided
Optimization<https://msdn.microsoft.com/en-us/library/e7k32f4k.aspx> and
often see 10% or more reduction in the size of application code loaded after
using PGO for key scenarios (e.g. application launch).   Making application
launch quickly is very important to us, and reducing the number of code pages
loaded helps with this goal.

Before we dig into turning it on, I'm wondering if there's any
pre-existing research / case studies about possible code page reduction seen
from other Clang PGO-enabled applications?  It sounds like there is some
possible instrumented run performance problems due to counter contention
resulting in sluggish performance and perhaps skewed profile data:
https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY.  I'd like an
overview of the optimizations that PGO does, but I don't find much from
looking at the Clang PGO section:
http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization.

For example, from reading different pages on how Clang PGO, it's unclear if
it does "block reordering" (i.e. moving unexecuted code blocks to a
distant code page, leaving only 'hot' executed code packed together for
greater code density).  I find mention of "hot arc" optimization
(-fprofile-arcs) , but I'm unclear if this is the same thing.  Does Clang
PGO do block reordering?

Thanks,
--Lee
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/19e91910/attachment.html>

Diego Novillo

2015-May-27 14:42 UTC

head link

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

On Tue, May 26, 2015 at 11:47 PM, Lee Hunt <leehu at
exchange.microsoft.com> wrote:
> For example, from reading different pages on how Clang PGO, it’s unclear if
> it does “block reordering” (i.e. moving unexecuted code blocks to a distant
> code page, leaving only ‘hot’ executed code packed together for greater
code
> density).  I find mention of “hot arc” optimization (-fprofile-arcs) , but
> I’m unclear if this is the same thing.  Does Clang PGO do block reordering?
A small clarification.  Clang itself does not implement any
optimizations.  Clang limits itself to generate LLVM IR.  The
annotated IR is then used by some LLVM optimizers to guide decisions.
At this time, there are few optimization passes that use the profile
information: block reordering and register allocation (to avoid
spilling on cold paths).

There are no other significant transformations that use profiling
information. We are working on that.  Notably, we'd like to add
profiling-based decisions to the inliner, loop optimizers and the
vectorizer.


Diego.

Xinliang David Li

2015-May-27 16:29 UTC

head link

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <leehu at
exchange.microsoft.com>
wrote:
>  Hello –
>
>
>
> I’m an Engineer in Microsoft Office after looking into possible advantages
> of using PGO for our Android Applications.
>
>
>
> We at Microsoft have deep experience with Visual C++’s Profile Guided
> Optimization
>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_e7k32f4k.aspx&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=L5s90Jkxqk45FMvD7qA0Visu71cC_bqMyLK3h0RSZtU&e=>
> and often see 10% or more reduction in the size of application code loaded
> after using PGO for key scenarios (e.g. application launch).
>
yes. This is true for the GCC too.  Clang's PGO does not shrink code size
yet.

>  Making application launch quickly is very important to us, and reducing
> the number of code pages loaded helps with this goal.
>
>
>
> Before we dig into turning it on, I’m wondering if there’s any
> pre-existing research / case studies about possible code page reduction
> seen from other Clang PGO-enabled applications?  It sounds like there is
> some possible instrumented run performance problems due to counter
> contention resulting in sluggish performance and perhaps skewed profile
> data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY
>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21topic_llvm-2Ddev_cDqYgnxNEhY&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=YaUiiOgIrmA6Io5p4aWzmppYDAKyp8ddTwozd_l-Wjg&e=>.
>
>
Counter contention is one issue. Redundant counter updates is another major
issue (due to the early instrumentation). We are working on the later and
see great speed ups.


> I’d like an overview of the optimizations that PGO does, but I don’t find
> much from looking at the Clang PGO section:
> http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
>
<https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23profile-2Dguided-2Doptimization&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=cKiMsZqz31mbPqwGaH_hX2B8sTtFSJ65A4_vbF-fkB4&e=>
> .
>
Profile data is not used in any IPA passes yet. It is used by any post
inline optimizations though -- including block layout, register allocator
etc.


>
>
> For example, from reading different pages on how Clang PGO, it’s unclear
> if it does “block reordering” (i.e. moving unexecuted code blocks to a
> distant code page, leaving only ‘hot’ executed code packed together for
> greater code density).
>
LLVM's block placement uses branch probability and frequency data, but
there is no function splitting optimization yet.

 I find mention of “hot arc” optimization (-fprofile-arcs) , but
I’m> unclear if this is the same thing.  Does Clang PGO do block reordering?
>
>It does reordering, but does not do splitting/partitioning.

David


>
>
> Thanks,
>
> --Lee
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/fcb09083/attachment.html>

Lee Hunt

2015-May-27 17:11 UTC

head link

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

Thanks! CIL [LeeHu] for a few comments…

From: Xinliang David Li [mailto:xinliangli at gmail.com]
Sent: Wednesday, May 27, 2015 9:29 AM
To: Lee Hunt
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code
density)

On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <leehu at
exchange.microsoft.com<mailto:leehu at exchange.microsoft.com>> wrote:
Hello –

I’m an Engineer in Microsoft Office after looking into possible advantages of
using PGO for our Android Applications.

We at Microsoft have deep experience with Visual C++’s Profile Guided
Optimization<https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_e7k32f4k.aspx&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=L5s90Jkxqk45FMvD7qA0Visu71cC_bqMyLK3h0RSZtU&e=>
and often see 10% or more reduction in the size of application code loaded after
using PGO for key scenarios (e.g. application launch).

yes. This is true for the GCC too.  Clang's PGO does not shrink code size
yet.

[LeeHu] Note: I’m not talking about shrinking code size, but rather reordering
it such that only ‘active’ branches within the profiled functions are grouped
together in ‘hot’ code pages.  This is a very big optimization for us in VC++
toolchain in PGO.
We also have the “/LTCG” flag – which is seemingly similar to the “-flto” Clang
flag -- that *does* shrink code by various means (dead code removal, common IL
tree collapsing) because it can see all the object code for an entire produced
target binary (e.g. .exe or .dll).
Does -flto also shrink code?

 Making application launch quickly is very important to us, and reducing the
number of code pages loaded helps with this goal.

Before we dig into turning it on, I’m wondering if there’s any pre-existing
research / case studies about possible code page reduction seen from other Clang
PGO-enabled applications?  It sounds like there is some possible instrumented
run performance problems due to counter contention resulting in sluggish
performance and perhaps skewed profile data:
https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY<https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21topic_llvm-2Ddev_cDqYgnxNEhY&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=YaUiiOgIrmA6Io5p4aWzmppYDAKyp8ddTwozd_l-Wjg&e=>.

Counter contention is one issue. Redundant counter updates is another major
issue (due to the early instrumentation). We are working on the later and see
great speed ups.

I’d like an overview of the optimizations that PGO does, but I don’t find much
from looking at the Clang PGO section:
http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization<https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23profile-2Dguided-2Doptimization&d=AwMFAg&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=CDx6fJHiO_U5ya1dHZhv-O5nAU_botD-I7BAyxPZXZE&s=cKiMsZqz31mbPqwGaH_hX2B8sTtFSJ65A4_vbF-fkB4&e=>.

Profile data is not used in any IPA passes yet. It is used by any post inline
optimizations though -- including block layout, register allocator etc.

[LeeHu]: sorry for naïve question, but what is IPA?  And what post-inline
optimizations are currently being done?   We’re currently using Clang 3.5 if
that matters.

For example, from reading different pages on how Clang PGO, it’s unclear if it
does “block reordering” (i.e. moving unexecuted code blocks to a distant code
page, leaving only ‘hot’ executed code packed together for greater code
density).

LLVM's block placement uses branch probability and frequency data, but there
is no function splitting optimization yet.

 I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if
this is the same thing.  Does Clang PGO do block reordering?

It does reordering, but does not do splitting/partitioning.

David

Thanks,
--Lee

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150527/f8cbc729/attachment.html>

Duncan P. N. Exon Smith

2015-May-27 18:13 UTC

head link

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

> On 2015 May 27, at 07:42, Diego Novillo <dnovillo at google.com>
wrote:
> 
> On Tue, May 26, 2015 at 11:47 PM, Lee Hunt <leehu at
exchange.microsoft.com> wrote:
> 
>> For example, from reading different pages on how Clang PGO, it’s
unclear if
>> it does “block reordering” (i.e. moving unexecuted code blocks to a
distant
>> code page, leaving only ‘hot’ executed code packed together for greater
code
>> density).  I find mention of “hot arc” optimization (-fprofile-arcs) ,
but
>> I’m unclear if this is the same thing.  Does Clang PGO do block
reordering?
> 
> A small clarification.  Clang itself does not implement any
> optimizations.  Clang limits itself to generate LLVM IR.  The
> annotated IR is then used by some LLVM optimizers to guide decisions.
> At this time, there are few optimization passes that use the profile
> information: block reordering and register allocation (to avoid
> spilling on cold paths).
> 
> There are no other significant transformations that use profiling
> information. We are working on that.  Notably, we'd like to add
> profiling-based decisions to the inliner
Just a quick note about the inliner.  Although the inliner itself
doesn't know how to use the profile, clang's IRGen has been modified
to add an 'inlinehint' attribute to hot functions and the 'cold'
attribute to cold functions.  Indirectly, PGO does affect the
inliner.  (We'll remove this once the inliner does the right thing on
its own.)
> , loop optimizers and the
> vectorizer.
> 
> 
> Diego.
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - May 2015 - [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

Seemingly Similar Threads