Hello David, thanks for detailed response! Do you have any tests that you use to measure the PGO effectiveness? I have tested clang version 6.0 with the same sample that Jie Chen used in 2016 and actually both frontend-based PGO and IR-based make code run slower, see the average time: clang++ -O3: 3.15 sec clang++ -O3 and -fprofile-instr-use: 3.160 sec clang++ -O3 and -fprofile-use: 3.180 sec g++ (7.3.0) -O3: 3.640 sec g++ (7.3.0) -O3 and -fprofile-use: 2.92 sec Do you have any idea what can be wrong? Maybe there are some recommendations in which cases one should use PGO with clang and when it is better not to do it? Thanks! On 02/05/2018 09:38 AM, Xinliang David Li wrote:> > > On Sun, Feb 4, 2018 at 9:59 PM, Victor Leschuk > <vleschuk at accesssoftek.com <mailto:vleschuk at accesssoftek.com>> wrote: > > Hello David! > > I have recently started acquaintance with PGO in LLVM/clang and found > your e-mail thread: > http://lists.llvm.org/pipermail/llvm-dev/2016-May/099395.html > <http://lists.llvm.org/pipermail/llvm-dev/2016-May/099395.html> . > Here you > posted a nice list of optimizations that use profiling and of those > which could be using but don't. However that thread is about 2 years > old. Could you please kindly let me know if there were any significant > changes in this area since that time? > > > > Yes, there were quite some changes since then. Here are some of the > new features: > > * LLVM IR based PGO -- this is designed to maximize program > performance. The option to turn it on is -fprofile-generate/-fprofile-use > * value profiling support in PGO -- currently support indirect call > target profiling and memcpy/memset size profiling and optimizations > * Profile data is made available for inliner to use (enabled only for > the new pass manager: -fexperimental-new-pass-manager) > * Profile aware LICM is available -- implemented via a profile driven > code sinking pass > * Partial inlining is made profile aware; Graham Yu also added > support for multiple region function outlining (with PGO) > * BB layout heuristics are tuned with PGO > * hotness driven function layout optimization > > There are pending work in the following area: > * profile aware loop vectorization, etc > * control heigh reduction optimization (Hiroshi is working on this) > > ThinLTO also works well with PGO. > > Hope this helps. > > David > > >/What I can tell you is that there are many missing ones (that can > benefit /from profile): such as profile aware LICM (patch pending), speculative PRE, > loop unrolling, loop peeling, auto vectorization, inlining, function > splitting, function layout, function outlinling, profile driven size > optimization, induction variable optimization/strength reduction, stringOp > specialization/optimization/inlining, switch peeling/lowering etc. The > biggest profile user today include ralloc, BB layout, ifcvt, shrinkwrapping > etc, but there should be rooms to be improvement there too. > > > Thanks in advance! > > -- > Best Regards, > > Victor Leschuk | Software Engineer | Access Softek > >-- Best Regards, Victor Leschuk | Software Engineer | Access Softek -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180207/0be7ebe7/attachment.html>
Victor, thanks for the experiment. My suspicion is it is due to the remaining issues with block layout -- especially with loop rotation (with PGO). Another problem is that tail dup is not happening after loop rotation which can limit the effectiveness of loop rotation. I tried the internal option -mllvm -force-precise-rotation-cost and there is about 10% speedup with -fprofile-use. This option turns on more precise cost model when computing rotation strategy but it is not turned on by default. +carrot who is working on this area. thanks, David On Tue, Feb 6, 2018 at 1:37 PM, Victor Leschuk <vleschuk at accesssoftek.com> wrote:> Hello David, thanks for detailed response! > > Do you have any tests that you use to measure the PGO effectiveness? I > have tested clang version 6.0 with the same sample that Jie Chen used in > 2016 and actually both frontend-based PGO and IR-based make code run > slower, see the average time: > > clang++ -O3: 3.15 sec > > clang++ -O3 and -fprofile-instr-use: 3.160 sec > > clang++ -O3 and -fprofile-use: 3.180 sec > > g++ (7.3.0) -O3: 3.640 sec > > g++ (7.3.0) -O3 and -fprofile-use: 2.92 sec > > Do you have any idea what can be wrong? Maybe there are some > recommendations in which cases one should use PGO with clang and when it is > better not to do it? > > Thanks! > > On 02/05/2018 09:38 AM, Xinliang David Li wrote: > > > > On Sun, Feb 4, 2018 at 9:59 PM, Victor Leschuk <vleschuk at accesssoftek.com> > wrote: > >> Hello David! >> >> I have recently started acquaintance with PGO in LLVM/clang and found >> your e-mail thread: >> http://lists.llvm.org/pipermail/llvm-dev/2016-May/099395.html . Here you >> posted a nice list of optimizations that use profiling and of those >> which could be using but don't. However that thread is about 2 years >> old. Could you please kindly let me know if there were any significant >> changes in this area since that time? >> > > > Yes, there were quite some changes since then. Here are some of the new > features: > > * LLVM IR based PGO -- this is designed to maximize program performance. > The option to turn it on is -fprofile-generate/-fprofile-use > * value profiling support in PGO -- currently support indirect call target > profiling and memcpy/memset size profiling and optimizations > * Profile data is made available for inliner to use (enabled only for the > new pass manager: -fexperimental-new-pass-manager) > * Profile aware LICM is available -- implemented via a profile driven code > sinking pass > * Partial inlining is made profile aware; Graham Yu also added support > for multiple region function outlining (with PGO) > * BB layout heuristics are tuned with PGO > * hotness driven function layout optimization > > There are pending work in the following area: > * profile aware loop vectorization, etc > * control heigh reduction optimization (Hiroshi is working on this) > > ThinLTO also works well with PGO. > > Hope this helps. > > David > > >* What I can tell you is that there are many missing ones (that can benefit > *from profile): such as profile aware LICM (patch pending), speculative PRE, > loop unrolling, loop peeling, auto vectorization, inlining, function > splitting, function layout, function outlinling, profile driven size > optimization, induction variable optimization/strength reduction, stringOp > specialization/optimization/inlining, switch peeling/lowering etc. The > biggest profile user today include ralloc, BB layout, ifcvt, shrinkwrapping > etc, but there should be rooms to be improvement there too. > > > >> Thanks in advance! >> >> -- >> Best Regards, >> >> Victor Leschuk | Software Engineer | Access Softek >> >> > > -- > Best Regards, > > Victor Leschuk | Software Engineer | Access Softek > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180206/7f7f4a1a/attachment.html>
David, could you please clarify on which code did you gain 10% improvement? I have run numerous tests with and w/o this option and it looks like it has no effect on performance (I am talking of the old 2016 sample to be concrete). Maybe we could investigate it together? Just tell me where to start? On 02/07/2018 02:11 AM, Xinliang David Li wrote:> Victor, thanks for the experiment. > > My suspicion is it is due to the remaining issues with block layout -- > especially with loop rotation (with PGO). Another problem is that tail > dup is not happening after loop rotation which can limit the > effectiveness of loop rotation. > > I tried the internal option -mllvm -force-precise-rotation-cost and > there is about 10% speedup with -fprofile-use. This option turns on > more precise cost model when computing rotation strategy but it is not > turned on by default. > > +carrot who is working on this area. > > thanks, > > David > > On Tue, Feb 6, 2018 at 1:37 PM, Victor Leschuk > <vleschuk at accesssoftek.com <mailto:vleschuk at accesssoftek.com>> wrote: > > Hello David, thanks for detailed response! > > Do you have any tests that you use to measure the PGO > effectiveness? I have tested clang version 6.0 with the same > sample that Jie Chen used in 2016 and actually both frontend-based > PGO and IR-based make code run slower, see the average time: > > clang++ -O3: 3.15 sec > > clang++ -O3 and -fprofile-instr-use: 3.160 sec > > clang++ -O3 and -fprofile-use: 3.180 sec > > g++ (7.3.0) -O3: 3.640 sec > > g++ (7.3.0) -O3 and -fprofile-use: 2.92 sec > > Do you have any idea what can be wrong? Maybe there are some > recommendations in which cases one should use PGO with clang and > when it is better not to do it? > > Thanks! > > > On 02/05/2018 09:38 AM, Xinliang David Li wrote: >> >> >> On Sun, Feb 4, 2018 at 9:59 PM, Victor Leschuk >> <vleschuk at accesssoftek.com <mailto:vleschuk at accesssoftek.com>> wrote: >> >> Hello David! >> >> I have recently started acquaintance with PGO in LLVM/clang >> and found >> your e-mail thread: >> http://lists.llvm.org/pipermail/llvm-dev/2016-May/099395.html >> <http://lists.llvm.org/pipermail/llvm-dev/2016-May/099395.html> >> . Here you >> posted a nice list of optimizations that use profiling and of >> those >> which could be using but don't. However that thread is about >> 2 years >> old. Could you please kindly let me know if there were any >> significant >> changes in this area since that time? >> >> >> >> Yes, there were quite some changes since then. Here are some of >> the new features: >> >> * LLVM IR based PGO -- this is designed to maximize program >> performance. The option to turn it on is >> -fprofile-generate/-fprofile-use >> * value profiling support in PGO -- currently support indirect >> call target profiling and memcpy/memset size profiling and >> optimizations >> * Profile data is made available for inliner to use (enabled only >> for the new pass manager: -fexperimental-new-pass-manager) >> * Profile aware LICM is available -- implemented via a profile >> driven code sinking pass >> * Partial inlining is made profile aware; Graham Yu also added >> support for multiple region function outlining (with PGO) >> * BB layout heuristics are tuned with PGO >> * hotness driven function layout optimization >> >> There are pending work in the following area: >> * profile aware loop vectorization, etc >> * control heigh reduction optimization (Hiroshi is working on this) >> >> ThinLTO also works well with PGO. >> >> Hope this helps. >> >> David >> >> >/What I can tell you is that there are many missing ones (that can >> benefit /from profile): such as profile aware LICM (patch pending), speculative PRE, >> loop unrolling, loop peeling, auto vectorization, inlining, function >> splitting, function layout, function outlinling, profile driven size >> optimization, induction variable optimization/strength reduction, stringOp >> specialization/optimization/inlining, switch peeling/lowering etc. The >> biggest profile user today include ralloc, BB layout, ifcvt, shrinkwrapping >> etc, but there should be rooms to be improvement there too. >> >> >> Thanks in advance! >> >> -- >> Best Regards, >> >> Victor Leschuk | Software Engineer | Access Softek >> >> > > -- > Best Regards, > > Victor Leschuk | Software Engineer | Access Softek > >-- Best Regards, Victor Leschuk | Software Engineer | Access Softek -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180208/60a061a9/attachment.html>