John Brawn via llvm-dev
2018-Aug-14  17:17 UTC
[llvm-dev] [RFC] Delaying phi-to-select transformation until later in the pass pipeline
Summary
======
I'm planning on adjusting SimplifyCFG so that it doesn't turn two-entry
phi
nodes into selects until later in the pass pipeline, to give passes which can
understand phis but not selects more opportunity to optimize. The thing I'm
trying to do which made me think of doing this is described below, but from the
benchmarking I've done it looks like this is overall a good idea regardless
of
if I manage to get that done or not.
Motivation
=========
My goal is to get clang to optimize some code containing a call to
std::min_element which is dereferenced, so something like:
  float min_element_example(float *data, int size)
  {
    return *std::min_element(data, data+size);
  }
which, after inlining a specialization, looks like:
  float min_element_example_inlined(float *first, float *last)
  {
    for (float *p = first; p != last; ++p)
    {
      if (*p < *first)
        first = p;
    }
    return *first;
  }
There are two loads in the loop, *p and *first, but actually the load *p can be
eliminated by using either the previous load *p or the previous *first,
depending on if the if-condition was taken or not. However the
"if (*p < *first) first = p" gets turned by simplifycfg into a
select and this
makes optimizing this a lot harder because you no longer have distinct paths
through the CFG.
I have some ideas on how to do the optimization (see my previous RFC
"Making GVN
able to visit the same block more than once" posted in April, though
I've
decided that the specific idea presented there isn't the right way to do
it),
but I think the first step is to make sure we don't have a select when we
try
to optimise this.
Approach
=======
I've posted a patch to https://reviews.llvm.org/D50723 showing what I'm
intending to do. An extra parameter is added to SimplifyCFG to control whether
two-entry phi nodes are converted into select, and this is set to false in all
instances before the end of module simplification. At the end of module
simplification we do SimplifyCFG, then Instcombine to optimise the selects that
are introduced, then EarlyCSE to eliminate common subexpressions introduced by
instcombine.
Benchmark Results
================
These are performance differences reported by LNT when running llvm-test-suite,
spec2000, and spec2006 at -O3 with and without the patch linked above (using
trunk llvm from a week or so ago).
AArch64 results on ARM Cortex-A72:
Performance Regressions - execution_time                              Change
SingleSource/Benchmarks/Shootout/Shootout-ary3                         9.48%
MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt                    3.79%
SingleSource/Benchmarks/CoyoteBench/huffbench                          1.40%
Performance Improvements - execution_time                             Change
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl              -23.74%
External/SPEC/CINT2000/256.bzip2/256.bzip2                            -9.82%
MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt               -9.57%
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt       -4.38%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -3.94%
MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl                   -3.44%
External/SPEC/CFP2006/453.povray/453.povray                           -2.50%
SingleSource/Benchmarks/Adobe-C++/stepanov_vector                     -1.49%
X86_64 results on Intel Xeon E5-2690:
Performance Regressions - execution_time           Change
MultiSource/Benchmarks/Ptrdist/yacr2/yacr2          5.62%
Performance Improvements - execution_time          Change
SingleSource/Benchmarks/Misc-C++/Large/sphereflake -4.43%
External/SPEC/CINT2006/456.hmmer/456.hmmer         -2.50%
External/SPEC/CINT2006/464.h264ref/464.h264ref     -1.60%
MultiSource/Benchmarks/nbench/nbench               -1.19%
SingleSource/Benchmarks/Adobe-C++/functionobjects  -1.07%
I had a brief look at the regressions and they all look to be caused by
getting bad luck with branch mispredictions: I looked into the Shootout-ary3 and
yacr2 cases and in both the hot code path was the same with and without the
patch, but with more mispredictions probably caused by changes elsewhere.
John
Krzysztof Parzyszek via llvm-dev
2018-Aug-14  17:45 UTC
[llvm-dev] [RFC] Delaying phi-to-select transformation until later in the pass pipeline
I think it would be good to have the CFG simplification options provided by TTI, including the limit on phi->select transformation. We have some code on Hexagon that would benefit not only from converting 1 phi to select (as is the limit now), but 4 of them. -Krzysztof On 8/14/2018 12:17 PM, John Brawn via llvm-dev wrote:> Summary > ======> > I'm planning on adjusting SimplifyCFG so that it doesn't turn two-entry phi > nodes into selects until later in the pass pipeline, to give passes which can > understand phis but not selects more opportunity to optimize. The thing I'm > trying to do which made me think of doing this is described below, but from the > benchmarking I've done it looks like this is overall a good idea regardless of > if I manage to get that done or not. > > Motivation > =========> > My goal is to get clang to optimize some code containing a call to > std::min_element which is dereferenced, so something like: > > float min_element_example(float *data, int size) > { > return *std::min_element(data, data+size); > } > > which, after inlining a specialization, looks like: > > float min_element_example_inlined(float *first, float *last) > { > for (float *p = first; p != last; ++p) > { > if (*p < *first) > first = p; > } > return *first; > } > > There are two loads in the loop, *p and *first, but actually the load *p can be > eliminated by using either the previous load *p or the previous *first, > depending on if the if-condition was taken or not. However the > "if (*p < *first) first = p" gets turned by simplifycfg into a select and this > makes optimizing this a lot harder because you no longer have distinct paths > through the CFG. > > I have some ideas on how to do the optimization (see my previous RFC "Making GVN > able to visit the same block more than once" posted in April, though I've > decided that the specific idea presented there isn't the right way to do it), > but I think the first step is to make sure we don't have a select when we try > to optimise this. > > Approach > =======> > I've posted a patch to https://reviews.llvm.org/D50723 showing what I'm > intending to do. An extra parameter is added to SimplifyCFG to control whether > two-entry phi nodes are converted into select, and this is set to false in all > instances before the end of module simplification. At the end of module > simplification we do SimplifyCFG, then Instcombine to optimise the selects that > are introduced, then EarlyCSE to eliminate common subexpressions introduced by > instcombine. > > Benchmark Results > ================> > These are performance differences reported by LNT when running llvm-test-suite, > spec2000, and spec2006 at -O3 with and without the patch linked above (using > trunk llvm from a week or so ago). > > AArch64 results on ARM Cortex-A72: > > Performance Regressions - execution_time Change > SingleSource/Benchmarks/Shootout/Shootout-ary3 9.48% > MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt 3.79% > SingleSource/Benchmarks/CoyoteBench/huffbench 1.40% > > Performance Improvements - execution_time Change > MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl -23.74% > External/SPEC/CINT2000/256.bzip2/256.bzip2 -9.82% > MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt -9.57% > MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt -4.38% > MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -3.94% > MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl -3.44% > External/SPEC/CFP2006/453.povray/453.povray -2.50% > SingleSource/Benchmarks/Adobe-C++/stepanov_vector -1.49% > > X86_64 results on Intel Xeon E5-2690: > > Performance Regressions - execution_time Change > MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 5.62% > > Performance Improvements - execution_time Change > SingleSource/Benchmarks/Misc-C++/Large/sphereflake -4.43% > External/SPEC/CINT2006/456.hmmer/456.hmmer -2.50% > External/SPEC/CINT2006/464.h264ref/464.h264ref -1.60% > MultiSource/Benchmarks/nbench/nbench -1.19% > SingleSource/Benchmarks/Adobe-C++/functionobjects -1.07% > > I had a brief look at the regressions and they all look to be caused by > getting bad luck with branch mispredictions: I looked into the Shootout-ary3 and > yacr2 cases and in both the hot code path was the same with and without the > patch, but with more mispredictions probably caused by changes elsewhere. > > John > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Sanjay Patel via llvm-dev
2018-Aug-14  19:39 UTC
[llvm-dev] [RFC] Delaying phi-to-select transformation until later in the pass pipeline
I didn't look closely at the new patch, but this appears to be a small extension to: https://reviews.llvm.org/D38566 ...and the GVN-based reasons for delaying transformation to 'select' are discussed in detail in the motivating bug for that patch: https://bugs.llvm.org/show_bug.cgi?id=34603 So this sounds like the right direction to me. Note that there was objection to the implementation (a pile of pass options vs. uniquely named passes). Here's another motivating bug where early transform to select prevents optimization: https://bugs.llvm.org/show_bug.cgi?id=36760 Is that case affected by this patch? On Tue, Aug 14, 2018 at 11:17 AM, John Brawn via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Summary > ======> > I'm planning on adjusting SimplifyCFG so that it doesn't turn two-entry phi > nodes into selects until later in the pass pipeline, to give passes which > can > understand phis but not selects more opportunity to optimize. The thing I'm > trying to do which made me think of doing this is described below, but > from the > benchmarking I've done it looks like this is overall a good idea > regardless of > if I manage to get that done or not. > > Motivation > =========> > My goal is to get clang to optimize some code containing a call to > std::min_element which is dereferenced, so something like: > > float min_element_example(float *data, int size) > { > return *std::min_element(data, data+size); > } > > which, after inlining a specialization, looks like: > > float min_element_example_inlined(float *first, float *last) > { > for (float *p = first; p != last; ++p) > { > if (*p < *first) > first = p; > } > return *first; > } > > There are two loads in the loop, *p and *first, but actually the load *p > can be > eliminated by using either the previous load *p or the previous *first, > depending on if the if-condition was taken or not. However the > "if (*p < *first) first = p" gets turned by simplifycfg into a select and > this > makes optimizing this a lot harder because you no longer have distinct > paths > through the CFG. > > I have some ideas on how to do the optimization (see my previous RFC > "Making GVN > able to visit the same block more than once" posted in April, though I've > decided that the specific idea presented there isn't the right way to do > it), > but I think the first step is to make sure we don't have a select when we > try > to optimise this. > > Approach > =======> > I've posted a patch to https://reviews.llvm.org/D50723 showing what I'm > intending to do. An extra parameter is added to SimplifyCFG to control > whether > two-entry phi nodes are converted into select, and this is set to false in > all > instances before the end of module simplification. At the end of module > simplification we do SimplifyCFG, then Instcombine to optimise the selects > that > are introduced, then EarlyCSE to eliminate common subexpressions > introduced by > instcombine. > > Benchmark Results > ================> > These are performance differences reported by LNT when running > llvm-test-suite, > spec2000, and spec2006 at -O3 with and without the patch linked above > (using > trunk llvm from a week or so ago). > > AArch64 results on ARM Cortex-A72: > > Performance Regressions - execution_time > Change > SingleSource/Benchmarks/Shootout/Shootout-ary3 > 9.48% > MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt > 3.79% > SingleSource/Benchmarks/CoyoteBench/huffbench > 1.40% > > Performance Improvements - execution_time > Change > MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl > -23.74% > External/SPEC/CINT2000/256.bzip2/256.bzip2 > -9.82% > MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt > -9.57% > MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt > -4.38% > MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt > -3.94% > MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl > -3.44% > External/SPEC/CFP2006/453.povray/453.povray > -2.50% > SingleSource/Benchmarks/Adobe-C++/stepanov_vector > -1.49% > > X86_64 results on Intel Xeon E5-2690: > > Performance Regressions - execution_time Change > MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 5.62% > > Performance Improvements - execution_time Change > SingleSource/Benchmarks/Misc-C++/Large/sphereflake -4.43% > External/SPEC/CINT2006/456.hmmer/456.hmmer -2.50% > External/SPEC/CINT2006/464.h264ref/464.h264ref -1.60% > MultiSource/Benchmarks/nbench/nbench -1.19% > SingleSource/Benchmarks/Adobe-C++/functionobjects -1.07% > > I had a brief look at the regressions and they all look to be caused by > getting bad luck with branch mispredictions: I looked into the > Shootout-ary3 and > yacr2 cases and in both the hot code path was the same with and without the > patch, but with more mispredictions probably caused by changes elsewhere. > > John > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180814/7cbb7212/attachment.html>
Philip Reames via llvm-dev
2018-Aug-15  19:38 UTC
[llvm-dev] [RFC] Delaying phi-to-select transformation until later in the pass pipeline
I'm concerned that we're focusing on one side of this.  Let me point out
a few concerns w/changing the canonical form here:
 1. LICM does not know how to hoist or sink regions.  It does know how
    to hoist and sink selects.
 2. InstCombine has limited support for triangles/diamonds, but fairly
    extensive support for selects.
 3. EarlyCSE and GVN do not know how to eliminate fully redundant
    triangles/diamonds.  PRE *may* get some of these cases, but that is
    by no means guaranteed or likely to be robust.
My personal opinion is that selects are the appropriate canonical form.
For the one of the specific cases mentioned, teaching GVN to do FRE and 
PRE for loads from selects of pointers just doesn't seem that painful.  
I would be really tempted to do that instead. Similarly, walking facts 
back from select uses in CVP seems generally useful since we have use 
dependent facts in a bunch of contexts, not just selects.  (Call 
arguments for instance, non-null from unconditional deref, etc..)
To be clear, I am raising concerns, not actively objecting to this.  If 
you want to move forward and commit to work through all of the issues 
identified I wont actively stand in the way.
Philip
On 08/14/2018 12:39 PM, Sanjay Patel via llvm-dev wrote:> I didn't look closely at the new patch, but this appears to be a small 
> extension to:
> https://reviews.llvm.org/D38566
> ...and the GVN-based reasons for delaying transformation to
'select'
> are discussed in detail in the motivating bug for that patch:
> https://bugs.llvm.org/show_bug.cgi?id=34603
>
> So this sounds like the right direction to me. Note that there was 
> objection to the implementation (a pile of pass options vs. uniquely 
> named passes).
>
> Here's another motivating bug where early transform to select prevents 
> optimization:
> https://bugs.llvm.org/show_bug.cgi?id=36760
>
> Is that case affected by this patch?
>
>
> On Tue, Aug 14, 2018 at 11:17 AM, John Brawn via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     Summary
>     ======>
>     I'm planning on adjusting SimplifyCFG so that it doesn't turn
>     two-entry phi
>     nodes into selects until later in the pass pipeline, to give
>     passes which can
>     understand phis but not selects more opportunity to optimize. The
>     thing I'm
>     trying to do which made me think of doing this is described below,
>     but from the
>     benchmarking I've done it looks like this is overall a good idea
>     regardless of
>     if I manage to get that done or not.
>
>     Motivation
>     =========>
>     My goal is to get clang to optimize some code containing a call to
>     std::min_element which is dereferenced, so something like:
>
>       float min_element_example(float *data, int size)
>       {
>         return *std::min_element(data, data+size);
>       }
>
>     which, after inlining a specialization, looks like:
>
>       float min_element_example_inlined(float *first, float *last)
>       {
>         for (float *p = first; p != last; ++p)
>         {
>           if (*p < *first)
>             first = p;
>         }
>         return *first;
>       }
>
>     There are two loads in the loop, *p and *first, but actually the
>     load *p can be
>     eliminated by using either the previous load *p or the previous
>     *first,
>     depending on if the if-condition was taken or not. However the
>     "if (*p < *first) first = p" gets turned by simplifycfg
into a
>     select and this
>     makes optimizing this a lot harder because you no longer have
>     distinct paths
>     through the CFG.
>
>     I have some ideas on how to do the optimization (see my previous
>     RFC "Making GVN
>     able to visit the same block more than once" posted in April,
>     though I've
>     decided that the specific idea presented there isn't the right way
>     to do it),
>     but I think the first step is to make sure we don't have a select
>     when we try
>     to optimise this.
>
>     Approach
>     =======>
>     I've posted a patch to https://reviews.llvm.org/D50723
>     <https://reviews.llvm.org/D50723> showing what I'm
>     intending to do. An extra parameter is added to SimplifyCFG to
>     control whether
>     two-entry phi nodes are converted into select, and this is set to
>     false in all
>     instances before the end of module simplification. At the end of
>     module
>     simplification we do SimplifyCFG, then Instcombine to optimise the
>     selects that
>     are introduced, then EarlyCSE to eliminate common subexpressions
>     introduced by
>     instcombine.
>
>     Benchmark Results
>     ================>
>     These are performance differences reported by LNT when running
>     llvm-test-suite,
>     spec2000, and spec2006 at -O3 with and without the patch linked
>     above (using
>     trunk llvm from a week or so ago).
>
>     AArch64 results on ARM Cortex-A72:
>
>     Performance Regressions - execution_time           Change
>     SingleSource/Benchmarks/Shootout/Shootout-ary3                9.48%
>     MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt              
>       3.79%
>     SingleSource/Benchmarks/CoyoteBench/huffbench                 1.40%
>
>     Performance Improvements - execution_time          Change
>     MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl          
>       -23.74%
>     External/SPEC/CINT2000/256.bzip2/256.bzip2               -9.82%
>     MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt          
>        -9.57%
>     MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt  
>        -4.38%
>     MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
>     -3.94%
>     MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl              
>      -3.44%
>     External/SPEC/CFP2006/453.povray/453.povray                -2.50%
>     SingleSource/Benchmarks/Adobe-C++/stepanov_vector              
>      -1.49%
>
>     X86_64 results on Intel Xeon E5-2690:
>
>     Performance Regressions - execution_time           Change
>     MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 5.62%
>
>     Performance Improvements - execution_time          Change
>     SingleSource/Benchmarks/Misc-C++/Large/sphereflake -4.43%
>     External/SPEC/CINT2006/456.hmmer/456.hmmer  -2.50%
>     External/SPEC/CINT2006/464.h264ref/464.h264ref  -1.60%
>     MultiSource/Benchmarks/nbench/nbench  -1.19%
>     SingleSource/Benchmarks/Adobe-C++/functionobjects -1.07%
>
>     I had a brief look at the regressions and they all look to be
>     caused by
>     getting bad luck with branch mispredictions: I looked into the
>     Shootout-ary3 and
>     yacr2 cases and in both the hot code path was the same with and
>     without the
>     patch, but with more mispredictions probably caused by changes
>     elsewhere.
>
>     John
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180815/c7885755/attachment.html>
Apparently Analagous Threads
- [RFC] Delaying phi-to-select transformation until later in the pass pipeline
- [RFC] Delaying phi-to-select transformation until later in the pass pipeline
- [LLVMdev] [RFC] AArch64: Should we disable GlobalMerge?
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] IR Passes and TargetTransformInfo: Straw Man