Arnold Schwaighofer
2014-Jan-28  01:22 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
In r200270 I added support to unroll conditional stores in the loop vectorizer.
It is currently off pending further benchmarking and can be enabled with
"-mllvm -vectorize-num-stores-pred=1”.
Furthermore, I added a heuristic to unroll until load/store ports are saturated
“-mllvm enable-loadstore-runtime-unroll” instead of the pure size based
heuristic.
Those two together with a patch that slightly changes the register heuristic and
libquantum’s three hot loops will unroll and goodness will ensue (at least for
libquantum).
commit 6b908b8b1084c97238cc642a3404a4285c21286f
Author: Arnold Schwaighofer <aschwaighofer at apple.com>
Date:   Mon Jan 27 13:21:55 2014 -0800
    Subtract one for loop induction variable. It is unlikely to be unrolled.
diff --git a/lib/Transforms/Vectorize/LoopVectorize.cpp
b/lib/Transforms/Vectorize/LoopVectorize.cpp
index 7867495..978c5a1 100644
--- a/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5142,8 +5142,8 @@ LoopVectorizationCostModel::selectUnrollFactor(bool
OptForSize,
   // fit without causing spills. All of this is rounded down if necessary to be
   // a power of two. We want power of two unroll factors to simplify any
   // addressing operations or alignment considerations.
-  unsigned UF = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
-                              R.MaxLocalUsers);
+  unsigned UF = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /
+                              (R.MaxLocalUsers - 1));
On Jan 21, 2014, at 11:46 AM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
> 
> On Jan 21, 2014, at 6:18 AM, Diego Novillo <dnovillo at google.com>
wrote:
> 
>> On 16/01/2014, 23:47 , Andrew Trick wrote:
>>> 
>>> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at
google.com> wrote:
>>> 
>>>> Chandler also pointed me at the vectorizer, which has its own
>>>> unroller. However, the vectorizer only unrolls enough to serve
the
>>>> target, it's not as general as the runtime-triggered
unroller. From
>>>> what I've seen, it will get a maximum unroll factor of 2 on
x86 (4 on
>>>> avx targets). Additionally, the vectorizer only unrolls to aid
>>>> reduction variables. When I forced the vectorizer to unroll
these
>>>> loops, the performance effects were nil.
>>> 
>>> Vectorization and partial unrolling (aka runtime unrolling) for ILP
should to be the same pass. The profitability analysis required in each case is
very closely related, and you never want to do one before or after the other.
The analysis makes sense even for targets without vector units. The “vector
unroller” has an extra restriction (unlike the LoopUnroll pass) in that it must
be able to interleave operations across iterations. This is usually a good thing
to check before unrolling, but the compiler’s dependence analysis may be too
conservative in some cases.
>> 
>> In addition to tuning the cost model, I found that the vectorizer does
not even choose to get that far into its analysis for some loops that I need
unrolled. In this particular case, there are three loops that need to be
unrolled to get the performance I'm looking for. Of the three, only one gets
far enough in the analysis to decide whether we unroll it or not.
>> 
> 
> I assume the other two loops are quantum_cnot's
<http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00054>
and quantum_toffoli's
<http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00082>.
> 
> The problem for the unroller in the loop vectorizer is that it wants to
if-convert those loops. The conditional store prevents if-conversion because we
can’t introduce a store on a path where there was none before:
<http://llvm.org/docs/Atomics.html#optimization-outside-atomic>.
> 
> for (…)
>  if (A[i] & mask)
>    A[i] = val
> 
> If we wanted the unroller in the vectorizer to handle such loops we would
have to teach it to leave the store behind an if:
> 
> 
> for (…)
>  if (A[i] & mask)
>    A[i] = val
> 
> =>
> 
> for ( … i+=2) {
>   pred<0,1> = A[i:i+1] & mask<0, 1>
>   val<0,1> = ...
>   if (pred<0>)
>       A[i]   = val<0>
>   if (pred<1>)
>       A[i+1] = val<1>
> }
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Chandler Carruth
2014-Jan-31  21:28 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
Hey Arnold, I've completed some pretty thorough benchmarking and wanted to share the results. On Mon, Jan 27, 2014 at 5:22 PM, Arnold Schwaighofer < aschwaighofer at apple.com> wrote:> Furthermore, I added a heuristic to unroll until load/store ports are > saturated “-mllvm enable-loadstore-runtime-unroll” instead of the pure size > based heuristic. >> Those two together with a patch that slightly changes the register > heuristic and libquantum’s three hot loops will unroll and goodness will > ensue (at least for libquantum). >Both enabling loadstore runtime unrolling and the register heuristic (enabled with -enable-ind-var-reg-heur) show no interesting regressions (way below the noise) and a few nice benefits across all of the applications I measure. I'd support enabling them right away and getting more feedback from others. I've measured on both westmere and sandybridge, with -march=x86-64 and -march=corei7-avx. I don't have any ARM hardware to benchmark with, but I suspect you have decent numbers there? We also have a nice LNT bot that will measure anything we enable for ARM. Finally, I've got some experimental results for x86 that show some improvements and no significant regressions when I increase several target thresholds. I'll start a new thread about that though. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140131/a24978a3/attachment.html>
Chandler Carruth
2014-Feb-01  12:02 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Fri, Jan 31, 2014 at 1:28 PM, Chandler Carruth <chandlerc at google.com>wrote:> Hey Arnold, > > I've completed some pretty thorough benchmarking and wanted to share the > results. > > On Mon, Jan 27, 2014 at 5:22 PM, Arnold Schwaighofer < > aschwaighofer at apple.com> wrote: > >> Furthermore, I added a heuristic to unroll until load/store ports are >> saturated “-mllvm enable-loadstore-runtime-unroll” instead of the pure size >> based heuristic. >> > >> Those two together with a patch that slightly changes the register >> heuristic and libquantum’s three hot loops will unroll and goodness will >> ensue (at least for libquantum). >> > > Both enabling loadstore runtime unrolling and the register heuristic > (enabled with -enable-ind-var-reg-heur) show no interesting regressions > (way below the noise) and a few nice benefits across all of the > applications I measure. I'd support enabling them right away and getting > more feedback from others. I've measured on both westmere and sandybridge, > with -march=x86-64 and -march=corei7-avx. >I've now also measured -vectorize-num-stores-pred={1,2,4} both with and without -enable-cond-stores vec. There are some crashers when using these currently. I may get a chance to reduce it soon, but I may not. However, enough built and ran that I can give some rough numbers on our end. With all permutations of these options I see a small improvement on a wide range o benchmarks running on westmere (march pinned at SSE3 essentially). I can't measure any real change between 1, 2, and 4. It's lost in the noise. But all are a definite improvement. The improvement is smaller on sandybridge for me, but still there, still consistent across 1, 2, and 4. No binary size impact of note (under 0.01% for *everything* discussed here). When I target march=corei7-avx, I get no real performance change for these flags. No regressions, no improvements. And still no code size changes. Note that for this last round, I started with the baseline of -enable-ind-var-reg-heur and -enable-loadstore-runtime-unroll, and added the -vectorize-num-stores-pred and -enable-cond-stores-vec to them. So unless you (or others) chime in with worrisome evidence, I think we should probably turn all four of these on, with whatever value for -vectorize-num-stores-pred looks good in your benchmarking. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140201/13b2f996/attachment.html>