Arnold Schwaighofer
2014-Jan-28 01:22 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
In r200270 I added support to unroll conditional stores in the loop vectorizer. It is currently off pending further benchmarking and can be enabled with "-mllvm -vectorize-num-stores-pred=1”. Furthermore, I added a heuristic to unroll until load/store ports are saturated “-mllvm enable-loadstore-runtime-unroll” instead of the pure size based heuristic. Those two together with a patch that slightly changes the register heuristic and libquantum’s three hot loops will unroll and goodness will ensue (at least for libquantum). commit 6b908b8b1084c97238cc642a3404a4285c21286f Author: Arnold Schwaighofer <aschwaighofer at apple.com> Date: Mon Jan 27 13:21:55 2014 -0800 Subtract one for loop induction variable. It is unlikely to be unrolled. diff --git a/lib/Transforms/Vectorize/LoopVectorize.cpp b/lib/Transforms/Vectorize/LoopVectorize.cpp index 7867495..978c5a1 100644 --- a/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -5142,8 +5142,8 @@ LoopVectorizationCostModel::selectUnrollFactor(bool OptForSize, // fit without causing spills. All of this is rounded down if necessary to be // a power of two. We want power of two unroll factors to simplify any // addressing operations or alignment considerations. - unsigned UF = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) / - R.MaxLocalUsers); + unsigned UF = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) / + (R.MaxLocalUsers - 1)); On Jan 21, 2014, at 11:46 AM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:> > On Jan 21, 2014, at 6:18 AM, Diego Novillo <dnovillo at google.com> wrote: > >> On 16/01/2014, 23:47 , Andrew Trick wrote: >>> >>> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at google.com> wrote: >>> >>>> Chandler also pointed me at the vectorizer, which has its own >>>> unroller. However, the vectorizer only unrolls enough to serve the >>>> target, it's not as general as the runtime-triggered unroller. From >>>> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on >>>> avx targets). Additionally, the vectorizer only unrolls to aid >>>> reduction variables. When I forced the vectorizer to unroll these >>>> loops, the performance effects were nil. >>> >>> Vectorization and partial unrolling (aka runtime unrolling) for ILP should to be the same pass. The profitability analysis required in each case is very closely related, and you never want to do one before or after the other. The analysis makes sense even for targets without vector units. The “vector unroller” has an extra restriction (unlike the LoopUnroll pass) in that it must be able to interleave operations across iterations. This is usually a good thing to check before unrolling, but the compiler’s dependence analysis may be too conservative in some cases. >> >> In addition to tuning the cost model, I found that the vectorizer does not even choose to get that far into its analysis for some loops that I need unrolled. In this particular case, there are three loops that need to be unrolled to get the performance I'm looking for. Of the three, only one gets far enough in the analysis to decide whether we unroll it or not. >> > > I assume the other two loops are quantum_cnot's <http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00054> and quantum_toffoli's <http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00082>. > > The problem for the unroller in the loop vectorizer is that it wants to if-convert those loops. The conditional store prevents if-conversion because we can’t introduce a store on a path where there was none before: <http://llvm.org/docs/Atomics.html#optimization-outside-atomic>. > > for (…) > if (A[i] & mask) > A[i] = val > > If we wanted the unroller in the vectorizer to handle such loops we would have to teach it to leave the store behind an if: > > > for (…) > if (A[i] & mask) > A[i] = val > > => > > for ( … i+=2) { > pred<0,1> = A[i:i+1] & mask<0, 1> > val<0,1> = ... > if (pred<0>) > A[i] = val<0> > if (pred<1>) > A[i+1] = val<1> > } > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Chandler Carruth
2014-Jan-31 21:28 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
Hey Arnold, I've completed some pretty thorough benchmarking and wanted to share the results. On Mon, Jan 27, 2014 at 5:22 PM, Arnold Schwaighofer < aschwaighofer at apple.com> wrote:> Furthermore, I added a heuristic to unroll until load/store ports are > saturated “-mllvm enable-loadstore-runtime-unroll” instead of the pure size > based heuristic. >> Those two together with a patch that slightly changes the register > heuristic and libquantum’s three hot loops will unroll and goodness will > ensue (at least for libquantum). >Both enabling loadstore runtime unrolling and the register heuristic (enabled with -enable-ind-var-reg-heur) show no interesting regressions (way below the noise) and a few nice benefits across all of the applications I measure. I'd support enabling them right away and getting more feedback from others. I've measured on both westmere and sandybridge, with -march=x86-64 and -march=corei7-avx. I don't have any ARM hardware to benchmark with, but I suspect you have decent numbers there? We also have a nice LNT bot that will measure anything we enable for ARM. Finally, I've got some experimental results for x86 that show some improvements and no significant regressions when I increase several target thresholds. I'll start a new thread about that though. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140131/a24978a3/attachment.html>
Chandler Carruth
2014-Feb-01 12:02 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Fri, Jan 31, 2014 at 1:28 PM, Chandler Carruth <chandlerc at google.com>wrote:> Hey Arnold, > > I've completed some pretty thorough benchmarking and wanted to share the > results. > > On Mon, Jan 27, 2014 at 5:22 PM, Arnold Schwaighofer < > aschwaighofer at apple.com> wrote: > >> Furthermore, I added a heuristic to unroll until load/store ports are >> saturated “-mllvm enable-loadstore-runtime-unroll” instead of the pure size >> based heuristic. >> > >> Those two together with a patch that slightly changes the register >> heuristic and libquantum’s three hot loops will unroll and goodness will >> ensue (at least for libquantum). >> > > Both enabling loadstore runtime unrolling and the register heuristic > (enabled with -enable-ind-var-reg-heur) show no interesting regressions > (way below the noise) and a few nice benefits across all of the > applications I measure. I'd support enabling them right away and getting > more feedback from others. I've measured on both westmere and sandybridge, > with -march=x86-64 and -march=corei7-avx. >I've now also measured -vectorize-num-stores-pred={1,2,4} both with and without -enable-cond-stores vec. There are some crashers when using these currently. I may get a chance to reduce it soon, but I may not. However, enough built and ran that I can give some rough numbers on our end. With all permutations of these options I see a small improvement on a wide range o benchmarks running on westmere (march pinned at SSE3 essentially). I can't measure any real change between 1, 2, and 4. It's lost in the noise. But all are a definite improvement. The improvement is smaller on sandybridge for me, but still there, still consistent across 1, 2, and 4. No binary size impact of note (under 0.01% for *everything* discussed here). When I target march=corei7-avx, I get no real performance change for these flags. No regressions, no improvements. And still no code size changes. Note that for this last round, I started with the baseline of -enable-ind-var-reg-heur and -enable-loadstore-runtime-unroll, and added the -vectorize-num-stores-pred and -enable-cond-stores-vec to them. So unless you (or others) chime in with worrisome evidence, I think we should probably turn all four of these on, with whatever value for -vectorize-num-stores-pred looks good in your benchmarking. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140201/13b2f996/attachment.html>