Diego Novillo
2014-Jan-16 00:13 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
I am starting to use the sample profiler to analyze new performance opportunities. The loop unroller has popped up in several of the benchmarks I'm running. In particular, libquantum. There is a ~12% opportunity when the runtime unroller is triggered. This helps functions like quantum_sigma_x (http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149). The function accounts for ~20% of total runtime. By allowing the runtime unroller, we can speedup the program by about 12%. I have been poking at the unroller a little bit. Currently, the runtime unroller is only triggered by a special flag or if the target states it in the unrolling preferences. We could also consult the block frequency information here. If the loop header has a higher relative frequency than the rest of the function, then we'd enable runtime unrolling. Chandler also pointed me at the vectorizer, which has its own unroller. However, the vectorizer only unrolls enough to serve the target, it's not as general as the runtime-triggered unroller. From what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on avx targets). Additionally, the vectorizer only unrolls to aid reduction variables. When I forced the vectorizer to unroll these loops, the performance effects were nil. I'm currently looking at changing LoopUnroll::runOnLoop() to consult block frequency information for the loop header to decide whether to try runtime triggers for loops that don't have a constant trip count but could be partially peeled. Does that sound reasonable? Thanks. Diego.
Hal Finkel
2014-Jan-16 00:36 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
----- Original Message -----> From: "Diego Novillo" <dnovillo at google.com> > To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > Cc: nadav at apple.com > Sent: Wednesday, January 15, 2014 6:13:27 PM > Subject: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info > > I am starting to use the sample profiler to analyze new performance > opportunities. The loop unroller has popped up in several of the > benchmarks I'm running. In particular, libquantum. There is a ~12% > opportunity when the runtime unroller is triggered. > > This helps functions like quantum_sigma_x > (http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149). > The function accounts for ~20% of total runtime. By allowing the > runtime unroller, we can speedup the program by about 12%. > > I have been poking at the unroller a little bit. Currently, the > runtime unroller is only triggered by a special flag or if the target > states it in the unrolling preferences. We could also consult the > block frequency information here. If the loop header has a higher > relative frequency than the rest of the function, then we'd enable > runtime unrolling. > > Chandler also pointed me at the vectorizer, which has its own > unroller. However, the vectorizer only unrolls enough to serve the > target, it's not as general as the runtime-triggered unroller. From > what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on > avx targets). Additionally, the vectorizer only unrolls to aid > reduction variables. When I forced the vectorizer to unroll these > loops, the performance effects were nil.It may be worth noting, that the vectorizer's unrolling is modulo unrolling (in the sense that the iterations are maximally intermixed), and so is bound by register pressure considerations (especially in the default configuration, where CodeGen does not make use of AA, and so often cannot 'fix' an expensive unrolling that has increased register pressure too much). The generic unroller, on the other hand, does concatenation unrolling, which has different benefits.> > I'm currently looking at changing LoopUnroll::runOnLoop() to consult > block frequency information for the loop header to decide whether to > try runtime triggers for loops that don't have a constant trip count > but could be partially peeled. > > Does that sound reasonable?This sounds good to me; I definitely feel that we should better exploit the generic unroller's capabilities. The last time that I tried enabling runtime unrolling (and partial unrolling) over the entire test suite on x86, there were many speedups and many slowdowns (although slightly more slowdowns than speedups). You seem to be suggesting that restricting runtime unrolling to known hot loops will eliminate many of the slowdowns. I'm certainly curious to see how that turns out. -Hal> > > Thanks. Diego. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Chandler Carruth
2014-Jan-16 00:41 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Wed, Jan 15, 2014 at 4:13 PM, Diego Novillo <dnovillo at google.com> wrote:> Chandler also pointed me at the vectorizer, which has its own > unroller. However, the vectorizer only unrolls enough to serve the > target, it's not as general as the runtime-triggered unroller. From > what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on > avx targets). Additionally, the vectorizer only unrolls to aid > reduction variables. When I forced the vectorizer to unroll these > loops, the performance effects were nil. >I just also want to point out that we should really be *vectorizing* this loop as well. It's a great candidate for it AFAICS.... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140115/12388c9f/attachment.html>
Diego Novillo
2014-Jan-16 00:41 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Wed, Jan 15, 2014 at 4:36 PM, Hal Finkel <hfinkel at anl.gov> wrote:> ----- Original Message ----- >> From: "Diego Novillo" <dnovillo at google.com> >> To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> >> Cc: nadav at apple.com >> Sent: Wednesday, January 15, 2014 6:13:27 PM >> Subject: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info >> >> I am starting to use the sample profiler to analyze new performance >> opportunities. The loop unroller has popped up in several of the >> benchmarks I'm running. In particular, libquantum. There is a ~12% >> opportunity when the runtime unroller is triggered. >> >> This helps functions like quantum_sigma_x >> (http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149). >> The function accounts for ~20% of total runtime. By allowing the >> runtime unroller, we can speedup the program by about 12%. >> >> I have been poking at the unroller a little bit. Currently, the >> runtime unroller is only triggered by a special flag or if the target >> states it in the unrolling preferences. We could also consult the >> block frequency information here. If the loop header has a higher >> relative frequency than the rest of the function, then we'd enable >> runtime unrolling. >> >> Chandler also pointed me at the vectorizer, which has its own >> unroller. However, the vectorizer only unrolls enough to serve the >> target, it's not as general as the runtime-triggered unroller. From >> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on >> avx targets). Additionally, the vectorizer only unrolls to aid >> reduction variables. When I forced the vectorizer to unroll these >> loops, the performance effects were nil. > > It may be worth noting, that the vectorizer's unrolling is modulo unrolling (in the sense that the iterations are maximally intermixed), and so is bound > by register pressure considerations (especially in the default configuration, where CodeGen does not make use of AA, and so often cannot 'fix' an > expensive unrolling that has increased register pressure too much). > > The generic unroller, on the other hand, does concatenation unrolling, which has different benefits.Thanks.> This sounds good to me; I definitely feel that we should better exploit the generic unroller's capabilities. > > The last time that I tried enabling runtime unrolling (and partial unrolling) over the entire test suite on x86, there were many speedups and many > slowdowns (although slightly more slowdowns than speedups). You seem to be suggesting that restricting runtime unrolling to known hot loops will > eliminate many of the slowdowns. I'm certainly curious to see how that turns out.Right. If I force the runtime unroller, I get a mixed bag of speedups and slowdowns. Additionally, code size skyrockets. By using it only on the functions that have hot loops (as per the profile), we only unroll those that make a difference. In the case of libquantum, there is a grand total of 3 loops that need to be runtime unrolled. Diego.
Nadav Rotem
2014-Jan-16 01:30 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
Hi Diego! Thanks for looking at this!> Chandler also pointed me at the vectorizer, which has its own > unroller. However, the vectorizer only unrolls enough to serve the > target, it's not as general as the runtime-triggered unroller. From > what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on > avx targets). Additionally, the vectorizer only unrolls to aid > reduction variables.The vectorizer has a heuristics that is used to decide when to unroll. One of the rules that the heuristics has is that reductions are profitable to unroll. The other rule is that small loops should also be unrolled.> When I forced the vectorizer to unroll these > loops, the performance effects were nil.Was the vectorizer successful in unrolling the loop in quantum_sigma_x? I wonder if 'size’ is typically high or low. Thanks, Nadav -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140115/bf23ce03/attachment.html>
Sean Silva
2014-Jan-16 03:38 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Wed, Jan 15, 2014 at 7:13 PM, Diego Novillo <dnovillo at google.com> wrote:> I am starting to use the sample profiler to analyze new performance > opportunities. The loop unroller has popped up in several of the > benchmarks I'm running. In particular, libquantum. There is a ~12% > opportunity when the runtime unroller is triggered. >Pardon my ignorance, but what exactly does "runtime unroller" mean? In particular the "runtime" part of it. Just from the name I'm imagining JIT'ing an unrolled version on the fly, or choosing an unrolled version at runtime, but neither of those interpretations seems likely. -- Sean Silva> > This helps functions like quantum_sigma_x > (http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149 > ). > The function accounts for ~20% of total runtime. By allowing the > runtime unroller, we can speedup the program by about 12%. > > I have been poking at the unroller a little bit. Currently, the > runtime unroller is only triggered by a special flag or if the target > states it in the unrolling preferences. We could also consult the > block frequency information here. If the loop header has a higher > relative frequency than the rest of the function, then we'd enable > runtime unrolling. > > Chandler also pointed me at the vectorizer, which has its own > unroller. However, the vectorizer only unrolls enough to serve the > target, it's not as general as the runtime-triggered unroller. From > what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on > avx targets). Additionally, the vectorizer only unrolls to aid > reduction variables. When I forced the vectorizer to unroll these > loops, the performance effects were nil. > > I'm currently looking at changing LoopUnroll::runOnLoop() to consult > block frequency information for the loop header to decide whether to > try runtime triggers for loops that don't have a constant trip count > but could be partially peeled. > > Does that sound reasonable? > > > Thanks. Diego. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140115/cb452f42/attachment.html>
Sean Silva
2014-Jan-16 03:43 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Wed, Jan 15, 2014 at 8:30 PM, Nadav Rotem <nrotem at apple.com> wrote:> > Hi Diego! > > Thanks for looking at this! > > Chandler also pointed me at the vectorizer, which has its own > unroller. However, the vectorizer only unrolls enough to serve the > target, it's not as general as the runtime-triggered unroller. From > what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on > avx targets). Additionally, the vectorizer only unrolls to aid > reduction variables. > > > The vectorizer has a heuristics that is used to decide when to unroll. > One of the rules that the heuristics has is that reductions are profitable > to unroll. The other rule is that small loops should also be unrolled. > > When I forced the vectorizer to unroll these > loops, the performance effects were nil. > > > Was the vectorizer successful in unrolling the loop in quantum_sigma_x? I > wonder if 'size’ is typically high or low. >Yeah, can you produce a histogram of the values of `reg->size`? (or provide the raw data for me to analyze?). -- Sean Silva> > Thanks, > Nadav > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140115/44719f0d/attachment.html>
Hal Finkel
2014-Jan-16 04:50 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
----- Original Message -----> From: "Sean Silva" <silvas at purdue.edu> > To: "Diego Novillo" <dnovillo at google.com> > Cc: nadav at apple.com, "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > Sent: Wednesday, January 15, 2014 9:38:32 PM > Subject: Re: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info > > > > > > > > > On Wed, Jan 15, 2014 at 7:13 PM, Diego Novillo < dnovillo at google.com > > wrote: > > > I am starting to use the sample profiler to analyze new performance > opportunities. The loop unroller has popped up in several of the > benchmarks I'm running. In particular, libquantum. There is a ~12% > opportunity when the runtime unroller is triggered. > > > > Pardon my ignorance, but what exactly does "runtime unroller" mean? > In particular the "runtime" part of it.He's referring to the code in lib/Transforms/Utils/LoopUnrollRuntime.cpp -- which can be enabled by using the -unroll-runtime flag. The 'runtime' refers to the fact that the trip count is not known at compile time. -Hal> Just from the name I'm > imagining JIT'ing an unrolled version on the fly, or choosing an > unrolled version at runtime, but neither of those interpretations > seems likely. > > > -- Sean Silva > > > > This helps functions like quantum_sigma_x > ( > http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149 > ). > The function accounts for ~20% of total runtime. By allowing the > runtime unroller, we can speedup the program by about 12%. > > I have been poking at the unroller a little bit. Currently, the > runtime unroller is only triggered by a special flag or if the target > states it in the unrolling preferences. We could also consult the > block frequency information here. If the loop header has a higher > relative frequency than the rest of the function, then we'd enable > runtime unrolling. > > Chandler also pointed me at the vectorizer, which has its own > unroller. However, the vectorizer only unrolls enough to serve the > target, it's not as general as the runtime-triggered unroller. From > what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on > avx targets). Additionally, the vectorizer only unrolls to aid > reduction variables. When I forced the vectorizer to unroll these > loops, the performance effects were nil. > > I'm currently looking at changing LoopUnroll::runOnLoop() to consult > block frequency information for the loop header to decide whether to > try runtime triggers for loops that don't have a constant trip count > but could be partially peeled. > > Does that sound reasonable? > > > Thanks. Diego. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Diego Novillo
2014-Jan-16 16:16 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote:> Was the vectorizer successful in unrolling the loop in quantum_sigma_x? I > wonder if 'size’ is typically high or low.No. The vectorizer stated that it wasn't going to bother with the loop because it wasn't profitable. Specifically: LV: Checking a loop in "quantum_sigma_x" LV: Found a loop: for.body LV: Found an induction variable. LV: Found a write-only loop! LV: We can vectorize this loop! LV: Found trip count: 0 LV: The Widest type: 64 bits. LV: The Widest register is: 128 bits. LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ] LV: Found an estimated cost of 0 for VF 1 For instruction: %state getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64 %indvars.iv, i32 1, !dbg !58 LV: Found an estimated cost of 1 for VF 1 For instruction: %3 = load i64* %state, align 8, !dbg !58, !tbaa !61 LV: Found an estimated cost of 1 for VF 1 For instruction: %xor xor i64 %3, %shl, !dbg !58 LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 %xor, i64* %state, align 8, !dbg !58, !tbaa !61 LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52 LV: Found an estimated cost of 0 for VF 1 For instruction: %4 trunc i64 %indvars.iv.next to i32, !dbg !52 LV: Found an estimated cost of 1 for VF 1 For instruction: %cmp icmp slt i32 %4, %1, !dbg !52 LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57 LV: Scalar loop costs: 5. LV: Found an estimated cost of 0 for VF 2 For instruction: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ] LV: Found an estimated cost of 0 for VF 2 For instruction: %state getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64 %indvars.iv, i32 1, !dbg !58 LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load i64* %state, align 8, !dbg !58, !tbaa !61 LV: Found an estimated cost of 1 for VF 2 For instruction: %xor xor i64 %3, %shl, !dbg !58 LV: Found an estimated cost of 6 for VF 2 For instruction: store i64 %xor, i64* %state, align 8, !dbg !58, !tbaa !61 LV: Found an estimated cost of 1 for VF 2 For instruction: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52 LV: Found an estimated cost of 0 for VF 2 For instruction: %4 trunc i64 %indvars.iv.next to i32, !dbg !52 LV: Found an estimated cost of 1 for VF 2 For instruction: %cmp icmp slt i32 %4, %1, !dbg !52 LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57 LV: Vector loop of width 2 costs: 7. LV: Selecting VF = : 1. LV: The target has 16 vector registers LV(REG): Calculating max register usage: LV(REG): At #0 Interval # 0 LV(REG): At #1 Interval # 1 LV(REG): At #2 Interval # 2 LV(REG): At #3 Interval # 3 LV(REG): At #5 Interval # 2 LV(REG): At #6 Interval # 2 LV(REG): At #7 Interval # 2 LV(REG): Found max usage: 3 LV(REG): Found invariant usage: 3 LV(REG): LoopSize: 9 LV: Found a vectorizable loop (1) in gates.ll LV: Unroll Factor is 1 LV: Vectorization is possible but not beneficial. I poked briefly at the vectorizer code to see if there is anything that the profile data could've told it, but this loop did not meet the requirements for unrolling. And even if it did, the trip count is not constant and the unroll factor used by the vectorizer is pretty low. So, even if we vectorized it (or parts of it) I don't think the speedup would be significant. What really helps this loop is to peel it a few times and do the remaining iterations in the loop. Diego.
Andrew Trick
2014-Jan-17 04:47 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at google.com> wrote:> Chandler also pointed me at the vectorizer, which has its own > unroller. However, the vectorizer only unrolls enough to serve the > target, it's not as general as the runtime-triggered unroller. From > what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on > avx targets). Additionally, the vectorizer only unrolls to aid > reduction variables. When I forced the vectorizer to unroll these > loops, the performance effects were nil.Vectorization and partial unrolling (aka runtime unrolling) for ILP should to be the same pass. The profitability analysis required in each case is very closely related, and you never want to do one before or after the other. The analysis makes sense even for targets without vector units. The “vector unroller” has an extra restriction (unlike the LoopUnroll pass) in that it must be able to interleave operations across iterations. This is usually a good thing to check before unrolling, but the compiler’s dependence analysis may be too conservative in some cases. Currently, the cost model is conservative w.r.t unrolling because we don't want to increase code size. But minimally, we should unroll until we can saturate the resources/ports. e.g. a loop with a single load should be unrolled x2 so we can do two loads per cycle. If you can come up with improved heuristics without generally impacting code size that’s great. Where we're currently looking for a constant trip count to avoid excessive unrolling, we could now look at profiled trip count if it isn't constant. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140116/dd6a873f/attachment.html>
Diego Novillo
2014-Jan-21 14:18 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On 16/01/2014, 23:47 , Andrew Trick wrote:> > On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at google.com > <mailto:dnovillo at google.com>> wrote: > >> Chandler also pointed me at the vectorizer, which has its own >> unroller. However, the vectorizer only unrolls enough to serve the >> target, it's not as general as the runtime-triggered unroller. From >> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on >> avx targets). Additionally, the vectorizer only unrolls to aid >> reduction variables. When I forced the vectorizer to unroll these >> loops, the performance effects were nil. > > Vectorization and partial unrolling (aka runtime unrolling) for ILP > should to be the same pass. The profitability analysis required in > each case is very closely related, and you never want to do one before > or after the other. The analysis makes sense even for targets without > vector units. The “vector unroller” has an extra restriction (unlike > the LoopUnroll pass) in that it must be able to interleave operations > across iterations. This is usually a good thing to check before > unrolling, but the compiler’s dependence analysis may be too > conservative in some cases.In addition to tuning the cost model, I found that the vectorizer does not even choose to get that far into its analysis for some loops that I need unrolled. In this particular case, there are three loops that need to be unrolled to get the performance I'm looking for. Of the three, only one gets far enough in the analysis to decide whether we unroll it or not. But I found a bigger issue. The loop optimizers run under the loop pass manager (I am still trying to wrap my head around that. I find it very odd and have not convinced myself why there is a separate manager for loops). Inside the loop pass manager, I am not allowed to call the block frequency analysis. Any attempts I make at scheduling BF analysis, sends the compiler into an infinite loop during initialization. Chandler suggested a way around the problem. I'll work on that first.> Currently, the cost model is conservative w.r.t unrolling because we > don't want to increase code size. But minimally, we should unroll > until we can saturate the resources/ports. e.g. a loop with a single > load should be unrolled x2 so we can do two loads per cycle. If you > can come up with improved heuristics without generally impacting code > size that’s great.Oh, code size will always go up. That's pretty much unavoidable when you decide to unroll. The trick here is to only unroll select loops. The profiler does not tell you the trip count. What it will do is cause the loop header to be excessively heavy wrt its parent in the block frequency analysis. In this particular case, you get something like: ---- Block Freqs ---- entry = 1.0 entry -> if.else = 0.375 entry -> if.then = 0.625 if.then = 0.625 if.then -> if.end3 = 0.625 if.else = 0.375 if.else -> for.cond.preheader = 0.37487 if.else -> if.end3 = 0.00006 for.cond.preheader = 0.37487 for.cond.preheader -> for.body.lr.ph = 0.37463 for.cond.preheader -> for.end = 0.00018 for.body.lr.ph = 0.37463 for.body.lr.ph -> for.body = 0.37463 * for.body = 682.0** ** for.body -> for.body = 681.65466* for.body -> for.end = 0.34527 for.end = 0.34545 for.end -> if.end3 = 0.34545 if.end3 = 0.9705 Notice how the head of the loop has weight 682, which is 682x the weight of its parent (the function entry, since this is an outermost loop). With static heuristics, this ratio is significantly lower (about 3x). When we see this, we can decide to unroll the loop. Diego. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140121/ddb3b6ee/attachment.html>
Diego Novillo
2014-Jan-21 14:18 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On 16/01/2014, 23:47 , Andrew Trick wrote:> > On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at google.com > <mailto:dnovillo at google.com>> wrote: > >> Chandler also pointed me at the vectorizer, which has its own >> unroller. However, the vectorizer only unrolls enough to serve the >> target, it's not as general as the runtime-triggered unroller. From >> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on >> avx targets). Additionally, the vectorizer only unrolls to aid >> reduction variables. When I forced the vectorizer to unroll these >> loops, the performance effects were nil. > > Vectorization and partial unrolling (aka runtime unrolling) for ILP > should to be the same pass. The profitability analysis required in > each case is very closely related, and you never want to do one before > or after the other. The analysis makes sense even for targets without > vector units. The “vector unroller” has an extra restriction (unlike > the LoopUnroll pass) in that it must be able to interleave operations > across iterations. This is usually a good thing to check before > unrolling, but the compiler’s dependence analysis may be too > conservative in some cases.In addition to tuning the cost model, I found that the vectorizer does not even choose to get that far into its analysis for some loops that I need unrolled. In this particular case, there are three loops that need to be unrolled to get the performance I'm looking for. Of the three, only one gets far enough in the analysis to decide whether we unroll it or not. But I found a bigger issue. The loop optimizers run under the loop pass manager (I am still trying to wrap my head around that. I find it very odd and have not convinced myself why there is a separate manager for loops). Inside the loop pass manager, I am not allowed to call the block frequency analysis. Any attempts I make at scheduling BF analysis, sends the compiler into an infinite loop during initialization. Chandler suggested a way around the problem. I'll work on that first.> Currently, the cost model is conservative w.r.t unrolling because we > don't want to increase code size. But minimally, we should unroll > until we can saturate the resources/ports. e.g. a loop with a single > load should be unrolled x2 so we can do two loads per cycle. If you > can come up with improved heuristics without generally impacting code > size that’s great.Oh, code size will always go up. That's pretty much unavoidable when you decide to unroll. The trick here is to only unroll select loops. The profiler does not tell you the trip count. What it will do is cause the loop header to be excessively heavy wrt its parent in the block frequency analysis. In this particular case, you get something like: ---- Block Freqs ---- entry = 1.0 entry -> if.else = 0.375 entry -> if.then = 0.625 if.then = 0.625 if.then -> if.end3 = 0.625 if.else = 0.375 if.else -> for.cond.preheader = 0.37487 if.else -> if.end3 = 0.00006 for.cond.preheader = 0.37487 for.cond.preheader -> for.body.lr.ph = 0.37463 for.cond.preheader -> for.end = 0.00018 for.body.lr.ph = 0.37463 for.body.lr.ph -> for.body = 0.37463 * for.body = 682.0** ** for.body -> for.body = 681.65466* for.body -> for.end = 0.34527 for.end = 0.34545 for.end -> if.end3 = 0.34545 if.end3 = 0.9705 Notice how the head of the loop has weight 682, which is 682x the weight of its parent (the function entry, since this is an outermost loop). With static heuristics, this ratio is significantly lower (about 3x). When we see this, we can decide to unroll the loop. Diego. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140121/3d441a98/attachment.html>
Apparently Analagous Threads
- [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
- [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
- [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable
- [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable
- LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target