thr3ads.net - llvm dev - [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Diego Novillo

2014-Jan-16 00:13 UTC

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

I am starting to use the sample profiler to analyze new performance
opportunities. The loop unroller has popped up in several of the
benchmarks I'm running. In particular, libquantum. There is a ~12%
opportunity when the runtime unroller is triggered.

This helps functions like quantum_sigma_x
(http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149).
The function accounts for ~20% of total runtime. By allowing the
runtime unroller, we can speedup the program by about 12%.

I have been poking at the unroller a little bit. Currently, the
runtime unroller is only triggered by a special flag or if the target
states it in the unrolling preferences. We could also consult the
block frequency information here. If the loop header has a higher
relative frequency than the rest of the function, then we'd enable
runtime unrolling.

Chandler also pointed me at the vectorizer, which has its own
unroller. However, the vectorizer only unrolls enough to serve the
target, it's not as general as the runtime-triggered unroller. From
what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
avx targets). Additionally, the vectorizer only unrolls to aid
reduction variables. When I forced the vectorizer to unroll these
loops, the performance effects were nil.

I'm currently looking at changing LoopUnroll::runOnLoop() to consult
block frequency information for the loop header to decide whether to
try runtime triggers for loops that don't have a constant trip count
but could be partially peeled.

Does that sound reasonable?


Thanks.  Diego.

Hal Finkel

2014-Jan-16 00:36 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

----- Original Message -----> From: "Diego Novillo" <dnovillo at google.com>
> To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Cc: nadav at apple.com
> Sent: Wednesday, January 15, 2014 6:13:27 PM
> Subject: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with
profile info
> 
> I am starting to use the sample profiler to analyze new performance
> opportunities. The loop unroller has popped up in several of the
> benchmarks I'm running. In particular, libquantum. There is a ~12%
> opportunity when the runtime unroller is triggered.
> 
> This helps functions like quantum_sigma_x
>
(http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149).
> The function accounts for ~20% of total runtime. By allowing the
> runtime unroller, we can speedup the program by about 12%.
> 
> I have been poking at the unroller a little bit. Currently, the
> runtime unroller is only triggered by a special flag or if the target
> states it in the unrolling preferences. We could also consult the
> block frequency information here. If the loop header has a higher
> relative frequency than the rest of the function, then we'd enable
> runtime unrolling.
> 
> Chandler also pointed me at the vectorizer, which has its own
> unroller. However, the vectorizer only unrolls enough to serve the
> target, it's not as general as the runtime-triggered unroller. From
> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
> avx targets). Additionally, the vectorizer only unrolls to aid
> reduction variables. When I forced the vectorizer to unroll these
> loops, the performance effects were nil.
It may be worth noting, that the vectorizer's unrolling is modulo unrolling
(in the sense that the iterations are maximally intermixed), and so is bound by
register pressure considerations (especially in the default configuration, where
CodeGen does not make use of AA, and so often cannot 'fix' an expensive
unrolling that has increased register pressure too much).

The generic unroller, on the other hand, does concatenation unrolling, which has
different benefits.
> 
> I'm currently looking at changing LoopUnroll::runOnLoop() to consult
> block frequency information for the loop header to decide whether to
> try runtime triggers for loops that don't have a constant trip count
> but could be partially peeled.
> 
> Does that sound reasonable?
This sounds good to me; I definitely feel that we should better exploit the
generic unroller's capabilities.

The last time that I tried enabling runtime unrolling (and partial unrolling)
over the entire test suite on x86, there were many speedups and many slowdowns
(although slightly more slowdowns than speedups). You seem to be suggesting that
restricting runtime unrolling to known hot loops will eliminate many of the
slowdowns. I'm certainly curious to see how that turns out.

 -Hal
> 
> 
> Thanks.  Diego.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Chandler Carruth

2014-Jan-16 00:41 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Wed, Jan 15, 2014 at 4:13 PM, Diego Novillo <dnovillo at google.com>
wrote:
> Chandler also pointed me at the vectorizer, which has its own
> unroller. However, the vectorizer only unrolls enough to serve the
> target, it's not as general as the runtime-triggered unroller. From
> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
> avx targets). Additionally, the vectorizer only unrolls to aid
> reduction variables. When I forced the vectorizer to unroll these
> loops, the performance effects were nil.
>
I just also want to point out that we should really be *vectorizing* this
loop as well. It's a great candidate for it AFAICS....
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140115/12388c9f/attachment.html>

Diego Novillo

2014-Jan-16 00:41 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Wed, Jan 15, 2014 at 4:36 PM, Hal Finkel <hfinkel at anl.gov>
wrote:> ----- Original Message -----
>> From: "Diego Novillo" <dnovillo at google.com>
>> To: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>
>> Cc: nadav at apple.com
>> Sent: Wednesday, January 15, 2014 6:13:27 PM
>> Subject: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum
with       profile info
>>
>> I am starting to use the sample profiler to analyze new performance
>> opportunities. The loop unroller has popped up in several of the
>> benchmarks I'm running. In particular, libquantum. There is a ~12%
>> opportunity when the runtime unroller is triggered.
>>
>> This helps functions like quantum_sigma_x
>>
(http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149).
>> The function accounts for ~20% of total runtime. By allowing the
>> runtime unroller, we can speedup the program by about 12%.
>>
>> I have been poking at the unroller a little bit. Currently, the
>> runtime unroller is only triggered by a special flag or if the target
>> states it in the unrolling preferences. We could also consult the
>> block frequency information here. If the loop header has a higher
>> relative frequency than the rest of the function, then we'd enable
>> runtime unrolling.
>>
>> Chandler also pointed me at the vectorizer, which has its own
>> unroller. However, the vectorizer only unrolls enough to serve the
>> target, it's not as general as the runtime-triggered unroller. From
>> what I've seen, it will get a maximum unroll factor of 2 on x86 (4
on
>> avx targets). Additionally, the vectorizer only unrolls to aid
>> reduction variables. When I forced the vectorizer to unroll these
>> loops, the performance effects were nil.
>
> It may be worth noting, that the vectorizer's unrolling is modulo
unrolling (in the sense that the iterations are maximally intermixed), and so is
bound
> by register pressure considerations (especially in the default
configuration, where CodeGen does not make use of AA, and so often cannot
'fix' an
> expensive unrolling that has increased register pressure too much).
>
> The generic unroller, on the other hand, does concatenation unrolling,
which has different benefits.
Thanks.
> This sounds good to me; I definitely feel that we should better exploit the
generic unroller's capabilities.
>
> The last time that I tried enabling runtime unrolling (and partial
unrolling) over the entire test suite on x86, there were many speedups and many
> slowdowns (although slightly more slowdowns than speedups). You seem to be
suggesting that restricting runtime unrolling to known hot loops will
> eliminate many of the slowdowns. I'm certainly curious to see how that
turns out.
Right. If I force the runtime unroller, I get a mixed bag of speedups
and slowdowns. Additionally, code size skyrockets. By using it only on
the functions that have hot loops (as per the profile), we only unroll
those that make a difference. In the case of libquantum, there is a
grand total of 3 loops that need to be runtime unrolled.


Diego.

Nadav Rotem

2014-Jan-16 01:30 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Hi Diego! 

Thanks for looking at this!
> Chandler also pointed me at the vectorizer, which has its own
> unroller. However, the vectorizer only unrolls enough to serve the
> target, it's not as general as the runtime-triggered unroller. From
> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
> avx targets). Additionally, the vectorizer only unrolls to aid
> reduction variables.
The vectorizer has a heuristics that is used to decide when to unroll.  One of
the rules that the heuristics has is that reductions are profitable to unroll.
The other rule is that small loops should also be unrolled.
> When I forced the vectorizer to unroll these
> loops, the performance effects were nil.
Was the vectorizer successful in unrolling the loop in quantum_sigma_x?  I
wonder if 'size’ is typically high or low.

Thanks,
Nadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140115/bf23ce03/attachment.html>

Sean Silva

2014-Jan-16 03:38 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Wed, Jan 15, 2014 at 7:13 PM, Diego Novillo <dnovillo at google.com>
wrote:
> I am starting to use the sample profiler to analyze new performance
> opportunities. The loop unroller has popped up in several of the
> benchmarks I'm running. In particular, libquantum. There is a ~12%
> opportunity when the runtime unroller is triggered.
>
Pardon my ignorance, but what exactly does "runtime unroller" mean? In
particular the "runtime" part of it. Just from the name I'm
imagining
JIT'ing an unrolled version on the fly, or choosing an unrolled version at
runtime, but neither of those interpretations seems likely.

-- Sean Silva

>
> This helps functions like quantum_sigma_x
> (http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149
> ).
> The function accounts for ~20% of total runtime. By allowing the
> runtime unroller, we can speedup the program by about 12%.
>
> I have been poking at the unroller a little bit. Currently, the
> runtime unroller is only triggered by a special flag or if the target
> states it in the unrolling preferences. We could also consult the
> block frequency information here. If the loop header has a higher
> relative frequency than the rest of the function, then we'd enable
> runtime unrolling.
>
> Chandler also pointed me at the vectorizer, which has its own
> unroller. However, the vectorizer only unrolls enough to serve the
> target, it's not as general as the runtime-triggered unroller. From
> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
> avx targets). Additionally, the vectorizer only unrolls to aid
> reduction variables. When I forced the vectorizer to unroll these
> loops, the performance effects were nil.
>
> I'm currently looking at changing LoopUnroll::runOnLoop() to consult
> block frequency information for the loop header to decide whether to
> try runtime triggers for loops that don't have a constant trip count
> but could be partially peeled.
>
> Does that sound reasonable?
>
>
> Thanks.  Diego.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140115/cb452f42/attachment.html>

Sean Silva

2014-Jan-16 03:43 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Wed, Jan 15, 2014 at 8:30 PM, Nadav Rotem <nrotem at apple.com> wrote:
>
> Hi Diego!
>
> Thanks for looking at this!
>
> Chandler also pointed me at the vectorizer, which has its own
> unroller. However, the vectorizer only unrolls enough to serve the
> target, it's not as general as the runtime-triggered unroller. From
> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
> avx targets). Additionally, the vectorizer only unrolls to aid
> reduction variables.
>
>
> The vectorizer has a heuristics that is used to decide when to unroll.
>  One of the rules that the heuristics has is that reductions are profitable
> to unroll. The other rule is that small loops should also be unrolled.
>
> When I forced the vectorizer to unroll these
> loops, the performance effects were nil.
>
>
> Was the vectorizer successful in unrolling the loop in quantum_sigma_x?  I
> wonder if 'size’ is typically high or low.
>
Yeah, can you produce a histogram of the values of `reg->size`? (or provide
the raw data for me to analyze?).

-- Sean Silva

>
> Thanks,
> Nadav
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140115/44719f0d/attachment.html>

Hal Finkel

2014-Jan-16 04:50 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

----- Original Message -----> From: "Sean Silva" <silvas at purdue.edu>
> To: "Diego Novillo" <dnovillo at google.com>
> Cc: nadav at apple.com, "LLVM Developers Mailing List"
<llvmdev at cs.uiuc.edu>
> Sent: Wednesday, January 15, 2014 9:38:32 PM
> Subject: Re: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum
with profile info
> 
> 
> 
> 
> 
> 
> 
> 
> On Wed, Jan 15, 2014 at 7:13 PM, Diego Novillo < dnovillo at google.com
> > wrote:
> 
> 
> I am starting to use the sample profiler to analyze new performance
> opportunities. The loop unroller has popped up in several of the
> benchmarks I'm running. In particular, libquantum. There is a ~12%
> opportunity when the runtime unroller is triggered.
> 
> 
> 
> Pardon my ignorance, but what exactly does "runtime unroller"
mean?
> In particular the "runtime" part of it. 
He's referring to the code in lib/Transforms/Utils/LoopUnrollRuntime.cpp --
which can be enabled by using the -unroll-runtime flag. The 'runtime'
refers to the fact that the trip count is not known at compile time.

 -Hal
> Just from the name I'm
> imagining JIT'ing an unrolled version on the fly, or choosing an
> unrolled version at runtime, but neither of those interpretations
> seems likely.
> 
> 
> -- Sean Silva
> 
> 
> 
> This helps functions like quantum_sigma_x
> (
> http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149
> ).
> The function accounts for ~20% of total runtime. By allowing the
> runtime unroller, we can speedup the program by about 12%.
> 
> I have been poking at the unroller a little bit. Currently, the
> runtime unroller is only triggered by a special flag or if the target
> states it in the unrolling preferences. We could also consult the
> block frequency information here. If the loop header has a higher
> relative frequency than the rest of the function, then we'd enable
> runtime unrolling.
> 
> Chandler also pointed me at the vectorizer, which has its own
> unroller. However, the vectorizer only unrolls enough to serve the
> target, it's not as general as the runtime-triggered unroller. From
> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
> avx targets). Additionally, the vectorizer only unrolls to aid
> reduction variables. When I forced the vectorizer to unroll these
> loops, the performance effects were nil.
> 
> I'm currently looking at changing LoopUnroll::runOnLoop() to consult
> block frequency information for the loop header to decide whether to
> try runtime triggers for loops that don't have a constant trip count
> but could be partially peeled.
> 
> Does that sound reasonable?
> 
> 
> Thanks. Diego.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Diego Novillo

2014-Jan-16 16:16 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Was the vectorizer successful in unrolling the loop in quantum_sigma_x?  I
> wonder if 'size’ is typically high or low.
No. The vectorizer stated that it wasn't going to bother with the loop
because it wasn't profitable. Specifically:

LV: Checking a loop in "quantum_sigma_x"
LV: Found a loop: for.body
LV: Found an induction variable.
LV: Found a write-only loop!
LV: We can vectorize this loop!
LV: Found trip count: 0
LV: The Widest type: 64 bits.
LV: The Widest register is: 128 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
%for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %state
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
%indvars.iv, i32 1, !dbg !58
LV: Found an estimated cost of 1 for VF 1 For instruction:   %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 1 For instruction:   %xor xor i64 %3,
%shl, !dbg !58
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
%xor, i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 1 For instruction:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
LV: Found an estimated cost of 0 for VF 1 For instruction:   %4 trunc i64
%indvars.iv.next to i32, !dbg !52
LV: Found an estimated cost of 1 for VF 1 For instruction:   %cmp icmp slt i32
%4, %1, !dbg !52
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
%cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
LV: Scalar loop costs: 5.
LV: Found an estimated cost of 0 for VF 2 For instruction:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
%for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %state
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
%indvars.iv, i32 1, !dbg !58
LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 2 For instruction:   %xor xor i64 %3,
%shl, !dbg !58
LV: Found an estimated cost of 6 for VF 2 For instruction:   store i64
%xor, i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 2 For instruction:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
LV: Found an estimated cost of 0 for VF 2 For instruction:   %4 trunc i64
%indvars.iv.next to i32, !dbg !52
LV: Found an estimated cost of 1 for VF 2 For instruction:   %cmp icmp slt i32
%4, %1, !dbg !52
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
%cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
LV: Vector loop of width 2 costs: 7.
LV: Selecting VF = : 1.
LV: The target has 16 vector registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #5 Interval # 2
LV(REG): At #6 Interval # 2
LV(REG): At #7 Interval # 2
LV(REG): Found max usage: 3
LV(REG): Found invariant usage: 3
LV(REG): LoopSize: 9
LV: Found a vectorizable loop (1) in gates.ll
LV: Unroll Factor is 1
LV: Vectorization is possible but not beneficial.

I poked briefly at the vectorizer code to see if there is anything
that the profile data could've told it, but this loop did not meet the
requirements for unrolling. And even if it did, the trip count is not
constant and the unroll factor used by the vectorizer is pretty low.
So, even if we vectorized it (or parts of it) I don't think the
speedup would be significant.

What really helps this loop is to peel it a few times and do the
remaining iterations in the loop.

Diego.

Andrew Trick

2014-Jan-17 04:47 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at google.com> wrote:
> Chandler also pointed me at the vectorizer, which has its own
> unroller. However, the vectorizer only unrolls enough to serve the
> target, it's not as general as the runtime-triggered unroller. From
> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
> avx targets). Additionally, the vectorizer only unrolls to aid
> reduction variables. When I forced the vectorizer to unroll these
> loops, the performance effects were nil.
Vectorization and partial unrolling (aka runtime unrolling) for ILP should to be
the same pass. The profitability analysis required in each case is very closely
related, and you never want to do one before or after the other. The analysis
makes sense even for targets without vector units. The “vector unroller” has an
extra restriction (unlike the LoopUnroll pass) in that it must be able to
interleave operations across iterations. This is usually a good thing to check
before unrolling, but the compiler’s dependence analysis may be too conservative
in some cases.

Currently, the cost model is conservative w.r.t unrolling because we don't
want to increase code size. But minimally, we should unroll until we can
saturate the resources/ports. e.g. a loop with a single load should be unrolled
x2 so we can do two loads per cycle. If you can come up with improved heuristics
without generally impacting code size that’s great.

Where we're currently looking for a constant trip count to avoid excessive
unrolling, we could now look at profiled trip count if it isn't constant.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140116/dd6a873f/attachment.html>

Diego Novillo

2014-Jan-21 14:18 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On 16/01/2014, 23:47 , Andrew Trick wrote:>
> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at google.com 
> <mailto:dnovillo at google.com>> wrote:
>
>> Chandler also pointed me at the vectorizer, which has its own
>> unroller. However, the vectorizer only unrolls enough to serve the
>> target, it's not as general as the runtime-triggered unroller. From
>> what I've seen, it will get a maximum unroll factor of 2 on x86 (4
on
>> avx targets). Additionally, the vectorizer only unrolls to aid
>> reduction variables. When I forced the vectorizer to unroll these
>> loops, the performance effects were nil.
>
> Vectorization and partial unrolling (aka runtime unrolling) for ILP 
> should to be the same pass. The profitability analysis required in 
> each case is very closely related, and you never want to do one before 
> or after the other. The analysis makes sense even for targets without 
> vector units. The “vector unroller” has an extra restriction (unlike 
> the LoopUnroll pass) in that it must be able to interleave operations 
> across iterations. This is usually a good thing to check before 
> unrolling, but the compiler’s dependence analysis may be too 
> conservative in some cases.
In addition to tuning the cost model, I found that the vectorizer does 
not even choose to get that far into its analysis for some loops that I 
need unrolled. In this particular case, there are three loops that need 
to be unrolled to get the performance I'm looking for. Of the three, 
only one gets far enough in the analysis to decide whether we unroll it 
or not.

But I found a bigger issue. The loop optimizers run under the loop pass 
manager (I am still trying to wrap my head around that. I find it very 
odd and have not convinced myself why there is a separate manager for 
loops). Inside the loop pass manager, I am not allowed to call the block 
frequency analysis. Any attempts I make at scheduling BF analysis, sends 
the compiler into an infinite loop during initialization.

Chandler suggested a way around the problem. I'll work on that first.
> Currently, the cost model is conservative w.r.t unrolling because we 
> don't want to increase code size. But minimally, we should unroll 
> until we can saturate the resources/ports. e.g. a loop with a single 
> load should be unrolled x2 so we can do two loads per cycle. If you 
> can come up with improved heuristics without generally impacting code 
> size that’s great.Oh, code size will always go up. That's pretty much unavoidable when you 
decide to unroll. The trick here is to only unroll select loops. The 
profiler does not tell you the trip count. What it will do is cause the 
loop header to be excessively heavy wrt its parent in the block 
frequency analysis. In this particular case, you get something like:

---- Block Freqs ----
  entry = 1.0
   entry -> if.else = 0.375
   entry -> if.then = 0.625
  if.then = 0.625
   if.then -> if.end3 = 0.625
  if.else = 0.375
   if.else -> for.cond.preheader = 0.37487
   if.else -> if.end3 = 0.00006
  for.cond.preheader = 0.37487
   for.cond.preheader -> for.body.lr.ph = 0.37463
   for.cond.preheader -> for.end = 0.00018
  for.body.lr.ph = 0.37463
   for.body.lr.ph -> for.body = 0.37463
* for.body = 682.0**
**  for.body -> for.body = 681.65466*
   for.body -> for.end = 0.34527
  for.end = 0.34545
   for.end -> if.end3 = 0.34545
  if.end3 = 0.9705

Notice how the head of the loop has weight 682, which is 682x the weight 
of its parent (the function entry, since this is an outermost loop).

With static heuristics, this ratio is significantly lower (about 3x).

When we see this, we can decide to unroll the loop.


Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140121/ddb3b6ee/attachment.html>

Diego Novillo

2014-Jan-21 14:18 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On 16/01/2014, 23:47 , Andrew Trick wrote:>
> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at google.com 
> <mailto:dnovillo at google.com>> wrote:
>
>> Chandler also pointed me at the vectorizer, which has its own
>> unroller. However, the vectorizer only unrolls enough to serve the
>> target, it's not as general as the runtime-triggered unroller. From
>> what I've seen, it will get a maximum unroll factor of 2 on x86 (4
on
>> avx targets). Additionally, the vectorizer only unrolls to aid
>> reduction variables. When I forced the vectorizer to unroll these
>> loops, the performance effects were nil.
>
> Vectorization and partial unrolling (aka runtime unrolling) for ILP 
> should to be the same pass. The profitability analysis required in 
> each case is very closely related, and you never want to do one before 
> or after the other. The analysis makes sense even for targets without 
> vector units. The “vector unroller” has an extra restriction (unlike 
> the LoopUnroll pass) in that it must be able to interleave operations 
> across iterations. This is usually a good thing to check before 
> unrolling, but the compiler’s dependence analysis may be too 
> conservative in some cases.
In addition to tuning the cost model, I found that the vectorizer does 
not even choose to get that far into its analysis for some loops that I 
need unrolled. In this particular case, there are three loops that need 
to be unrolled to get the performance I'm looking for. Of the three, 
only one gets far enough in the analysis to decide whether we unroll it 
or not.

But I found a bigger issue. The loop optimizers run under the loop pass 
manager (I am still trying to wrap my head around that. I find it very 
odd and have not convinced myself why there is a separate manager for 
loops). Inside the loop pass manager, I am not allowed to call the block 
frequency analysis. Any attempts I make at scheduling BF analysis, sends 
the compiler into an infinite loop during initialization.

Chandler suggested a way around the problem. I'll work on that first.
> Currently, the cost model is conservative w.r.t unrolling because we 
> don't want to increase code size. But minimally, we should unroll 
> until we can saturate the resources/ports. e.g. a loop with a single 
> load should be unrolled x2 so we can do two loads per cycle. If you 
> can come up with improved heuristics without generally impacting code 
> size that’s great.Oh, code size will always go up. That's pretty much unavoidable when you 
decide to unroll. The trick here is to only unroll select loops. The 
profiler does not tell you the trip count. What it will do is cause the 
loop header to be excessively heavy wrt its parent in the block 
frequency analysis. In this particular case, you get something like:

---- Block Freqs ----
  entry = 1.0
   entry -> if.else = 0.375
   entry -> if.then = 0.625
  if.then = 0.625
   if.then -> if.end3 = 0.625
  if.else = 0.375
   if.else -> for.cond.preheader = 0.37487
   if.else -> if.end3 = 0.00006
  for.cond.preheader = 0.37487
   for.cond.preheader -> for.body.lr.ph = 0.37463
   for.cond.preheader -> for.end = 0.00018
  for.body.lr.ph = 0.37463
   for.body.lr.ph -> for.body = 0.37463
* for.body = 682.0**
**  for.body -> for.body = 681.65466*
   for.body -> for.end = 0.34527
  for.end = 0.34545
   for.end -> if.end3 = 0.34545
  if.end3 = 0.9705

Notice how the head of the loop has weight 682, which is 682x the weight 
of its parent (the function entry, since this is an outermost loop).

With static heuristics, this ratio is significantly lower (about 3x).

When we see this, we can decide to unroll the loop.


Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140121/3d441a98/attachment.html>

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Jan 2014 - [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Seemingly Similar Threads