thr3ads.net - llvm dev - [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Diego Novillo

2014-Jan-21 14:18 UTC

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On 16/01/2014, 23:47 , Andrew Trick wrote:>
> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at google.com 
> <mailto:dnovillo at google.com>> wrote:
>
>> Chandler also pointed me at the vectorizer, which has its own
>> unroller. However, the vectorizer only unrolls enough to serve the
>> target, it's not as general as the runtime-triggered unroller. From
>> what I've seen, it will get a maximum unroll factor of 2 on x86 (4
on
>> avx targets). Additionally, the vectorizer only unrolls to aid
>> reduction variables. When I forced the vectorizer to unroll these
>> loops, the performance effects were nil.
>
> Vectorization and partial unrolling (aka runtime unrolling) for ILP 
> should to be the same pass. The profitability analysis required in 
> each case is very closely related, and you never want to do one before 
> or after the other. The analysis makes sense even for targets without 
> vector units. The “vector unroller” has an extra restriction (unlike 
> the LoopUnroll pass) in that it must be able to interleave operations 
> across iterations. This is usually a good thing to check before 
> unrolling, but the compiler’s dependence analysis may be too 
> conservative in some cases.
In addition to tuning the cost model, I found that the vectorizer does 
not even choose to get that far into its analysis for some loops that I 
need unrolled. In this particular case, there are three loops that need 
to be unrolled to get the performance I'm looking for. Of the three, 
only one gets far enough in the analysis to decide whether we unroll it 
or not.

But I found a bigger issue. The loop optimizers run under the loop pass 
manager (I am still trying to wrap my head around that. I find it very 
odd and have not convinced myself why there is a separate manager for 
loops). Inside the loop pass manager, I am not allowed to call the block 
frequency analysis. Any attempts I make at scheduling BF analysis, sends 
the compiler into an infinite loop during initialization.

Chandler suggested a way around the problem. I'll work on that first.
> Currently, the cost model is conservative w.r.t unrolling because we 
> don't want to increase code size. But minimally, we should unroll 
> until we can saturate the resources/ports. e.g. a loop with a single 
> load should be unrolled x2 so we can do two loads per cycle. If you 
> can come up with improved heuristics without generally impacting code 
> size that’s great.Oh, code size will always go up. That's pretty much unavoidable when you 
decide to unroll. The trick here is to only unroll select loops. The 
profiler does not tell you the trip count. What it will do is cause the 
loop header to be excessively heavy wrt its parent in the block 
frequency analysis. In this particular case, you get something like:

---- Block Freqs ----
  entry = 1.0
   entry -> if.else = 0.375
   entry -> if.then = 0.625
  if.then = 0.625
   if.then -> if.end3 = 0.625
  if.else = 0.375
   if.else -> for.cond.preheader = 0.37487
   if.else -> if.end3 = 0.00006
  for.cond.preheader = 0.37487
   for.cond.preheader -> for.body.lr.ph = 0.37463
   for.cond.preheader -> for.end = 0.00018
  for.body.lr.ph = 0.37463
   for.body.lr.ph -> for.body = 0.37463
* for.body = 682.0**
**  for.body -> for.body = 681.65466*
   for.body -> for.end = 0.34527
  for.end = 0.34545
   for.end -> if.end3 = 0.34545
  if.end3 = 0.9705

Notice how the head of the loop has weight 682, which is 682x the weight 
of its parent (the function entry, since this is an outermost loop).

With static heuristics, this ratio is significantly lower (about 3x).

When we see this, we can decide to unroll the loop.


Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140121/3d441a98/attachment.html>

Arnold Schwaighofer

2014-Jan-21 19:46 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Jan 21, 2014, at 6:18 AM, Diego Novillo <dnovillo at google.com> wrote:
> On 16/01/2014, 23:47 , Andrew Trick wrote:
>> 
>> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at
google.com> wrote:
>> 
>>> Chandler also pointed me at the vectorizer, which has its own
>>> unroller. However, the vectorizer only unrolls enough to serve the
>>> target, it's not as general as the runtime-triggered unroller.
From
>>> what I've seen, it will get a maximum unroll factor of 2 on x86
(4 on
>>> avx targets). Additionally, the vectorizer only unrolls to aid
>>> reduction variables. When I forced the vectorizer to unroll these
>>> loops, the performance effects were nil.
>> 
>> Vectorization and partial unrolling (aka runtime unrolling) for ILP
should to be the same pass. The profitability analysis required in each case is
very closely related, and you never want to do one before or after the other.
The analysis makes sense even for targets without vector units. The “vector
unroller” has an extra restriction (unlike the LoopUnroll pass) in that it must
be able to interleave operations across iterations. This is usually a good thing
to check before unrolling, but the compiler’s dependence analysis may be too
conservative in some cases.
> 
> In addition to tuning the cost model, I found that the vectorizer does not
even choose to get that far into its analysis for some loops that I need
unrolled. In this particular case, there are three loops that need to be
unrolled to get the performance I'm looking for. Of the three, only one gets
far enough in the analysis to decide whether we unroll it or not.
> 
I assume the other two loops are quantum_cnot's
<http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00054>
and quantum_toffoli's
<http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00082>.

The problem for the unroller in the loop vectorizer is that it wants to
if-convert those loops. The conditional store prevents if-conversion because we
can’t introduce a store on a path where there was none before:
<http://llvm.org/docs/Atomics.html#optimization-outside-atomic>.

for (…)
  if (A[i] & mask)
    A[i] = val

If we wanted the unroller in the vectorizer to handle such loops we would have
to teach it to leave the store behind an if:


for (…)
  if (A[i] & mask)
    A[i] = val

=>

for ( … i+=2) {
   pred<0,1> = A[i:i+1] & mask<0, 1>
   val<0,1> = ...
   if (pred<0>)
       A[i]   = val<0>
   if (pred<1>)
       A[i+1] = val<1>
}

Andrew Trick

2014-Jan-21 21:31 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Jan 21, 2014, at 11:46 AM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
> 
> On Jan 21, 2014, at 6:18 AM, Diego Novillo <dnovillo at google.com>
wrote:
> 
>> On 16/01/2014, 23:47 , Andrew Trick wrote:
>>> 
>>> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at
google.com> wrote:
>>> 
>>>> Chandler also pointed me at the vectorizer, which has its own
>>>> unroller. However, the vectorizer only unrolls enough to serve
the
>>>> target, it's not as general as the runtime-triggered
unroller. From
>>>> what I've seen, it will get a maximum unroll factor of 2 on
x86 (4 on
>>>> avx targets). Additionally, the vectorizer only unrolls to aid
>>>> reduction variables. When I forced the vectorizer to unroll
these
>>>> loops, the performance effects were nil.
>>> 
>>> Vectorization and partial unrolling (aka runtime unrolling) for ILP
should to be the same pass. The profitability analysis required in each case is
very closely related, and you never want to do one before or after the other.
The analysis makes sense even for targets without vector units. The “vector
unroller” has an extra restriction (unlike the LoopUnroll pass) in that it must
be able to interleave operations across iterations. This is usually a good thing
to check before unrolling, but the compiler’s dependence analysis may be too
conservative in some cases.
>> 
>> In addition to tuning the cost model, I found that the vectorizer does
not even choose to get that far into its analysis for some loops that I need
unrolled. In this particular case, there are three loops that need to be
unrolled to get the performance I'm looking for. Of the three, only one gets
far enough in the analysis to decide whether we unroll it or not.
>> 
> 
> I assume the other two loops are quantum_cnot's
<http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00054>
and quantum_toffoli's
<http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00082>.
> 
> The problem for the unroller in the loop vectorizer is that it wants to
if-convert those loops. The conditional store prevents if-conversion because we
can’t introduce a store on a path where there was none before:
<http://llvm.org/docs/Atomics.html#optimization-outside-atomic>.
> 
> for (…)
>  if (A[i] & mask)
>    A[i] = val
> 
> If we wanted the unroller in the vectorizer to handle such loops we would
have to teach it to leave the store behind an if:
> 
> 
> for (…)
>  if (A[i] & mask)
>    A[i] = val
> 
> =>
> 
> for ( … i+=2) {
>   pred<0,1> = A[i:i+1] & mask<0, 1>
>   val<0,1> = ...
>   if (pred<0>)
>       A[i]   = val<0>
>   if (pred<1>)
>       A[i+1] = val<1>
> }
Thanks for confirming that the vectorizer bails early when it tries to
if-convert. How hard would it be to allow the analysis to proceed in the
presence of control flow? The analysis could conservatively ignore the control
flow and we could go ahead with unrolling while disabling vectorization.

If we do allow the analysis to proceed, will it successfully disambiguate loads
and stores from multiple iterations?

-Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140121/dfe6d350/attachment.html>

Andrew Trick

2014-Jan-21 21:31 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Jan 21, 2014, at 6:18 AM, Diego Novillo <dnovillo at google.com> wrote:
> On 16/01/2014, 23:47 , Andrew Trick wrote:
>> 
>> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at
google.com> wrote:
>> 
>>> Chandler also pointed me at the vectorizer, which has its own
>>> unroller. However, the vectorizer only unrolls enough to serve the
>>> target, it's not as general as the runtime-triggered unroller.
From
>>> what I've seen, it will get a maximum unroll factor of 2 on x86
(4 on
>>> avx targets). Additionally, the vectorizer only unrolls to aid
>>> reduction variables. When I forced the vectorizer to unroll these
>>> loops, the performance effects were nil.
>> 
>> Vectorization and partial unrolling (aka runtime unrolling) for ILP
should to be the same pass. The profitability analysis required in each case is
very closely related, and you never want to do one before or after the other.
The analysis makes sense even for targets without vector units. The “vector
unroller” has an extra restriction (unlike the LoopUnroll pass) in that it must
be able to interleave operations across iterations. This is usually a good thing
to check before unrolling, but the compiler’s dependence analysis may be too
conservative in some cases.
> 
> In addition to tuning the cost model, I found that the vectorizer does not
even choose to get that far into its analysis for some loops that I need
unrolled. In this particular case, there are three loops that need to be
unrolled to get the performance I'm looking for. Of the three, only one gets
far enough in the analysis to decide whether we unroll it or not.
> 
> But I found a bigger issue. The loop optimizers run under the loop pass
manager (I am still trying to wrap my head around that. I find it very odd and
have not convinced myself why there is a separate manager for loops). Inside the
loop pass manager, I am not allowed to call the block frequency analysis. Any
attempts I make at scheduling BF analysis, sends the compiler into an infinite
loop during initialization.
> 
> Chandler suggested a way around the problem. I'll work on that first.
It is very difficult to deal with the LoopPassManager. The concept doesn’t fit
with typical loop passes, which may need to rerun function level analyses, and
can affect code outside the current loop. One nice thing about it is that a pass
can push a new loop onto the queue and rerun all managed loop passes just on
that loop. In the past I’ve seen efforts to consolidate multiple loop passes
into the same LoopPassManager, but don’t really understand the benefit other
than reducing the one-time overhead of instantiating multiple pass managers and
some minor IR access locality improvements. Maybe Chandler’s work will help with
the overhead of instantiating pass managers.

Anyway, I see no reason that the vectorizer shouldn’t run in its own loop pass
manager.
>> Currently, the cost model is conservative w.r.t unrolling because we
don't want to increase code size. But minimally, we should unroll until we
can saturate the resources/ports. e.g. a loop with a single load should be
unrolled x2 so we can do two loads per cycle. If you can come up with improved
heuristics without generally impacting code size that’s great.
> Oh, code size will always go up. That's pretty much unavoidable when
you decide to unroll. The trick here is to only unroll select loops. The
profiler does not tell you the trip count. What it will do is cause the loop
header to be excessively heavy wrt its parent in the block frequency analysis.
In this particular case, you get something like:
> 
> ---- Block Freqs ----
>  entry = 1.0
>   entry -> if.else = 0.375
>   entry -> if.then = 0.625
>  if.then = 0.625
>   if.then -> if.end3 = 0.625
>  if.else = 0.375
>   if.else -> for.cond.preheader = 0.37487
>   if.else -> if.end3 = 0.00006
>  for.cond.preheader = 0.37487
>   for.cond.preheader -> for.body.lr.ph = 0.37463
>   for.cond.preheader -> for.end = 0.00018
>  for.body.lr.ph = 0.37463
>   for.body.lr.ph -> for.body = 0.37463
>  for.body = 682.0
>   for.body -> for.body = 681.65466
>   for.body -> for.end = 0.34527
>  for.end = 0.34545
>   for.end -> if.end3 = 0.34545
>  if.end3 = 0.9705
> 
> Notice how the head of the loop has weight 682, which is 682x the weight of
its parent (the function entry, since this is an outermost loop).
Yep. That’s exactly what I meant by the profiled trip count. We don’t get a
known count or even a distribution, just an average. And we generally only care
if the loop has a *highish* or *lowish* trip count.
> With static heuristics, this ratio is significantly lower (about 3x).
I believe you, but I thought we would use 10x as a default estimated trip count.
> When we see this, we can decide to unroll the loop.
I’m sure we could be more aggressive than we are currently at O3. If the
vectorizer unroller’s conditions are too strict, we could consider calling the
standard partial unroller only on the loops that didn’t get vectorized.

-Andy
> 
> Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140121/0d7b156d/attachment-0001.html>

Chandler Carruth

2014-Jan-21 22:01 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Just to add a few notes...

On Tue, Jan 21, 2014 at 1:31 PM, Andrew Trick <atrick at apple.com> wrote:
> Chandler suggested a way around the problem. I'll work on that first.
>
>
> It is very difficult to deal with the LoopPassManager. The concept doesn’t
> fit with typical loop passes, which may need to rerun function level
> analyses, and can affect code outside the current loop. One nice thing
> about it is that a pass can push a new loop onto the queue and rerun all
> managed loop passes just on that loop. In the past I’ve seen efforts to
> consolidate multiple loop passes into the same LoopPassManager, but don’t
> really understand the benefit other than reducing the one-time overhead of
> instantiating multiple pass managers and some minor IR access locality
> improvements. Maybe Chandler’s work will help with the overhead of
> instantiating pass managers.
>
Maybe, but currently I don't really understand the ideal state for loop
pass pipeline formation. Maybe I'll sit down with you, Owen, and/or others
that are more deeply involved in the loop passes to get a good design in my
head. But we're not quite there yet. =]

>
> Anyway, I see no reason that the vectorizer shouldn’t run in its own loop
> pass manager.
>
It turns out, this is all moot - the loop vectorizer *already* doesn't run
as part of these loop passes. It runs on its own during the late
optimization phase.

If it is a loop pass at all, I would just suggest making it a function pass
and being done with it. Then there will be no problems here. That make
sense to you guys as well?

I’m sure we could be more aggressive than we are currently at O3. If
the> vectorizer unroller’s conditions are too strict, we could consider calling
> the standard partial unroller only on the loops that didn’t get vectorized.
>
I think we can be more aggressive across the board. I've got a pretty good
conservative benchmark suite and rig for increasing the aggressiveness of
these kinds of optimizations (it is very sensitive to code size and
consists of macro benchmarks that aren't usually bottlenecked on a single
loop). I'm going to try some basic relaxing of the unrolling thresholds in
the loop vectorizer to see if there is at least an *obvious* better
baseline level. Then we can see if there is more detailed tuning to do.

However, I'd like to do that after the if-conversion issue is fixed if
possible. I have bandwidth to do the evaluation of thresholds, but not
really to hack on the if-conversion stuff. Any guesses if you guys will
have time to look at that aspect?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140121/cc2b77d5/attachment.html>

Arnold Schwaighofer

2014-Jan-28 01:22 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

In r200270 I added support to unroll conditional stores in the loop vectorizer.

It is currently off pending further benchmarking and can be enabled with
"-mllvm -vectorize-num-stores-pred=1”.

Furthermore, I added a heuristic to unroll until load/store ports are saturated
“-mllvm enable-loadstore-runtime-unroll” instead of the pure size based
heuristic.

Those two together with a patch that slightly changes the register heuristic and
libquantum’s three hot loops will unroll and goodness will ensue (at least for
libquantum).


commit 6b908b8b1084c97238cc642a3404a4285c21286f
Author: Arnold Schwaighofer <aschwaighofer at apple.com>
Date:   Mon Jan 27 13:21:55 2014 -0800

    Subtract one for loop induction variable. It is unlikely to be unrolled.

diff --git a/lib/Transforms/Vectorize/LoopVectorize.cpp
b/lib/Transforms/Vectorize/LoopVectorize.cpp
index 7867495..978c5a1 100644
--- a/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5142,8 +5142,8 @@ LoopVectorizationCostModel::selectUnrollFactor(bool
OptForSize,
   // fit without causing spills. All of this is rounded down if necessary to be
   // a power of two. We want power of two unroll factors to simplify any
   // addressing operations or alignment considerations.
-  unsigned UF = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
-                              R.MaxLocalUsers);
+  unsigned UF = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /
+                              (R.MaxLocalUsers - 1));



On Jan 21, 2014, at 11:46 AM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
> 
> On Jan 21, 2014, at 6:18 AM, Diego Novillo <dnovillo at google.com>
wrote:
> 
>> On 16/01/2014, 23:47 , Andrew Trick wrote:
>>> 
>>> On Jan 15, 2014, at 4:13 PM, Diego Novillo <dnovillo at
google.com> wrote:
>>> 
>>>> Chandler also pointed me at the vectorizer, which has its own
>>>> unroller. However, the vectorizer only unrolls enough to serve
the
>>>> target, it's not as general as the runtime-triggered
unroller. From
>>>> what I've seen, it will get a maximum unroll factor of 2 on
x86 (4 on
>>>> avx targets). Additionally, the vectorizer only unrolls to aid
>>>> reduction variables. When I forced the vectorizer to unroll
these
>>>> loops, the performance effects were nil.
>>> 
>>> Vectorization and partial unrolling (aka runtime unrolling) for ILP
should to be the same pass. The profitability analysis required in each case is
very closely related, and you never want to do one before or after the other.
The analysis makes sense even for targets without vector units. The “vector
unroller” has an extra restriction (unlike the LoopUnroll pass) in that it must
be able to interleave operations across iterations. This is usually a good thing
to check before unrolling, but the compiler’s dependence analysis may be too
conservative in some cases.
>> 
>> In addition to tuning the cost model, I found that the vectorizer does
not even choose to get that far into its analysis for some loops that I need
unrolled. In this particular case, there are three loops that need to be
unrolled to get the performance I'm looking for. Of the three, only one gets
far enough in the analysis to decide whether we unroll it or not.
>> 
> 
> I assume the other two loops are quantum_cnot's
<http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00054>
and quantum_toffoli's
<http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00082>.
> 
> The problem for the unroller in the loop vectorizer is that it wants to
if-convert those loops. The conditional store prevents if-conversion because we
can’t introduce a store on a path where there was none before:
<http://llvm.org/docs/Atomics.html#optimization-outside-atomic>.
> 
> for (…)
>  if (A[i] & mask)
>    A[i] = val
> 
> If we wanted the unroller in the vectorizer to handle such loops we would
have to teach it to leave the store behind an if:
> 
> 
> for (…)
>  if (A[i] & mask)
>    A[i] = val
> 
> =>
> 
> for ( … i+=2) {
>   pred<0,1> = A[i:i+1] & mask<0, 1>
>   val<0,1> = ...
>   if (pred<0>)
>       A[i]   = val<0>
>   if (pred<1>)
>       A[i+1] = val<1>
> }
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Jan 2014 - [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Maybe Matching Threads