thr3ads.net - llvm dev - [LLVMdev] [RFC] Heuristic for complete loop unrolling [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Michael Zolotukhin

2015-Jan-23 20:05 UTC

[LLVMdev] [RFC] Heuristic for complete loop unrolling

Hi devs,

Recently I came across an interesting testcase that LLVM failed to optimize
well. The test does some image processing, and as a part of it, it traverses all
the pixels and computes some value basing on the adjacent pixels. So, the hot
part looks like this:

for(y = 0..height) {
 for (x = 0..width) {
   val = 0
   for (j = 0..5) {
     for (i = 0..5) {
       val += img[x+i,y+j] * weight[i,j]
     }
   }
 }
}

And ‘weight' is just a constant matrix with some coefficients.

If we unroll the two internal loops (with tripcount 5), then we can replace
weight[i,j] with concrete constant values. In this particular case, many of the
coefficients are actually 0 or 1, which enables huge code simplifications later
on. But currently we unroll only the innermost one, because unrolling both of
them will exceed the threshold.

When deciding whether to unroll or not, we currently look only at the
instruction count of the loop. My proposal is to, on top of that, check if we
can enable any later optimizations by unrolling - in this case by replacing a
load with a constant. Similar to what we do in inlining heuristics, we can
estimate how many instructions would be potentially eliminated after unrolling
and adjust our threshold with this value.

I can imagine that it might be also useful for computations, involving sparse
constant matrixes (like identity matrix).

The attached patch implements this feature, and with it we handle the original
testcase well.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: complete-unroll.patch
Type: application/octet-stream
Size: 4398 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/85bd7256/attachment.obj>
-------------- next part --------------


Does it look good? Of course, any ideas, suggestions and other feedback are
welcome!


Thanks,
Michael

James Molloy

2015-Jan-23 21:27 UTC

head link

[LLVMdev] [RFC] Heuristic for complete loop unrolling

Hi Michael,

This looks very much like the inner loops from convolve() in Geekbench... ;)

Interestingly if we reroll that loop nest, the number of instructions
inside goes down by a factor of 3 so we do decide to inline them all. But
that is just luck / a hack. I completely agree that our current unrolling
heuristics don't take the right things into account (our inlining
heuristics are similar!).

When thinking about this previously, I've been worried that there's a
slippery slope where you could end up implementing half of the rest of the
compiler's analysis passes in the unroller/inliner to check for
opportunities. Do you have any thoughts on that? How do we keep it under
control?

cc'ing Kevin who has been doing a lot of work recently on the loop unroller.

Cheers,

James

On Fri Jan 23 2015 at 8:52:30 PM Michael Zolotukhin <mzolotukhin at
apple.com>
wrote:
> Hi devs,
>
> Recently I came across an interesting testcase that LLVM failed to
> optimize well. The test does some image processing, and as a part of it, it
> traverses all the pixels and computes some value basing on the adjacent
> pixels. So, the hot part looks like this:
>
> for(y = 0..height) {
>  for (x = 0..width) {
>    val = 0
>    for (j = 0..5) {
>      for (i = 0..5) {
>        val += img[x+i,y+j] * weight[i,j]
>      }
>    }
>  }
> }
>
> And ‘weight' is just a constant matrix with some coefficients.
>
> If we unroll the two internal loops (with tripcount 5), then we can
> replace weight[i,j] with concrete constant values. In this particular case,
> many of the coefficients are actually 0 or 1, which enables huge code
> simplifications later on. But currently we unroll only the innermost one,
> because unrolling both of them will exceed the threshold.
>
> When deciding whether to unroll or not, we currently look only at the
> instruction count of the loop. My proposal is to, on top of that, check if
> we can enable any later optimizations by unrolling - in this case by
> replacing a load with a constant. Similar to what we do in inlining
> heuristics, we can estimate how many instructions would be potentially
> eliminated after unrolling and adjust our threshold with this value.
>
> I can imagine that it might be also useful for computations, involving
> sparse constant matrixes (like identity matrix).
>
> The attached patch implements this feature, and with it we handle the
> original testcase well.
>
>
>
> Does it look good? Of course, any ideas, suggestions and other feedback
> are welcome!
>
>
> Thanks,
> Michael_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/0148ece2/attachment.html>

Owen Anderson

2015-Jan-23 21:38 UTC

head link

[LLVMdev] [RFC] Heuristic for complete loop unrolling

Hi Michael,

I’m unsure about other targets, but for the ones I work on converting
register-indexed loads into constant-indexed ones is a huge win, even in
situations where the load cannot be constant folded.  Do you think there is any
reasonable way to incorporate that into this change?  Maybe a TTI hook?

—Owen

> On Jan 23, 2015, at 12:05 PM, Michael Zolotukhin <mzolotukhin at
apple.com> wrote:
> 
> Hi devs,
> 
> Recently I came across an interesting testcase that LLVM failed to optimize
well. The test does some image processing, and as a part of it, it traverses all
the pixels and computes some value basing on the adjacent pixels. So, the hot
part looks like this:
> 
> for(y = 0..height) {
> for (x = 0..width) {
>   val = 0
>   for (j = 0..5) {
>     for (i = 0..5) {
>       val += img[x+i,y+j] * weight[i,j]
>     }
>   }
> }
> }
> 
> And ‘weight' is just a constant matrix with some coefficients.
> 
> If we unroll the two internal loops (with tripcount 5), then we can replace
weight[i,j] with concrete constant values. In this particular case, many of the
coefficients are actually 0 or 1, which enables huge code simplifications later
on. But currently we unroll only the innermost one, because unrolling both of
them will exceed the threshold.
> 
> When deciding whether to unroll or not, we currently look only at the
instruction count of the loop. My proposal is to, on top of that, check if we
can enable any later optimizations by unrolling - in this case by replacing a
load with a constant. Similar to what we do in inlining heuristics, we can
estimate how many instructions would be potentially eliminated after unrolling
and adjust our threshold with this value.
> 
> I can imagine that it might be also useful for computations, involving
sparse constant matrixes (like identity matrix).
> 
> The attached patch implements this feature, and with it we handle the
original testcase well.
> 
> <complete-unroll.patch>
> 
> Does it look good? Of course, any ideas, suggestions and other feedback are
welcome!
> 
> 
> Thanks,
> Michael_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Michael Zolotukhin

2015-Jan-23 22:20 UTC

head link

[LLVMdev] [RFC] Heuristic for complete loop unrolling

> On Jan 23, 2015, at 1:27 PM, James Molloy <james at
jamesmolloy.co.uk> wrote:
> 
> Hi Michael,
Hi James,> 
> This looks very much like the inner loops from convolve() in Geekbench...
;)Yes, the original example was from Geekbench, but I wasn’t sure if I can copy
it:)
> Interestingly if we reroll that loop nest, the number of instructions
inside goes down by a factor of 3 so we do decide to inline them all. But that
is just luck / a hack. I completely agree that our current unrolling heuristics
don't take the right things into account (our inlining heuristics are
similar!).
> 
> When thinking about this previously, I've been worried that there's
a slippery slope where you could end up implementing half of the rest of the
compiler's analysis passes in the unroller/inliner to check for
opportunities. Do you have any thoughts on that? How do we keep it under
control?That’s a valid point. Theoretically, I think the best solution here would be to
run remaining passes in ‘sand-box’ mode and decide what to do basing on what
this run shows. But it’s hard to tell when to stop in these sand-box runs:
Should we just run const-prop+InstCombine? Should we try other loop
optimizations after that? Should we run all passes including backend (e.g.
unrolling will have a huge impact on scheduling, so it could really be
interesting)? And in any case, it would be quite compile-time expensive.

So, right now it seems to me that the best solution is to model future
optimizations by (hopefully simple) heuristics. I don’t think that we should
worry much that the code reminds const-prop, or another simple pass. However, if
we try to reimplement a vectorizer, or, say SROA, inside our heuristic, then
probably something is wrong. As I said, we should either model such passes with
simple heuristics, or we should think about interfaces in these passes that
would allow to execute dry-runs.

E.g. another possible heuristic for unroller would be the following: if after
unrolling all loads/stores are consecutive, then it’s beneficial to unroll,
because the code will be probably vectorized by SLP later. That’s a simple
heuristic, and in some cases it will give a wrong advice. We can either accept
it, or add an interface to SLP so that it'll answer whether it can vectorize
an instructions sequence or not. That would be the second way, more accurate but
more expensive.

In some cases the first approach would be acceptable, in others - the second,
and I think that should be decided on a case-by-case basis.

Michael
> 
> cc'ing Kevin who has been doing a lot of work recently on the loop
unroller.
> 
> Cheers,
> 
> James
> 
> On Fri Jan 23 2015 at 8:52:30 PM Michael Zolotukhin <mzolotukhin at
apple.com <mailto:mzolotukhin at apple.com>> wrote:
> Hi devs,
> 
> Recently I came across an interesting testcase that LLVM failed to optimize
well. The test does some image processing, and as a part of it, it traverses all
the pixels and computes some value basing on the adjacent pixels. So, the hot
part looks like this:
> 
> for(y = 0..height) {
>  for (x = 0..width) {
>    val = 0
>    for (j = 0..5) {
>      for (i = 0..5) {
>        val += img[x+i,y+j] * weight[i,j]
>      }
>    }
>  }
> }
> 
> And ‘weight' is just a constant matrix with some coefficients.
> 
> If we unroll the two internal loops (with tripcount 5), then we can replace
weight[i,j] with concrete constant values. In this particular case, many of the
coefficients are actually 0 or 1, which enables huge code simplifications later
on. But currently we unroll only the innermost one, because unrolling both of
them will exceed the threshold.
> 
> When deciding whether to unroll or not, we currently look only at the
instruction count of the loop. My proposal is to, on top of that, check if we
can enable any later optimizations by unrolling - in this case by replacing a
load with a constant. Similar to what we do in inlining heuristics, we can
estimate how many instructions would be potentially eliminated after unrolling
and adjust our threshold with this value.
> 
> I can imagine that it might be also useful for computations, involving
sparse constant matrixes (like identity matrix).
> 
> The attached patch implements this feature, and with it we handle the
original testcase well.
> 
> 
> 
> Does it look good? Of course, any ideas, suggestions and other feedback are
welcome!
> 
> 
> Thanks,
> Michael_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/8068680c/attachment.html>

Michael Zolotukhin

2015-Jan-23 22:26 UTC

head link

[LLVMdev] [RFC] Heuristic for complete loop unrolling

> On Jan 23, 2015, at 1:38 PM, Owen Anderson <resistor at mac.com>
wrote:
> 
> Hi Michael,
> 
> I’m unsure about other targets, but for the ones I work on converting
register-indexed loads into constant-indexed ones is a huge win, even in
situations where the load cannot be constant folded.  Do you think there is any
reasonable way to incorporate that into this change?  Maybe a TTI hook?Hi Owen,

It’s not difficult to incorporate such customization into the patch. Actually,
right now there is a magic constant in the patch which gives a weight to every
potentially removed load instruction. It might make sense to make this constant
target-dependent. And if we do it, we can also add a weight for loads that will
become constant-indexed (not necessarily constant-folded).

Michael> 
> —Owen
> 
> 
>> On Jan 23, 2015, at 12:05 PM, Michael Zolotukhin <mzolotukhin at
apple.com <mailto:mzolotukhin at apple.com>> wrote:
>> 
>> Hi devs,
>> 
>> Recently I came across an interesting testcase that LLVM failed to
optimize well. The test does some image processing, and as a part of it, it
traverses all the pixels and computes some value basing on the adjacent pixels.
So, the hot part looks like this:
>> 
>> for(y = 0..height) {
>> for (x = 0..width) {
>>  val = 0
>>  for (j = 0..5) {
>>    for (i = 0..5) {
>>      val += img[x+i,y+j] * weight[i,j]
>>    }
>>  }
>> }
>> }
>> 
>> And ‘weight' is just a constant matrix with some coefficients.
>> 
>> If we unroll the two internal loops (with tripcount 5), then we can
replace weight[i,j] with concrete constant values. In this particular case, many
of the coefficients are actually 0 or 1, which enables huge code simplifications
later on. But currently we unroll only the innermost one, because unrolling both
of them will exceed the threshold.
>> 
>> When deciding whether to unroll or not, we currently look only at the
instruction count of the loop. My proposal is to, on top of that, check if we
can enable any later optimizations by unrolling - in this case by replacing a
load with a constant. Similar to what we do in inlining heuristics, we can
estimate how many instructions would be potentially eliminated after unrolling
and adjust our threshold with this value.
>> 
>> I can imagine that it might be also useful for computations, involving
sparse constant matrixes (like identity matrix).
>> 
>> The attached patch implements this feature, and with it we handle the
original testcase well.
>> 
>> <complete-unroll.patch>
>> 
>> Does it look good? Of course, any ideas, suggestions and other feedback
are welcome!
>> 
>> 
>> Thanks,
>> Michael_______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/9f108bdf/attachment.html>

Sean Silva

2015-Jan-24 13:02 UTC

head link

[LLVMdev] [RFC] Heuristic for complete loop unrolling

On Fri, Jan 23, 2015 at 8:05 PM, Michael Zolotukhin <mzolotukhin at
apple.com>
wrote:
> Hi devs,
>
> Recently I came across an interesting testcase that LLVM failed to
> optimize well. The test does some image processing, and as a part of it, it
> traverses all the pixels and computes some value basing on the adjacent
> pixels. So, the hot part looks like this:
>
> for(y = 0..height) {
>  for (x = 0..width) {
>    val = 0
>    for (j = 0..5) {
>      for (i = 0..5) {
>        val += img[x+i,y+j] * weight[i,j]
>      }
>    }
>  }
> }
>
> And ‘weight' is just a constant matrix with some coefficients.
>
> If we unroll the two internal loops (with tripcount 5), then we can
> replace weight[i,j] with concrete constant values. In this particular case,
> many of the coefficients are actually 0 or 1, which enables huge code
> simplifications later on. But currently we unroll only the innermost one,
> because unrolling both of them will exceed the threshold.
>
> When deciding whether to unroll or not, we currently look only at the
> instruction count of the loop. My proposal is to, on top of that, check if
> we can enable any later optimizations by unrolling - in this case by
> replacing a load with a constant. Similar to what we do in inlining
> heuristics, we can estimate how many instructions would be potentially
> eliminated after unrolling and adjust our threshold with this value.
>
> I can imagine that it might be also useful for computations, involving
> sparse constant matrixes (like identity matrix).
>
> The attached patch implements this feature, and with it we handle the
> original testcase well.
>
>
>
>
> Does it look good? Of course, any ideas, suggestions and other feedback
> are welcome!
>
This is just a naive convolution algorithm implementation. Either the
author doesn't care about performance, or the author is expecting the
compiler to recognize the convolution and perform high-level loop
optimizations.

I'm not a domain expert on loop optimization, but convolution is an
operation of such *enormous* importance (FIR filtering, image processing,
PDE solving, etc.) that I would expect this to be well-researched already.
I wouldn't expect a C compiler to have this optimization, but I would
expect a fortran compiler to have it. A quick search of the literature
shows results well into the last millennium about optimizing this
particular kind of loop (which fortran calls a "stencil").

I actually think the general problem that this thread has been talking
about (how to determine if a particular optimization is going to enable
further optimizations) is very interesting in general, but I would take
this particular example with a grain of salt because is a very specific
stylized computation.

-- Sean Silva




>
>
> Thanks,
> Michael
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150124/542a2538/attachment.html>

Hal Finkel

2015-Jan-24 14:25 UTC

head link

[LLVMdev] [RFC] Heuristic for complete loop unrolling

----- Original Message -----> From: "Sean Silva" <chisophugis at gmail.com>
> To: "Michael Zolotukhin" <mzolotukhin at apple.com>
> Cc: "LLVM Developers Mailing List (llvmdev at cs.uiuc.edu)"
<llvmdev at cs.uiuc.edu>
> Sent: Saturday, January 24, 2015 7:02:35 AM
> Subject: Re: [LLVMdev] [RFC] Heuristic for complete loop unrolling
> 
> 
> On Fri, Jan 23, 2015 at 8:05 PM, Michael Zolotukhin <
> mzolotukhin at apple.com > wrote:
> 
> 
> Hi devs,
> 
> Recently I came across an interesting testcase that LLVM failed to
> optimize well. The test does some image processing, and as a part of
> it, it traverses all the pixels and computes some value basing on
> the adjacent pixels. So, the hot part looks like this:
> 
> for(y = 0..height) {
> for (x = 0..width) {
> val = 0
> for (j = 0..5) {
> for (i = 0..5) {
> val += img[x+i,y+j] * weight[i,j]
> }
> }
> }
> }
> 
> And ‘weight' is just a constant matrix with some coefficients.
> 
> If we unroll the two internal loops (with tripcount 5), then we can
> replace weight[i,j] with concrete constant values. In this
> particular case, many of the coefficients are actually 0 or 1, which
> enables huge code simplifications later on. But currently we unroll
> only the innermost one, because unrolling both of them will exceed
> the threshold.
> 
> When deciding whether to unroll or not, we currently look only at the
> instruction count of the loop. My proposal is to, on top of that,
> check if we can enable any later optimizations by unrolling - in
> this case by replacing a load with a constant. Similar to what we do
> in inlining heuristics, we can estimate how many instructions would
> be potentially eliminated after unrolling and adjust our threshold
> with this value.
> 
> I can imagine that it might be also useful for computations,
> involving sparse constant matrixes (like identity matrix).
> 
> The attached patch implements this feature, and with it we handle the
> original testcase well.
> 
> 
> 
> 
> Does it look good? Of course, any ideas, suggestions and other
> feedback are welcome!
> 
> 
> 
> This is just a naive convolution algorithm implementation. Either the
> author doesn't care about performance, or the author is expecting
> the compiler to recognize the convolution and perform high-level
> loop optimizations.
> 
> 
> I'm not a domain expert on loop optimization, but convolution is an
> operation of such *enormous* importance (FIR filtering, image
> processing, PDE solving, etc.) that I would expect this to be
> well-researched already. I wouldn't expect a C compiler to have this
> optimization, but I would expect a fortran compiler to have it.
I strongly feel that we should aim for a convergence in these two sets of
expectations. Especially given the literature in this area, it seems reasonable
to investigate adding this kind of capability.

 -Hal
> A
> quick search of the literature shows results well into the last
> millennium about optimizing this particular kind of loop (which
> fortran calls a "stencil").
> 
> 
> I actually think the general problem that this thread has been
> talking about (how to determine if a particular optimization is
> going to enable further optimizations) is very interesting in
> general, but I would take this particular example with a grain of
> salt because is a very specific stylized computation.
> 
> 
> -- Sean Silva
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Thanks,
> Michael
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Owen Anderson

2015-Jan-24 17:39 UTC

head link

[LLVMdev] [RFC] Heuristic for complete loop unrolling

> On Jan 24, 2015, at 5:02 AM, Sean Silva <chisophugis at gmail.com>
wrote:
> 
> I actually think the general problem that this thread has been talking
about (how to determine if a particular optimization is going to enable further
optimizations) is very interesting in general, but I would take this particular
example with a grain of salt because is a very specific stylized computation.
I see variations of this optimization come up in many contexts in graphics
programming, not just convolutions.  It's very common in code that operates
on 4x4 matrices, which comes up frequently in vertex shaders.

Convolutions and stencils are particularly interesting in that there are *even
more* optimizations you can do for them, typically with the goal of improving
cache locality.  But what Michael is proposing is useful for other classes of
programs as well.

-Owen

Michael Zolotukhin

2015-Jan-24 20:26 UTC

head link

[LLVMdev] [RFC] Heuristic for complete loop unrolling

> On Jan 24, 2015, at 5:02 AM, Sean Silva <chisophugis at gmail.com>
wrote:
> 
> 
> 
> On Fri, Jan 23, 2015 at 8:05 PM, Michael Zolotukhin <mzolotukhin at
apple.com <mailto:mzolotukhin at apple.com>> wrote:
> Hi devs,
> 
> Recently I came across an interesting testcase that LLVM failed to optimize
well. The test does some image processing, and as a part of it, it traverses all
the pixels and computes some value basing on the adjacent pixels. So, the hot
part looks like this:
> 
> for(y = 0..height) {
>  for (x = 0..width) {
>    val = 0
>    for (j = 0..5) {
>      for (i = 0..5) {
>        val += img[x+i,y+j] * weight[i,j]
>      }
>    }
>  }
> }
> 
> And ‘weight' is just a constant matrix with some coefficients.
> 
> If we unroll the two internal loops (with tripcount 5), then we can replace
weight[i,j] with concrete constant values. In this particular case, many of the
coefficients are actually 0 or 1, which enables huge code simplifications later
on. But currently we unroll only the innermost one, because unrolling both of
them will exceed the threshold.
> 
> When deciding whether to unroll or not, we currently look only at the
instruction count of the loop. My proposal is to, on top of that, check if we
can enable any later optimizations by unrolling - in this case by replacing a
load with a constant. Similar to what we do in inlining heuristics, we can
estimate how many instructions would be potentially eliminated after unrolling
and adjust our threshold with this value.
> 
> I can imagine that it might be also useful for computations, involving
sparse constant matrixes (like identity matrix).
> 
> The attached patch implements this feature, and with it we handle the
original testcase well.
> 
> 
> 
> 
> Does it look good? Of course, any ideas, suggestions and other feedback are
welcome!
> 
> This is just a naive convolution algorithm implementation. Either the
author doesn't care about performance, or the author is expecting the
compiler to recognize the convolution and perform high-level loop optimizations.
> 
> I'm not a domain expert on loop optimization, but convolution is an
operation of such *enormous* importance (FIR filtering, image processing, PDE
solving, etc.) that I would expect this to be well-researched already. I
wouldn't expect a C compiler to have this optimization, but I would expect a
fortran compiler to have it. A quick search of the literature shows results well
into the last millennium about optimizing this particular kind of loop (which
fortran calls a "stencil").
> 
> I actually think the general problem that this thread has been talking
about (how to determine if a particular optimization is going to enable further
optimizations) is very interesting in general, but I would take this particular
example with a grain of salt because is a very specific stylized computation.Hi Sean,

I took this example as the most obvious one, but it’s not the only one. E.g.
with the patch SPEC2006/458.sjeng also gains 2% on x86. It’s a program playing
chess, and one of the subtasks of the algorithm is to find out whether a
position is attacked or not. I haven’t investigated this gain thoroughly, but I
saw a lot of constant arrays in the source code, so I believe this optimization
just fired up there. And even the original example is not an artificial one - as
James has already mentioned, it’s based on Geekbench/sharpen and Geekbench/blur
(they use the same implementation of convolution algorithm). These two tests
might get 90% and 5% in performance from such optimization. And then there are
also vertex computations mentioned by Owen, which will also benefit from it.

I also have a feeling that this is a targeted approach, but since it helps a lot
of applications from different areas, I believe it’s worthwhile. I’d be happy to
hear about other viable alternatives, but currently this one looks optimal to
me, since it’s very cheap to implement (and thus to maintain), doesn’t add up
much to compile time, and gives the expected results.

Michael> -- Sean Silva
> 
> 
> 
>  
> 
> 
> Thanks,
> Michael
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150124/12a93f54/attachment.html>

llvm dev - Jan 2015 - [LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling

[LLVMdev] [RFC] Heuristic for complete loop unrolling