thr3ads.net - llvm dev - [llvm-dev] (RFC) Adjusting default loop fully unroll threshold [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Mehdi Amini via llvm-dev

2017-Jan-31 00:59 UTC

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

> On Jan 30, 2017, at 4:56 PM, Dehao Chen <dehao at google.com> wrote:
> 
> 
> 
> On Mon, Jan 30, 2017 at 3:56 PM, Chandler Carruth <chandlerc at
google.com <mailto:chandlerc at google.com>> wrote:
> On Mon, Jan 30, 2017 at 3:51 PM Mehdi Amini via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> On Jan 30, 2017, at 10:49 AM, Dehao Chen via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> Currently, loop fully unroller shares the same default threshold as
loop dynamic unroller and partial unroller. This seems conservative because
unlike dynamic/partial unrolling, fully unrolling will not affect LSD/ICache
performance. In https://reviews.llvm.org/D28368
<https://reviews.llvm.org/D28368>, I proposed to double the threshold for
loop fully unroller. This will change the codegen of several SPECCPU benchmarks:
>> 
>> Code size:
>> 447.dealII 0.50%
>> 453.povray 0.42%
>> 433.milc 0.20%
>> 445.gobmk 0.32%
>> 403.gcc 0.05%
>> 464.h264ref 3.62%
>> 
>> Compile Time:
>> 447.dealII 0.22%
>> 453.povray -0.16%
>> 433.milc 0.09%
>> 445.gobmk -2.43%
>> 403.gcc 0.06%
>> 464.h264ref 3.21%
>> 
>> Performance (on intel sandybridge):
>> 447.dealII +0.07%
>> 453.povray +1.79%
>> 433.milc +1.02%
>> 445.gobmk +0.56%
>> 403.gcc -0.16%
>> 464.h264ref -0.41%
>> 
> 
> 
> Can you clarify how to read these numbers? (I’m using +xx% to indicates a
slowdown usually, it seems you’re doing the opposite?).
> 
> As this is comparing spec scores instead of run time, +xx% here means
speedup, -xx% means slowdown.
>  
> 
> So considering 464.h264ref, does it mean it is 3.6% slower to compile, gets
3.2% larger, and 0.4% slower?
> 
> That is correct. The 0.4% slowdown is in the run-to-run noise range.
Ok, thanks for the clarifications.
What about the noise on the improvements? How reliable are you other numbers on
this aspect?
>  
> 
> Another question is about PGO integration: is it already hooked there?
Should we have a more aggressive threshold in a hot function? (Assuming we’re
willing to spend some binary size there but not on the cold path).
> 
> I would even wire the *unrolling* the other way: just suppress unrolling in
cold paths to save binary size. rolled loops seem like a generally good thing in
cold code unless they are having some larger impact (IE, the loop itself is more
expensive than the unrolled form).
> 
> 
> Agree that we could suppress unrolling in cold path to save code size. But
that's orthogonal with the propose here. This proposal focuses on O2
performance: shall we have different (higher) fully unroll threshold than
dynamic/partial unroll.
I agree that this is (to some extent) orthogonal, and it makes sense to me to
differentiate the threshold for full unroll and the dynamic/partial case.

Thanks,

— 
Mehdi

> We can have a separate patch to further boost threshold for hot loops and
suppress unrolling for cold loops. One concern is that in order to check if a
loop is hot/cold, we will need BFI for the loop pass. In the legacy loop pass
manager, this will insert a function pass in the middle of a series of loop
passes.
> 
> Dehao
>  
>  
> 
> Thanks,
> 
> — 
> Mehdi
> 
> 
>> Looks like the change has overall positive performance impact with very
small code size/compile time overhead. Now the question is shall we make this
change default in O2, or shall we leave it in O3. We would like to have more
input from the community to make the decision.
>> 
>> Thanks,
>> 
>> Dehao
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170130/7b96c055/attachment-0001.html>

Chandler Carruth via llvm-dev

2017-Jan-31 01:03 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

The general direction here seems fine (we should re-evaluate these kinds of
thresholds from time to time).

The other high level question I have here is: why 2x? Why not 4x?

I think it would be good to show some data from a reasonable spread and
where some reasonable point is in the curve to draw the line.

I'm also mildly worried about basing something like this on SPEC which
involves relatively small programs. It might be good to ask others to
provide data as well on code size / performance tradeoffs for larger
applications. (Chrome for example.)

And it might be worth writing down the methodology used to set the
thresholds so we start gaining some consistency on this front.

On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at apple.com>
wrote:
> On Jan 30, 2017, at 4:56 PM, Dehao Chen <dehao at google.com> wrote:
>
>
>
> On Mon, Jan 30, 2017 at 3:56 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
> On Mon, Jan 30, 2017 at 3:51 PM Mehdi Amini via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Jan 30, 2017, at 10:49 AM, Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Currently, loop fully unroller shares the same default threshold as loop
> dynamic unroller and partial unroller. This seems conservative because
> unlike dynamic/partial unrolling, fully unrolling will not affect
> LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed to
> double the threshold for loop fully unroller. This will change the codegen
> of several SPECCPU benchmarks:
>
> Code size:
> 447.dealII 0.50%
> 453.povray 0.42%
> 433.milc 0.20%
> 445.gobmk 0.32%
> 403.gcc 0.05%
> 464.h264ref 3.62%
>
> Compile Time:
> 447.dealII 0.22%
> 453.povray -0.16%
> 433.milc 0.09%
> 445.gobmk -2.43%
> 403.gcc 0.06%
> 464.h264ref 3.21%
>
> Performance (on intel sandybridge):
> 447.dealII +0.07%
> 453.povray +1.79%
> 433.milc +1.02%
> 445.gobmk +0.56%
> 403.gcc -0.16%
> 464.h264ref -0.41%
>
>
> Can you clarify how to read these numbers? (I’m using +xx% to indicates a
> slowdown usually, it seems you’re doing the opposite?).
>
>
> As this is comparing spec scores instead of run time, +xx% here means
> speedup, -xx% means slowdown.
>
>
>
> So considering 464.h264ref, does it mean it is 3.6% slower to compile,
> gets 3.2% larger, and 0.4% slower?
>
>
> That is correct. The 0.4% slowdown is in the run-to-run noise range.
>
>
> Ok, thanks for the clarifications.
> What about the noise on the improvements? How reliable are you other
> numbers on this aspect?
>
>
>
>
> Another question is about PGO integration: is it already hooked there?
> Should we have a more aggressive threshold in a hot function? (Assuming
> we’re willing to spend some binary size there but not on the cold path).
>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>
>
>
> Agree that we could suppress unrolling in cold path to save code size. But
> that's orthogonal with the propose here. This proposal focuses on O2
> performance: shall we have different (higher) fully unroll threshold than
> dynamic/partial unroll.
>
>
> I agree that this is (to some extent) orthogonal, and it makes sense to me
> to differentiate the threshold for full unroll and the dynamic/partial
case.
>
> Thanks,
>
> —
> Mehdi
>
>
> We can have a separate patch to further boost threshold for hot loops and
> suppress unrolling for cold loops. One concern is that in order to check if
> a loop is hot/cold, we will need BFI for the loop pass. In the legacy loop
> pass manager, this will insert a function pass in the middle of a series of
> loop passes.
>
> Dehao
>
>
>
>
>
> Thanks,
>
> —
> Mehdi
>
>
> Looks like the change has overall positive performance impact with very
> small code size/compile time overhead. Now the question is shall we make
> this change default in O2, or shall we leave it in O3. We would like to
> have more input from the community to make the decision.
>
> Thanks,
>
> Dehao
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/5f82014a/attachment.html>

Chandler Carruth via llvm-dev

2017-Jan-31 01:06 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at apple.com>
wrote:
>
>
> Another question is about PGO integration: is it already hooked there?
> Should we have a more aggressive threshold in a hot function? (Assuming
> we’re willing to spend some binary size there but not on the cold path).
>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>
>
>
> Agree that we could suppress unrolling in cold path to save code size. But
> that's orthogonal with the propose here. This proposal focuses on O2
> performance: shall we have different (higher) fully unroll threshold than
> dynamic/partial unroll.
>
>
> I agree that this is (to some extent) orthogonal, and it makes sense to me
> to differentiate the threshold for full unroll and the dynamic/partial
case.
>
There is one issue that makes these not orthogonal.

If even *static* profile hints will reduce some of the code size increase
caused by higher unrolling thresholds for non-cold code, we should factor
that into the tradeoff in picking where the threshold goes.

However, getting PGO into the full unroller is currently challenging
outside of the new pass manager. We already have some unfortunate hacks
around this in LoopUnswitch that are making the port of it to the new PM
more annoying.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/44fac14b/attachment.html>

Dehao Chen via llvm-dev

2017-Jan-31 23:20 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Recollected the data from trunk head with stddev data and more threshold
data points attached:

Performance:

stddev/mean 300 450 600 750
403 0.37% 0.11% 0.11% 0.09% 0.79%
433 0.14% 0.51% 0.25% -0.63% -0.29%
445 0.08% 0.48% 0.89% 0.12% 0.83%
447 0.16% 3.50% 2.69% 3.66% 3.59%
453 0.11% 1.49% 0.45% -0.07% 0.78%
464 0.17% 0.75% 1.80% 1.86% 1.54%
Code size:

300 450 600 750
403 0.56% 2.41% 2.74% 3.75%
433 0.96% 2.84% 4.19% 4.87%
445 2.16% 3.62% 4.48% 5.88%
447 2.96% 5.09% 6.74% 8.89%
453 0.94% 1.67% 2.73% 2.96%
464 8.02% 13.50% 20.51% 26.59%
Compile time is proportional in the experiments and more noisy, so I did
not include it.

We have >2% speedup on some google internal benchmarks when switching the
threshold from 150 to 300.

Dehao

On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
wrote:
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>>
>>
>> Another question is about PGO integration: is it already hooked there?
>> Should we have a more aggressive threshold in a hot function? (Assuming
>> we’re willing to spend some binary size there but not on the cold
path).
>>
>>
>> I would even wire the *unrolling* the other way: just suppress
unrolling
>> in cold paths to save binary size. rolled loops seem like a generally
good
>> thing in cold code unless they are having some larger impact (IE, the
loop
>> itself is more expensive than the unrolled form).
>>
>>
>>
>> Agree that we could suppress unrolling in cold path to save code size.
>> But that's orthogonal with the propose here. This proposal focuses
on O2
>> performance: shall we have different (higher) fully unroll threshold
than
>> dynamic/partial unroll.
>>
>>
>> I agree that this is (to some extent) orthogonal, and it makes sense to
>> me to differentiate the threshold for full unroll and the
dynamic/partial
>> case.
>>
>
> There is one issue that makes these not orthogonal.
>
> If even *static* profile hints will reduce some of the code size increase
> caused by higher unrolling thresholds for non-cold code, we should factor
> that into the tradeoff in picking where the threshold goes.
>
> However, getting PGO into the full unroller is currently challenging
> outside of the new pass manager. We already have some unfortunate hacks
> around this in LoopUnswitch that are making the port of it to the new PM
> more annoying.
>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/3299bb23/attachment.html>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Jan 2017 - (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Apparently Analagous Threads