thr3ads.net - llvm dev - [llvm-dev] (RFC) Adjusting default loop fully unroll threshold [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth via llvm-dev

2017-Feb-02 00:47 UTC

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

I had suggested having size metrics from somewhat larger applications such
as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
binaries with rough size brackets?

On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com> wrote:
> With the new data points, any comments on whether this can justify setting
> fully inline threshold to 300 (or any other number) in O2? I can collect
> more data points if it's helpful.
>
> Thanks,
> Dehao
>
> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Recollected the data from trunk head with stddev data and more threshold
> data points attached:
>
> Performance:
>
> stddev/mean 300 450 600 750
> 403 0.37% 0.11% 0.11% 0.09% 0.79%
> 433 0.14% 0.51% 0.25% -0.63% -0.29%
> 445 0.08% 0.48% 0.89% 0.12% 0.83%
> 447 0.16% 3.50% 2.69% 3.66% 3.59%
> 453 0.11% 1.49% 0.45% -0.07% 0.78%
> 464 0.17% 0.75% 1.80% 1.86% 1.54%
> Code size:
>
> 300 450 600 750
> 403 0.56% 2.41% 2.74% 3.75%
> 433 0.96% 2.84% 4.19% 4.87%
> 445 2.16% 3.62% 4.48% 5.88%
> 447 2.96% 5.09% 6.74% 8.89%
> 453 0.94% 1.67% 2.73% 2.96%
> 464 8.02% 13.50% 20.51% 26.59%
> Compile time is proportional in the experiments and more noisy, so I did
> not include it.
>
> We have >2% speedup on some google internal benchmarks when switching
the
> threshold from 150 to 300.
>
> Dehao
>
> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>
>
> Another question is about PGO integration: is it already hooked there?
> Should we have a more aggressive threshold in a hot function? (Assuming
> we’re willing to spend some binary size there but not on the cold path).
>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>
>
>
> Agree that we could suppress unrolling in cold path to save code size. But
> that's orthogonal with the propose here. This proposal focuses on O2
> performance: shall we have different (higher) fully unroll threshold than
> dynamic/partial unroll.
>
>
> I agree that this is (to some extent) orthogonal, and it makes sense to me
> to differentiate the threshold for full unroll and the dynamic/partial
case.
>
>
> There is one issue that makes these not orthogonal.
>
> If even *static* profile hints will reduce some of the code size increase
> caused by higher unrolling thresholds for non-cold code, we should factor
> that into the tradeoff in picking where the threshold goes.
>
> However, getting PGO into the full unroller is currently challenging
> outside of the new pass manager. We already have some unfortunate hacks
> around this in LoopUnswitch that are making the port of it to the new PM
> more annoying.
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170202/c9b67e49/attachment.html>

Xinliang David Li via llvm-dev

2017-Feb-02 00:57 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

clang, chrome, and some internal large apps are good candidates for size
metrics.

David

On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I had suggested having size metrics from somewhat larger applications such
> as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
> binaries with rough size brackets?
>
> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com>
wrote:
>
>> With the new data points, any comments on whether this can justify
>> setting fully inline threshold to 300 (or any other number) in O2? I
can
>> collect more data points if it's helpful.
>>
>> Thanks,
>> Dehao
>>
>> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com>
wrote:
>>
>> Recollected the data from trunk head with stddev data and more
threshold
>> data points attached:
>>
>> Performance:
>>
>> stddev/mean 300 450 600 750
>> 403 0.37% 0.11% 0.11% 0.09% 0.79%
>> 433 0.14% 0.51% 0.25% -0.63% -0.29%
>> 445 0.08% 0.48% 0.89% 0.12% 0.83%
>> 447 0.16% 3.50% 2.69% 3.66% 3.59%
>> 453 0.11% 1.49% 0.45% -0.07% 0.78%
>> 464 0.17% 0.75% 1.80% 1.86% 1.54%
>> Code size:
>>
>> 300 450 600 750
>> 403 0.56% 2.41% 2.74% 3.75%
>> 433 0.96% 2.84% 4.19% 4.87%
>> 445 2.16% 3.62% 4.48% 5.88%
>> 447 2.96% 5.09% 6.74% 8.89%
>> 453 0.94% 1.67% 2.73% 2.96%
>> 464 8.02% 13.50% 20.51% 26.59%
>> Compile time is proportional in the experiments and more noisy, so I
did
>> not include it.
>>
>> We have >2% speedup on some google internal benchmarks when
switching the
>> threshold from 150 to 300.
>>
>> Dehao
>>
>> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
>> wrote:
>>
>> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com>
>> wrote:
>>
>>
>>
>> Another question is about PGO integration: is it already hooked there?
>> Should we have a more aggressive threshold in a hot function? (Assuming
>> we’re willing to spend some binary size there but not on the cold
path).
>>
>>
>> I would even wire the *unrolling* the other way: just suppress
unrolling
>> in cold paths to save binary size. rolled loops seem like a generally
good
>> thing in cold code unless they are having some larger impact (IE, the
loop
>> itself is more expensive than the unrolled form).
>>
>>
>>
>> Agree that we could suppress unrolling in cold path to save code size.
>> But that's orthogonal with the propose here. This proposal focuses
on O2
>> performance: shall we have different (higher) fully unroll threshold
than
>> dynamic/partial unroll.
>>
>>
>> I agree that this is (to some extent) orthogonal, and it makes sense to
>> me to differentiate the threshold for full unroll and the
dynamic/partial
>> case.
>>
>>
>> There is one issue that makes these not orthogonal.
>>
>> If even *static* profile hints will reduce some of the code size
increase
>> caused by higher unrolling thresholds for non-cold code, we should
factor
>> that into the tradeoff in picking where the threshold goes.
>>
>> However, getting PGO into the full unroller is currently challenging
>> outside of the new pass manager. We already have some unfortunate hacks
>> around this in LoopUnswitch that are making the port of it to the new
PM
>> more annoying.
>>
>>
>>
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/1b54ec7f/attachment-0001.html>

Mikhail Zolotukhin via llvm-dev

2017-Feb-02 02:08 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> clang, chrome, and some internal large apps are good candidates for size
metrics.I'd also add the standard LLVM testsuite just because it's the suite
everyone in the community can use.

Michael> 
> David
> 
> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> I had suggested having size metrics from somewhat larger applications such
as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
binaries with rough size brackets?
> 
> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com
<mailto:dehao at google.com>> wrote:
> With the new data points, any comments on whether this can justify setting
fully inline threshold to 300 (or any other number) in O2? I can collect more
data points if it's helpful.
> 
> Thanks,
> Dehao
> 
> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com
<mailto:dehao at google.com>> wrote:
> Recollected the data from trunk head with stddev data and more threshold
data points attached:
> 
> Performance:
> 
> stddev/mean	300	450	600	750
> 403	0.37%	0.11%	0.11%	0.09%	0.79%
> 433	0.14%	0.51%	0.25%	-0.63%	-0.29%
> 445	0.08%	0.48%	0.89%	0.12%	0.83%
> 447	0.16%	3.50%	2.69%	3.66%	3.59%
> 453	0.11%	1.49%	0.45%	-0.07%	0.78%
> 464	0.17%	0.75%	1.80%	1.86%	1.54%
> 
> Code size:
> 
> 300	450	600	750
> 403	0.56%	2.41%	2.74%	3.75%
> 433	0.96%	2.84%	4.19%	4.87%
> 445	2.16%	3.62%	4.48%	5.88%
> 447	2.96%	5.09%	6.74%	8.89%
> 453	0.94%	1.67%	2.73%	2.96%
> 464	8.02%	13.50%	20.51%	26.59%
> 
> Compile time is proportional in the experiments and more noisy, so I did
not include it.
> 
> We have >2% speedup on some google internal benchmarks when switching
the threshold from 150 to 300.
> 
> Dehao
> 
> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com <mailto:chandlerc at google.com>> wrote:
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
>> 
> 
>> 
>> Another question is about PGO integration: is it already hooked there?
Should we have a more aggressive threshold in a hot function? (Assuming we’re
willing to spend some binary size there but not on the cold path).
>> 
>> I would even wire the *unrolling* the other way: just suppress
unrolling in cold paths to save binary size. rolled loops seem like a generally
good thing in cold code unless they are having some larger impact (IE, the loop
itself is more expensive than the unrolled form).
>> 
>> 
>> Agree that we could suppress unrolling in cold path to save code size.
But that's orthogonal with the propose here. This proposal focuses on O2
performance: shall we have different (higher) fully unroll threshold than
dynamic/partial unroll.
> 
> I agree that this is (to some extent) orthogonal, and it makes sense to me
to differentiate the threshold for full unroll and the dynamic/partial case.
> 
> There is one issue that makes these not orthogonal.
> 
> If even *static* profile hints will reduce some of the code size increase
caused by higher unrolling thresholds for non-cold code, we should factor that
into the tradeoff in picking where the threshold goes.
> 
> However, getting PGO into the full unroller is currently challenging
outside of the new pass manager. We already have some unfortunate hacks around
this in LoopUnswitch that are making the port of it to the new PM more annoying.
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/417e1ce2/attachment.html>

llvm dev - Feb 2017 - (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold