thr3ads.net - llvm dev - [llvm-dev] (RFC) Adjusting default loop fully unroll threshold [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Dehao Chen via llvm-dev

2017-Jan-31 23:20 UTC

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Recollected the data from trunk head with stddev data and more threshold
data points attached:

Performance:

stddev/mean 300 450 600 750
403 0.37% 0.11% 0.11% 0.09% 0.79%
433 0.14% 0.51% 0.25% -0.63% -0.29%
445 0.08% 0.48% 0.89% 0.12% 0.83%
447 0.16% 3.50% 2.69% 3.66% 3.59%
453 0.11% 1.49% 0.45% -0.07% 0.78%
464 0.17% 0.75% 1.80% 1.86% 1.54%
Code size:

300 450 600 750
403 0.56% 2.41% 2.74% 3.75%
433 0.96% 2.84% 4.19% 4.87%
445 2.16% 3.62% 4.48% 5.88%
447 2.96% 5.09% 6.74% 8.89%
453 0.94% 1.67% 2.73% 2.96%
464 8.02% 13.50% 20.51% 26.59%
Compile time is proportional in the experiments and more noisy, so I did
not include it.

We have >2% speedup on some google internal benchmarks when switching the
threshold from 150 to 300.

Dehao

On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
wrote:
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>>
>>
>> Another question is about PGO integration: is it already hooked there?
>> Should we have a more aggressive threshold in a hot function? (Assuming
>> we’re willing to spend some binary size there but not on the cold
path).
>>
>>
>> I would even wire the *unrolling* the other way: just suppress
unrolling
>> in cold paths to save binary size. rolled loops seem like a generally
good
>> thing in cold code unless they are having some larger impact (IE, the
loop
>> itself is more expensive than the unrolled form).
>>
>>
>>
>> Agree that we could suppress unrolling in cold path to save code size.
>> But that's orthogonal with the propose here. This proposal focuses
on O2
>> performance: shall we have different (higher) fully unroll threshold
than
>> dynamic/partial unroll.
>>
>>
>> I agree that this is (to some extent) orthogonal, and it makes sense to
>> me to differentiate the threshold for full unroll and the
dynamic/partial
>> case.
>>
>
> There is one issue that makes these not orthogonal.
>
> If even *static* profile hints will reduce some of the code size increase
> caused by higher unrolling thresholds for non-cold code, we should factor
> that into the tradeoff in picking where the threshold goes.
>
> However, getting PGO into the full unroller is currently challenging
> outside of the new pass manager. We already have some unfortunate hacks
> around this in LoopUnswitch that are making the port of it to the new PM
> more annoying.
>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/3299bb23/attachment.html>

Dehao Chen via llvm-dev

2017-Feb-02 00:33 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

With the new data points, any comments on whether this can justify setting
fully inline threshold to 300 (or any other number) in O2? I can collect
more data points if it's helpful.

Thanks,
Dehao

On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com> wrote:
> Recollected the data from trunk head with stddev data and more threshold
> data points attached:
>
> Performance:
>
> stddev/mean 300 450 600 750
> 403 0.37% 0.11% 0.11% 0.09% 0.79%
> 433 0.14% 0.51% 0.25% -0.63% -0.29%
> 445 0.08% 0.48% 0.89% 0.12% 0.83%
> 447 0.16% 3.50% 2.69% 3.66% 3.59%
> 453 0.11% 1.49% 0.45% -0.07% 0.78%
> 464 0.17% 0.75% 1.80% 1.86% 1.54%
> Code size:
>
> 300 450 600 750
> 403 0.56% 2.41% 2.74% 3.75%
> 433 0.96% 2.84% 4.19% 4.87%
> 445 2.16% 3.62% 4.48% 5.88%
> 447 2.96% 5.09% 6.74% 8.89%
> 453 0.94% 1.67% 2.73% 2.96%
> 464 8.02% 13.50% 20.51% 26.59%
> Compile time is proportional in the experiments and more noisy, so I did
> not include it.
>
> We have >2% speedup on some google internal benchmarks when switching
the
> threshold from 150 to 300.
>
> Dehao
>
> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
>> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com>
>> wrote:
>>
>>>
>>>
>>> Another question is about PGO integration: is it already hooked
there?
>>> Should we have a more aggressive threshold in a hot function?
(Assuming
>>> we’re willing to spend some binary size there but not on the cold
path).
>>>
>>>
>>> I would even wire the *unrolling* the other way: just suppress
unrolling
>>> in cold paths to save binary size. rolled loops seem like a
generally good
>>> thing in cold code unless they are having some larger impact (IE,
the loop
>>> itself is more expensive than the unrolled form).
>>>
>>>
>>>
>>> Agree that we could suppress unrolling in cold path to save code
size.
>>> But that's orthogonal with the propose here. This proposal
focuses on O2
>>> performance: shall we have different (higher) fully unroll
threshold than
>>> dynamic/partial unroll.
>>>
>>>
>>> I agree that this is (to some extent) orthogonal, and it makes
sense to
>>> me to differentiate the threshold for full unroll and the
dynamic/partial
>>> case.
>>>
>>
>> There is one issue that makes these not orthogonal.
>>
>> If even *static* profile hints will reduce some of the code size
increase
>> caused by higher unrolling thresholds for non-cold code, we should
factor
>> that into the tradeoff in picking where the threshold goes.
>>
>> However, getting PGO into the full unroller is currently challenging
>> outside of the new pass manager. We already have some unfortunate hacks
>> around this in LoopUnswitch that are making the port of it to the new
PM
>> more annoying.
>>
>>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/f1efba7f/attachment.html>

Chandler Carruth via llvm-dev

2017-Feb-02 00:47 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

I had suggested having size metrics from somewhat larger applications such
as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
binaries with rough size brackets?

On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com> wrote:
> With the new data points, any comments on whether this can justify setting
> fully inline threshold to 300 (or any other number) in O2? I can collect
> more data points if it's helpful.
>
> Thanks,
> Dehao
>
> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Recollected the data from trunk head with stddev data and more threshold
> data points attached:
>
> Performance:
>
> stddev/mean 300 450 600 750
> 403 0.37% 0.11% 0.11% 0.09% 0.79%
> 433 0.14% 0.51% 0.25% -0.63% -0.29%
> 445 0.08% 0.48% 0.89% 0.12% 0.83%
> 447 0.16% 3.50% 2.69% 3.66% 3.59%
> 453 0.11% 1.49% 0.45% -0.07% 0.78%
> 464 0.17% 0.75% 1.80% 1.86% 1.54%
> Code size:
>
> 300 450 600 750
> 403 0.56% 2.41% 2.74% 3.75%
> 433 0.96% 2.84% 4.19% 4.87%
> 445 2.16% 3.62% 4.48% 5.88%
> 447 2.96% 5.09% 6.74% 8.89%
> 453 0.94% 1.67% 2.73% 2.96%
> 464 8.02% 13.50% 20.51% 26.59%
> Compile time is proportional in the experiments and more noisy, so I did
> not include it.
>
> We have >2% speedup on some google internal benchmarks when switching
the
> threshold from 150 to 300.
>
> Dehao
>
> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>
>
> Another question is about PGO integration: is it already hooked there?
> Should we have a more aggressive threshold in a hot function? (Assuming
> we’re willing to spend some binary size there but not on the cold path).
>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>
>
>
> Agree that we could suppress unrolling in cold path to save code size. But
> that's orthogonal with the propose here. This proposal focuses on O2
> performance: shall we have different (higher) fully unroll threshold than
> dynamic/partial unroll.
>
>
> I agree that this is (to some extent) orthogonal, and it makes sense to me
> to differentiate the threshold for full unroll and the dynamic/partial
case.
>
>
> There is one issue that makes these not orthogonal.
>
> If even *static* profile hints will reduce some of the code size increase
> caused by higher unrolling thresholds for non-cold code, we should factor
> that into the tradeoff in picking where the threshold goes.
>
> However, getting PGO into the full unroller is currently challenging
> outside of the new pass manager. We already have some unfortunate hacks
> around this in LoopUnswitch that are making the port of it to the new PM
> more annoying.
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170202/c9b67e49/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Feb 2017 - (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Apparently Analagous Threads