thr3ads.net - llvm dev - [llvm-dev] (RFC) Adjusting default loop fully unroll threshold [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2017-Feb-10 23:23 UTC

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

On 02/10/2017 05:21 PM, Dehao Chen wrote:> Thanks every for the comments.
>
> Do we have a decision here?
You're good to go as far as I'm concerned.

  -Hal
>
> Dehao
>
> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>>     Sorry if I missed it, but what machine/CPU are you using to
>>     collect the perf numbers?
>>
>>     I am concerned that what may be a win on a CPU that keeps a
>>     couple of hundred instructions in-flight and has many MB of
>>     caches will not hold for a small core.
>
>     In my experience, unrolling tends to help weaker cores even more
>     than stronger ones because it allows the instruction scheduler
>     more opportunities to hide latency. Obviously, instruction-cache
>     pressure is an important consideration, but the code size changes
>     here seems small.
>
>>
>>     Is the proposed change universal? Is there a way to undo it?
>
>     All of the unrolling thresholds should be target-adjustable using
>     the TTI::getUnrollingPreferences hook.
>
>      -Hal
>
>
>>
>>     On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev
>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>         Ping... with the updated code size impact data, any more
>>         comments? Any more data that would be interesting to collect?
>>
>>         Thanks,
>>         Dehao
>>
>>         On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at
google.com
>>         <mailto:dehao at google.com>> wrote:
>>
>>             Here is the code size impact for clang, chrome and 24
>>             google internal benchmarks (name omited, 14 15 16 are
>>             encoding/decoding benchmarks similar as h264). There are
>>             2 columns, for threshold 300 and 450 respectively.
>>
>>             I also tested the llvm test suite. Changing the threshold
>>             to 300/450 does not affect code gen for any binary in the
>>             test suite.
>>
>>
>>
>>             	300 	450
>>             clang 	0.30% 	0.63%
>>             chrome 	0.00% 	0.00%
>>             1 	0.27% 	0.67%
>>             2 	0.44% 	0.93%
>>             3 	0.44% 	0.93%
>>             4 	0.26% 	0.53%
>>             5 	0.74% 	2.21%
>>             6 	0.74% 	2.21%
>>             7 	0.74% 	2.21%
>>             8 	0.46% 	1.05%
>>             9 	0.35% 	0.86%
>>             10 	0.35% 	0.86%
>>             11 	0.40% 	0.83%
>>             12 	0.32% 	0.65%
>>             13 	0.31% 	0.64%
>>             14 	4.52% 	8.23%
>>             15 	9.90% 	19.38%
>>             16 	9.90% 	19.38%
>>             17 	0.68% 	1.97%
>>             18 	0.21% 	0.48%
>>             19 	0.99% 	3.44%
>>             20 	0.19% 	0.46%
>>             21 	0.57% 	1.62%
>>             22 	0.37% 	1.05%
>>             23 	0.78% 	1.30%
>>             24 	0.51% 	1.54%
>>
>>
>>             On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via
>>             llvm-dev <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>>                 On Feb 1, 2017, at 4:57 PM, Xinliang David Li via
>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>                 <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>
>>>                 clang, chrome, and some internal large apps are
good
>>>                 candidates for size metrics.
>>                 I'd also add the standard LLVM testsuite just
because
>>                 it's the suite everyone in the community can use.
>>
>>                 Michael
>>>
>>>                 David
>>>
>>>                 On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth
via
>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>                 <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>
>>>                     I had suggested having size metrics from
>>>                     somewhat larger applications such as Chrome,
>>>                     Webkit, or Firefox; clang itself; and maybe
some
>>>                     of our internal binaries with rough size
brackets?
>>>
>>>                     On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen
>>>                     <dehao at google.com <mailto:dehao at
google.com>> wrote:
>>>
>>>                         With the new data points, any comments on
>>>                         whether this can justify setting fully
>>>                         inline threshold to 300 (or any other
>>>                         number) in O2? I can collect more data
>>>                         points if it's helpful.
>>>
>>>                         Thanks,
>>>                         Dehao
>>>
>>>                         On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen
>>>                         <dehao at google.com <mailto:dehao at
google.com>>
>>>                         wrote:
>>>
>>>                             Recollected the data from trunk head
>>>                             with stddev data and more threshold
data
>>>                             points attached:
>>>
>>>                             Performance:
>>>
>>>                             	stddev/mean 	300 	450 	600 	750
>>>                             403 	0.37% 	0.11% 	0.11% 	0.09% 	0.79%
>>>                             433 	0.14% 	0.51% 	0.25% 	-0.63% 
-0.29%
>>>                             445 	0.08% 	0.48% 	0.89% 	0.12% 	0.83%
>>>                             447 	0.16% 	3.50% 	2.69% 	3.66% 	3.59%
>>>                             453 	0.11% 	1.49% 	0.45% 	-0.07% 	0.78%
>>>                             464 	0.17% 	0.75% 	1.80% 	1.86% 	1.54%
>>>
>>>
>>>                             Code size:
>>>
>>>                             	300 	450 	600 	750
>>>                             403 	0.56% 	2.41% 	2.74% 	3.75%
>>>                             433 	0.96% 	2.84% 	4.19% 	4.87%
>>>                             445 	2.16% 	3.62% 	4.48% 	5.88%
>>>                             447 	2.96% 	5.09% 	6.74% 	8.89%
>>>                             453 	0.94% 	1.67% 	2.73% 	2.96%
>>>                             464 	8.02% 	13.50% 	20.51% 	26.59%
>>>
>>>
>>>                             Compile time is proportional in the
>>>                             experiments and more noisy, so I did
not
>>>                             include it.
>>>
>>>                             We have >2% speedup on some google
>>>                             internal benchmarks when switching the
>>>                             threshold from 150 to 300.
>>>
>>>                             Dehao
>>>
>>>                             On Mon, Jan 30, 2017 at 5:06 PM,
>>>                             Chandler Carruth <chandlerc at
google.com
>>>                             <mailto:chandlerc at
google.com>> wrote:
>>>
>>>                                 On Mon, Jan 30, 2017 at 4:59 PM
>>>                                 Mehdi Amini <mehdi.amini at
apple.com
>>>                                 <mailto:mehdi.amini at
apple.com>> wrote:
>>>
>>>>
>>>>
>>>>                                             Another question is
>>>>                                             about PGO
integration:
>>>>                                             is it already
hooked
>>>>                                             there? Should we
have a
>>>>                                             more aggressive
>>>>                                             threshold in a hot
>>>>                                             function? (Assuming
>>>>                                             we’re willing to
spend
>>>>                                             some binary size
there
>>>>                                             but not on the cold
path).
>>>>
>>>>
>>>>                                         I would even wire the
>>>>                                         *unrolling* the other
way:
>>>>                                         just suppress unrolling
in
>>>>                                         cold paths to save
binary
>>>>                                         size. rolled loops seem
>>>>                                         like a generally good
thing
>>>>                                         in cold code unless
they
>>>>                                         are having some larger
>>>>                                         impact (IE, the loop
itself
>>>>                                         is more expensive than
the
>>>>                                         unrolled form).
>>>>
>>>>
>>>>
>>>>                                     Agree that we could
suppress
>>>>                                     unrolling in cold path to
save
>>>>                                     code size. But that's
>>>>                                     orthogonal with the propose
>>>>                                     here. This proposal focuses
on
>>>>                                     O2 performance: shall we
have
>>>>                                     different (higher) fully
unroll
>>>>                                     threshold than
dynamic/partial
>>>>                                     unroll.
>>>
>>>                                     I agree that this is (to some
>>>                                     extent) orthogonal, and it
makes
>>>                                     sense to me to differentiate
the
>>>                                     threshold for full unroll and
>>>                                     the dynamic/partial case.
>>>
>>>
>>>                                 There is one issue that makes these
>>>                                 not orthogonal.
>>>
>>>                                 If even *static* profile hints will
>>>                                 reduce some of the code size
>>>                                 increase caused by higher unrolling
>>>                                 thresholds for non-cold code, we
>>>                                 should factor that into the
tradeoff
>>>                                 in picking where the threshold
goes.
>>>
>>>                                 However, getting PGO into the full
>>>                                 unroller is currently challenging
>>>                                 outside of the new pass manager. We
>>>                                 already have some unfortunate hacks
>>>                                 around this in LoopUnswitch that
are
>>>                                 making the port of it to the new PM
>>>                                 more annoying.
>>>
>>>
>>>
>>>
>>>                     _______________________________________________
>>>                     LLVM Developers mailing list
>>>                     llvm-dev at lists.llvm.org
>>>                     <mailto:llvm-dev at lists.llvm.org>
>>>                    
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>                    
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>
>>>
>>>                 _______________________________________________
>>>                 LLVM Developers mailing list
>>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>                
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>                
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>                 _______________________________________________
>>                 LLVM Developers mailing list
>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>
>>
>>         _______________________________________________
>>         LLVM Developers mailing list
>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170210/63eed20d/attachment-0001.html>

Dehao Chen via llvm-dev

2017-Feb-10 23:53 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thanks Hal, could you help approve https://reviews.llvm.org/D28368?

I'll hold off until early Tuesday in case other people have more concerns.

Thanks,
Dehao

On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 02/10/2017 05:21 PM, Dehao Chen wrote:
>
> Thanks every for the comments.
>
> Do we have a decision here?
>
>
> You're good to go as far as I'm concerned.
>
>  -Hal
>
>
> Dehao
>
> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>>
>> Sorry if I missed it, but what machine/CPU are you using to collect the
>> perf numbers?
>>
>> I am concerned that what may be a win on a CPU that keeps a couple of
>> hundred instructions in-flight and has many MB of caches will not hold
for
>> a small core.
>>
>>
>> In my experience, unrolling tends to help weaker cores even more than
>> stronger ones because it allows the instruction scheduler more
>> opportunities to hide latency. Obviously, instruction-cache pressure is
an
>> important consideration, but the code size changes here seems small.
>>
>>
>> Is the proposed change universal? Is there a way to undo it?
>>
>>
>> All of the unrolling thresholds should be target-adjustable using the
>> TTI::getUnrollingPreferences hook.
>>
>>  -Hal
>>
>>
>>
>> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Ping... with the updated code size impact data, any more comments?
Any
>>> more data that would be interesting to collect?
>>>
>>> Thanks,
>>> Dehao
>>>
>>> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at
google.com> wrote:
>>>
>>>> Here is the code size impact for clang, chrome and 24 google
internal
>>>> benchmarks (name omited, 14 15 16 are encoding/decoding
benchmarks similar
>>>> as h264). There are 2 columns, for threshold 300 and 450
respectively.
>>>>
>>>> I also tested the llvm test suite. Changing the threshold to
300/450
>>>> does not affect code gen for any binary in the test suite.
>>>>
>>>>
>>>>
>>>> 300 450
>>>> clang 0.30% 0.63%
>>>> chrome 0.00% 0.00%
>>>> 1 0.27% 0.67%
>>>> 2 0.44% 0.93%
>>>> 3 0.44% 0.93%
>>>> 4 0.26% 0.53%
>>>> 5 0.74% 2.21%
>>>> 6 0.74% 2.21%
>>>> 7 0.74% 2.21%
>>>> 8 0.46% 1.05%
>>>> 9 0.35% 0.86%
>>>> 10 0.35% 0.86%
>>>> 11 0.40% 0.83%
>>>> 12 0.32% 0.65%
>>>> 13 0.31% 0.64%
>>>> 14 4.52% 8.23%
>>>> 15 9.90% 19.38%
>>>> 16 9.90% 19.38%
>>>> 17 0.68% 1.97%
>>>> 18 0.21% 0.48%
>>>> 19 0.99% 3.44%
>>>> 20 0.19% 0.46%
>>>> 21 0.57% 1.62%
>>>> 22 0.37% 1.05%
>>>> 23 0.78% 1.30%
>>>> 24 0.51% 1.54%
>>>>
>>>> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev
<
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev
<
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> clang, chrome, and some internal large apps are good
candidates for
>>>>> size metrics.
>>>>>
>>>>> I'd also add the standard LLVM testsuite just because
it's the suite
>>>>> everyone in the community can use.
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> David
>>>>>
>>>>> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via
llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> I had suggested having size metrics from somewhat
larger applications
>>>>>> such as Chrome, Webkit, or Firefox; clang itself; and
maybe some of our
>>>>>> internal binaries with rough size brackets?
>>>>>>
>>>>>> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at
google.com> wrote:
>>>>>>
>>>>>>> With the new data points, any comments on whether
this can justify
>>>>>>> setting fully inline threshold to 300 (or any other
number) in O2? I can
>>>>>>> collect more data points if it's helpful.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dehao
>>>>>>>
>>>>>>> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen
<dehao at google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Recollected the data from trunk head with stddev
data and more
>>>>>>> threshold data points attached:
>>>>>>>
>>>>>>> Performance:
>>>>>>>
>>>>>>> stddev/mean 300 450 600 750
>>>>>>> 403 0.37% 0.11% 0.11% 0.09% 0.79%
>>>>>>> 433 0.14% 0.51% 0.25% -0.63% -0.29%
>>>>>>> 445 0.08% 0.48% 0.89% 0.12% 0.83%
>>>>>>> 447 0.16% 3.50% 2.69% 3.66% 3.59%
>>>>>>> 453 0.11% 1.49% 0.45% -0.07% 0.78%
>>>>>>> 464 0.17% 0.75% 1.80% 1.86% 1.54%
>>>>>>> Code size:
>>>>>>>
>>>>>>> 300 450 600 750
>>>>>>> 403 0.56% 2.41% 2.74% 3.75%
>>>>>>> 433 0.96% 2.84% 4.19% 4.87%
>>>>>>> 445 2.16% 3.62% 4.48% 5.88%
>>>>>>> 447 2.96% 5.09% 6.74% 8.89%
>>>>>>> 453 0.94% 1.67% 2.73% 2.96%
>>>>>>> 464 8.02% 13.50% 20.51% 26.59%
>>>>>>> Compile time is proportional in the experiments and
more noisy, so I
>>>>>>> did not include it.
>>>>>>>
>>>>>>> We have >2% speedup on some google internal
benchmarks when
>>>>>>> switching the threshold from 150 to 300.
>>>>>>>
>>>>>>> Dehao
>>>>>>>
>>>>>>> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth
<
>>>>>>> chandlerc at google.com> wrote:
>>>>>>>
>>>>>>> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini
<mehdi.amini at apple.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Another question is about PGO integration: is it
already hooked
>>>>>>> there? Should we have a more aggressive threshold
in a hot function?
>>>>>>> (Assuming we’re willing to spend some binary size
there but not on the cold
>>>>>>> path).
>>>>>>>
>>>>>>>
>>>>>>> I would even wire the *unrolling* the other way:
just suppress
>>>>>>> unrolling in cold paths to save binary size. rolled
loops seem like a
>>>>>>> generally good thing in cold code unless they are
having some larger impact
>>>>>>> (IE, the loop itself is more expensive than the
unrolled form).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Agree that we could suppress unrolling in cold path
to save code
>>>>>>> size. But that's orthogonal with the propose
here. This proposal focuses on
>>>>>>> O2 performance: shall we have different (higher)
fully unroll threshold
>>>>>>> than dynamic/partial unroll.
>>>>>>>
>>>>>>>
>>>>>>> I agree that this is (to some extent) orthogonal,
and it makes sense
>>>>>>> to me to differentiate the threshold for full
unroll and the
>>>>>>> dynamic/partial case.
>>>>>>>
>>>>>>>
>>>>>>> There is one issue that makes these not orthogonal.
>>>>>>>
>>>>>>> If even *static* profile hints will reduce some of
the code size
>>>>>>> increase caused by higher unrolling thresholds for
non-cold code, we should
>>>>>>> factor that into the tradeoff in picking where the
threshold goes.
>>>>>>>
>>>>>>> However, getting PGO into the full unroller is
currently challenging
>>>>>>> outside of the new pass manager. We already have
some unfortunate hacks
>>>>>>> around this in LoopUnswitch that are making the
port of it to the new PM
>>>>>>> more annoying.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170210/f93ca7ee/attachment-0001.html>

Sanjay Patel via llvm-dev

2017-Feb-12 16:34 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Since we can override the settings, I have no objections.

I still think it would be good to document here and in the review/commit
message which CPU model was used to acquire the experimental data. That
could be useful to anyone that comes along later and wants to reproduce
and/or compare to the original, motivating data.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170212/8dd45aea/attachment.html>

Dehao Chen via llvm-dev

2017-Feb-13 16:29 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thanks for the comment. The performance experiments were performed on Intel
Sandybridge. Updated this info to the patch description.

Dehao

On Sun, Feb 12, 2017 at 8:24 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> Since we can override the settings, I have no objections.
>
> I still think it would be good to document here and in the review/commit
> message which CPU model was used to acquire the experimental data. That
> could be useful to anyone that comes along later and wants to reproduce
> and/or compare to the original, motivating data.
>
> On Fri, Feb 10, 2017 at 4:53 PM, Dehao Chen <dehao at google.com>
wrote:
>
>> Thanks Hal, could you help approve https://reviews.llvm.org/D28368?
>>
>> I'll hold off until early Tuesday in case other people have more
concerns.
>>
>> Thanks,
>> Dehao
>>
>> On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>>
>>>
>>> On 02/10/2017 05:21 PM, Dehao Chen wrote:
>>>
>>> Thanks every for the comments.
>>>
>>> Do we have a decision here?
>>>
>>>
>>> You're good to go as far as I'm concerned.
>>>
>>>  -Hal
>>>
>>>
>>> Dehao
>>>
>>> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at
anl.gov> wrote:
>>>
>>>>
>>>> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>>>>
>>>> Sorry if I missed it, but what machine/CPU are you using to
collect the
>>>> perf numbers?
>>>>
>>>> I am concerned that what may be a win on a CPU that keeps a
couple of
>>>> hundred instructions in-flight and has many MB of caches will
not hold for
>>>> a small core.
>>>>
>>>>
>>>> In my experience, unrolling tends to help weaker cores even
more than
>>>> stronger ones because it allows the instruction scheduler more
>>>> opportunities to hide latency. Obviously, instruction-cache
pressure is an
>>>> important consideration, but the code size changes here seems
small.
>>>>
>>>>
>>>> Is the proposed change universal? Is there a way to undo it?
>>>>
>>>>
>>>> All of the unrolling thresholds should be target-adjustable
using the
>>>> TTI::getUnrollingPreferences hook.
>>>>
>>>>  -Hal
>>>>
>>>>
>>>>
>>>> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> Ping... with the updated code size impact data, any more
comments? Any
>>>>> more data that would be interesting to collect?
>>>>>
>>>>> Thanks,
>>>>> Dehao
>>>>>
>>>>> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at
google.com> wrote:
>>>>>
>>>>>> Here is the code size impact for clang, chrome and 24
google internal
>>>>>> benchmarks (name omited, 14 15 16 are encoding/decoding
benchmarks similar
>>>>>> as h264). There are 2 columns, for threshold 300 and
450 respectively.
>>>>>>
>>>>>> I also tested the llvm test suite. Changing the
threshold to 300/450
>>>>>> does not affect code gen for any binary in the test
suite.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 300 450
>>>>>> clang 0.30% 0.63%
>>>>>> chrome 0.00% 0.00%
>>>>>> 1 0.27% 0.67%
>>>>>> 2 0.44% 0.93%
>>>>>> 3 0.44% 0.93%
>>>>>> 4 0.26% 0.53%
>>>>>> 5 0.74% 2.21%
>>>>>> 6 0.74% 2.21%
>>>>>> 7 0.74% 2.21%
>>>>>> 8 0.46% 1.05%
>>>>>> 9 0.35% 0.86%
>>>>>> 10 0.35% 0.86%
>>>>>> 11 0.40% 0.83%
>>>>>> 12 0.32% 0.65%
>>>>>> 13 0.31% 0.64%
>>>>>> 14 4.52% 8.23%
>>>>>> 15 9.90% 19.38%
>>>>>> 16 9.90% 19.38%
>>>>>> 17 0.68% 1.97%
>>>>>> 18 0.21% 0.48%
>>>>>> 19 0.99% 3.44%
>>>>>> 20 0.19% 0.46%
>>>>>> 21 0.57% 1.62%
>>>>>> 22 0.37% 1.05%
>>>>>> 23 0.78% 1.30%
>>>>>> 24 0.51% 1.54%
>>>>>>
>>>>>> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via
llvm-dev <
>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>>> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via
llvm-dev <
>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>> clang, chrome, and some internal large apps are
good candidates for
>>>>>>> size metrics.
>>>>>>>
>>>>>>> I'd also add the standard LLVM testsuite just
because it's the suite
>>>>>>> everyone in the community can use.
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth
via llvm-dev <
>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>>> I had suggested having size metrics from
somewhat larger
>>>>>>>> applications such as Chrome, Webkit, or
Firefox; clang itself; and maybe
>>>>>>>> some of our internal binaries with rough size
brackets?
>>>>>>>>
>>>>>>>> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen
<dehao at google.com> wrote:
>>>>>>>>
>>>>>>>>> With the new data points, any comments on
whether this can justify
>>>>>>>>> setting fully inline threshold to 300 (or
any other number) in O2? I can
>>>>>>>>> collect more data points if it's
helpful.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Dehao
>>>>>>>>>
>>>>>>>>> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen
<dehao at google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Recollected the data from trunk head with
stddev data and more
>>>>>>>>> threshold data points attached:
>>>>>>>>>
>>>>>>>>> Performance:
>>>>>>>>>
>>>>>>>>> stddev/mean 300 450 600 750
>>>>>>>>> 403 0.37% 0.11% 0.11% 0.09% 0.79%
>>>>>>>>> 433 0.14% 0.51% 0.25% -0.63% -0.29%
>>>>>>>>> 445 0.08% 0.48% 0.89% 0.12% 0.83%
>>>>>>>>> 447 0.16% 3.50% 2.69% 3.66% 3.59%
>>>>>>>>> 453 0.11% 1.49% 0.45% -0.07% 0.78%
>>>>>>>>> 464 0.17% 0.75% 1.80% 1.86% 1.54%
>>>>>>>>> Code size:
>>>>>>>>>
>>>>>>>>> 300 450 600 750
>>>>>>>>> 403 0.56% 2.41% 2.74% 3.75%
>>>>>>>>> 433 0.96% 2.84% 4.19% 4.87%
>>>>>>>>> 445 2.16% 3.62% 4.48% 5.88%
>>>>>>>>> 447 2.96% 5.09% 6.74% 8.89%
>>>>>>>>> 453 0.94% 1.67% 2.73% 2.96%
>>>>>>>>> 464 8.02% 13.50% 20.51% 26.59%
>>>>>>>>> Compile time is proportional in the
experiments and more noisy, so
>>>>>>>>> I did not include it.
>>>>>>>>>
>>>>>>>>> We have >2% speedup on some google
internal benchmarks when
>>>>>>>>> switching the threshold from 150 to 300.
>>>>>>>>>
>>>>>>>>> Dehao
>>>>>>>>>
>>>>>>>>> On Mon, Jan 30, 2017 at 5:06 PM, Chandler
Carruth <
>>>>>>>>> chandlerc at google.com> wrote:
>>>>>>>>>
>>>>>>>>> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini
<mehdi.amini at apple.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Another question is about PGO integration:
is it already hooked
>>>>>>>>> there? Should we have a more aggressive
threshold in a hot function?
>>>>>>>>> (Assuming we’re willing to spend some
binary size there but not on the cold
>>>>>>>>> path).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would even wire the *unrolling* the other
way: just suppress
>>>>>>>>> unrolling in cold paths to save binary
size. rolled loops seem like a
>>>>>>>>> generally good thing in cold code unless
they are having some larger impact
>>>>>>>>> (IE, the loop itself is more expensive than
the unrolled form).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Agree that we could suppress unrolling in
cold path to save code
>>>>>>>>> size. But that's orthogonal with the
propose here. This proposal focuses on
>>>>>>>>> O2 performance: shall we have different
(higher) fully unroll threshold
>>>>>>>>> than dynamic/partial unroll.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I agree that this is (to some extent)
orthogonal, and it makes
>>>>>>>>> sense to me to differentiate the threshold
for full unroll and the
>>>>>>>>> dynamic/partial case.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is one issue that makes these not
orthogonal.
>>>>>>>>>
>>>>>>>>> If even *static* profile hints will reduce
some of the code size
>>>>>>>>> increase caused by higher unrolling
thresholds for non-cold code, we should
>>>>>>>>> factor that into the tradeoff in picking
where the threshold goes.
>>>>>>>>>
>>>>>>>>> However, getting PGO into the full unroller
is currently
>>>>>>>>> challenging outside of the new pass
manager. We already have some
>>>>>>>>> unfortunate hacks around this in
LoopUnswitch that are making the port of
>>>>>>>>> it to the new PM more annoying.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170213/b10fe583/attachment-0001.html>

Chandler Carruth via llvm-dev

2017-Feb-13 18:56 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

FWIW, I'm good with the updated data, but I'd really like at least
someone
from Apple and someone from ARM to chime in here... CC-ing random people in
the hope it helps...

On Mon, Feb 13, 2017 at 8:30 AM Dehao Chen via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Thanks for the comment. The performance experiments were performed on
> Intel Sandybridge. Updated this info to the patch description.
>
> Dehao
> On Sun, Feb 12, 2017 at 8:24 AM, Sanjay Patel <spatel at
rotateright.com>
> wrote:
>
> Since we can override the settings, I have no objections.
>
> I still think it would be good to document here and in the review/commit
> message which CPU model was used to acquire the experimental data. That
> could be useful to anyone that comes along later and wants to reproduce
> and/or compare to the original, motivating data.
>
>
> On Fri, Feb 10, 2017 at 4:53 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Thanks Hal, could you help approve https://reviews.llvm.org/D28368?
>
> I'll hold off until early Tuesday in case other people have more
concerns.
>
> Thanks,
> Dehao
>
> On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> On 02/10/2017 05:21 PM, Dehao Chen wrote:
>
> Thanks every for the comments.
>
> Do we have a decision here?
>
>
> You're good to go as far as I'm concerned.
>
>  -Hal
>
>
> Dehao
>
> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>
> Sorry if I missed it, but what machine/CPU are you using to collect the
> perf numbers?
>
> I am concerned that what may be a win on a CPU that keeps a couple of
> hundred instructions in-flight and has many MB of caches will not hold for
> a small core.
>
>
> In my experience, unrolling tends to help weaker cores even more than
> stronger ones because it allows the instruction scheduler more
> opportunities to hide latency. Obviously, instruction-cache pressure is an
> important consideration, but the code size changes here seems small.
>
>
> Is the proposed change universal? Is there a way to undo it?
>
>
> All of the unrolling thresholds should be target-adjustable using the
> TTI::getUnrollingPreferences hook.
>
>  -Hal
>
>
>
> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Ping... with the updated code size impact data, any more comments? Any
> more data that would be interesting to collect?
>
> Thanks,
> Dehao
>
> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Here is the code size impact for clang, chrome and 24 google internal
> benchmarks (name omited, 14 15 16 are encoding/decoding benchmarks similar
> as h264). There are 2 columns, for threshold 300 and 450 respectively.
>
> I also tested the llvm test suite. Changing the threshold to 300/450 does
> not affect code gen for any binary in the test suite.
>
>
>
> 300 450
> clang 0.30% 0.63%
> chrome 0.00% 0.00%
> 1 0.27% 0.67%
> 2 0.44% 0.93%
> 3 0.44% 0.93%
> 4 0.26% 0.53%
> 5 0.74% 2.21%
> 6 0.74% 2.21%
> 7 0.74% 2.21%
> 8 0.46% 1.05%
> 9 0.35% 0.86%
> 10 0.35% 0.86%
> 11 0.40% 0.83%
> 12 0.32% 0.65%
> 13 0.31% 0.64%
> 14 4.52% 8.23%
> 15 9.90% 19.38%
> 16 9.90% 19.38%
> 17 0.68% 1.97%
> 18 0.21% 0.48%
> 19 0.99% 3.44%
> 20 0.19% 0.46%
> 21 0.57% 1.62%
> 22 0.37% 1.05%
> 23 0.78% 1.30%
> 24 0.51% 1.54%
>
> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> clang, chrome, and some internal large apps are good candidates for size
> metrics.
>
> I'd also add the standard LLVM testsuite just because it's the
suite
> everyone in the community can use.
>
> Michael
>
>
> David
>
> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I had suggested having size metrics from somewhat larger applications such
> as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
> binaries with rough size brackets?
>
> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com>
wrote:
>
> With the new data points, any comments on whether this can justify setting
> fully inline threshold to 300 (or any other number) in O2? I can collect
> more data points if it's helpful.
>
> Thanks,
> Dehao
>
> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Recollected the data from trunk head with stddev data and more threshold
> data points attached:
>
> Performance:
>
> stddev/mean 300 450 600 750
> 403 0.37% 0.11% 0.11% 0.09% 0.79%
> 433 0.14% 0.51% 0.25% -0.63% -0.29%
> 445 0.08% 0.48% 0.89% 0.12% 0.83%
> 447 0.16% 3.50% 2.69% 3.66% 3.59%
> 453 0.11% 1.49% 0.45% -0.07% 0.78%
> 464 0.17% 0.75% 1.80% 1.86% 1.54%
> Code size:
>
> 300 450 600 750
> 403 0.56% 2.41% 2.74% 3.75%
> 433 0.96% 2.84% 4.19% 4.87%
> 445 2.16% 3.62% 4.48% 5.88%
> 447 2.96% 5.09% 6.74% 8.89%
> 453 0.94% 1.67% 2.73% 2.96%
> 464 8.02% 13.50% 20.51% 26.59%
> Compile time is proportional in the experiments and more noisy, so I did
> not include it.
>
> We have >2% speedup on some google internal benchmarks when switching
the
> threshold from 150 to 300.
>
> Dehao
>
> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>
>
> Another question is about PGO integration: is it already hooked there?
> Should we have a more aggressive threshold in a hot function? (Assuming
> we’re willing to spend some binary size there but not on the cold path).
>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>
>
>
> Agree that we could suppress unrolling in cold path to save code size. But
> that's orthogonal with the propose here. This proposal focuses on O2
> performance: shall we have different (higher) fully unroll threshold than
> dynamic/partial unroll.
>
>
> I agree that this is (to some extent) orthogonal, and it makes sense to me
> to differentiate the threshold for full unroll and the dynamic/partial
case.
>
>
> There is one issue that makes these not orthogonal.
>
> If even *static* profile hints will reduce some of the code size increase
> caused by higher unrolling thresholds for non-cold code, we should factor
> that into the tradeoff in picking where the threshold goes.
>
> However, getting PGO into the full unroller is currently challenging
> outside of the new pass manager. We already have some unfortunate hacks
> around this in LoopUnswitch that are making the port of it to the new PM
> more annoying.
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170213/6635f7d9/attachment-0001.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Feb 2017 - (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Possibly Parallel Threads