thr3ads.net - llvm dev - [llvm-dev] (RFC) Adjusting default loop fully unroll threshold [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2017-Feb-08 06:24 UTC

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:> Sorry if I missed it, but what machine/CPU are you using to collect 
> the perf numbers?
>
> I am concerned that what may be a win on a CPU that keeps a couple of 
> hundred instructions in-flight and has many MB of caches will not hold 
> for a small core.
In my experience, unrolling tends to help weaker cores even more than 
stronger ones because it allows the instruction scheduler more 
opportunities to hide latency. Obviously, instruction-cache pressure is 
an important consideration, but the code size changes here seems small.
>
> Is the proposed change universal? Is there a way to undo it?
All of the unrolling thresholds should be target-adjustable using the 
TTI::getUnrollingPreferences hook.

  -Hal
>
> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     Ping... with the updated code size impact data, any more comments?
>     Any more data that would be interesting to collect?
>
>     Thanks,
>     Dehao
>
>     On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com
>     <mailto:dehao at google.com>> wrote:
>
>         Here is the code size impact for clang, chrome and 24 google
>         internal benchmarks (name omited, 14 15 16 are
>         encoding/decoding benchmarks similar as h264). There are 2
>         columns, for threshold 300 and 450 respectively.
>
>         I also tested the llvm test suite. Changing the threshold to
>         300/450 does not affect code gen for any binary in the test suite.
>
>
>
>         	300 	450
>         clang 	0.30% 	0.63%
>         chrome 	0.00% 	0.00%
>         1 	0.27% 	0.67%
>         2 	0.44% 	0.93%
>         3 	0.44% 	0.93%
>         4 	0.26% 	0.53%
>         5 	0.74% 	2.21%
>         6 	0.74% 	2.21%
>         7 	0.74% 	2.21%
>         8 	0.46% 	1.05%
>         9 	0.35% 	0.86%
>         10 	0.35% 	0.86%
>         11 	0.40% 	0.83%
>         12 	0.32% 	0.65%
>         13 	0.31% 	0.64%
>         14 	4.52% 	8.23%
>         15 	9.90% 	19.38%
>         16 	9.90% 	19.38%
>         17 	0.68% 	1.97%
>         18 	0.21% 	0.48%
>         19 	0.99% 	3.44%
>         20 	0.19% 	0.46%
>         21 	0.57% 	1.62%
>         22 	0.37% 	1.05%
>         23 	0.78% 	1.30%
>         24 	0.51% 	1.54%
>
>
>         On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via
>         llvm-dev <llvm-dev at lists.llvm.org
>         <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>>             On Feb 1, 2017, at 4:57 PM, Xinliang David Li via
>>             llvm-dev <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>             clang, chrome, and some internal large apps are good
>>             candidates for size metrics.
>             I'd also add the standard LLVM testsuite just because
it's
>             the suite everyone in the community can use.
>
>             Michael
>>
>>             David
>>
>>             On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via
>>             llvm-dev <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>                 I had suggested having size metrics from somewhat
>>                 larger applications such as Chrome, Webkit, or
>>                 Firefox; clang itself; and maybe some of our internal
>>                 binaries with rough size brackets?
>>
>>                 On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen
>>                 <dehao at google.com <mailto:dehao at
google.com>> wrote:
>>
>>                     With the new data points, any comments on whether
>>                     this can justify setting fully inline threshold
>>                     to 300 (or any other number) in O2? I can collect
>>                     more data points if it's helpful.
>>
>>                     Thanks,
>>                     Dehao
>>
>>                     On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen
>>                     <dehao at google.com <mailto:dehao at
google.com>> wrote:
>>
>>                         Recollected the data from trunk head with
>>                         stddev data and more threshold data points
>>                         attached:
>>
>>                         Performance:
>>
>>                         	stddev/mean 	300 	450 	600 	750
>>                         403 	0.37% 	0.11% 	0.11% 	0.09% 	0.79%
>>                         433 	0.14% 	0.51% 	0.25% 	-0.63% 	-0.29%
>>                         445 	0.08% 	0.48% 	0.89% 	0.12% 	0.83%
>>                         447 	0.16% 	3.50% 	2.69% 	3.66% 	3.59%
>>                         453 	0.11% 	1.49% 	0.45% 	-0.07% 	0.78%
>>                         464 	0.17% 	0.75% 	1.80% 	1.86% 	1.54%
>>
>>
>>                         Code size:
>>
>>                         	300 	450 	600 	750
>>                         403 	0.56% 	2.41% 	2.74% 	3.75%
>>                         433 	0.96% 	2.84% 	4.19% 	4.87%
>>                         445 	2.16% 	3.62% 	4.48% 	5.88%
>>                         447 	2.96% 	5.09% 	6.74% 	8.89%
>>                         453 	0.94% 	1.67% 	2.73% 	2.96%
>>                         464 	8.02% 	13.50% 	20.51% 	26.59%
>>
>>
>>                         Compile time is proportional in the
>>                         experiments and more noisy, so I did not
>>                         include it.
>>
>>                         We have >2% speedup on some google internal
>>                         benchmarks when switching the threshold from
>>                         150 to 300.
>>
>>                         Dehao
>>
>>                         On Mon, Jan 30, 2017 at 5:06 PM, Chandler
>>                         Carruth <chandlerc at google.com
>>                         <mailto:chandlerc at google.com>>
wrote:
>>
>>                             On Mon, Jan 30, 2017 at 4:59 PM Mehdi
>>                             Amini <mehdi.amini at apple.com
>>                             <mailto:mehdi.amini at apple.com>>
wrote:
>>
>>>
>>>
>>>                                         Another question is about
>>>                                         PGO integration: is it
>>>                                         already hooked there?
Should
>>>                                         we have a more aggressive
>>>                                         threshold in a hot
function?
>>>                                         (Assuming we’re willing to
>>>                                         spend some binary size
there
>>>                                         but not on the cold path).
>>>
>>>
>>>                                     I would even wire the
>>>                                     *unrolling* the other way: just
>>>                                     suppress unrolling in cold
paths
>>>                                     to save binary size. rolled
>>>                                     loops seem like a generally
good
>>>                                     thing in cold code unless they
>>>                                     are having some larger impact
>>>                                     (IE, the loop itself is more
>>>                                     expensive than the unrolled
form).
>>>
>>>
>>>
>>>                                 Agree that we could suppress
>>>                                 unrolling in cold path to save code
>>>                                 size. But that's orthogonal
with the
>>>                                 propose here. This proposal focuses
>>>                                 on O2 performance: shall we have
>>>                                 different (higher) fully unroll
>>>                                 threshold than dynamic/partial
unroll.
>>
>>                                 I agree that this is (to some extent)
>>                                 orthogonal, and it makes sense to me
>>                                 to differentiate the threshold for
>>                                 full unroll and the dynamic/partial
case.
>>
>>
>>                             There is one issue that makes these not
>>                             orthogonal.
>>
>>                             If even *static* profile hints will
>>                             reduce some of the code size increase
>>                             caused by higher unrolling thresholds for
>>                             non-cold code, we should factor that into
>>                             the tradeoff in picking where the
>>                             threshold goes.
>>
>>                             However, getting PGO into the full
>>                             unroller is currently challenging outside
>>                             of the new pass manager. We already have
>>                             some unfortunate hacks around this in
>>                             LoopUnswitch that are making the port of
>>                             it to the new PM more annoying.
>>
>>
>>
>>
>>                 _______________________________________________
>>                 LLVM Developers mailing list
>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>             _______________________________________________
>>             LLVM Developers mailing list
>>             llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>             http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>            
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>             _______________________________________________
>             LLVM Developers mailing list
>             llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>             http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>             <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170208/0f7dd34d/attachment-0001.html>

Dehao Chen via llvm-dev

2017-Feb-10 23:21 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thanks every for the comments.

Do we have a decision here?

Dehao

On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>
> Sorry if I missed it, but what machine/CPU are you using to collect the
> perf numbers?
>
> I am concerned that what may be a win on a CPU that keeps a couple of
> hundred instructions in-flight and has many MB of caches will not hold for
> a small core.
>
>
> In my experience, unrolling tends to help weaker cores even more than
> stronger ones because it allows the instruction scheduler more
> opportunities to hide latency. Obviously, instruction-cache pressure is an
> important consideration, but the code size changes here seems small.
>
>
> Is the proposed change universal? Is there a way to undo it?
>
>
> All of the unrolling thresholds should be target-adjustable using the
> TTI::getUnrollingPreferences hook.
>
>  -Hal
>
>
>
> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Ping... with the updated code size impact data, any more comments? Any
>> more data that would be interesting to collect?
>>
>> Thanks,
>> Dehao
>>
>> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>>
>>> Here is the code size impact for clang, chrome and 24 google
internal
>>> benchmarks (name omited, 14 15 16 are encoding/decoding benchmarks
similar
>>> as h264). There are 2 columns, for threshold 300 and 450
respectively.
>>>
>>> I also tested the llvm test suite. Changing the threshold to
300/450
>>> does not affect code gen for any binary in the test suite.
>>>
>>>
>>>
>>> 300 450
>>> clang 0.30% 0.63%
>>> chrome 0.00% 0.00%
>>> 1 0.27% 0.67%
>>> 2 0.44% 0.93%
>>> 3 0.44% 0.93%
>>> 4 0.26% 0.53%
>>> 5 0.74% 2.21%
>>> 6 0.74% 2.21%
>>> 7 0.74% 2.21%
>>> 8 0.46% 1.05%
>>> 9 0.35% 0.86%
>>> 10 0.35% 0.86%
>>> 11 0.40% 0.83%
>>> 12 0.32% 0.65%
>>> 13 0.31% 0.64%
>>> 14 4.52% 8.23%
>>> 15 9.90% 19.38%
>>> 16 9.90% 19.38%
>>> 17 0.68% 1.97%
>>> 18 0.21% 0.48%
>>> 19 0.99% 3.44%
>>> 20 0.19% 0.46%
>>> 21 0.57% 1.62%
>>> 22 0.37% 1.05%
>>> 23 0.78% 1.30%
>>> 24 0.51% 1.54%
>>>
>>> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev
<
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> clang, chrome, and some internal large apps are good candidates
for
>>>> size metrics.
>>>>
>>>> I'd also add the standard LLVM testsuite just because
it's the suite
>>>> everyone in the community can use.
>>>>
>>>> Michael
>>>>
>>>>
>>>> David
>>>>
>>>> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev
<
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> I had suggested having size metrics from somewhat larger
applications
>>>>> such as Chrome, Webkit, or Firefox; clang itself; and maybe
some of our
>>>>> internal binaries with rough size brackets?
>>>>>
>>>>> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at
google.com> wrote:
>>>>>
>>>>>> With the new data points, any comments on whether this
can justify
>>>>>> setting fully inline threshold to 300 (or any other
number) in O2? I can
>>>>>> collect more data points if it's helpful.
>>>>>>
>>>>>> Thanks,
>>>>>> Dehao
>>>>>>
>>>>>> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao
at google.com> wrote:
>>>>>>
>>>>>> Recollected the data from trunk head with stddev data
and more
>>>>>> threshold data points attached:
>>>>>>
>>>>>> Performance:
>>>>>>
>>>>>> stddev/mean 300 450 600 750
>>>>>> 403 0.37% 0.11% 0.11% 0.09% 0.79%
>>>>>> 433 0.14% 0.51% 0.25% -0.63% -0.29%
>>>>>> 445 0.08% 0.48% 0.89% 0.12% 0.83%
>>>>>> 447 0.16% 3.50% 2.69% 3.66% 3.59%
>>>>>> 453 0.11% 1.49% 0.45% -0.07% 0.78%
>>>>>> 464 0.17% 0.75% 1.80% 1.86% 1.54%
>>>>>> Code size:
>>>>>>
>>>>>> 300 450 600 750
>>>>>> 403 0.56% 2.41% 2.74% 3.75%
>>>>>> 433 0.96% 2.84% 4.19% 4.87%
>>>>>> 445 2.16% 3.62% 4.48% 5.88%
>>>>>> 447 2.96% 5.09% 6.74% 8.89%
>>>>>> 453 0.94% 1.67% 2.73% 2.96%
>>>>>> 464 8.02% 13.50% 20.51% 26.59%
>>>>>> Compile time is proportional in the experiments and
more noisy, so I
>>>>>> did not include it.
>>>>>>
>>>>>> We have >2% speedup on some google internal
benchmarks when switching
>>>>>> the threshold from 150 to 300.
>>>>>>
>>>>>> Dehao
>>>>>>
>>>>>> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <
>>>>>> chandlerc at google.com> wrote:
>>>>>>
>>>>>> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini
<mehdi.amini at apple.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Another question is about PGO integration: is it
already hooked
>>>>>> there? Should we have a more aggressive threshold in a
hot function?
>>>>>> (Assuming we’re willing to spend some binary size there
but not on the cold
>>>>>> path).
>>>>>>
>>>>>>
>>>>>> I would even wire the *unrolling* the other way: just
suppress
>>>>>> unrolling in cold paths to save binary size. rolled
loops seem like a
>>>>>> generally good thing in cold code unless they are
having some larger impact
>>>>>> (IE, the loop itself is more expensive than the
unrolled form).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Agree that we could suppress unrolling in cold path to
save code
>>>>>> size. But that's orthogonal with the propose here.
This proposal focuses on
>>>>>> O2 performance: shall we have different (higher) fully
unroll threshold
>>>>>> than dynamic/partial unroll.
>>>>>>
>>>>>>
>>>>>> I agree that this is (to some extent) orthogonal, and
it makes sense
>>>>>> to me to differentiate the threshold for full unroll
and the
>>>>>> dynamic/partial case.
>>>>>>
>>>>>>
>>>>>> There is one issue that makes these not orthogonal.
>>>>>>
>>>>>> If even *static* profile hints will reduce some of the
code size
>>>>>> increase caused by higher unrolling thresholds for
non-cold code, we should
>>>>>> factor that into the tradeoff in picking where the
threshold goes.
>>>>>>
>>>>>> However, getting PGO into the full unroller is
currently challenging
>>>>>> outside of the new pass manager. We already have some
unfortunate hacks
>>>>>> around this in LoopUnswitch that are making the port of
it to the new PM
>>>>>> more annoying.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170210/4fbe2b72/attachment-0001.html>

Hal Finkel via llvm-dev

2017-Feb-10 23:23 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

On 02/10/2017 05:21 PM, Dehao Chen wrote:> Thanks every for the comments.
>
> Do we have a decision here?
You're good to go as far as I'm concerned.

  -Hal
>
> Dehao
>
> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>>     Sorry if I missed it, but what machine/CPU are you using to
>>     collect the perf numbers?
>>
>>     I am concerned that what may be a win on a CPU that keeps a
>>     couple of hundred instructions in-flight and has many MB of
>>     caches will not hold for a small core.
>
>     In my experience, unrolling tends to help weaker cores even more
>     than stronger ones because it allows the instruction scheduler
>     more opportunities to hide latency. Obviously, instruction-cache
>     pressure is an important consideration, but the code size changes
>     here seems small.
>
>>
>>     Is the proposed change universal? Is there a way to undo it?
>
>     All of the unrolling thresholds should be target-adjustable using
>     the TTI::getUnrollingPreferences hook.
>
>      -Hal
>
>
>>
>>     On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev
>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>         Ping... with the updated code size impact data, any more
>>         comments? Any more data that would be interesting to collect?
>>
>>         Thanks,
>>         Dehao
>>
>>         On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at
google.com
>>         <mailto:dehao at google.com>> wrote:
>>
>>             Here is the code size impact for clang, chrome and 24
>>             google internal benchmarks (name omited, 14 15 16 are
>>             encoding/decoding benchmarks similar as h264). There are
>>             2 columns, for threshold 300 and 450 respectively.
>>
>>             I also tested the llvm test suite. Changing the threshold
>>             to 300/450 does not affect code gen for any binary in the
>>             test suite.
>>
>>
>>
>>             	300 	450
>>             clang 	0.30% 	0.63%
>>             chrome 	0.00% 	0.00%
>>             1 	0.27% 	0.67%
>>             2 	0.44% 	0.93%
>>             3 	0.44% 	0.93%
>>             4 	0.26% 	0.53%
>>             5 	0.74% 	2.21%
>>             6 	0.74% 	2.21%
>>             7 	0.74% 	2.21%
>>             8 	0.46% 	1.05%
>>             9 	0.35% 	0.86%
>>             10 	0.35% 	0.86%
>>             11 	0.40% 	0.83%
>>             12 	0.32% 	0.65%
>>             13 	0.31% 	0.64%
>>             14 	4.52% 	8.23%
>>             15 	9.90% 	19.38%
>>             16 	9.90% 	19.38%
>>             17 	0.68% 	1.97%
>>             18 	0.21% 	0.48%
>>             19 	0.99% 	3.44%
>>             20 	0.19% 	0.46%
>>             21 	0.57% 	1.62%
>>             22 	0.37% 	1.05%
>>             23 	0.78% 	1.30%
>>             24 	0.51% 	1.54%
>>
>>
>>             On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via
>>             llvm-dev <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>>                 On Feb 1, 2017, at 4:57 PM, Xinliang David Li via
>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>                 <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>
>>>                 clang, chrome, and some internal large apps are
good
>>>                 candidates for size metrics.
>>                 I'd also add the standard LLVM testsuite just
because
>>                 it's the suite everyone in the community can use.
>>
>>                 Michael
>>>
>>>                 David
>>>
>>>                 On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth
via
>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>                 <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>
>>>                     I had suggested having size metrics from
>>>                     somewhat larger applications such as Chrome,
>>>                     Webkit, or Firefox; clang itself; and maybe
some
>>>                     of our internal binaries with rough size
brackets?
>>>
>>>                     On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen
>>>                     <dehao at google.com <mailto:dehao at
google.com>> wrote:
>>>
>>>                         With the new data points, any comments on
>>>                         whether this can justify setting fully
>>>                         inline threshold to 300 (or any other
>>>                         number) in O2? I can collect more data
>>>                         points if it's helpful.
>>>
>>>                         Thanks,
>>>                         Dehao
>>>
>>>                         On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen
>>>                         <dehao at google.com <mailto:dehao at
google.com>>
>>>                         wrote:
>>>
>>>                             Recollected the data from trunk head
>>>                             with stddev data and more threshold
data
>>>                             points attached:
>>>
>>>                             Performance:
>>>
>>>                             	stddev/mean 	300 	450 	600 	750
>>>                             403 	0.37% 	0.11% 	0.11% 	0.09% 	0.79%
>>>                             433 	0.14% 	0.51% 	0.25% 	-0.63% 
-0.29%
>>>                             445 	0.08% 	0.48% 	0.89% 	0.12% 	0.83%
>>>                             447 	0.16% 	3.50% 	2.69% 	3.66% 	3.59%
>>>                             453 	0.11% 	1.49% 	0.45% 	-0.07% 	0.78%
>>>                             464 	0.17% 	0.75% 	1.80% 	1.86% 	1.54%
>>>
>>>
>>>                             Code size:
>>>
>>>                             	300 	450 	600 	750
>>>                             403 	0.56% 	2.41% 	2.74% 	3.75%
>>>                             433 	0.96% 	2.84% 	4.19% 	4.87%
>>>                             445 	2.16% 	3.62% 	4.48% 	5.88%
>>>                             447 	2.96% 	5.09% 	6.74% 	8.89%
>>>                             453 	0.94% 	1.67% 	2.73% 	2.96%
>>>                             464 	8.02% 	13.50% 	20.51% 	26.59%
>>>
>>>
>>>                             Compile time is proportional in the
>>>                             experiments and more noisy, so I did
not
>>>                             include it.
>>>
>>>                             We have >2% speedup on some google
>>>                             internal benchmarks when switching the
>>>                             threshold from 150 to 300.
>>>
>>>                             Dehao
>>>
>>>                             On Mon, Jan 30, 2017 at 5:06 PM,
>>>                             Chandler Carruth <chandlerc at
google.com
>>>                             <mailto:chandlerc at
google.com>> wrote:
>>>
>>>                                 On Mon, Jan 30, 2017 at 4:59 PM
>>>                                 Mehdi Amini <mehdi.amini at
apple.com
>>>                                 <mailto:mehdi.amini at
apple.com>> wrote:
>>>
>>>>
>>>>
>>>>                                             Another question is
>>>>                                             about PGO
integration:
>>>>                                             is it already
hooked
>>>>                                             there? Should we
have a
>>>>                                             more aggressive
>>>>                                             threshold in a hot
>>>>                                             function? (Assuming
>>>>                                             we’re willing to
spend
>>>>                                             some binary size
there
>>>>                                             but not on the cold
path).
>>>>
>>>>
>>>>                                         I would even wire the
>>>>                                         *unrolling* the other
way:
>>>>                                         just suppress unrolling
in
>>>>                                         cold paths to save
binary
>>>>                                         size. rolled loops seem
>>>>                                         like a generally good
thing
>>>>                                         in cold code unless
they
>>>>                                         are having some larger
>>>>                                         impact (IE, the loop
itself
>>>>                                         is more expensive than
the
>>>>                                         unrolled form).
>>>>
>>>>
>>>>
>>>>                                     Agree that we could
suppress
>>>>                                     unrolling in cold path to
save
>>>>                                     code size. But that's
>>>>                                     orthogonal with the propose
>>>>                                     here. This proposal focuses
on
>>>>                                     O2 performance: shall we
have
>>>>                                     different (higher) fully
unroll
>>>>                                     threshold than
dynamic/partial
>>>>                                     unroll.
>>>
>>>                                     I agree that this is (to some
>>>                                     extent) orthogonal, and it
makes
>>>                                     sense to me to differentiate
the
>>>                                     threshold for full unroll and
>>>                                     the dynamic/partial case.
>>>
>>>
>>>                                 There is one issue that makes these
>>>                                 not orthogonal.
>>>
>>>                                 If even *static* profile hints will
>>>                                 reduce some of the code size
>>>                                 increase caused by higher unrolling
>>>                                 thresholds for non-cold code, we
>>>                                 should factor that into the
tradeoff
>>>                                 in picking where the threshold
goes.
>>>
>>>                                 However, getting PGO into the full
>>>                                 unroller is currently challenging
>>>                                 outside of the new pass manager. We
>>>                                 already have some unfortunate hacks
>>>                                 around this in LoopUnswitch that
are
>>>                                 making the port of it to the new PM
>>>                                 more annoying.
>>>
>>>
>>>
>>>
>>>                     _______________________________________________
>>>                     LLVM Developers mailing list
>>>                     llvm-dev at lists.llvm.org
>>>                     <mailto:llvm-dev at lists.llvm.org>
>>>                    
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>                    
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>
>>>
>>>                 _______________________________________________
>>>                 LLVM Developers mailing list
>>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>                
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>                
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>                 _______________________________________________
>>                 LLVM Developers mailing list
>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>
>>
>>         _______________________________________________
>>         LLVM Developers mailing list
>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170210/63eed20d/attachment-0001.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Feb 2017 - (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Reasonably Related Threads