thr3ads.net - llvm dev - [llvm-dev] (RFC) Adjusting default loop fully unroll threshold [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth via llvm-dev

2017-Feb-15 18:10 UTC

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thanks for running these Kristof!

I'd still like to hear from Apple, and if we can get a few more x86
micro-architectures covered that'd be great, but it looks like -O3 is
uncontroversial, and the question is whether this makes sense at O2...

To me, it would help a lot to know the actual breakdown of benchmarks such
as yours Kristof (as they seem to have more codesize impact than others
have mentioned). Specificially, are the runtime improvements correlated
with the codesize increases? And what are the absolute size deltas? For
*very* small benchmarks, a 5% code size fluctuation seems less concerning
than for a larger benchmark. If the larger code size changes are mostly
smaller benchmarks and reasonably correlated to the ones likely to see
improvement from the change (this seemed to be the case w/ Dehao's data on
x86 for example) that would to me indicate this makes sense at O2.

Note that I'm fine if you have to list the benchmarks as "1, 2, 3,
..." or
whatever, much like we did for Google-internal benchmarks. It's still
useful to know the shape of the change.

On Tue, Feb 14, 2017 at 1:06 PM Kristof Beyls via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I've run the patch on https://reviews.llvm.org/D28368 on the test-suite
> and other benchmarks, for AArch64 -O3 -fomit-frame-pointer, both for
> Cortex-A53 and Cortex-A57.
>
> The geomean over the few hundred programs in there is roughly the same for
> Cortex-A53 and Cortex-A57: a bit over 1% improvement in execution speed for
> a bit over 5% increase in code size.
> Obviously I wouldn't want this for optimization levels where code size
is
> of any concern, like -Os or -Oz, but don't have a problem with this
going
> in for other optimization levels where this isn't a concern.
>
> Thanks,
>
> Kristof
>
>
>
>
> On 13 Feb 2017, at 19:56, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> FWIW, I'm good with the updated data, but I'd really like at least
someone
> from Apple and someone from ARM to chime in here... CC-ing random people in
> the hope it helps...
>
> On Mon, Feb 13, 2017 at 8:30 AM Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Thanks for the comment. The performance experiments were performed on
> Intel Sandybridge. Updated this info to the patch description.
>
> Dehao
> On Sun, Feb 12, 2017 at 8:24 AM, Sanjay Patel <spatel at
rotateright.com>
> wrote:
>
> Since we can override the settings, I have no objections.
>
> I still think it would be good to document here and in the review/commit
> message which CPU model was used to acquire the experimental data. That
> could be useful to anyone that comes along later and wants to reproduce
> and/or compare to the original, motivating data.
>
> On Fri, Feb 10, 2017 at 4:53 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Thanks Hal, could you help approve https://reviews.llvm.org/D28368?
>
> I'll hold off until early Tuesday in case other people have more
concerns.
>
> Thanks,
> Dehao
>
> On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> On 02/10/2017 05:21 PM, Dehao Chen wrote:
>
> Thanks every for the comments.
>
> Do we have a decision here?
>
>
> You're good to go as far as I'm concerned.
>
>  -Hal
>
>
> Dehao
>
> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>
> Sorry if I missed it, but what machine/CPU are you using to collect the
> perf numbers?
>
> I am concerned that what may be a win on a CPU that keeps a couple of
> hundred instructions in-flight and has many MB of caches will not hold for
> a small core.
>
>
> In my experience, unrolling tends to help weaker cores even more than
> stronger ones because it allows the instruction scheduler more
> opportunities to hide latency. Obviously, instruction-cache pressure is an
> important consideration, but the code size changes here seems small.
>
>
> Is the proposed change universal? Is there a way to undo it?
>
>
> All of the unrolling thresholds should be target-adjustable using the
> TTI::getUnrollingPreferences hook.
>
>  -Hal
>
> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Ping... with the updated code size impact data, any more comments? Any
> more data that would be interesting to collect?
>
> Thanks,
> Dehao
>
> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Here is the code size impact for clang, chrome and 24 google internal
> benchmarks (name omited, 14 15 16 are encoding/decoding benchmarks similar
> as h264). There are 2 columns, for threshold 300 and 450 respectively.
>
> I also tested the llvm test suite. Changing the threshold to 300/450 does
> not affect code gen for any binary in the test suite.
>
>
>
> 300 450
> clang 0.30% 0.63%
> chrome 0.00% 0.00%
> 1 0.27% 0.67%
> 2 0.44% 0.93%
> 3 0.44% 0.93%
> 4 0.26% 0.53%
> 5 0.74% 2.21%
> 6 0.74% 2.21%
> 7 0.74% 2.21%
> 8 0.46% 1.05%
> 9 0.35% 0.86%
> 10 0.35% 0.86%
> 11 0.40% 0.83%
> 12 0.32% 0.65%
> 13 0.31% 0.64%
> 14 4.52% 8.23%
> 15 9.90% 19.38%
> 16 9.90% 19.38%
> 17 0.68% 1.97%
> 18 0.21% 0.48%
> 19 0.99% 3.44%
> 20 0.19% 0.46%
> 21 0.57% 1.62%
> 22 0.37% 1.05%
> 23 0.78% 1.30%
> 24 0.51% 1.54%
>
> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> clang, chrome, and some internal large apps are good candidates for size
> metrics.
>
> I'd also add the standard LLVM testsuite just because it's the
suite
> everyone in the community can use.
>
> Michael
>
>
> David
>
> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I had suggested having size metrics from somewhat larger applications such
> as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
> binaries with rough size brackets?
>
> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com>
wrote:
>
> With the new data points, any comments on whether this can justify setting
> fully inline threshold to 300 (or any other number) in O2? I can collect
> more data points if it's helpful.
>
> Thanks,
> Dehao
>
> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Recollected the data from trunk head with stddev data and more threshold
> data points attached:
>
> Performance:
>
> stddev/mean 300 450 600 750
> 403 0.37% 0.11% 0.11% 0.09% 0.79%
> 433 0.14% 0.51% 0.25% -0.63% -0.29%
> 445 0.08% 0.48% 0.89% 0.12% 0.83%
> 447 0.16% 3.50% 2.69% 3.66% 3.59%
> 453 0.11% 1.49% 0.45% -0.07% 0.78%
> 464 0.17% 0.75% 1.80% 1.86% 1.54%
> Code size:
>
> 300 450 600 750
> 403 0.56% 2.41% 2.74% 3.75%
> 433 0.96% 2.84% 4.19% 4.87%
> 445 2.16% 3.62% 4.48% 5.88%
> 447 2.96% 5.09% 6.74% 8.89%
> 453 0.94% 1.67% 2.73% 2.96%
> 464 8.02% 13.50% 20.51% 26.59%
> Compile time is proportional in the experiments and more noisy, so I did
> not include it.
>
> We have >2% speedup on some google internal benchmarks when switching
the
> threshold from 150 to 300.
>
> Dehao
>
> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>
>
> Another question is about PGO integration: is it already hooked there?
> Should we have a more aggressive threshold in a hot function? (Assuming
> we’re willing to spend some binary size there but not on the cold path).
>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>
>
>
> Agree that we could suppress unrolling in cold path to save code size. But
> that's orthogonal with the propose here. This proposal focuses on O2
> performance: shall we have different (higher) fully unroll threshold than
> dynamic/partial unroll.
>
>
> I agree that this is (to some extent) orthogonal, and it makes sense to me
> to differentiate the threshold for full unroll and the dynamic/partial
case.
>
> There is one issue that makes these not orthogonal.
>
> If even *static* profile hints will reduce some of the code size increase
> caused by higher unrolling thresholds for non-cold code, we should factor
> that into the tradeoff in picking where the threshold goes.
>
> However, getting PGO into the full unroller is currently challenging
> outside of the new pass manager. We already have some unfortunate hacks
> around this in LoopUnswitch that are making the port of it to the new PM
> more annoying.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170215/1c41621b/attachment-0001.html>

Kristof Beyls via llvm-dev

2017-Feb-16 10:46 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

On 15 Feb 2017, at 19:10, Chandler Carruth <chandlerc at
gmail.com<mailto:chandlerc at gmail.com>> wrote:

Thanks for running these Kristof!

I'd still like to hear from Apple, and if we can get a few more x86
micro-architectures covered that'd be great, but it looks like -O3 is
uncontroversial, and the question is whether this makes sense at O2...

To me, it would help a lot to know the actual breakdown of benchmarks such as
yours Kristof (as they seem to have more codesize impact than others have
mentioned). Specificially, are the runtime improvements correlated with the
codesize increases? And what are the absolute size deltas? For *very* small
benchmarks, a 5% code size fluctuation seems less concerning than for a larger
benchmark. If the larger code size changes are mostly smaller benchmarks and
reasonably correlated to the ones likely to see improvement from the change
(this seemed to be the case w/ Dehao's data on x86 for example) that would
to me indicate this makes sense at O2.

Note that I'm fine if you have to list the benchmarks as "1, 2, 3,
..." or whatever, much like we did for Google-internal benchmarks. It's
still useful to know the shape of the change.

With this being data from a few hundred programs, I don't think listing the
data in a long table really helps in getting a feel for the overall structure of
the data.
Instead, I created a few scatter plots that hopefully helps in getting a better
feel for the overall effect of the patch. The charts below are for the
Cortex-A57 numbers. I decided not to produce a chart for Cortex-A53 as the shape
of the data didn't seem very different. The optimization level used is -O3
-fomit-frame-pointer, targeting AArch64 linux.

The first chart shows relative code size increase (vertical axis) vs absolute
code size:
The biggest relative code size increases indeed didn't happen for the
biggest programs, but instead for a few programs weighing in at about 100KB.
I'm assuming the Google benchmark set covers much bigger programs than the
ones displayed here.
FWIW, the cluster of programs where code size increases between 60% to 80% with
a size of about 100KB, all come from MultiSource/Benchmarks/TSVC. Interestingly,
these programs seem to have float and double variants, e.g.
(MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt and
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl), and the code size
bloat only happens for the double variants. I think it may still be worthwhile
to check if this also happens on other architectures, and why it happens only
for the double-variants, not the float-variants.

[cid:C557D770-9D82-45EA-AA84-A5CB28B190EA]

The second chart shows relative code size increase (vertical axis) vs relative
performance improvement (horizontal axis):
I manually checked the cause of the 3 biggest performance regressions
(proprietary benchmark1: -13.70%; MultiSource/Applications/hexxagon/hexxagon:
-10.10%; MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow -5.23%).
For the proprietary benchmark and hexxagon, the code generation didn't
change for the hottest parts, so probably is caused by micro-architectural
effects of code layout changes.
For fourinarow, there seemed to be a lot more spill/fill code, so probably due
to non-optimality of register allocation.

[cid:35438EFB-1337-4478-88C7-B8A718B61681]

The third chart below just zooms in on the above chart to the -5% to 5%
performance improvement range:
[cid:C7AB0398-ED09-448D-BF28-5FD328D90350]

Whether to enable the increase in unroll threshold only at O3 or also at O2: I
don't have a strong opinion based on the above data.
Maybe the compile time impact is what should be driving that discussion the
most? I'm afraid I don't have compile time numbers.
Ultimately, I guess this boils down to what exactly the difference is in intent
between O2 and O3, which seems like a never-ending discussion...

Hoping you find this useful,

Kristof

On Tue, Feb 14, 2017 at 1:06 PM Kristof Beyls via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
I've run the patch on https://reviews.llvm.org/D28368 on the test-suite and
other benchmarks, for AArch64 -O3 -fomit-frame-pointer, both for Cortex-A53 and
Cortex-A57.

The geomean over the few hundred programs in there is roughly the same for
Cortex-A53 and Cortex-A57: a bit over 1% improvement in execution speed for a
bit over 5% increase in code size.
Obviously I wouldn't want this for optimization levels where code size is of
any concern, like -Os or -Oz, but don't have a problem with this going in
for other optimization levels where this isn't a concern.

Thanks,

Kristof

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/b7a68c98/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_absolute_vs_relative.png
Type: image/png
Size: 86966 bytes
Desc: unroll_codesize_absolute_vs_relative.png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/b7a68c98/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_vs_performance.png
Type: image/png
Size: 84065 bytes
Desc: unroll_codesize_vs_performance.png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/b7a68c98/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_vs_performance_zoom.png
Type: image/png
Size: 103095 bytes
Desc: unroll_codesize_vs_performance_zoom.png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/b7a68c98/attachment-0005.png>

Chandler Carruth via llvm-dev

2017-Feb-16 23:45 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

First off, I just want to say wow and thank you. This kind of data is
amazing. =D

On Thu, Feb 16, 2017 at 2:46 AM Kristof Beyls <Kristof.Beyls at arm.com>
wrote:
> The biggest relative code size increases indeed didn't happen for the
> biggest programs, but instead for a few programs weighing in at about
100KB.
> I'm assuming the Google benchmark set covers much bigger programs than
the
> ones displayed here.
> FWIW, the cluster of programs where code size increases between 60% to 80%
> with a size of about 100KB, all come from MultiSource/Benchmarks/TSVC.
> Interestingly, these programs seem to have float and double variants,  e.g.
> (MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt and
> MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl), and the code size
> bloat only happens for the double variants.
>
I think we should definitely look at this (as it seems likely to be a bug
somewhere), but I'm also not overly concerned with size regressions in the
TSVC benchmarks which are unusually loop heavy and small. We've have
several other changes that caused big fluctuations here.


> I think it may still be worthwhile to check if this also happens on other
> architectures, and why it happens only for the double-variants, not the
> float-variants.
>
+1

The second chart shows relative code size increase (vertical axis)
vs> relative performance improvement (horizontal axis):
> I manually checked the cause of the 3 biggest performance regressions
> (proprietary benchmark1: -13.70%;
> MultiSource/Applications/hexxagon/hexxagon: -10.10%;
> MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow -5.23%).
> For the proprietary benchmark and hexxagon, the code generation didn't
> change for the hottest parts, so probably is caused by micro-architectural
> effects of code layout changes.
>
This is always good to know, even though it is frustrating. =]

> For fourinarow, there seemed to be a lot more spill/fill code, so probably
> due to non-optimality of register allocation.
>
This is something we should probably look at. If you have the output lying
around, maybe file a PR about it?

The third chart below just zooms in on the above chart to the -5% to
5%> performance improvement range:
> [image: unroll_codesize_vs_performance_zoom.png]
>
>
> Whether to enable the increase in unroll threshold only at O3 or also at
> O2: I don't have a strong opinion based on the above data.
>
FWIW, this data seems to clearly indicate that we don't get performance
wins with any consistency when the code size goes up (and thus the change
has impact). As a consequence, I pretty strongly suspect that this should
be *just* used at O3 at least for now.

I see two further directions for Dehao that make sense here (at least to
me):
1) I suspect we should investigate *why* the size increases are happening
without helping speed. I can imagine some reasons that this would of course
happen (cold loops getting unrolled), but especially in light of the
oddities you point out above, I suspect there may be issues where more
unrolling is uncovering other problems and if we fix those other problems
the shape of things will be different. We should at least address the
issues you uncovered above.

2) If this turns out to be architecture specific (it seems that way at
least initially, but hard to tell for sure with different benchmark sets)
we might make AArch64 and x86 use different thresholds here. I'm skeptical
about this though. I suspect we should do #1, and we'll either get a
different shape, or just decide that O3 is more appropriate.

> Maybe the compile time impact is what should be driving that discussion
> the most? I'm afraid I don't have compile time numbers.
>
FWIW, I strongly suspect that for *this* change, compile time and code size
will be pretty precisely correlated. Dehao's data shows that to be true in
several cases certainly.

> Ultimately, I guess this boils down to what exactly the difference is in
> intent between O2 and O3, which seems like a never-ending discussion...
>
The definitions I am working from are here:
https://github.com/llvm-project/llvm-project/blob/master/llvm/include/llvm/Passes/PassBuilder.h#L81-L90

I've highlighted the part that makes me think O3 is better here: the code
size increases (and thus compile time increases) don't seem to correspond
to runtime improvements.

>
> Hoping you find this useful,
>
Very. Once again, this kind of data and analysis is awesome. =D
>
> Kristof
>
>
> On Tue, Feb 14, 2017 at 1:06 PM Kristof Beyls via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I've run the patch on https://reviews.llvm.org/D28368 on the test-suite
> and other benchmarks, for AArch64 -O3 -fomit-frame-pointer, both for
> Cortex-A53 and Cortex-A57.
>
> The geomean over the few hundred programs in there is roughly the same for
> Cortex-A53 and Cortex-A57: a bit over 1% improvement in execution speed for
> a bit over 5% increase in code size.
> Obviously I wouldn't want this for optimization levels where code size
is
> of any concern, like -Os or -Oz, but don't have a problem with this
going
> in for other optimization levels where this isn't a concern.
>
> Thanks,
>
> Kristof
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/c7268499/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_absolute_vs_relative.png
Type: image/png
Size: 86966 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/c7268499/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_vs_performance.png
Type: image/png
Size: 84065 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/c7268499/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_vs_performance_zoom.png
Type: image/png
Size: 103095 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/c7268499/attachment-0005.png>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Feb 2017 - (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Seemingly Similar Threads