thr3ads.net - llvm dev - [llvm-dev] (RFC) Adjusting default loop fully unroll threshold [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth via llvm-dev

2017-Feb-13 18:56 UTC

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

FWIW, I'm good with the updated data, but I'd really like at least
someone
from Apple and someone from ARM to chime in here... CC-ing random people in
the hope it helps...

On Mon, Feb 13, 2017 at 8:30 AM Dehao Chen via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Thanks for the comment. The performance experiments were performed on
> Intel Sandybridge. Updated this info to the patch description.
>
> Dehao
> On Sun, Feb 12, 2017 at 8:24 AM, Sanjay Patel <spatel at
rotateright.com>
> wrote:
>
> Since we can override the settings, I have no objections.
>
> I still think it would be good to document here and in the review/commit
> message which CPU model was used to acquire the experimental data. That
> could be useful to anyone that comes along later and wants to reproduce
> and/or compare to the original, motivating data.
>
>
> On Fri, Feb 10, 2017 at 4:53 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Thanks Hal, could you help approve https://reviews.llvm.org/D28368?
>
> I'll hold off until early Tuesday in case other people have more
concerns.
>
> Thanks,
> Dehao
>
> On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> On 02/10/2017 05:21 PM, Dehao Chen wrote:
>
> Thanks every for the comments.
>
> Do we have a decision here?
>
>
> You're good to go as far as I'm concerned.
>
>  -Hal
>
>
> Dehao
>
> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>
> Sorry if I missed it, but what machine/CPU are you using to collect the
> perf numbers?
>
> I am concerned that what may be a win on a CPU that keeps a couple of
> hundred instructions in-flight and has many MB of caches will not hold for
> a small core.
>
>
> In my experience, unrolling tends to help weaker cores even more than
> stronger ones because it allows the instruction scheduler more
> opportunities to hide latency. Obviously, instruction-cache pressure is an
> important consideration, but the code size changes here seems small.
>
>
> Is the proposed change universal? Is there a way to undo it?
>
>
> All of the unrolling thresholds should be target-adjustable using the
> TTI::getUnrollingPreferences hook.
>
>  -Hal
>
>
>
> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Ping... with the updated code size impact data, any more comments? Any
> more data that would be interesting to collect?
>
> Thanks,
> Dehao
>
> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Here is the code size impact for clang, chrome and 24 google internal
> benchmarks (name omited, 14 15 16 are encoding/decoding benchmarks similar
> as h264). There are 2 columns, for threshold 300 and 450 respectively.
>
> I also tested the llvm test suite. Changing the threshold to 300/450 does
> not affect code gen for any binary in the test suite.
>
>
>
> 300 450
> clang 0.30% 0.63%
> chrome 0.00% 0.00%
> 1 0.27% 0.67%
> 2 0.44% 0.93%
> 3 0.44% 0.93%
> 4 0.26% 0.53%
> 5 0.74% 2.21%
> 6 0.74% 2.21%
> 7 0.74% 2.21%
> 8 0.46% 1.05%
> 9 0.35% 0.86%
> 10 0.35% 0.86%
> 11 0.40% 0.83%
> 12 0.32% 0.65%
> 13 0.31% 0.64%
> 14 4.52% 8.23%
> 15 9.90% 19.38%
> 16 9.90% 19.38%
> 17 0.68% 1.97%
> 18 0.21% 0.48%
> 19 0.99% 3.44%
> 20 0.19% 0.46%
> 21 0.57% 1.62%
> 22 0.37% 1.05%
> 23 0.78% 1.30%
> 24 0.51% 1.54%
>
> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> clang, chrome, and some internal large apps are good candidates for size
> metrics.
>
> I'd also add the standard LLVM testsuite just because it's the
suite
> everyone in the community can use.
>
> Michael
>
>
> David
>
> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I had suggested having size metrics from somewhat larger applications such
> as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
> binaries with rough size brackets?
>
> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com>
wrote:
>
> With the new data points, any comments on whether this can justify setting
> fully inline threshold to 300 (or any other number) in O2? I can collect
> more data points if it's helpful.
>
> Thanks,
> Dehao
>
> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Recollected the data from trunk head with stddev data and more threshold
> data points attached:
>
> Performance:
>
> stddev/mean 300 450 600 750
> 403 0.37% 0.11% 0.11% 0.09% 0.79%
> 433 0.14% 0.51% 0.25% -0.63% -0.29%
> 445 0.08% 0.48% 0.89% 0.12% 0.83%
> 447 0.16% 3.50% 2.69% 3.66% 3.59%
> 453 0.11% 1.49% 0.45% -0.07% 0.78%
> 464 0.17% 0.75% 1.80% 1.86% 1.54%
> Code size:
>
> 300 450 600 750
> 403 0.56% 2.41% 2.74% 3.75%
> 433 0.96% 2.84% 4.19% 4.87%
> 445 2.16% 3.62% 4.48% 5.88%
> 447 2.96% 5.09% 6.74% 8.89%
> 453 0.94% 1.67% 2.73% 2.96%
> 464 8.02% 13.50% 20.51% 26.59%
> Compile time is proportional in the experiments and more noisy, so I did
> not include it.
>
> We have >2% speedup on some google internal benchmarks when switching
the
> threshold from 150 to 300.
>
> Dehao
>
> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>
>
> Another question is about PGO integration: is it already hooked there?
> Should we have a more aggressive threshold in a hot function? (Assuming
> we’re willing to spend some binary size there but not on the cold path).
>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>
>
>
> Agree that we could suppress unrolling in cold path to save code size. But
> that's orthogonal with the propose here. This proposal focuses on O2
> performance: shall we have different (higher) fully unroll threshold than
> dynamic/partial unroll.
>
>
> I agree that this is (to some extent) orthogonal, and it makes sense to me
> to differentiate the threshold for full unroll and the dynamic/partial
case.
>
>
> There is one issue that makes these not orthogonal.
>
> If even *static* profile hints will reduce some of the code size increase
> caused by higher unrolling thresholds for non-cold code, we should factor
> that into the tradeoff in picking where the threshold goes.
>
> However, getting PGO into the full unroller is currently challenging
> outside of the new pass manager. We already have some unfortunate hacks
> around this in LoopUnswitch that are making the port of it to the new PM
> more annoying.
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170213/6635f7d9/attachment-0001.html>

Gerolf Hoflehner via llvm-dev

2017-Feb-13 21:59 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

For threshold tuning I would like to see a little more insight than only running
the numbers, eg. what is the reason for the biggest gain + biggest loss? Maybe
another optimization could be improved and the unrolling factor is just hiding a
weakness (and fix that would be more beneficial).

For unrolling specifically I agree with Hal that the hooks should be target
specific. Actually, I go further and think they should be uArch specific. I have
no data or prove but would not be surprised to see a wider variety of numbers
when the thresholds are tested on a wide range of x86 machines.

My first thought also was along the lines of Matthias: do it at a higher opt
level e.g. O3 or possibly revisit/start thinking about O4.

Gerolf


> On Feb 13, 2017, at 10:56 AM, Chandler Carruth via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> FWIW, I'm good with the updated data, but I'd really like at least
someone from Apple and someone from ARM to chime in here... CC-ing random people
in the hope it helps...
> 
> On Mon, Feb 13, 2017 at 8:30 AM Dehao Chen via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> Thanks for the comment. The performance experiments were performed on Intel
Sandybridge. Updated this info to the patch description.
> 
> Dehao
> On Sun, Feb 12, 2017 at 8:24 AM, Sanjay Patel <spatel at rotateright.com
<mailto:spatel at rotateright.com>> wrote:
> Since we can override the settings, I have no objections. 
> 
> I still think it would be good to document here and in the review/commit
message which CPU model was used to acquire the experimental data. That could be
useful to anyone that comes along later and wants to reproduce and/or compare to
the original, motivating data.
> 
> On Fri, Feb 10, 2017 at 4:53 PM, Dehao Chen <dehao at google.com
<mailto:dehao at google.com>> wrote:
> Thanks Hal, could you help approve https://reviews.llvm.org/D28368
<https://reviews.llvm.org/D28368>?
> 
> I'll hold off until early Tuesday in case other people have more
concerns.
> 
> Thanks,
> Dehao
> 
> On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
> 
> On 02/10/2017 05:21 PM, Dehao Chen wrote:
>> Thanks every for the comments.
>> 
>> Do we have a decision here?
> 
> You're good to go as far as I'm concerned.
> 
>  -Hal
> 
>> 
>> Dehao
>> 
>> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
>> 
>> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>>> Sorry if I missed it, but what machine/CPU are you using to collect
the perf numbers?
>>> 
>>> I am concerned that what may be a win on a CPU that keeps a couple
of hundred instructions in-flight and has many MB of caches will not hold for a
small core.
>> 
>> In my experience, unrolling tends to help weaker cores even more than
stronger ones because it allows the instruction scheduler more opportunities to
hide latency. Obviously, instruction-cache pressure is an important
consideration, but the code size changes here seems small.
>> 
>>> 
>>> Is the proposed change universal? Is there a way to undo it?
>> 
>> All of the unrolling thresholds should be target-adjustable using the
TTI::getUnrollingPreferences hook.
>> 
>>  -Hal
>> 
>> 
>>> 
>>> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>> Ping... with the updated code size impact data, any more comments?
Any more data that would be interesting to collect?
>>> 
>>> Thanks,
>>> Dehao
>>> 
>>> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com
<mailto:dehao at google.com>> wrote:
>>> Here is the code size impact for clang, chrome and 24 google
internal benchmarks (name omited, 14 15 16 are encoding/decoding benchmarks
similar as h264). There are 2 columns, for threshold 300 and 450 respectively.
>>> 
>>> I also tested the llvm test suite. Changing the threshold to
300/450 does not affect code gen for any binary in the test suite.
>>> 
>>> 
>>> 
>>> 300	450
>>> clang	0.30%	0.63%
>>> chrome	0.00%	0.00%
>>> 1	0.27%	0.67%
>>> 2	0.44%	0.93%
>>> 3	0.44%	0.93%
>>> 4	0.26%	0.53%
>>> 5	0.74%	2.21%
>>> 6	0.74%	2.21%
>>> 7	0.74%	2.21%
>>> 8	0.46%	1.05%
>>> 9	0.35%	0.86%
>>> 10	0.35%	0.86%
>>> 11	0.40%	0.83%
>>> 12	0.32%	0.65%
>>> 13	0.31%	0.64%
>>> 14	4.52%	8.23%
>>> 15	9.90%	19.38%
>>> 16	9.90%	19.38%
>>> 17	0.68%	1.97%
>>> 18	0.21%	0.48%
>>> 19	0.99%	3.44%
>>> 20	0.19%	0.46%
>>> 21	0.57%	1.62%
>>> 22	0.37%	1.05%
>>> 23	0.78%	1.30%
>>> 24	0.51%	1.54%
>>> 
>>> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> clang, chrome, and some internal large apps are good candidates
for size metrics.
>>> I'd also add the standard LLVM testsuite just because it's
the suite everyone in the community can use.
>>> 
>>> Michael
>>>> 
>>>> David
>>>> 
>>>> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> I had suggested having size metrics from somewhat larger
applications such as Chrome, Webkit, or Firefox; clang itself; and maybe some of
our internal binaries with rough size brackets?
>>>> 
>>>> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at
google.com <mailto:dehao at google.com>> wrote:
>>>> With the new data points, any comments on whether this can
justify setting fully inline threshold to 300 (or any other number) in O2? I can
collect more data points if it's helpful.
>>>> 
>>>> Thanks,
>>>> Dehao
>>>> 
>>>> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at
google.com <mailto:dehao at google.com>> wrote:
>>>> Recollected the data from trunk head with stddev data and more
threshold data points attached:
>>>> 
>>>> Performance:
>>>> 
>>>> stddev/mean	300	450	600	750
>>>> 403	0.37%	0.11%	0.11%	0.09%	0.79%
>>>> 433	0.14%	0.51%	0.25%	-0.63%	-0.29%
>>>> 445	0.08%	0.48%	0.89%	0.12%	0.83%
>>>> 447	0.16%	3.50%	2.69%	3.66%	3.59%
>>>> 453	0.11%	1.49%	0.45%	-0.07%	0.78%
>>>> 464	0.17%	0.75%	1.80%	1.86%	1.54%
>>>> Code size:
>>>> 
>>>> 300	450	600	750
>>>> 403	0.56%	2.41%	2.74%	3.75%
>>>> 433	0.96%	2.84%	4.19%	4.87%
>>>> 445	2.16%	3.62%	4.48%	5.88%
>>>> 447	2.96%	5.09%	6.74%	8.89%
>>>> 453	0.94%	1.67%	2.73%	2.96%
>>>> 464	8.02%	13.50%	20.51%	26.59%
>>>> Compile time is proportional in the experiments and more noisy,
so I did not include it.
>>>> 
>>>> We have >2% speedup on some google internal benchmarks when
switching the threshold from 150 to 300.
>>>> 
>>>> Dehao
>>>> 
>>>> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc
at google.com <mailto:chandlerc at google.com>> wrote:
>>>> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com <mailto:mehdi.amini at apple.com>> wrote:
>>>>> 
>>>> 
>>>>> 
>>>>> Another question is about PGO integration: is it already
hooked there? Should we have a more aggressive threshold in a hot function?
(Assuming we’re willing to spend some binary size there but not on the cold
path).
>>>>> 
>>>>> I would even wire the *unrolling* the other way: just
suppress unrolling in cold paths to save binary size. rolled loops seem like a
generally good thing in cold code unless they are having some larger impact (IE,
the loop itself is more expensive than the unrolled form).
>>>>> 
>>>>> 
>>>>> Agree that we could suppress unrolling in cold path to save
code size. But that's orthogonal with the propose here. This proposal
focuses on O2 performance: shall we have different (higher) fully unroll
threshold than dynamic/partial unroll.
>>>> 
>>>> I agree that this is (to some extent) orthogonal, and it makes
sense to me to differentiate the threshold for full unroll and the
dynamic/partial case.
>>>> 
>>>> There is one issue that makes these not orthogonal.
>>>> 
>>>> If even *static* profile hints will reduce some of the code
size increase caused by higher unrolling thresholds for non-cold code, we should
factor that into the tradeoff in picking where the threshold goes.
>>>> 
>>>> However, getting PGO into the full unroller is currently
challenging outside of the new pass manager. We already have some unfortunate
hacks around this in LoopUnswitch that are making the port of it to the new PM
more annoying.
>>>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170213/22746def/attachment-0001.html>

Dehao Chen via llvm-dev

2017-Feb-13 22:16 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thanks for the inputs.

On Mon, Feb 13, 2017 at 1:59 PM, Gerolf Hoflehner <ghoflehner at
apple.com>
wrote:
> For threshold tuning I would like to see a little more insight than only
> running the numbers, eg. what is the reason for the biggest gain + biggest
> loss? Maybe another optimization could be improved and the unrolling factor
> is just hiding a weakness (and fix that would be more beneficial).
>
I analyzed the perf improvements on 2 google internal applications. The
benefits are simply from reduced dynamic instructions resulted from the
unrolled loop.

>
> For unrolling specifically I agree with Hal that the hooks should be
> target specific. Actually, I go further and think they should be uArch
> specific. I have no data or prove but would not be surprised to see a wider
> variety of numbers when the thresholds are tested on a wide range of x86
> machines.
>
I could provide more data on different x86 uArchs if necessary.

>
> My first thought also was along the lines of Matthias: do it at a higher
> opt level e.g. O3 or possibly revisit/start thinking about O4.
>
That was my first thought too. But after a second thought, if the code size
increase is small, and there is no data to show performance regression, I
could not think of a reason to block it from O2. Additionally, it does not
make sense to have the same threshold for full unroll and partial unroll
even in O2 (as reasoned in early posts of this thread). That's why I
started the patch and RFC to try to push this change to O2.

Thanks,
Dehao

>
> Gerolf
>
>
>
> On Feb 13, 2017, at 10:56 AM, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> FWIW, I'm good with the updated data, but I'd really like at least
someone
> from Apple and someone from ARM to chime in here... CC-ing random people in
> the hope it helps...
>
> On Mon, Feb 13, 2017 at 8:30 AM Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Thanks for the comment. The performance experiments were performed on
>> Intel Sandybridge. Updated this info to the patch description.
>>
>> Dehao
>> On Sun, Feb 12, 2017 at 8:24 AM, Sanjay Patel <spatel at
rotateright.com>
>> wrote:
>>
>> Since we can override the settings, I have no objections.
>>
>> I still think it would be good to document here and in the
review/commit
>> message which CPU model was used to acquire the experimental data. That
>> could be useful to anyone that comes along later and wants to reproduce
>> and/or compare to the original, motivating data.
>>
>>
>> On Fri, Feb 10, 2017 at 4:53 PM, Dehao Chen <dehao at google.com>
wrote:
>>
>> Thanks Hal, could you help approve https://reviews.llvm.org/D28368?
>>
>> I'll hold off until early Tuesday in case other people have more
concerns.
>>
>> Thanks,
>> Dehao
>>
>> On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>>
>>
>> On 02/10/2017 05:21 PM, Dehao Chen wrote:
>>
>> Thanks every for the comments.
>>
>> Do we have a decision here?
>>
>>
>> You're good to go as far as I'm concerned.
>>
>>  -Hal
>>
>>
>> Dehao
>>
>> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>>
>>
>> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>>
>> Sorry if I missed it, but what machine/CPU are you using to collect the
>> perf numbers?
>>
>> I am concerned that what may be a win on a CPU that keeps a couple of
>> hundred instructions in-flight and has many MB of caches will not hold
for
>> a small core.
>>
>>
>> In my experience, unrolling tends to help weaker cores even more than
>> stronger ones because it allows the instruction scheduler more
>> opportunities to hide latency. Obviously, instruction-cache pressure is
an
>> important consideration, but the code size changes here seems small.
>>
>>
>> Is the proposed change universal? Is there a way to undo it?
>>
>>
>> All of the unrolling thresholds should be target-adjustable using the
>> TTI::getUnrollingPreferences hook.
>>
>>  -Hal
>>
>>
>>
>> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Ping... with the updated code size impact data, any more comments? Any
>> more data that would be interesting to collect?
>>
>> Thanks,
>> Dehao
>>
>> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>>
>> Here is the code size impact for clang, chrome and 24 google internal
>> benchmarks (name omited, 14 15 16 are encoding/decoding benchmarks
similar
>> as h264). There are 2 columns, for threshold 300 and 450 respectively.
>>
>> I also tested the llvm test suite. Changing the threshold to 300/450
does
>> not affect code gen for any binary in the test suite.
>>
>>
>>
>> 300 450
>> clang 0.30% 0.63%
>> chrome 0.00% 0.00%
>> 1 0.27% 0.67%
>> 2 0.44% 0.93%
>> 3 0.44% 0.93%
>> 4 0.26% 0.53%
>> 5 0.74% 2.21%
>> 6 0.74% 2.21%
>> 7 0.74% 2.21%
>> 8 0.46% 1.05%
>> 9 0.35% 0.86%
>> 10 0.35% 0.86%
>> 11 0.40% 0.83%
>> 12 0.32% 0.65%
>> 13 0.31% 0.64%
>> 14 4.52% 8.23%
>> 15 9.90% 19.38%
>> 16 9.90% 19.38%
>> 17 0.68% 1.97%
>> 18 0.21% 0.48%
>> 19 0.99% 3.44%
>> 20 0.19% 0.46%
>> 21 0.57% 1.62%
>> 22 0.37% 1.05%
>> 23 0.78% 1.30%
>> 24 0.51% 1.54%
>>
>> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> clang, chrome, and some internal large apps are good candidates for
size
>> metrics.
>>
>> I'd also add the standard LLVM testsuite just because it's the
suite
>> everyone in the community can use.
>>
>> Michael
>>
>>
>> David
>>
>> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> I had suggested having size metrics from somewhat larger applications
>> such as Chrome, Webkit, or Firefox; clang itself; and maybe some of our
>> internal binaries with rough size brackets?
>>
>> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com>
wrote:
>>
>> With the new data points, any comments on whether this can justify
>> setting fully inline threshold to 300 (or any other number) in O2? I
can
>> collect more data points if it's helpful.
>>
>> Thanks,
>> Dehao
>>
>> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com>
wrote:
>>
>> Recollected the data from trunk head with stddev data and more
threshold
>> data points attached:
>>
>> Performance:
>>
>> stddev/mean 300 450 600 750
>> 403 0.37% 0.11% 0.11% 0.09% 0.79%
>> 433 0.14% 0.51% 0.25% -0.63% -0.29%
>> 445 0.08% 0.48% 0.89% 0.12% 0.83%
>> 447 0.16% 3.50% 2.69% 3.66% 3.59%
>> 453 0.11% 1.49% 0.45% -0.07% 0.78%
>> 464 0.17% 0.75% 1.80% 1.86% 1.54%
>> Code size:
>>
>> 300 450 600 750
>> 403 0.56% 2.41% 2.74% 3.75%
>> 433 0.96% 2.84% 4.19% 4.87%
>> 445 2.16% 3.62% 4.48% 5.88%
>> 447 2.96% 5.09% 6.74% 8.89%
>> 453 0.94% 1.67% 2.73% 2.96%
>> 464 8.02% 13.50% 20.51% 26.59%
>> Compile time is proportional in the experiments and more noisy, so I
did
>> not include it.
>>
>> We have >2% speedup on some google internal benchmarks when
switching the
>> threshold from 150 to 300.
>>
>> Dehao
>>
>> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
>> wrote:
>>
>> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com>
>> wrote:
>>
>>
>>
>> Another question is about PGO integration: is it already hooked there?
>> Should we have a more aggressive threshold in a hot function? (Assuming
>> we’re willing to spend some binary size there but not on the cold
path).
>>
>>
>> I would even wire the *unrolling* the other way: just suppress
unrolling
>> in cold paths to save binary size. rolled loops seem like a generally
good
>> thing in cold code unless they are having some larger impact (IE, the
loop
>> itself is more expensive than the unrolled form).
>>
>>
>>
>> Agree that we could suppress unrolling in cold path to save code size.
>> But that's orthogonal with the propose here. This proposal focuses
on O2
>> performance: shall we have different (higher) fully unroll threshold
than
>> dynamic/partial unroll.
>>
>>
>> I agree that this is (to some extent) orthogonal, and it makes sense to
>> me to differentiate the threshold for full unroll and the
dynamic/partial
>> case.
>>
>>
>> There is one issue that makes these not orthogonal.
>>
>> If even *static* profile hints will reduce some of the code size
increase
>> caused by higher unrolling thresholds for non-cold code, we should
factor
>> that into the tradeoff in picking where the threshold goes.
>>
>> However, getting PGO into the full unroller is currently challenging
>> outside of the new pass manager. We already have some unfortunate hacks
>> around this in LoopUnswitch that are making the port of it to the new
PM
>> more annoying.
>>
>>
>> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170213/308d46f0/attachment-0001.html>

Chandler Carruth via llvm-dev

2017-Feb-13 22:17 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

On Mon, Feb 13, 2017 at 2:06 PM Gerolf Hoflehner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> For unrolling specifically I agree with Hal that the hooks should be
> target specific. Actually, I go further and think they should be uArch
> specific.
>
They already are, it is just that no one has contributed a patch to use
this on x86 microarchitectures.

Until someone shows up with data showing that we need different tunings for
different microarchitectures, it doesn't make sense for us to just make up
numbers there.

On the (very limited) microarchitectures we have and can test on, we're not
seeing a need for microarchitectural tuning. But if others have different
data, that would of course be welcome. That's part of what we're looking
for in this thread.

> I have no data or prove but would not be surprised to see a wider variety
> of numbers when the thresholds are tested on a wide range of x86 machines.
>
Until we have data, I don't see how we can act on this though.

> My first thought also was along the lines of Matthias: do it at a higher
> opt level e.g. O3 or possibly revisit/start thinking about O4.
>
Why? What about the data presented means that this isn't appropriate at O2?
I'm fine if that's the answer, but I think we need to have a clear and
unambiguous rationale behind it. With the current data on this thread, the
code size and compile time impact seem *very small* except for very small
benchmarks, many of which actually show the performance improvement as well.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170213/aebc122d/attachment.html>

Kristof Beyls via llvm-dev

2017-Feb-14 19:36 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

I've run the patch on https://reviews.llvm.org/D28368 on the test-suite and
other benchmarks, for AArch64 -O3 -fomit-frame-pointer, both for Cortex-A53 and
Cortex-A57.

The geomean over the few hundred programs in there is roughly the same for
Cortex-A53 and Cortex-A57: a bit over 1% improvement in execution speed for a
bit over 5% increase in code size.
Obviously I wouldn't want this for optimization levels where code size is of
any concern, like -Os or -Oz, but don't have a problem with this going in
for other optimization levels where this isn't a concern.

Thanks,

Kristof




On 13 Feb 2017, at 19:56, Chandler Carruth via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

FWIW, I'm good with the updated data, but I'd really like at least
someone from Apple and someone from ARM to chime in here... CC-ing random people
in the hope it helps...

On Mon, Feb 13, 2017 at 8:30 AM Dehao Chen via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Thanks for the comment. The performance experiments were performed on Intel
Sandybridge. Updated this info to the patch description.

Dehao
On Sun, Feb 12, 2017 at 8:24 AM, Sanjay Patel <spatel at
rotateright.com<mailto:spatel at rotateright.com>> wrote:
Since we can override the settings, I have no objections.

I still think it would be good to document here and in the review/commit message
which CPU model was used to acquire the experimental data. That could be useful
to anyone that comes along later and wants to reproduce and/or compare to the
original, motivating data.

On Fri, Feb 10, 2017 at 4:53 PM, Dehao Chen <dehao at
google.com<mailto:dehao at google.com>> wrote:
Thanks Hal, could you help approve https://reviews.llvm.org/D28368?

I'll hold off until early Tuesday in case other people have more concerns.

Thanks,
Dehao

On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:


On 02/10/2017 05:21 PM, Dehao Chen wrote:
Thanks every for the comments.

Do we have a decision here?

You're good to go as far as I'm concerned.

 -Hal


Dehao

On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:


On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
Sorry if I missed it, but what machine/CPU are you using to collect the perf
numbers?

I am concerned that what may be a win on a CPU that keeps a couple of hundred
instructions in-flight and has many MB of caches will not hold for a small core.

In my experience, unrolling tends to help weaker cores even more than stronger
ones because it allows the instruction scheduler more opportunities to hide
latency. Obviously, instruction-cache pressure is an important consideration,
but the code size changes here seems small.


Is the proposed change universal? Is there a way to undo it?

All of the unrolling thresholds should be target-adjustable using the
TTI::getUnrollingPreferences hook.

 -Hal



On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Ping... with the updated code size impact data, any more comments? Any more data
that would be interesting to collect?

Thanks,
Dehao

On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at
google.com<mailto:dehao at google.com>> wrote:
Here is the code size impact for clang, chrome and 24 google internal benchmarks
(name omited, 14 15 16 are encoding/decoding benchmarks similar as h264). There
are 2 columns, for threshold 300 and 450 respectively.

I also tested the llvm test suite. Changing the threshold to 300/450 does not
affect code gen for any binary in the test suite.



        300     450
clang   0.30%   0.63%
chrome  0.00%   0.00%
1       0.27%   0.67%
2       0.44%   0.93%
3       0.44%   0.93%
4       0.26%   0.53%
5       0.74%   2.21%
6       0.74%   2.21%
7       0.74%   2.21%
8       0.46%   1.05%
9       0.35%   0.86%
10      0.35%   0.86%
11      0.40%   0.83%
12      0.32%   0.65%
13      0.31%   0.64%
14      4.52%   8.23%
15      9.90%   19.38%
16      9.90%   19.38%
17      0.68%   1.97%
18      0.21%   0.48%
19      0.99%   3.44%
20      0.19%   0.46%
21      0.57%   1.62%
22      0.37%   1.05%
23      0.78%   1.30%
24      0.51%   1.54%

On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

clang, chrome, and some internal large apps are good candidates for size
metrics.
I'd also add the standard LLVM testsuite just because it's the suite
everyone in the community can use.

Michael

David

On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
I had suggested having size metrics from somewhat larger applications such as
Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
binaries with rough size brackets?

On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at
google.com<mailto:dehao at google.com>> wrote:
With the new data points, any comments on whether this can justify setting fully
inline threshold to 300 (or any other number) in O2? I can collect more data
points if it's helpful.

Thanks,
Dehao

On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at
google.com<mailto:dehao at google.com>> wrote:
Recollected the data from trunk head with stddev data and more threshold data
points attached:

Performance:

        stddev/mean     300     450     600     750
403     0.37%   0.11%   0.11%   0.09%   0.79%
433     0.14%   0.51%   0.25%   -0.63%  -0.29%
445     0.08%   0.48%   0.89%   0.12%   0.83%
447     0.16%   3.50%   2.69%   3.66%   3.59%
453     0.11%   1.49%   0.45%   -0.07%  0.78%
464     0.17%   0.75%   1.80%   1.86%   1.54%

Code size:

        300     450     600     750
403     0.56%   2.41%   2.74%   3.75%
433     0.96%   2.84%   4.19%   4.87%
445     2.16%   3.62%   4.48%   5.88%
447     2.96%   5.09%   6.74%   8.89%
453     0.94%   1.67%   2.73%   2.96%
464     8.02%   13.50%  20.51%  26.59%

Compile time is proportional in the experiments and more noisy, so I did not
include it.

We have >2% speedup on some google internal benchmarks when switching the
threshold from 150 to 300.

Dehao

On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com<mailto:chandlerc at google.com>> wrote:
On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com<mailto:mehdi.amini at apple.com>> wrote:


Another question is about PGO integration: is it already hooked there? Should we
have a more aggressive threshold in a hot function? (Assuming we’re willing to
spend some binary size there but not on the cold path).

I would even wire the *unrolling* the other way: just suppress unrolling in cold
paths to save binary size. rolled loops seem like a generally good thing in cold
code unless they are having some larger impact (IE, the loop itself is more
expensive than the unrolled form).


Agree that we could suppress unrolling in cold path to save code size. But
that's orthogonal with the propose here. This proposal focuses on O2
performance: shall we have different (higher) fully unroll threshold than
dynamic/partial unroll.

I agree that this is (to some extent) orthogonal, and it makes sense to me to
differentiate the threshold for full unroll and the dynamic/partial case.

There is one issue that makes these not orthogonal.

If even *static* profile hints will reduce some of the code size increase caused
by higher unrolling thresholds for non-cold code, we should factor that into the
tradeoff in picking where the threshold goes.

However, getting PGO into the full unroller is currently challenging outside of
the new pass manager. We already have some unfortunate hacks around this in
LoopUnswitch that are making the port of it to the new PM more annoying.

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170214/09fb7087/attachment-0001.html>

Chandler Carruth via llvm-dev

2017-Feb-15 18:10 UTC

head link

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thanks for running these Kristof!

I'd still like to hear from Apple, and if we can get a few more x86
micro-architectures covered that'd be great, but it looks like -O3 is
uncontroversial, and the question is whether this makes sense at O2...

To me, it would help a lot to know the actual breakdown of benchmarks such
as yours Kristof (as they seem to have more codesize impact than others
have mentioned). Specificially, are the runtime improvements correlated
with the codesize increases? And what are the absolute size deltas? For
*very* small benchmarks, a 5% code size fluctuation seems less concerning
than for a larger benchmark. If the larger code size changes are mostly
smaller benchmarks and reasonably correlated to the ones likely to see
improvement from the change (this seemed to be the case w/ Dehao's data on
x86 for example) that would to me indicate this makes sense at O2.

Note that I'm fine if you have to list the benchmarks as "1, 2, 3,
..." or
whatever, much like we did for Google-internal benchmarks. It's still
useful to know the shape of the change.

On Tue, Feb 14, 2017 at 1:06 PM Kristof Beyls via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I've run the patch on https://reviews.llvm.org/D28368 on the test-suite
> and other benchmarks, for AArch64 -O3 -fomit-frame-pointer, both for
> Cortex-A53 and Cortex-A57.
>
> The geomean over the few hundred programs in there is roughly the same for
> Cortex-A53 and Cortex-A57: a bit over 1% improvement in execution speed for
> a bit over 5% increase in code size.
> Obviously I wouldn't want this for optimization levels where code size
is
> of any concern, like -Os or -Oz, but don't have a problem with this
going
> in for other optimization levels where this isn't a concern.
>
> Thanks,
>
> Kristof
>
>
>
>
> On 13 Feb 2017, at 19:56, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> FWIW, I'm good with the updated data, but I'd really like at least
someone
> from Apple and someone from ARM to chime in here... CC-ing random people in
> the hope it helps...
>
> On Mon, Feb 13, 2017 at 8:30 AM Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Thanks for the comment. The performance experiments were performed on
> Intel Sandybridge. Updated this info to the patch description.
>
> Dehao
> On Sun, Feb 12, 2017 at 8:24 AM, Sanjay Patel <spatel at
rotateright.com>
> wrote:
>
> Since we can override the settings, I have no objections.
>
> I still think it would be good to document here and in the review/commit
> message which CPU model was used to acquire the experimental data. That
> could be useful to anyone that comes along later and wants to reproduce
> and/or compare to the original, motivating data.
>
> On Fri, Feb 10, 2017 at 4:53 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Thanks Hal, could you help approve https://reviews.llvm.org/D28368?
>
> I'll hold off until early Tuesday in case other people have more
concerns.
>
> Thanks,
> Dehao
>
> On Fri, Feb 10, 2017 at 3:23 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> On 02/10/2017 05:21 PM, Dehao Chen wrote:
>
> Thanks every for the comments.
>
> Do we have a decision here?
>
>
> You're good to go as far as I'm concerned.
>
>  -Hal
>
>
> Dehao
>
> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>
> Sorry if I missed it, but what machine/CPU are you using to collect the
> perf numbers?
>
> I am concerned that what may be a win on a CPU that keeps a couple of
> hundred instructions in-flight and has many MB of caches will not hold for
> a small core.
>
>
> In my experience, unrolling tends to help weaker cores even more than
> stronger ones because it allows the instruction scheduler more
> opportunities to hide latency. Obviously, instruction-cache pressure is an
> important consideration, but the code size changes here seems small.
>
>
> Is the proposed change universal? Is there a way to undo it?
>
>
> All of the unrolling thresholds should be target-adjustable using the
> TTI::getUnrollingPreferences hook.
>
>  -Hal
>
> On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Ping... with the updated code size impact data, any more comments? Any
> more data that would be interesting to collect?
>
> Thanks,
> Dehao
>
> On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Here is the code size impact for clang, chrome and 24 google internal
> benchmarks (name omited, 14 15 16 are encoding/decoding benchmarks similar
> as h264). There are 2 columns, for threshold 300 and 450 respectively.
>
> I also tested the llvm test suite. Changing the threshold to 300/450 does
> not affect code gen for any binary in the test suite.
>
>
>
> 300 450
> clang 0.30% 0.63%
> chrome 0.00% 0.00%
> 1 0.27% 0.67%
> 2 0.44% 0.93%
> 3 0.44% 0.93%
> 4 0.26% 0.53%
> 5 0.74% 2.21%
> 6 0.74% 2.21%
> 7 0.74% 2.21%
> 8 0.46% 1.05%
> 9 0.35% 0.86%
> 10 0.35% 0.86%
> 11 0.40% 0.83%
> 12 0.32% 0.65%
> 13 0.31% 0.64%
> 14 4.52% 8.23%
> 15 9.90% 19.38%
> 16 9.90% 19.38%
> 17 0.68% 1.97%
> 18 0.21% 0.48%
> 19 0.99% 3.44%
> 20 0.19% 0.46%
> 21 0.57% 1.62%
> 22 0.37% 1.05%
> 23 0.78% 1.30%
> 24 0.51% 1.54%
>
> On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Feb 1, 2017, at 4:57 PM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> clang, chrome, and some internal large apps are good candidates for size
> metrics.
>
> I'd also add the standard LLVM testsuite just because it's the
suite
> everyone in the community can use.
>
> Michael
>
>
> David
>
> On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I had suggested having size metrics from somewhat larger applications such
> as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal
> binaries with rough size brackets?
>
> On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com>
wrote:
>
> With the new data points, any comments on whether this can justify setting
> fully inline threshold to 300 (or any other number) in O2? I can collect
> more data points if it's helpful.
>
> Thanks,
> Dehao
>
> On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Recollected the data from trunk head with stddev data and more threshold
> data points attached:
>
> Performance:
>
> stddev/mean 300 450 600 750
> 403 0.37% 0.11% 0.11% 0.09% 0.79%
> 433 0.14% 0.51% 0.25% -0.63% -0.29%
> 445 0.08% 0.48% 0.89% 0.12% 0.83%
> 447 0.16% 3.50% 2.69% 3.66% 3.59%
> 453 0.11% 1.49% 0.45% -0.07% 0.78%
> 464 0.17% 0.75% 1.80% 1.86% 1.54%
> Code size:
>
> 300 450 600 750
> 403 0.56% 2.41% 2.74% 3.75%
> 433 0.96% 2.84% 4.19% 4.87%
> 445 2.16% 3.62% 4.48% 5.88%
> 447 2.96% 5.09% 6.74% 8.89%
> 453 0.94% 1.67% 2.73% 2.96%
> 464 8.02% 13.50% 20.51% 26.59%
> Compile time is proportional in the experiments and more noisy, so I did
> not include it.
>
> We have >2% speedup on some google internal benchmarks when switching
the
> threshold from 150 to 300.
>
> Dehao
>
> On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>
>
> Another question is about PGO integration: is it already hooked there?
> Should we have a more aggressive threshold in a hot function? (Assuming
> we’re willing to spend some binary size there but not on the cold path).
>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>
>
>
> Agree that we could suppress unrolling in cold path to save code size. But
> that's orthogonal with the propose here. This proposal focuses on O2
> performance: shall we have different (higher) fully unroll threshold than
> dynamic/partial unroll.
>
>
> I agree that this is (to some extent) orthogonal, and it makes sense to me
> to differentiate the threshold for full unroll and the dynamic/partial
case.
>
> There is one issue that makes these not orthogonal.
>
> If even *static* profile hints will reduce some of the code size increase
> caused by higher unrolling thresholds for non-cold code, we should factor
> that into the tradeoff in picking where the threshold goes.
>
> However, getting PGO into the full unroller is currently challenging
> outside of the new pass manager. We already have some unfortunate hacks
> around this in LoopUnswitch that are making the port of it to the new PM
> more annoying.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170215/1c41621b/attachment-0001.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Feb 2017 - (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Possibly Parallel Threads