thr3ads.net - llvm dev - [llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53 [May 2017]

If this information is useful, please help other people find it:
Share via:

Kristof Beyls via llvm-dev

2017-May-31 12:57 UTC

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Motivation

At the moment, when targeting armv7a, clang defaults to generate code as if
-mcpu=cortex-a8 was specified.
When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was
specified.

This leads to surprising code generation, by the compiler optimizing for a
specific micro-architecture, whereas the intent from the user was probably to
generate code that is "blended" for all the cores implementing the
requested architecture. One example of a user being surprised like this is at
https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not produced
to optimize for a Cortex-A8-specific micro-architectural behaviour, even though
the user didn't request to optimize specifically for Cortex-A8.

It would be much cleaner conceptually if clang would default to -mcpu=generic
when no specific cpu is specified.

What is the impact of this change on execution speed?

I think the main reason to be hesitant to change the default CPU for ARM to
-mcpu=generic is the potential impact on performance of generated code.

I've measured quite a wide selection of benchmarks with this change, on the
following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72.

Impact on execution speed, for each core, when using -march=armv7a, after
changing the default cpu from cortex-a8 to generic is as follows.
A positive numbers means speedup, a negative number means slow-down. These are
the geomean results over 350 programs coming from benchmark suites such as the
test-suite, SPEC2000, SPEC2006 and a range of proprietary suites.

Cortex-A9: 0.96%
Cortex-A53: -0.64%
Cortex-A57: 1.04%
Cortex-A72: 1.17%

Impact on execution speed, for each core, when using -march=armv8a, after
changing the default cpu from cortex-a53 to generic:

(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
Cortex-A53: -0.09%
Cortex-A57: -0.12%
Cortex-A72: 0.03%

Should we enable scheduling for an in-order core even for -mcpu=generic?

In the above measurements it shows that the biggest negative impact seen is with
-march=armv7a on Cortex-A53: -0.64%.
It seems that the in-order Cortex-A53 core is losing quite a bit of performance
when the instructions aren't scheduled - which is to be expected.
Therefore, I also experimented with letting instructions be scheduled according
to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if
it's beneficial to schedule instructions for an in-order core rather than
not trying to schedule them at all, for -mcpu=generic.

Measurement results:

-march=armv7a

Cortex-A9: 1.57% (up from 0.96%)
Cortex-A53: 0.47% (up from -0.64%)
Cortex-A57: 1.74% (up from 1.04%)
Cortex-A72: 1.72% (up from 1.17%)

-march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in the
32-bit ARM backend):

(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
Cortex-A53: 0.49% (up from -0.09%)
Cortex-A57: 0.09% (up from -0.12%)
Cortex-A72: 0.20% (up from 0.03%)

Conclusion: for all the in-order and out-of-order cores I measured, it's
beneficial to get the instructions scheduled using the Cortex-A8 pipeline model
in combination with -mcpu=generic.


Taking into account the above measurements, my conclusions are:
1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for
march=armv7a and march=armv8a.
2. We probably want to let the compiler schedule instructions using the
Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on
all cores tested.

Do people agree with these conclusions?
Any objections against implementing this?
Any other potential impact this may have that I forgot to consider above?

Thanks,

Kristof
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170531/a1d6edb8/attachment.html>

Renato Golin via llvm-dev

2017-May-31 13:35 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

On 31 May 2017 at 13:57, Kristof Beyls <Kristof.Beyls at arm.com>
wrote:> Taking into account the above measurements, my conclusions are:
> 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or
Cortex-A53
> for march=armv7a and march=armv8a.
Using -mcpu=native makes more sense to me, if at all possible to
detect, falling back to generic, which doesn't hurt.

> 2. We probably want to let the compiler schedule instructions using the
> Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup
> on all cores tested.
Same here, I'd use the schedule of the detected CPU, if any, or fall
back to A8 (which seems fine).

But yeah, it's time we get rid of the A8/A53 defaults.

While we're at it, we may think about ARMv7's NEON default. Generating
only VFP is slower on boards with NEON, but generating NEON crashes
with SIGILL on borads that don't have it.

I'd be happy if Clang could detect CPU/FPU and set the flags
accordingly, or fall back to "generic"/A8-schedule/VFP defaults.

cheers,
--renato

Eric Christopher via llvm-dev

2017-May-31 15:02 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

On Wed, May 31, 2017 at 6:35 AM Renato Golin <renato.golin at linaro.org>
wrote:
> On 31 May 2017 at 13:57, Kristof Beyls <Kristof.Beyls at arm.com>
wrote:
> > Taking into account the above measurements, my conclusions are:
> > 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or
> Cortex-A53
> > for march=armv7a and march=armv8a.
>
> Using -mcpu=native makes more sense to me, if at all possible to
> detect, falling back to generic, which doesn't hurt.
>
>Ultimately either solution is fine with me. If Kristof wanted to switch it
to generic while getting the autodetection stuff up that would also be ok.

-eric

>
> > 2. We probably want to let the compiler schedule instructions using
the
> > Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of
> speedup
> > on all cores tested.
>
> Same here, I'd use the schedule of the detected CPU, if any, or fall
> back to A8 (which seems fine).
>
> But yeah, it's time we get rid of the A8/A53 defaults.
>
> While we're at it, we may think about ARMv7's NEON default.
Generating
> only VFP is slower on boards with NEON, but generating NEON crashes
> with SIGILL on borads that don't have it.
>
> I'd be happy if Clang could detect CPU/FPU and set the flags
> accordingly, or fall back to "generic"/A8-schedule/VFP defaults.
>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170531/9e1176b6/attachment.html>

Evandro Menezes via llvm-dev

2017-May-31 15:23 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

On 05/31/2017 08:35 AM, Renato Golin wrote:> On 31 May 2017 at 13:57, Kristof Beyls <Kristof.Beyls at arm.com>
wrote:
>> Taking into account the above measurements, my conclusions are:
>> 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or
Cortex-A53
>> for march=armv7a and march=armv8a.
> Using -mcpu=native makes more sense to me, if at all possible to
> detect, falling back to generic, which doesn't hurt.
For the sake of predictability, methinks that it'd make more sense for 
the default to always mean the same thing for everyone, as Kristof 
suggested.


-- 
Evandro Menezes

Evandro Menezes via llvm-dev

2017-May-31 15:25 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Hi, Kristof.

I think that it makes sense.  Your results also somehow corroborate the 
model adopted in GCC for the generic tuning, especially WRT scheduling 
in order.

Thank you,

-- 
Evandro Menezes

On 05/31/2017 07:57 AM, Kristof Beyls wrote:> *Motivation*
>
> At the moment, when targeting armv7a, clang defaults to generate code 
> as if -mcpu=cortex-a8 was specified.
> When targeting armv8a, it defaults to generate code as if 
> -mcpu=cortex-a53 was specified.
>
> This leads to surprising code generation, by the compiler optimizing 
> for a specific micro-architecture, whereas the intent from the user 
> was probably to generate code that is "blended" for all the cores
> implementing the requested architecture. One example of a user being 
> surprised like this is at 
> https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not 
> produced to optimize for a Cortex-A8-specific micro-architectural 
> behaviour, even though the user didn't request to optimize 
> specifically for Cortex-A8.
>
> It would be much cleaner conceptually if clang would default to 
> -mcpu=generic when no specific cpu is specified.
>
> *What is the impact of this change on execution speed?*
> *
> *
> I think the main reason to be hesitant to change the default CPU for 
> ARM to -mcpu=generic is the potential impact on performance of 
> generated code.
> *
> *
> I've measured quite a wide selection of benchmarks with this change, 
> on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72.
>
> Impact on execution speed, for each core, when using -march=armv7a, 
> after changing the default cpu from cortex-a8 to generic is as follows.
> A positive numbers means speedup, a negative number means slow-down. 
> These are the geomean results over 350 programs coming from benchmark 
> suites such as the test-suite, SPEC2000, SPEC2006 and a range of 
> proprietary suites.
>
> Cortex-A9: 0.96%
> Cortex-A53: -0.64%
> Cortex-A57: 1.04%
> Cortex-A72: 1.17%
>
> Impact on execution speed, for each core, when using -march=armv8a, 
> after changing the default cpu from cortex-a53 to generic:
>
> (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
> Cortex-A53: -0.09%
> Cortex-A57: -0.12%
> Cortex-A72: 0.03%
>
> *Should we enable scheduling for an in-order core even for -mcpu=generic?*
> *
> *
> In the above measurements it shows that the biggest negative impact 
> seen is with -march=armv7a on Cortex-A53: -0.64%.
> It seems that the in-order Cortex-A53 core is losing quite a bit of 
> performance when the instructions aren't scheduled - which is to be 
> expected.
> Therefore, I also experimented with letting instructions be scheduled 
> according to the Cortex-A8 pipeline model, even for -mcpu=generic, 
> trying to figure out if it's beneficial to schedule instructions for 
> an in-order core rather than not trying to schedule them at all, for 
> -mcpu=generic.
>
> Measurement results:
>
> -march=armv7a
>
> Cortex-A9: 1.57% (up from 0.96%)
> Cortex-A53: 0.47% (up from -0.64%)
> Cortex-A57: 1.74% (up from 1.04%)
> Cortex-A72: 1.72% (up from 1.17%)
>
> -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 
> in the 32-bit ARM backend):
>
> (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
> Cortex-A53: 0.49% (up from -0.09%)
> Cortex-A57: 0.09% (up from -0.12%)
> Cortex-A72: 0.20% (up from 0.03%)
>
> Conclusion: for all the in-order and out-of-order cores I measured, 
> it's beneficial to get the instructions scheduled using the Cortex-A8 
> pipeline model in combination with -mcpu=generic.
>
>
> Taking into account the above measurements, my conclusions are:
> 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or 
> Cortex-A53 for march=armv7a and march=armv8a.
> 2. We probably want to let the compiler schedule instructions using 
> the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit 
> of speedup on all cores tested.
>
> Do people agree with these conclusions?
> Any objections against implementing this?
> Any other potential impact this may have that I forgot to consider above?
>
> Thanks,
>
> Kristof

Stephen Hines via llvm-dev

2017-May-31 15:57 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Wow, these are some fantastic results! Android is definitely in favor of
fixing the defaults, so this proposal looks great from our perspective.

Thanks,
Steve

On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at arm.com>
wrote:
> *Motivation*
>
> At the moment, when targeting armv7a, clang defaults to generate code as
> if -mcpu=cortex-a8 was specified.
> When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53
> was specified.
>
> This leads to surprising code generation, by the compiler optimizing for a
> specific micro-architecture, whereas the intent from the user was probably
> to generate code that is "blended" for all the cores implementing
the
> requested architecture. One example of a user being surprised like this is
> at https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not
> produced to optimize for a Cortex-A8-specific micro-architectural
> behaviour, even though the user didn't request to optimize specifically
for
> Cortex-A8.
>
> It would be much cleaner conceptually if clang would default to
> -mcpu=generic when no specific cpu is specified.
>
> *What is the impact of this change on execution speed?*
>
> I think the main reason to be hesitant to change the default CPU for ARM
> to -mcpu=generic is the potential impact on performance of generated code.
>
> I've measured quite a wide selection of benchmarks with this change, on
> the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72.
>
> Impact on execution speed, for each core, when using -march=armv7a, after
> changing the default cpu from cortex-a8 to generic is as follows.
> A positive numbers means speedup, a negative number means slow-down. These
> are the geomean results over 350 programs coming from benchmark suites such
> as the test-suite, SPEC2000, SPEC2006 and a range of proprietary suites.
>
> Cortex-A9: 0.96%
> Cortex-A53: -0.64%
> Cortex-A57: 1.04%
> Cortex-A72: 1.17%
>
> Impact on execution speed, for each core, when using -march=armv8a, after
> changing the default cpu from cortex-a53 to generic:
>
> (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
> Cortex-A53: -0.09%
> Cortex-A57: -0.12%
> Cortex-A72: 0.03%
>
> *Should we enable scheduling for an in-order core even for -mcpu=generic?*
>
> In the above measurements it shows that the biggest negative impact seen
> is with -march=armv7a on Cortex-A53: -0.64%.
> It seems that the in-order Cortex-A53 core is losing quite a bit of
> performance when the instructions aren't scheduled - which is to be
> expected.
> Therefore, I also experimented with letting instructions be scheduled
> according to the Cortex-A8 pipeline model, even for -mcpu=generic, trying
> to figure out if it's beneficial to schedule instructions for an
in-order
> core rather than not trying to schedule them at all, for -mcpu=generic.
>
> Measurement results:
>
> -march=armv7a
>
> Cortex-A9: 1.57% (up from 0.96%)
> Cortex-A53: 0.47% (up from -0.64%)
> Cortex-A57: 1.74% (up from 1.04%)
> Cortex-A72: 1.72% (up from 1.17%)
>
> -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in
> the 32-bit ARM backend):
>
> (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
> Cortex-A53: 0.49% (up from -0.09%)
> Cortex-A57: 0.09% (up from -0.12%)
> Cortex-A72: 0.20% (up from 0.03%)
>
> Conclusion: for all the in-order and out-of-order cores I measured,
it's
> beneficial to get the instructions scheduled using the Cortex-A8 pipeline
> model in combination with -mcpu=generic.
>
>
> Taking into account the above measurements, my conclusions are:
> 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or
> Cortex-A53 for march=armv7a and march=armv8a.
> 2. We probably want to let the compiler schedule instructions using the
> Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup
> on all cores tested.
>
> Do people agree with these conclusions?
> Any objections against implementing this?
> Any other potential impact this may have that I forgot to consider above?
>
> Thanks,
>
> Kristof
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170531/e0f0ff0c/attachment.html>

Kristof Beyls via llvm-dev

2017-Jun-01 06:37 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Thanks for everyone giving their feedback!
I saw pretty unanimous support for making -mcpu=generic the default and making
-mcpu=generic schedule for an in-order CPU (Cortex-A8 in this case).
I'll be making those changes shortly.

I think the comments also make clear that it's less obvious whether we'd
want -mcpu=native to become a default. It's probably good for some use
cases, but really not good for other use cases. I won't be making that
change, nor advocate for it.

Thanks!

Kristof


On 31 May 2017, at 17:57, Stephen Hines <srhines at
google.com<mailto:srhines at google.com>> wrote:

Wow, these are some fantastic results! Android is definitely in favor of fixing
the defaults, so this proposal looks great from our perspective.

Thanks,
Steve

On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at
arm.com<mailto:Kristof.Beyls at arm.com>> wrote:
Motivation

At the moment, when targeting armv7a, clang defaults to generate code as if
-mcpu=cortex-a8 was specified.
When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was
specified.

This leads to surprising code generation, by the compiler optimizing for a
specific micro-architecture, whereas the intent from the user was probably to
generate code that is "blended" for all the cores implementing the
requested architecture. One example of a user being surprised like this is at
https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not produced
to optimize for a Cortex-A8-specific micro-architectural behaviour, even though
the user didn't request to optimize specifically for Cortex-A8.

It would be much cleaner conceptually if clang would default to -mcpu=generic
when no specific cpu is specified.

What is the impact of this change on execution speed?

I think the main reason to be hesitant to change the default CPU for ARM to
-mcpu=generic is the potential impact on performance of generated code.

I've measured quite a wide selection of benchmarks with this change, on the
following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72.

Impact on execution speed, for each core, when using -march=armv7a, after
changing the default cpu from cortex-a8 to generic is as follows.
A positive numbers means speedup, a negative number means slow-down. These are
the geomean results over 350 programs coming from benchmark suites such as the
test-suite, SPEC2000, SPEC2006 and a range of proprietary suites.

Cortex-A9: 0.96%
Cortex-A53: -0.64%
Cortex-A57: 1.04%
Cortex-A72: 1.17%

Impact on execution speed, for each core, when using -march=armv8a, after
changing the default cpu from cortex-a53 to generic:

(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
Cortex-A53: -0.09%
Cortex-A57: -0.12%
Cortex-A72: 0.03%

Should we enable scheduling for an in-order core even for -mcpu=generic?

In the above measurements it shows that the biggest negative impact seen is with
-march=armv7a on Cortex-A53: -0.64%.
It seems that the in-order Cortex-A53 core is losing quite a bit of performance
when the instructions aren't scheduled - which is to be expected.
Therefore, I also experimented with letting instructions be scheduled according
to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if
it's beneficial to schedule instructions for an in-order core rather than
not trying to schedule them at all, for -mcpu=generic.

Measurement results:

-march=armv7a

Cortex-A9: 1.57% (up from 0.96%)
Cortex-A53: 0.47% (up from -0.64%)
Cortex-A57: 1.74% (up from 1.04%)
Cortex-A72: 1.72% (up from 1.17%)

-march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in the
32-bit ARM backend):

(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
Cortex-A53: 0.49% (up from -0.09%)
Cortex-A57: 0.09% (up from -0.12%)
Cortex-A72: 0.20% (up from 0.03%)

Conclusion: for all the in-order and out-of-order cores I measured, it's
beneficial to get the instructions scheduled using the Cortex-A8 pipeline model
in combination with -mcpu=generic.


Taking into account the above measurements, my conclusions are:
1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for
march=armv7a and march=armv8a.
2. We probably want to let the compiler schedule instructions using the
Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on
all cores tested.

Do people agree with these conclusions?
Any objections against implementing this?
Any other potential impact this may have that I forgot to consider above?

Thanks,

Kristof


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170601/6ef638e4/attachment.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - May 2017 - [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Reasonably Related Threads