thr3ads.net - llvm dev - [llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53 [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Kristof Beyls via llvm-dev

2017-Jun-01 06:37 UTC

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Thanks for everyone giving their feedback!
I saw pretty unanimous support for making -mcpu=generic the default and making
-mcpu=generic schedule for an in-order CPU (Cortex-A8 in this case).
I'll be making those changes shortly.

I think the comments also make clear that it's less obvious whether we'd
want -mcpu=native to become a default. It's probably good for some use
cases, but really not good for other use cases. I won't be making that
change, nor advocate for it.

Thanks!

Kristof


On 31 May 2017, at 17:57, Stephen Hines <srhines at
google.com<mailto:srhines at google.com>> wrote:

Wow, these are some fantastic results! Android is definitely in favor of fixing
the defaults, so this proposal looks great from our perspective.

Thanks,
Steve

On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at
arm.com<mailto:Kristof.Beyls at arm.com>> wrote:
Motivation

At the moment, when targeting armv7a, clang defaults to generate code as if
-mcpu=cortex-a8 was specified.
When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was
specified.

This leads to surprising code generation, by the compiler optimizing for a
specific micro-architecture, whereas the intent from the user was probably to
generate code that is "blended" for all the cores implementing the
requested architecture. One example of a user being surprised like this is at
https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not produced
to optimize for a Cortex-A8-specific micro-architectural behaviour, even though
the user didn't request to optimize specifically for Cortex-A8.

It would be much cleaner conceptually if clang would default to -mcpu=generic
when no specific cpu is specified.

What is the impact of this change on execution speed?

I think the main reason to be hesitant to change the default CPU for ARM to
-mcpu=generic is the potential impact on performance of generated code.

I've measured quite a wide selection of benchmarks with this change, on the
following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72.

Impact on execution speed, for each core, when using -march=armv7a, after
changing the default cpu from cortex-a8 to generic is as follows.
A positive numbers means speedup, a negative number means slow-down. These are
the geomean results over 350 programs coming from benchmark suites such as the
test-suite, SPEC2000, SPEC2006 and a range of proprietary suites.

Cortex-A9: 0.96%
Cortex-A53: -0.64%
Cortex-A57: 1.04%
Cortex-A72: 1.17%

Impact on execution speed, for each core, when using -march=armv8a, after
changing the default cpu from cortex-a53 to generic:

(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
Cortex-A53: -0.09%
Cortex-A57: -0.12%
Cortex-A72: 0.03%

Should we enable scheduling for an in-order core even for -mcpu=generic?

In the above measurements it shows that the biggest negative impact seen is with
-march=armv7a on Cortex-A53: -0.64%.
It seems that the in-order Cortex-A53 core is losing quite a bit of performance
when the instructions aren't scheduled - which is to be expected.
Therefore, I also experimented with letting instructions be scheduled according
to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if
it's beneficial to schedule instructions for an in-order core rather than
not trying to schedule them at all, for -mcpu=generic.

Measurement results:

-march=armv7a

Cortex-A9: 1.57% (up from 0.96%)
Cortex-A53: 0.47% (up from -0.64%)
Cortex-A57: 1.74% (up from 1.04%)
Cortex-A72: 1.72% (up from 1.17%)

-march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in the
32-bit ARM backend):

(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
Cortex-A53: 0.49% (up from -0.09%)
Cortex-A57: 0.09% (up from -0.12%)
Cortex-A72: 0.20% (up from 0.03%)

Conclusion: for all the in-order and out-of-order cores I measured, it's
beneficial to get the instructions scheduled using the Cortex-A8 pipeline model
in combination with -mcpu=generic.


Taking into account the above measurements, my conclusions are:
1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for
march=armv7a and march=armv8a.
2. We probably want to let the compiler schedule instructions using the
Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on
all cores tested.

Do people agree with these conclusions?
Any objections against implementing this?
Any other potential impact this may have that I forgot to consider above?

Thanks,

Kristof


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170601/6ef638e4/attachment.html>

Renato Golin via llvm-dev

2017-Jun-01 09:17 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

On 1 June 2017 at 07:37, Kristof Beyls <Kristof.Beyls at arm.com>
wrote:> I think the comments also make clear that it's less obvious whether
we'd
> want -mcpu=native to become a default. It's probably good for some use
> cases, but really not good for other use cases. I won't be making that
> change, nor advocate for it.
That was just me and I am now thoroughly convinced it's not a good idea. :)

Please, proceed as planned.

Thanks Kristof, for the detailed investigation and everyone for their comments.

cheers,
--renato

Evandro Menezes via llvm-dev

2017-Jun-01 20:23 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Hi, Kristof.

It sounds like a good plan, but one thing is not clear to me from your 
post.  Which pipeline model will be used for AArch64, A53's (i.e., none)?

Thank you,

-- 
Evandro Menezes

On 06/01/2017 01:37 AM, Kristof Beyls wrote:> Thanks for everyone giving their feedback!
> I saw pretty unanimous support for making -mcpu=generic the default 
> and making -mcpu=generic schedule for an in-order CPU (Cortex-A8 in 
> this case).
> I'll be making those changes shortly.
>
> I think the comments also make clear that it's less obvious whether 
> we'd want -mcpu=native to become a default. It's probably good for 
> some use cases, but really not good for other use cases. I won't be 
> making that change, nor advocate for it.
>
> Thanks!
>
> Kristof
>
>
>> On 31 May 2017, at 17:57, Stephen Hines <srhines at google.com 
>> <mailto:srhines at google.com>> wrote:
>>
>> Wow, these are some fantastic results! Android is definitely in favor 
>> of fixing the defaults, so this proposal looks great from our 
>> perspective.
>>
>> Thanks,
>> Steve
>>
>> On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at
arm.com
>> <mailto:Kristof.Beyls at arm.com>> wrote:
>>
>>     *Motivation*
>>
>>     At the moment, when targeting armv7a, clang defaults to generate
>>     code as if -mcpu=cortex-a8 was specified.
>>     When targeting armv8a, it defaults to generate code as if
>>     -mcpu=cortex-a53 was specified.
>>
>>     This leads to surprising code generation, by the compiler
>>     optimizing for a specific micro-architecture, whereas the intent
>>     from the user was probably to generate code that is
"blended" for
>>     all the cores implementing the requested architecture. One
>>     example of a user being surprised like this is at
>>     https://bugs.llvm.org//show_bug.cgi?id=27219
>>     <https://bugs.llvm.org//show_bug.cgi?id=27219>, where
vmla's are
>>     not produced to optimize for a Cortex-A8-specific
>>     micro-architectural behaviour, even though the user didn't
>>     request to optimize specifically for Cortex-A8.
>>
>>     It would be much cleaner conceptually if clang would default to
>>     -mcpu=generic when no specific cpu is specified.
>>
>>     *What is the impact of this change on execution speed?*
>>     *
>>     *
>>     I think the main reason to be hesitant to change the default CPU
>>     for ARM to -mcpu=generic is the potential impact on performance
>>     of generated code.
>>     *
>>     *
>>     I've measured quite a wide selection of benchmarks with this
>>     change, on the following cores: Cortex-A9, Cortex-A53,
>>     Cortex-A57, Cortex-A72.
>>
>>     Impact on execution speed, for each core, when using
>>     -march=armv7a, after changing the default cpu from cortex-a8 to
>>     generic is as follows.
>>     A positive numbers means speedup, a negative number means
>>     slow-down. These are the geomean results over 350 programs coming
>>     from benchmark suites such as the test-suite, SPEC2000, SPEC2006
>>     and a range of proprietary suites.
>>
>>     Cortex-A9: 0.96%
>>     Cortex-A53: -0.64%
>>     Cortex-A57: 1.04%
>>     Cortex-A72: 1.17%
>>
>>     Impact on execution speed, for each core, when using
>>     -march=armv8a, after changing the default cpu from cortex-a53 to
>>     generic:
>>
>>     (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
>>     Cortex-A53: -0.09%
>>     Cortex-A57: -0.12%
>>     Cortex-A72: 0.03%
>>
>>     *Should we enable scheduling for an in-order core even for
>>     -mcpu=generic?*
>>     *
>>     *
>>     In the above measurements it shows that the biggest negative
>>     impact seen is with -march=armv7a on Cortex-A53: -0.64%.
>>     It seems that the in-order Cortex-A53 core is losing quite a bit
>>     of performance when the instructions aren't scheduled - which
is
>>     to be expected.
>>     Therefore, I also experimented with letting instructions be
>>     scheduled according to the Cortex-A8 pipeline model, even for
>>     -mcpu=generic, trying to figure out if it's beneficial to
>>     schedule instructions for an in-order core rather than not trying
>>     to schedule them at all, for -mcpu=generic.
>>
>>     Measurement results:
>>
>>     -march=armv7a
>>
>>     Cortex-A9: 1.57% (up from 0.96%)
>>     Cortex-A53: 0.47% (up from -0.64%)
>>     Cortex-A57: 1.74% (up from 1.04%)
>>     Cortex-A72: 1.72% (up from 1.17%)
>>
>>     -march=armv8a (Note that there isn't a pipeline model for
>>     Cortex-A53 in the 32-bit ARM backend):
>>
>>     (Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
>>     Cortex-A53: 0.49% (up from -0.09%)
>>     Cortex-A57: 0.09% (up from -0.12%)
>>     Cortex-A72: 0.20% (up from 0.03%)
>>
>>     Conclusion: for all the in-order and out-of-order cores I
>>     measured, it's beneficial to get the instructions scheduled
using
>>     the Cortex-A8 pipeline model in combination with -mcpu=generic.
>>
>>
>>     Taking into account the above measurements, my conclusions are:
>>     1. We should make -mcpu=generic the default cpu, not Cortex-A8 or
>>     Cortex-A53 for march=armv7a and march=armv8a.
>>     2. We probably want to let the compiler schedule instructions
>>     using the Cortex-A8 pipeline model for -mcpu=generic, since it
>>     gives a bit of speedup on all cores tested.
>>
>>     Do people agree with these conclusions?
>>     Any objections against implementing this?
>>     Any other potential impact this may have that I forgot to
>>     consider above?
>>
>>     Thanks,
>>
>>     Kristof
>>
>>
>

Kristof Beyls via llvm-dev

2017-Jun-20 14:05 UTC

head link

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Hi Evandro,

For now, I'm only looking at AArch32, not AArch64.
Indeed, we could also perform in-order scheduling for -mcpu=generic on AArch64.
Cortex-A53 indeed seems to be the best/only choice available.
But before making that change, that'll require another round of lots of
benchmarking.

So in summary: I'll put the idea on my backlog, but I probably won't
have time to get all the benchmarking done in the very near future.

Thanks,

Kristof

On 1 Jun 2017, at 22:23, Evandro Menezes <e.menezes at
samsung.com<mailto:e.menezes at samsung.com>> wrote:

Hi, Kristof.

It sounds like a good plan, but one thing is not clear to me from your
post. Which pipeline model will be used for AArch64, A53's (i.e., none)?

Thank you,

--
Evandro Menezes

On 06/01/2017 01:37 AM, Kristof Beyls wrote:
Thanks for everyone giving their feedback!
I saw pretty unanimous support for making -mcpu=generic the default
and making -mcpu=generic schedule for an in-order CPU (Cortex-A8 in
this case).
I'll be making those changes shortly.

I think the comments also make clear that it's less obvious whether
we'd want -mcpu=native to become a default. It's probably good for
some use cases, but really not good for other use cases. I won't be
making that change, nor advocate for it.

Thanks!

Kristof

On 31 May 2017, at 17:57, Stephen Hines <srhines at
google.com<mailto:srhines at google.com>
<mailto:srhines at google.com>> wrote:

Wow, these are some fantastic results! Android is definitely in favor
of fixing the defaults, so this proposal looks great from our
perspective.

Thanks,
Steve

On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at
arm.com<mailto:Kristof.Beyls at arm.com>
<mailto:Kristof.Beyls at arm.com>> wrote:

*Motivation*

At the moment, when targeting armv7a, clang defaults to generate
code as if -mcpu=cortex-a8 was specified.
When targeting armv8a, it defaults to generate code as if
-mcpu=cortex-a53 was specified.

This leads to surprising code generation, by the compiler
optimizing for a specific micro-architecture, whereas the intent
from the user was probably to generate code that is "blended" for
all the cores implementing the requested architecture. One
example of a user being surprised like this is at
https://bugs.llvm.org//show_bug.cgi?id=27219
<https://bugs.llvm.org//show_bug.cgi?id=27219>, where vmla's are
not produced to optimize for a Cortex-A8-specific
micro-architectural behaviour, even though the user didn't
request to optimize specifically for Cortex-A8.

It would be much cleaner conceptually if clang would default to
-mcpu=generic when no specific cpu is specified.

*What is the impact of this change on execution speed?*
*
*
I think the main reason to be hesitant to change the default CPU
for ARM to -mcpu=generic is the potential impact on performance
of generated code.
*
*
I've measured quite a wide selection of benchmarks with this
change, on the following cores: Cortex-A9, Cortex-A53,
Cortex-A57, Cortex-A72.

Impact on execution speed, for each core, when using
-march=armv7a, after changing the default cpu from cortex-a8 to
generic is as follows.
A positive numbers means speedup, a negative number means
slow-down. These are the geomean results over 350 programs coming
from benchmark suites such as the test-suite, SPEC2000, SPEC2006
and a range of proprietary suites.

Cortex-A9: 0.96%
Cortex-A53: -0.64%
Cortex-A57: 1.04%
Cortex-A72: 1.17%

Impact on execution speed, for each core, when using
-march=armv8a, after changing the default cpu from cortex-a53 to
generic:

(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
Cortex-A53: -0.09%
Cortex-A57: -0.12%
Cortex-A72: 0.03%

*Should we enable scheduling for an in-order core even for
-mcpu=generic?*
*
*
In the above measurements it shows that the biggest negative
impact seen is with -march=armv7a on Cortex-A53: -0.64%.
It seems that the in-order Cortex-A53 core is losing quite a bit
of performance when the instructions aren't scheduled - which is
to be expected.
Therefore, I also experimented with letting instructions be
scheduled according to the Cortex-A8 pipeline model, even for
-mcpu=generic, trying to figure out if it's beneficial to
schedule instructions for an in-order core rather than not trying
to schedule them at all, for -mcpu=generic.

Measurement results:

-march=armv7a

Cortex-A9: 1.57% (up from 0.96%)
Cortex-A53: 0.47% (up from -0.64%)
Cortex-A57: 1.74% (up from 1.04%)
Cortex-A72: 1.72% (up from 1.17%)

-march=armv8a (Note that there isn't a pipeline model for
Cortex-A53 in the 32-bit ARM backend):

(Cortex-A9 is an armv7a core, so can't execute armv8a binaries)
Cortex-A53: 0.49% (up from -0.09%)
Cortex-A57: 0.09% (up from -0.12%)
Cortex-A72: 0.20% (up from 0.03%)

Conclusion: for all the in-order and out-of-order cores I
measured, it's beneficial to get the instructions scheduled using
the Cortex-A8 pipeline model in combination with -mcpu=generic.

Taking into account the above measurements, my conclusions are:
1. We should make -mcpu=generic the default cpu, not Cortex-A8 or
Cortex-A53 for march=armv7a and march=armv8a.
2. We probably want to let the compiler schedule instructions
using the Cortex-A8 pipeline model for -mcpu=generic, since it
gives a bit of speedup on all cores tested.

Do people agree with these conclusions?
Any objections against implementing this?
Any other potential impact this may have that I forgot to
consider above?

Thanks,

Kristof

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170620/282d821d/attachment-0001.html>

llvm dev - Jun 2017 - [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53