Kristof Beyls via llvm-dev
2017-May-31 12:57 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
Motivation At the moment, when targeting armv7a, clang defaults to generate code as if -mcpu=cortex-a8 was specified. When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was specified. This leads to surprising code generation, by the compiler optimizing for a specific micro-architecture, whereas the intent from the user was probably to generate code that is "blended" for all the cores implementing the requested architecture. One example of a user being surprised like this is at https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not produced to optimize for a Cortex-A8-specific micro-architectural behaviour, even though the user didn't request to optimize specifically for Cortex-A8. It would be much cleaner conceptually if clang would default to -mcpu=generic when no specific cpu is specified. What is the impact of this change on execution speed? I think the main reason to be hesitant to change the default CPU for ARM to -mcpu=generic is the potential impact on performance of generated code. I've measured quite a wide selection of benchmarks with this change, on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72. Impact on execution speed, for each core, when using -march=armv7a, after changing the default cpu from cortex-a8 to generic is as follows. A positive numbers means speedup, a negative number means slow-down. These are the geomean results over 350 programs coming from benchmark suites such as the test-suite, SPEC2000, SPEC2006 and a range of proprietary suites. Cortex-A9: 0.96% Cortex-A53: -0.64% Cortex-A57: 1.04% Cortex-A72: 1.17% Impact on execution speed, for each core, when using -march=armv8a, after changing the default cpu from cortex-a53 to generic: (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) Cortex-A53: -0.09% Cortex-A57: -0.12% Cortex-A72: 0.03% Should we enable scheduling for an in-order core even for -mcpu=generic? In the above measurements it shows that the biggest negative impact seen is with -march=armv7a on Cortex-A53: -0.64%. It seems that the in-order Cortex-A53 core is losing quite a bit of performance when the instructions aren't scheduled - which is to be expected. Therefore, I also experimented with letting instructions be scheduled according to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if it's beneficial to schedule instructions for an in-order core rather than not trying to schedule them at all, for -mcpu=generic. Measurement results: -march=armv7a Cortex-A9: 1.57% (up from 0.96%) Cortex-A53: 0.47% (up from -0.64%) Cortex-A57: 1.74% (up from 1.04%) Cortex-A72: 1.72% (up from 1.17%) -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in the 32-bit ARM backend): (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) Cortex-A53: 0.49% (up from -0.09%) Cortex-A57: 0.09% (up from -0.12%) Cortex-A72: 0.20% (up from 0.03%) Conclusion: for all the in-order and out-of-order cores I measured, it's beneficial to get the instructions scheduled using the Cortex-A8 pipeline model in combination with -mcpu=generic. Taking into account the above measurements, my conclusions are: 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for march=armv7a and march=armv8a. 2. We probably want to let the compiler schedule instructions using the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on all cores tested. Do people agree with these conclusions? Any objections against implementing this? Any other potential impact this may have that I forgot to consider above? Thanks, Kristof -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170531/a1d6edb8/attachment.html>
Renato Golin via llvm-dev
2017-May-31 13:35 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
On 31 May 2017 at 13:57, Kristof Beyls <Kristof.Beyls at arm.com> wrote:> Taking into account the above measurements, my conclusions are: > 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 > for march=armv7a and march=armv8a.Using -mcpu=native makes more sense to me, if at all possible to detect, falling back to generic, which doesn't hurt.> 2. We probably want to let the compiler schedule instructions using the > Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup > on all cores tested.Same here, I'd use the schedule of the detected CPU, if any, or fall back to A8 (which seems fine). But yeah, it's time we get rid of the A8/A53 defaults. While we're at it, we may think about ARMv7's NEON default. Generating only VFP is slower on boards with NEON, but generating NEON crashes with SIGILL on borads that don't have it. I'd be happy if Clang could detect CPU/FPU and set the flags accordingly, or fall back to "generic"/A8-schedule/VFP defaults. cheers, --renato
Eric Christopher via llvm-dev
2017-May-31 15:02 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
On Wed, May 31, 2017 at 6:35 AM Renato Golin <renato.golin at linaro.org> wrote:> On 31 May 2017 at 13:57, Kristof Beyls <Kristof.Beyls at arm.com> wrote: > > Taking into account the above measurements, my conclusions are: > > 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or > Cortex-A53 > > for march=armv7a and march=armv8a. > > Using -mcpu=native makes more sense to me, if at all possible to > detect, falling back to generic, which doesn't hurt. > >Ultimately either solution is fine with me. If Kristof wanted to switch it to generic while getting the autodetection stuff up that would also be ok. -eric> > > 2. We probably want to let the compiler schedule instructions using the > > Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of > speedup > > on all cores tested. > > Same here, I'd use the schedule of the detected CPU, if any, or fall > back to A8 (which seems fine). > > But yeah, it's time we get rid of the A8/A53 defaults. > > While we're at it, we may think about ARMv7's NEON default. Generating > only VFP is slower on boards with NEON, but generating NEON crashes > with SIGILL on borads that don't have it. > > I'd be happy if Clang could detect CPU/FPU and set the flags > accordingly, or fall back to "generic"/A8-schedule/VFP defaults. > > cheers, > --renato >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170531/9e1176b6/attachment.html>
Evandro Menezes via llvm-dev
2017-May-31 15:23 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
On 05/31/2017 08:35 AM, Renato Golin wrote:> On 31 May 2017 at 13:57, Kristof Beyls <Kristof.Beyls at arm.com> wrote: >> Taking into account the above measurements, my conclusions are: >> 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 >> for march=armv7a and march=armv8a. > Using -mcpu=native makes more sense to me, if at all possible to > detect, falling back to generic, which doesn't hurt.For the sake of predictability, methinks that it'd make more sense for the default to always mean the same thing for everyone, as Kristof suggested. -- Evandro Menezes
Evandro Menezes via llvm-dev
2017-May-31 15:25 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
Hi, Kristof. I think that it makes sense. Your results also somehow corroborate the model adopted in GCC for the generic tuning, especially WRT scheduling in order. Thank you, -- Evandro Menezes On 05/31/2017 07:57 AM, Kristof Beyls wrote:> *Motivation* > > At the moment, when targeting armv7a, clang defaults to generate code > as if -mcpu=cortex-a8 was specified. > When targeting armv8a, it defaults to generate code as if > -mcpu=cortex-a53 was specified. > > This leads to surprising code generation, by the compiler optimizing > for a specific micro-architecture, whereas the intent from the user > was probably to generate code that is "blended" for all the cores > implementing the requested architecture. One example of a user being > surprised like this is at > https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not > produced to optimize for a Cortex-A8-specific micro-architectural > behaviour, even though the user didn't request to optimize > specifically for Cortex-A8. > > It would be much cleaner conceptually if clang would default to > -mcpu=generic when no specific cpu is specified. > > *What is the impact of this change on execution speed?* > * > * > I think the main reason to be hesitant to change the default CPU for > ARM to -mcpu=generic is the potential impact on performance of > generated code. > * > * > I've measured quite a wide selection of benchmarks with this change, > on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72. > > Impact on execution speed, for each core, when using -march=armv7a, > after changing the default cpu from cortex-a8 to generic is as follows. > A positive numbers means speedup, a negative number means slow-down. > These are the geomean results over 350 programs coming from benchmark > suites such as the test-suite, SPEC2000, SPEC2006 and a range of > proprietary suites. > > Cortex-A9: 0.96% > Cortex-A53: -0.64% > Cortex-A57: 1.04% > Cortex-A72: 1.17% > > Impact on execution speed, for each core, when using -march=armv8a, > after changing the default cpu from cortex-a53 to generic: > > (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) > Cortex-A53: -0.09% > Cortex-A57: -0.12% > Cortex-A72: 0.03% > > *Should we enable scheduling for an in-order core even for -mcpu=generic?* > * > * > In the above measurements it shows that the biggest negative impact > seen is with -march=armv7a on Cortex-A53: -0.64%. > It seems that the in-order Cortex-A53 core is losing quite a bit of > performance when the instructions aren't scheduled - which is to be > expected. > Therefore, I also experimented with letting instructions be scheduled > according to the Cortex-A8 pipeline model, even for -mcpu=generic, > trying to figure out if it's beneficial to schedule instructions for > an in-order core rather than not trying to schedule them at all, for > -mcpu=generic. > > Measurement results: > > -march=armv7a > > Cortex-A9: 1.57% (up from 0.96%) > Cortex-A53: 0.47% (up from -0.64%) > Cortex-A57: 1.74% (up from 1.04%) > Cortex-A72: 1.72% (up from 1.17%) > > -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 > in the 32-bit ARM backend): > > (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) > Cortex-A53: 0.49% (up from -0.09%) > Cortex-A57: 0.09% (up from -0.12%) > Cortex-A72: 0.20% (up from 0.03%) > > Conclusion: for all the in-order and out-of-order cores I measured, > it's beneficial to get the instructions scheduled using the Cortex-A8 > pipeline model in combination with -mcpu=generic. > > > Taking into account the above measurements, my conclusions are: > 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or > Cortex-A53 for march=armv7a and march=armv8a. > 2. We probably want to let the compiler schedule instructions using > the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit > of speedup on all cores tested. > > Do people agree with these conclusions? > Any objections against implementing this? > Any other potential impact this may have that I forgot to consider above? > > Thanks, > > Kristof
Stephen Hines via llvm-dev
2017-May-31 15:57 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
Wow, these are some fantastic results! Android is definitely in favor of fixing the defaults, so this proposal looks great from our perspective. Thanks, Steve On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at arm.com> wrote:> *Motivation* > > At the moment, when targeting armv7a, clang defaults to generate code as > if -mcpu=cortex-a8 was specified. > When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 > was specified. > > This leads to surprising code generation, by the compiler optimizing for a > specific micro-architecture, whereas the intent from the user was probably > to generate code that is "blended" for all the cores implementing the > requested architecture. One example of a user being surprised like this is > at https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not > produced to optimize for a Cortex-A8-specific micro-architectural > behaviour, even though the user didn't request to optimize specifically for > Cortex-A8. > > It would be much cleaner conceptually if clang would default to > -mcpu=generic when no specific cpu is specified. > > *What is the impact of this change on execution speed?* > > I think the main reason to be hesitant to change the default CPU for ARM > to -mcpu=generic is the potential impact on performance of generated code. > > I've measured quite a wide selection of benchmarks with this change, on > the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72. > > Impact on execution speed, for each core, when using -march=armv7a, after > changing the default cpu from cortex-a8 to generic is as follows. > A positive numbers means speedup, a negative number means slow-down. These > are the geomean results over 350 programs coming from benchmark suites such > as the test-suite, SPEC2000, SPEC2006 and a range of proprietary suites. > > Cortex-A9: 0.96% > Cortex-A53: -0.64% > Cortex-A57: 1.04% > Cortex-A72: 1.17% > > Impact on execution speed, for each core, when using -march=armv8a, after > changing the default cpu from cortex-a53 to generic: > > (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) > Cortex-A53: -0.09% > Cortex-A57: -0.12% > Cortex-A72: 0.03% > > *Should we enable scheduling for an in-order core even for -mcpu=generic?* > > In the above measurements it shows that the biggest negative impact seen > is with -march=armv7a on Cortex-A53: -0.64%. > It seems that the in-order Cortex-A53 core is losing quite a bit of > performance when the instructions aren't scheduled - which is to be > expected. > Therefore, I also experimented with letting instructions be scheduled > according to the Cortex-A8 pipeline model, even for -mcpu=generic, trying > to figure out if it's beneficial to schedule instructions for an in-order > core rather than not trying to schedule them at all, for -mcpu=generic. > > Measurement results: > > -march=armv7a > > Cortex-A9: 1.57% (up from 0.96%) > Cortex-A53: 0.47% (up from -0.64%) > Cortex-A57: 1.74% (up from 1.04%) > Cortex-A72: 1.72% (up from 1.17%) > > -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in > the 32-bit ARM backend): > > (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) > Cortex-A53: 0.49% (up from -0.09%) > Cortex-A57: 0.09% (up from -0.12%) > Cortex-A72: 0.20% (up from 0.03%) > > Conclusion: for all the in-order and out-of-order cores I measured, it's > beneficial to get the instructions scheduled using the Cortex-A8 pipeline > model in combination with -mcpu=generic. > > > Taking into account the above measurements, my conclusions are: > 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or > Cortex-A53 for march=armv7a and march=armv8a. > 2. We probably want to let the compiler schedule instructions using the > Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup > on all cores tested. > > Do people agree with these conclusions? > Any objections against implementing this? > Any other potential impact this may have that I forgot to consider above? > > Thanks, > > Kristof >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170531/e0f0ff0c/attachment.html>
Kristof Beyls via llvm-dev
2017-Jun-01 06:37 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
Thanks for everyone giving their feedback! I saw pretty unanimous support for making -mcpu=generic the default and making -mcpu=generic schedule for an in-order CPU (Cortex-A8 in this case). I'll be making those changes shortly. I think the comments also make clear that it's less obvious whether we'd want -mcpu=native to become a default. It's probably good for some use cases, but really not good for other use cases. I won't be making that change, nor advocate for it. Thanks! Kristof On 31 May 2017, at 17:57, Stephen Hines <srhines at google.com<mailto:srhines at google.com>> wrote: Wow, these are some fantastic results! Android is definitely in favor of fixing the defaults, so this proposal looks great from our perspective. Thanks, Steve On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at arm.com<mailto:Kristof.Beyls at arm.com>> wrote: Motivation At the moment, when targeting armv7a, clang defaults to generate code as if -mcpu=cortex-a8 was specified. When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was specified. This leads to surprising code generation, by the compiler optimizing for a specific micro-architecture, whereas the intent from the user was probably to generate code that is "blended" for all the cores implementing the requested architecture. One example of a user being surprised like this is at https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not produced to optimize for a Cortex-A8-specific micro-architectural behaviour, even though the user didn't request to optimize specifically for Cortex-A8. It would be much cleaner conceptually if clang would default to -mcpu=generic when no specific cpu is specified. What is the impact of this change on execution speed? I think the main reason to be hesitant to change the default CPU for ARM to -mcpu=generic is the potential impact on performance of generated code. I've measured quite a wide selection of benchmarks with this change, on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72. Impact on execution speed, for each core, when using -march=armv7a, after changing the default cpu from cortex-a8 to generic is as follows. A positive numbers means speedup, a negative number means slow-down. These are the geomean results over 350 programs coming from benchmark suites such as the test-suite, SPEC2000, SPEC2006 and a range of proprietary suites. Cortex-A9: 0.96% Cortex-A53: -0.64% Cortex-A57: 1.04% Cortex-A72: 1.17% Impact on execution speed, for each core, when using -march=armv8a, after changing the default cpu from cortex-a53 to generic: (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) Cortex-A53: -0.09% Cortex-A57: -0.12% Cortex-A72: 0.03% Should we enable scheduling for an in-order core even for -mcpu=generic? In the above measurements it shows that the biggest negative impact seen is with -march=armv7a on Cortex-A53: -0.64%. It seems that the in-order Cortex-A53 core is losing quite a bit of performance when the instructions aren't scheduled - which is to be expected. Therefore, I also experimented with letting instructions be scheduled according to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if it's beneficial to schedule instructions for an in-order core rather than not trying to schedule them at all, for -mcpu=generic. Measurement results: -march=armv7a Cortex-A9: 1.57% (up from 0.96%) Cortex-A53: 0.47% (up from -0.64%) Cortex-A57: 1.74% (up from 1.04%) Cortex-A72: 1.72% (up from 1.17%) -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in the 32-bit ARM backend): (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) Cortex-A53: 0.49% (up from -0.09%) Cortex-A57: 0.09% (up from -0.12%) Cortex-A72: 0.20% (up from 0.03%) Conclusion: for all the in-order and out-of-order cores I measured, it's beneficial to get the instructions scheduled using the Cortex-A8 pipeline model in combination with -mcpu=generic. Taking into account the above measurements, my conclusions are: 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for march=armv7a and march=armv8a. 2. We probably want to let the compiler schedule instructions using the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on all cores tested. Do people agree with these conclusions? Any objections against implementing this? Any other potential impact this may have that I forgot to consider above? Thanks, Kristof -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170601/6ef638e4/attachment.html>
Seemingly Similar Threads
- [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
- strange strsplit gsub problem 0 is this a bug or a string length limitation?
- (RFC) Adjusting default loop fully unroll threshold
- (RFC) Adjusting default loop fully unroll threshold
- Passing literal -cpu model string to qemu