Kristof Beyls via llvm-dev
2017-Jun-01 06:37 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
Thanks for everyone giving their feedback! I saw pretty unanimous support for making -mcpu=generic the default and making -mcpu=generic schedule for an in-order CPU (Cortex-A8 in this case). I'll be making those changes shortly. I think the comments also make clear that it's less obvious whether we'd want -mcpu=native to become a default. It's probably good for some use cases, but really not good for other use cases. I won't be making that change, nor advocate for it. Thanks! Kristof On 31 May 2017, at 17:57, Stephen Hines <srhines at google.com<mailto:srhines at google.com>> wrote: Wow, these are some fantastic results! Android is definitely in favor of fixing the defaults, so this proposal looks great from our perspective. Thanks, Steve On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at arm.com<mailto:Kristof.Beyls at arm.com>> wrote: Motivation At the moment, when targeting armv7a, clang defaults to generate code as if -mcpu=cortex-a8 was specified. When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was specified. This leads to surprising code generation, by the compiler optimizing for a specific micro-architecture, whereas the intent from the user was probably to generate code that is "blended" for all the cores implementing the requested architecture. One example of a user being surprised like this is at https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla's are not produced to optimize for a Cortex-A8-specific micro-architectural behaviour, even though the user didn't request to optimize specifically for Cortex-A8. It would be much cleaner conceptually if clang would default to -mcpu=generic when no specific cpu is specified. What is the impact of this change on execution speed? I think the main reason to be hesitant to change the default CPU for ARM to -mcpu=generic is the potential impact on performance of generated code. I've measured quite a wide selection of benchmarks with this change, on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72. Impact on execution speed, for each core, when using -march=armv7a, after changing the default cpu from cortex-a8 to generic is as follows. A positive numbers means speedup, a negative number means slow-down. These are the geomean results over 350 programs coming from benchmark suites such as the test-suite, SPEC2000, SPEC2006 and a range of proprietary suites. Cortex-A9: 0.96% Cortex-A53: -0.64% Cortex-A57: 1.04% Cortex-A72: 1.17% Impact on execution speed, for each core, when using -march=armv8a, after changing the default cpu from cortex-a53 to generic: (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) Cortex-A53: -0.09% Cortex-A57: -0.12% Cortex-A72: 0.03% Should we enable scheduling for an in-order core even for -mcpu=generic? In the above measurements it shows that the biggest negative impact seen is with -march=armv7a on Cortex-A53: -0.64%. It seems that the in-order Cortex-A53 core is losing quite a bit of performance when the instructions aren't scheduled - which is to be expected. Therefore, I also experimented with letting instructions be scheduled according to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if it's beneficial to schedule instructions for an in-order core rather than not trying to schedule them at all, for -mcpu=generic. Measurement results: -march=armv7a Cortex-A9: 1.57% (up from 0.96%) Cortex-A53: 0.47% (up from -0.64%) Cortex-A57: 1.74% (up from 1.04%) Cortex-A72: 1.72% (up from 1.17%) -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in the 32-bit ARM backend): (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) Cortex-A53: 0.49% (up from -0.09%) Cortex-A57: 0.09% (up from -0.12%) Cortex-A72: 0.20% (up from 0.03%) Conclusion: for all the in-order and out-of-order cores I measured, it's beneficial to get the instructions scheduled using the Cortex-A8 pipeline model in combination with -mcpu=generic. Taking into account the above measurements, my conclusions are: 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for march=armv7a and march=armv8a. 2. We probably want to let the compiler schedule instructions using the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on all cores tested. Do people agree with these conclusions? Any objections against implementing this? Any other potential impact this may have that I forgot to consider above? Thanks, Kristof -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170601/6ef638e4/attachment.html>
Renato Golin via llvm-dev
2017-Jun-01 09:17 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
On 1 June 2017 at 07:37, Kristof Beyls <Kristof.Beyls at arm.com> wrote:> I think the comments also make clear that it's less obvious whether we'd > want -mcpu=native to become a default. It's probably good for some use > cases, but really not good for other use cases. I won't be making that > change, nor advocate for it.That was just me and I am now thoroughly convinced it's not a good idea. :) Please, proceed as planned. Thanks Kristof, for the detailed investigation and everyone for their comments. cheers, --renato
Evandro Menezes via llvm-dev
2017-Jun-01 20:23 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
Hi, Kristof. It sounds like a good plan, but one thing is not clear to me from your post. Which pipeline model will be used for AArch64, A53's (i.e., none)? Thank you, -- Evandro Menezes On 06/01/2017 01:37 AM, Kristof Beyls wrote:> Thanks for everyone giving their feedback! > I saw pretty unanimous support for making -mcpu=generic the default > and making -mcpu=generic schedule for an in-order CPU (Cortex-A8 in > this case). > I'll be making those changes shortly. > > I think the comments also make clear that it's less obvious whether > we'd want -mcpu=native to become a default. It's probably good for > some use cases, but really not good for other use cases. I won't be > making that change, nor advocate for it. > > Thanks! > > Kristof > > >> On 31 May 2017, at 17:57, Stephen Hines <srhines at google.com >> <mailto:srhines at google.com>> wrote: >> >> Wow, these are some fantastic results! Android is definitely in favor >> of fixing the defaults, so this proposal looks great from our >> perspective. >> >> Thanks, >> Steve >> >> On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at arm.com >> <mailto:Kristof.Beyls at arm.com>> wrote: >> >> *Motivation* >> >> At the moment, when targeting armv7a, clang defaults to generate >> code as if -mcpu=cortex-a8 was specified. >> When targeting armv8a, it defaults to generate code as if >> -mcpu=cortex-a53 was specified. >> >> This leads to surprising code generation, by the compiler >> optimizing for a specific micro-architecture, whereas the intent >> from the user was probably to generate code that is "blended" for >> all the cores implementing the requested architecture. One >> example of a user being surprised like this is at >> https://bugs.llvm.org//show_bug.cgi?id=27219 >> <https://bugs.llvm.org//show_bug.cgi?id=27219>, where vmla's are >> not produced to optimize for a Cortex-A8-specific >> micro-architectural behaviour, even though the user didn't >> request to optimize specifically for Cortex-A8. >> >> It would be much cleaner conceptually if clang would default to >> -mcpu=generic when no specific cpu is specified. >> >> *What is the impact of this change on execution speed?* >> * >> * >> I think the main reason to be hesitant to change the default CPU >> for ARM to -mcpu=generic is the potential impact on performance >> of generated code. >> * >> * >> I've measured quite a wide selection of benchmarks with this >> change, on the following cores: Cortex-A9, Cortex-A53, >> Cortex-A57, Cortex-A72. >> >> Impact on execution speed, for each core, when using >> -march=armv7a, after changing the default cpu from cortex-a8 to >> generic is as follows. >> A positive numbers means speedup, a negative number means >> slow-down. These are the geomean results over 350 programs coming >> from benchmark suites such as the test-suite, SPEC2000, SPEC2006 >> and a range of proprietary suites. >> >> Cortex-A9: 0.96% >> Cortex-A53: -0.64% >> Cortex-A57: 1.04% >> Cortex-A72: 1.17% >> >> Impact on execution speed, for each core, when using >> -march=armv8a, after changing the default cpu from cortex-a53 to >> generic: >> >> (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) >> Cortex-A53: -0.09% >> Cortex-A57: -0.12% >> Cortex-A72: 0.03% >> >> *Should we enable scheduling for an in-order core even for >> -mcpu=generic?* >> * >> * >> In the above measurements it shows that the biggest negative >> impact seen is with -march=armv7a on Cortex-A53: -0.64%. >> It seems that the in-order Cortex-A53 core is losing quite a bit >> of performance when the instructions aren't scheduled - which is >> to be expected. >> Therefore, I also experimented with letting instructions be >> scheduled according to the Cortex-A8 pipeline model, even for >> -mcpu=generic, trying to figure out if it's beneficial to >> schedule instructions for an in-order core rather than not trying >> to schedule them at all, for -mcpu=generic. >> >> Measurement results: >> >> -march=armv7a >> >> Cortex-A9: 1.57% (up from 0.96%) >> Cortex-A53: 0.47% (up from -0.64%) >> Cortex-A57: 1.74% (up from 1.04%) >> Cortex-A72: 1.72% (up from 1.17%) >> >> -march=armv8a (Note that there isn't a pipeline model for >> Cortex-A53 in the 32-bit ARM backend): >> >> (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) >> Cortex-A53: 0.49% (up from -0.09%) >> Cortex-A57: 0.09% (up from -0.12%) >> Cortex-A72: 0.20% (up from 0.03%) >> >> Conclusion: for all the in-order and out-of-order cores I >> measured, it's beneficial to get the instructions scheduled using >> the Cortex-A8 pipeline model in combination with -mcpu=generic. >> >> >> Taking into account the above measurements, my conclusions are: >> 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or >> Cortex-A53 for march=armv7a and march=armv8a. >> 2. We probably want to let the compiler schedule instructions >> using the Cortex-A8 pipeline model for -mcpu=generic, since it >> gives a bit of speedup on all cores tested. >> >> Do people agree with these conclusions? >> Any objections against implementing this? >> Any other potential impact this may have that I forgot to >> consider above? >> >> Thanks, >> >> Kristof >> >> >
Kristof Beyls via llvm-dev
2017-Jun-20 14:05 UTC
[llvm-dev] [RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53
Hi Evandro, For now, I'm only looking at AArch32, not AArch64. Indeed, we could also perform in-order scheduling for -mcpu=generic on AArch64. Cortex-A53 indeed seems to be the best/only choice available. But before making that change, that'll require another round of lots of benchmarking. So in summary: I'll put the idea on my backlog, but I probably won't have time to get all the benchmarking done in the very near future. Thanks, Kristof On 1 Jun 2017, at 22:23, Evandro Menezes <e.menezes at samsung.com<mailto:e.menezes at samsung.com>> wrote: Hi, Kristof. It sounds like a good plan, but one thing is not clear to me from your post. Which pipeline model will be used for AArch64, A53's (i.e., none)? Thank you, -- Evandro Menezes On 06/01/2017 01:37 AM, Kristof Beyls wrote: Thanks for everyone giving their feedback! I saw pretty unanimous support for making -mcpu=generic the default and making -mcpu=generic schedule for an in-order CPU (Cortex-A8 in this case). I'll be making those changes shortly. I think the comments also make clear that it's less obvious whether we'd want -mcpu=native to become a default. It's probably good for some use cases, but really not good for other use cases. I won't be making that change, nor advocate for it. Thanks! Kristof On 31 May 2017, at 17:57, Stephen Hines <srhines at google.com<mailto:srhines at google.com> <mailto:srhines at google.com>> wrote: Wow, these are some fantastic results! Android is definitely in favor of fixing the defaults, so this proposal looks great from our perspective. Thanks, Steve On Wed, May 31, 2017 at 5:57 AM, Kristof Beyls <Kristof.Beyls at arm.com<mailto:Kristof.Beyls at arm.com> <mailto:Kristof.Beyls at arm.com>> wrote: *Motivation* At the moment, when targeting armv7a, clang defaults to generate code as if -mcpu=cortex-a8 was specified. When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was specified. This leads to surprising code generation, by the compiler optimizing for a specific micro-architecture, whereas the intent from the user was probably to generate code that is "blended" for all the cores implementing the requested architecture. One example of a user being surprised like this is at https://bugs.llvm.org//show_bug.cgi?id=27219 <https://bugs.llvm.org//show_bug.cgi?id=27219>, where vmla's are not produced to optimize for a Cortex-A8-specific micro-architectural behaviour, even though the user didn't request to optimize specifically for Cortex-A8. It would be much cleaner conceptually if clang would default to -mcpu=generic when no specific cpu is specified. *What is the impact of this change on execution speed?* * * I think the main reason to be hesitant to change the default CPU for ARM to -mcpu=generic is the potential impact on performance of generated code. * * I've measured quite a wide selection of benchmarks with this change, on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72. Impact on execution speed, for each core, when using -march=armv7a, after changing the default cpu from cortex-a8 to generic is as follows. A positive numbers means speedup, a negative number means slow-down. These are the geomean results over 350 programs coming from benchmark suites such as the test-suite, SPEC2000, SPEC2006 and a range of proprietary suites. Cortex-A9: 0.96% Cortex-A53: -0.64% Cortex-A57: 1.04% Cortex-A72: 1.17% Impact on execution speed, for each core, when using -march=armv8a, after changing the default cpu from cortex-a53 to generic: (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) Cortex-A53: -0.09% Cortex-A57: -0.12% Cortex-A72: 0.03% *Should we enable scheduling for an in-order core even for -mcpu=generic?* * * In the above measurements it shows that the biggest negative impact seen is with -march=armv7a on Cortex-A53: -0.64%. It seems that the in-order Cortex-A53 core is losing quite a bit of performance when the instructions aren't scheduled - which is to be expected. Therefore, I also experimented with letting instructions be scheduled according to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if it's beneficial to schedule instructions for an in-order core rather than not trying to schedule them at all, for -mcpu=generic. Measurement results: -march=armv7a Cortex-A9: 1.57% (up from 0.96%) Cortex-A53: 0.47% (up from -0.64%) Cortex-A57: 1.74% (up from 1.04%) Cortex-A72: 1.72% (up from 1.17%) -march=armv8a (Note that there isn't a pipeline model for Cortex-A53 in the 32-bit ARM backend): (Cortex-A9 is an armv7a core, so can't execute armv8a binaries) Cortex-A53: 0.49% (up from -0.09%) Cortex-A57: 0.09% (up from -0.12%) Cortex-A72: 0.20% (up from 0.03%) Conclusion: for all the in-order and out-of-order cores I measured, it's beneficial to get the instructions scheduled using the Cortex-A8 pipeline model in combination with -mcpu=generic. Taking into account the above measurements, my conclusions are: 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for march=armv7a and march=armv8a. 2. We probably want to let the compiler schedule instructions using the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on all cores tested. Do people agree with these conclusions? Any objections against implementing this? Any other potential impact this may have that I forgot to consider above? Thanks, Kristof -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170620/282d821d/attachment-0001.html>