Craig Topper via llvm-dev
2017-Nov-13 22:15 UTC
[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available
On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > On 11/11/2017 09:52 PM, UE US via llvm-dev wrote: > > If skylake is that bad at AVX2 > > > I don't think this says anything negative about AVX2, but AVX-512. > > it belongs in -mcpu / -march IMO. > > > No. We'd still want to enable the architectural features for vector > intrinsics and the like. >I took this to mean that the feature should be enabled by default for -march=skylake-avx512.> > > Based on the current performance data we're seeing, we think we need to > ultimately default skylake-avx512 to -mprefer-vector-width=256. > > > Craig, is this for both integer and floating-point code? >I believe so, but I'll try to get confirmation from the people with more data.> > > -Hal > > Most people will build for the standard x86_64-pc-linux or whatever > anyway, and completely ignore the change. This will mainly affect those > who build their own software and optimize for their system, and lots there > have probably caught on to this already. I always thought that's what > -march was made for, really. > > GNOMETOYS > > On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Yes - I was thinking of FeatureFastScalarFSQRT / FeatureFastVectorFSQRT >> which are used by isFsqrtCheap(). These were added to override the default >> x86 sqrt estimate codegen with: >> https://reviews.llvm.org/D21379 >> >> But I'm not sure we really need that kind of hack. Can we adjust the >> attribute in clang based on the target cpu? Ie, if you have something like: >> $ clang -O2 -march=skylake-avx512 foo.c >> >> Then you can detect that in the clang driver and pass >> -mprefer-vector-width=256 to clang codegen as an option? Clang codegen then >> adds that function attribute to everything it outputs. Then, the >> vectorizers and/or backend detect that attribute and adjust their behavior >> based on it. >> >Do we have a precedent for setting a target independent flag from a target specific cpu string in the clang driver? Want to make sure I understand what the processing on such a thing would look like. Particularly to get the order right so the user can override it.> >> So I don't think we should be messing with any kind of type legality >> checking because that stuff should all be correct already. We're just >> choosing a vector size based on a pref. I think we should even allow the >> pref to go bigger than a legal type. This came up somewhere on llvm-dev or >> in a bug recently in the context of vector reductions. >> >> >> >> On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper <craig.topper at gmail.com> >> wrote: >> >>> Are you referring to the X86TargetLowering::isFsqrtCheap hook? >>> >>> ~Craig >>> >>> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel <spatel at rotateright.com> >>> wrote: >>> >>>> We can tie a user preference / override to a CPU model. We do something >>>> like that for square root estimates already (although it does use a >>>> SubtargetFeature currently for x86; ideally, we'd key that off of something >>>> in the CPU scheduler model). >>>> >>>> >>>> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper <craig.topper at gmail.com> >>>> wrote: >>>> >>>>> I agree that a less x86 specific command line makes sense. I've been >>>>> having an internal discussions with gcc folks and their evaluating >>>>> switching to something like -mprefer-vector-width=128/256/512/none >>>>> >>>>> Based on the current performance data we're seeing, we think we need >>>>> to ultimately default skylake-avx512 to -mprefer-vector-width=256. If we go >>>>> with a target independent option/implementation is there someway we could >>>>> still affect the default behavior in a target specific way? >>>>> >>>>> ~Craig >>>>> >>>>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel <spatel at rotateright.com> >>>>> wrote: >>>>> >>>>>> It's clear from the Intel docs how this has evolved, but from a >>>>>> compiler perspective, this isn't a Skylake "feature" :) ... nor an Intel >>>>>> feature, nor an x86 feature. >>>>>> >>>>>> It's a generic programmer hint for any target with multiple potential >>>>>> vector lengths. >>>>>> >>>>>> On x86, there's already a potential use case for this hint with a >>>>>> different starting motivation: re-vectorization. That's where we take C >>>>>> code that uses 128-bit vector intrinsics and selectively widen it to 256- >>>>>> or 512-bit vector ops based on a newer CPU target than the code was >>>>>> originally written for. >>>>>> >>>>>> I think it's just a matter of time before a customer requests the >>>>>> same ability for another target (maybe they already have and I don't know >>>>>> about it). So we should have a solution that recognizes that possibility. >>>>>> >>>>>> Note that having a target-independent implementation in the optimizer >>>>>> doesn't preclude a flag alias in clang to maintain compatibility with gcc. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev < >>>>>> llvm-dev at lists.llvm.org> wrote: >>>>>> >>>>>>> On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev wrote: >>>>>>> > That's a very good point about the ordering of the command line >>>>>>> options. >>>>>>> > gcc's current implementation treats -mprefer-avx256 has "prefer >>>>>>> 256 over >>>>>>> > 512" and -mprefer-avx128 as "prefer 128 over 256". Which feels >>>>>>> weird for >>>>>>> > other reasons, but has less of an ordering ambiguity. >>>>>>> > >>>>>>> > -mprefer-avx128 has been in gcc for many years and predates the >>>>>>> creation >>>>>>> > of >>>>>>> > avx512. -mprefer-avx256 was added a couple months ago. >>>>>>> > >>>>>>> > We've had an internal conversation with the implementor of >>>>>>> > -mprefer-avx256 >>>>>>> > in gcc about making -mprefer-avx128 affect 512-bit vectors as >>>>>>> well. I'll >>>>>>> > bring up the ambiguity issue with them. >>>>>>> > >>>>>>> > Do we want to be compatible with gcc here? >>>>>>> >>>>>>> I certainly believe we would want to be compatible with gcc (if we >>>>>>> use >>>>>>> the same names). >>>>>>> >>>>>>> Best, >>>>>>> Tobias >>>>>>> >>>>>>> > >>>>>>> > ~Craig >>>>>>> > >>>>>>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher < >>>>>>> echristo at gmail.com> >>>>>>> > wrote: >>>>>>> > >>>>>>> > > >>>>>>> > > >>>>>>> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev < >>>>>>> > > llvm-dev at lists.llvm.org> wrote: >>>>>>> > > >>>>>>> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev < >>>>>>> > >> llvm-dev at lists.llvm.org> wrote: >>>>>>> > >> >>>>>>> > >>> Hello all, >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> I would like to propose adding the -mprefer-avx256 and >>>>>>> -mprefer-avx128 >>>>>>> > >>> command line flags supported by latest GCC to clang. These >>>>>>> flags will be >>>>>>> > >>> used to limit the vector register size presented by TTI to the >>>>>>> vectorizers. >>>>>>> > >>> The backend will still be able to use wider registers for code >>>>>>> written >>>>>>> > >>> using the instrinsics in x86intrin.h. And the backend will >>>>>>> still be able to >>>>>>> > >>> use AVX512VL instructions and the additional XMM16-31 and >>>>>>> YMM16-31 >>>>>>> > >>> registers. >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> Motivation: >>>>>>> > >>> >>>>>>> > >>> -Using 512-bit operations on some Intel CPUs may cause a >>>>>>> decrease in CPU >>>>>>> > >>> frequency that may offset the gains from using the wider >>>>>>> register size. See >>>>>>> > >>> section 15.26 of Intel® 64 and IA-32 Architectures >>>>>>> Optimization Reference >>>>>>> > >>> Manual published October 2017. >>>>>>> > >>> >>>>>>> > >> >>>>>>> > >> I note the doc mentions that 256-bit AVX operations also have >>>>>>> the same >>>>>>> > >> issue with reducing the CPU frequency, which is nice to see >>>>>>> documented! >>>>>>> > >> >>>>>>> > >> There's also the issues discussed here <http://www.agner.org/ >>>>>>> > >> optimize/blog/read.php?i=165> (and elsewhere) related to >>>>>>> warm-up time >>>>>>> > >> for the 256-bit execution pipeline, which is another issue with >>>>>>> using >>>>>>> > >> wide-vector ops. >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> -The vector ALUs on ports 0 and 1 of the Skylake Server >>>>>>> microarchitecture >>>>>>> > >>> are only 256-bits wide. 512-bit instructions using these ALUs >>>>>>> must use both >>>>>>> > >>> ports. See section 2.1 of Intel® 64 and IA-32 Architectures >>>>>>> Optimization >>>>>>> > >>> Reference Manual published October 2017. >>>>>>> > >>> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >>> Implementation Plan: >>>>>>> > >>> >>>>>>> > >>> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in >>>>>>> X86.td not >>>>>>> > >>> mapped to any CPU. >>>>>>> > >>> >>>>>>> > >>> -Add mprefer-avx256 and mprefer-avx128 and the corresponding >>>>>>> > >>> -mno-prefer-avx128/256 options to clang's driver Options.td >>>>>>> file. I believe >>>>>>> > >>> this will allow clang to pass these straight through to the >>>>>>> -target-feature >>>>>>> > >>> attribute in IR. >>>>>>> > >>> >>>>>>> > >>> -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if >>>>>>> AVX512 is >>>>>>> > >>> enabled and prefer-avx256 and prefer-avx128 is not set. >>>>>>> Similarly return >>>>>>> > >>> 256 if AVX is enabled and prefer-avx128 is not set. >>>>>>> > >>> >>>>>>> > >> >>>>>>> > >> Instead of multiple flags that have difficult to understand >>>>>>> intersecting >>>>>>> > >> behavior, one flag with a value would be better. E.g., what >>>>>>> should >>>>>>> > >> "-mprefer-avx256 -mprefer-avx128 -mno-prefer-avx256" do? No >>>>>>> matter the >>>>>>> > >> answer, it's confusing. (Similarly with other such >>>>>>> combinations). Just a >>>>>>> > >> single arg "-mprefer-avx={128/256/512}" (with no "no" version) >>>>>>> seems easier >>>>>>> > >> to understand to me (keeping the same behavior as you mention: >>>>>>> asking to >>>>>>> > >> prefer a larger width than is supported by your architecture >>>>>>> should be fine >>>>>>> > >> but ignored). >>>>>>> > >> >>>>>>> > >> >>>>>>> > > I agree with this. It's a little more plumbing as far as >>>>>>> subtarget >>>>>>> > > features etc (represent via an optional value or just various >>>>>>> "set the avx >>>>>>> > > width" features - the latter being easier, but uglier), however, >>>>>>> it's >>>>>>> > > probably the right thing to do. >>>>>>> > > >>>>>>> > > I was looking at this myself just a couple weeks ago and think >>>>>>> this is the >>>>>>> > > right direction (when and how to turn things off) - and probably >>>>>>> makes >>>>>>> > > sense to be a default for these architectures? We might end up >>>>>>> needing to >>>>>>> > > check a couple of additional TTI places, but it sounds like >>>>>>> you're on top >>>>>>> > > of it. :) >>>>>>> > > >>>>>>> > > Thanks very much for doing this work. >>>>>>> > > >>>>>>> > > -eric >>>>>>> > > >>>>>>> > > >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> There may be some other backend changes needed, but I plan to >>>>>>> address >>>>>>> > >>> those as we find them. >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> At a later point, consider making -mprefer-avx256 the default >>>>>>> for >>>>>>> > >>> Skylake Server due to the above mentioned performance >>>>>>> considerations. >>>>>>> > >>> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >>> >>>>>>> > >> Does this sound reasonable? >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> *Latest Intel Optimization manual available here: >>>>>>> > >>> https://software.intel.com/en-us/articles/intel-sdm#optimiza >>>>>>> tion >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> -Craig Topper >>>>>>> > >>> >>>>>>> > >>> _______________________________________________ >>>>>>> > >>> LLVM Developers mailing list >>>>>>> > >>> llvm-dev at lists.llvm.org >>>>>>> > >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>> > >>> >>>>>>> > >>> _______________________________________________ >>>>>>> > >> LLVM Developers mailing list >>>>>>> > >> llvm-dev at lists.llvm.org >>>>>>> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>> > >> >>>>>>> > > >>>>>>> > _______________________________________________ >>>>>>> > LLVM Developers mailing list >>>>>>> > llvm-dev at lists.llvm.org >>>>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>> _______________________________________________ >>>>>>> LLVM Developers mailing list >>>>>>> llvm-dev at lists.llvm.org >>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> > > > _______________________________________________ > LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171113/c0ceb080/attachment-0001.html>
Sanjay Patel via llvm-dev
2017-Nov-13 23:45 UTC
[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available
On Mon, Nov 13, 2017 at 3:15 PM, Craig Topper <craig.topper at gmail.com> wrote:> Do we have a precedent for setting a target independent flag from a target > specific cpu string in the clang driver? Want to make sure I understand > what the processing on such a thing would look like. Particularly to get > the order right so the user can override it. >I think Clang::AddX86TargetArgs() has a target CPU in its arg list, so you could do some checking/adding in there, but I'm just guessing at what's the right way to do this - ask on cfe-dev? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171113/51e37143/attachment.html>
Eric Christopher via llvm-dev
2017-Nov-13 23:49 UTC
[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available
On Mon, Nov 13, 2017 at 2:15 PM Craig Topper via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> On 11/11/2017 09:52 PM, UE US via llvm-dev wrote: >> >> If skylake is that bad at AVX2 >> >> >> I don't think this says anything negative about AVX2, but AVX-512. >> >Right. I think we're at AVX/AVX2 is "bad" on Haswell/Broadwell and AVX512 is "bad" on Skylake. At least in the "random autovectorization spread out" aspect.> >> >> it belongs in -mcpu / -march IMO. >> >> >> No. We'd still want to enable the architectural features for vector >> intrinsics and the like. >> > > I took this to mean that the feature should be enabled by default for > -march=skylake-avx512. >Agreed. -eric> > > >> >> >> Based on the current performance data we're seeing, we think we need to >> ultimately default skylake-avx512 to -mprefer-vector-width=256. >> >> >> Craig, is this for both integer and floating-point code? >> > > I believe so, but I'll try to get confirmation from the people with more > data. > > >> >> >> -Hal >> >> Most people will build for the standard x86_64-pc-linux or whatever >> anyway, and completely ignore the change. This will mainly affect those >> who build their own software and optimize for their system, and lots there >> have probably caught on to this already. I always thought that's what >> -march was made for, really. >> >> GNOMETOYS >> >> On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> Yes - I was thinking of FeatureFastScalarFSQRT / FeatureFastVectorFSQRT >>> which are used by isFsqrtCheap(). These were added to override the default >>> x86 sqrt estimate codegen with: >>> https://reviews.llvm.org/D21379 >>> >>> But I'm not sure we really need that kind of hack. Can we adjust the >>> attribute in clang based on the target cpu? Ie, if you have something like: >>> $ clang -O2 -march=skylake-avx512 foo.c >>> >>> Then you can detect that in the clang driver and pass >>> -mprefer-vector-width=256 to clang codegen as an option? Clang codegen then >>> adds that function attribute to everything it outputs. Then, the >>> vectorizers and/or backend detect that attribute and adjust their behavior >>> based on it. >>> >> > Do we have a precedent for setting a target independent flag from a target > specific cpu string in the clang driver? Want to make sure I understand > what the processing on such a thing would look like. Particularly to get > the order right so the user can override it. > > >> >>> So I don't think we should be messing with any kind of type legality >>> checking because that stuff should all be correct already. We're just >>> choosing a vector size based on a pref. I think we should even allow the >>> pref to go bigger than a legal type. This came up somewhere on llvm-dev or >>> in a bug recently in the context of vector reductions. >>> >>> >>> >>> On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper <craig.topper at gmail.com> >>> wrote: >>> >>>> Are you referring to the X86TargetLowering::isFsqrtCheap hook? >>>> >>>> ~Craig >>>> >>>> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel <spatel at rotateright.com> >>>> wrote: >>>> >>>>> We can tie a user preference / override to a CPU model. We do >>>>> something like that for square root estimates already (although it does use >>>>> a SubtargetFeature currently for x86; ideally, we'd key that off of >>>>> something in the CPU scheduler model). >>>>> >>>>> >>>>> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper <craig.topper at gmail.com> >>>>> wrote: >>>>> >>>>>> I agree that a less x86 specific command line makes sense. I've been >>>>>> having an internal discussions with gcc folks and their evaluating >>>>>> switching to something like -mprefer-vector-width=128/256/512/none >>>>>> >>>>>> Based on the current performance data we're seeing, we think we need >>>>>> to ultimately default skylake-avx512 to -mprefer-vector-width=256. If we go >>>>>> with a target independent option/implementation is there someway we could >>>>>> still affect the default behavior in a target specific way? >>>>>> >>>>>> ~Craig >>>>>> >>>>>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel <spatel at rotateright.com> >>>>>> wrote: >>>>>> >>>>>>> It's clear from the Intel docs how this has evolved, but from a >>>>>>> compiler perspective, this isn't a Skylake "feature" :) ... nor an Intel >>>>>>> feature, nor an x86 feature. >>>>>>> >>>>>>> It's a generic programmer hint for any target with multiple >>>>>>> potential vector lengths. >>>>>>> >>>>>>> On x86, there's already a potential use case for this hint with a >>>>>>> different starting motivation: re-vectorization. That's where we take C >>>>>>> code that uses 128-bit vector intrinsics and selectively widen it to 256- >>>>>>> or 512-bit vector ops based on a newer CPU target than the code was >>>>>>> originally written for. >>>>>>> >>>>>>> I think it's just a matter of time before a customer requests the >>>>>>> same ability for another target (maybe they already have and I don't know >>>>>>> about it). So we should have a solution that recognizes that possibility. >>>>>>> >>>>>>> Note that having a target-independent implementation in the >>>>>>> optimizer doesn't preclude a flag alias in clang to maintain compatibility >>>>>>> with gcc. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev < >>>>>>> llvm-dev at lists.llvm.org> wrote: >>>>>>> >>>>>>>> On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev wrote: >>>>>>>> > That's a very good point about the ordering of the command line >>>>>>>> options. >>>>>>>> > gcc's current implementation treats -mprefer-avx256 has "prefer >>>>>>>> 256 over >>>>>>>> > 512" and -mprefer-avx128 as "prefer 128 over 256". Which feels >>>>>>>> weird for >>>>>>>> > other reasons, but has less of an ordering ambiguity. >>>>>>>> > >>>>>>>> > -mprefer-avx128 has been in gcc for many years and predates the >>>>>>>> creation >>>>>>>> > of >>>>>>>> > avx512. -mprefer-avx256 was added a couple months ago. >>>>>>>> > >>>>>>>> > We've had an internal conversation with the implementor of >>>>>>>> > -mprefer-avx256 >>>>>>>> > in gcc about making -mprefer-avx128 affect 512-bit vectors as >>>>>>>> well. I'll >>>>>>>> > bring up the ambiguity issue with them. >>>>>>>> > >>>>>>>> > Do we want to be compatible with gcc here? >>>>>>>> >>>>>>>> I certainly believe we would want to be compatible with gcc (if we >>>>>>>> use >>>>>>>> the same names). >>>>>>>> >>>>>>>> Best, >>>>>>>> Tobias >>>>>>>> >>>>>>>> > >>>>>>>> > ~Craig >>>>>>>> > >>>>>>>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher < >>>>>>>> echristo at gmail.com> >>>>>>>> > wrote: >>>>>>>> > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev < >>>>>>>> > > llvm-dev at lists.llvm.org> wrote: >>>>>>>> > > >>>>>>>> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev < >>>>>>>> > >> llvm-dev at lists.llvm.org> wrote: >>>>>>>> > >> >>>>>>>> > >>> Hello all, >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> I would like to propose adding the -mprefer-avx256 and >>>>>>>> -mprefer-avx128 >>>>>>>> > >>> command line flags supported by latest GCC to clang. These >>>>>>>> flags will be >>>>>>>> > >>> used to limit the vector register size presented by TTI to >>>>>>>> the vectorizers. >>>>>>>> > >>> The backend will still be able to use wider registers for >>>>>>>> code written >>>>>>>> > >>> using the instrinsics in x86intrin.h. And the backend will >>>>>>>> still be able to >>>>>>>> > >>> use AVX512VL instructions and the additional XMM16-31 and >>>>>>>> YMM16-31 >>>>>>>> > >>> registers. >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> Motivation: >>>>>>>> > >>> >>>>>>>> > >>> -Using 512-bit operations on some Intel CPUs may cause a >>>>>>>> decrease in CPU >>>>>>>> > >>> frequency that may offset the gains from using the wider >>>>>>>> register size. See >>>>>>>> > >>> section 15.26 of Intel® 64 and IA-32 Architectures >>>>>>>> Optimization Reference >>>>>>>> > >>> Manual published October 2017. >>>>>>>> > >>> >>>>>>>> > >> >>>>>>>> > >> I note the doc mentions that 256-bit AVX operations also have >>>>>>>> the same >>>>>>>> > >> issue with reducing the CPU frequency, which is nice to see >>>>>>>> documented! >>>>>>>> > >> >>>>>>>> > >> There's also the issues discussed here <http://www.agner.org/ >>>>>>>> > >> optimize/blog/read.php?i=165> (and elsewhere) related to >>>>>>>> warm-up time >>>>>>>> > >> for the 256-bit execution pipeline, which is another issue >>>>>>>> with using >>>>>>>> > >> wide-vector ops. >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> -The vector ALUs on ports 0 and 1 of the Skylake Server >>>>>>>> microarchitecture >>>>>>>> > >>> are only 256-bits wide. 512-bit instructions using these ALUs >>>>>>>> must use both >>>>>>>> > >>> ports. See section 2.1 of Intel® 64 and IA-32 Architectures >>>>>>>> Optimization >>>>>>>> > >>> Reference Manual published October 2017. >>>>>>>> > >>> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >>> Implementation Plan: >>>>>>>> > >>> >>>>>>>> > >>> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in >>>>>>>> X86.td not >>>>>>>> > >>> mapped to any CPU. >>>>>>>> > >>> >>>>>>>> > >>> -Add mprefer-avx256 and mprefer-avx128 and the corresponding >>>>>>>> > >>> -mno-prefer-avx128/256 options to clang's driver Options.td >>>>>>>> file. I believe >>>>>>>> > >>> this will allow clang to pass these straight through to the >>>>>>>> -target-feature >>>>>>>> > >>> attribute in IR. >>>>>>>> > >>> >>>>>>>> > >>> -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if >>>>>>>> AVX512 is >>>>>>>> > >>> enabled and prefer-avx256 and prefer-avx128 is not set. >>>>>>>> Similarly return >>>>>>>> > >>> 256 if AVX is enabled and prefer-avx128 is not set. >>>>>>>> > >>> >>>>>>>> > >> >>>>>>>> > >> Instead of multiple flags that have difficult to understand >>>>>>>> intersecting >>>>>>>> > >> behavior, one flag with a value would be better. E.g., what >>>>>>>> should >>>>>>>> > >> "-mprefer-avx256 -mprefer-avx128 -mno-prefer-avx256" do? No >>>>>>>> matter the >>>>>>>> > >> answer, it's confusing. (Similarly with other such >>>>>>>> combinations). Just a >>>>>>>> > >> single arg "-mprefer-avx={128/256/512}" (with no "no" version) >>>>>>>> seems easier >>>>>>>> > >> to understand to me (keeping the same behavior as you mention: >>>>>>>> asking to >>>>>>>> > >> prefer a larger width than is supported by your architecture >>>>>>>> should be fine >>>>>>>> > >> but ignored). >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > > I agree with this. It's a little more plumbing as far as >>>>>>>> subtarget >>>>>>>> > > features etc (represent via an optional value or just various >>>>>>>> "set the avx >>>>>>>> > > width" features - the latter being easier, but uglier), >>>>>>>> however, it's >>>>>>>> > > probably the right thing to do. >>>>>>>> > > >>>>>>>> > > I was looking at this myself just a couple weeks ago and think >>>>>>>> this is the >>>>>>>> > > right direction (when and how to turn things off) - and >>>>>>>> probably makes >>>>>>>> > > sense to be a default for these architectures? We might end up >>>>>>>> needing to >>>>>>>> > > check a couple of additional TTI places, but it sounds like >>>>>>>> you're on top >>>>>>>> > > of it. :) >>>>>>>> > > >>>>>>>> > > Thanks very much for doing this work. >>>>>>>> > > >>>>>>>> > > -eric >>>>>>>> > > >>>>>>>> > > >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> There may be some other backend changes needed, but I plan to >>>>>>>> address >>>>>>>> > >>> those as we find them. >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> At a later point, consider making -mprefer-avx256 the default >>>>>>>> for >>>>>>>> > >>> Skylake Server due to the above mentioned performance >>>>>>>> considerations. >>>>>>>> > >>> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >>> >>>>>>>> > >> Does this sound reasonable? >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> *Latest Intel Optimization manual available here: >>>>>>>> > >>> >>>>>>>> https://software.intel.com/en-us/articles/intel-sdm#optimization >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> -Craig Topper >>>>>>>> > >>> >>>>>>>> > >>> _______________________________________________ >>>>>>>> > >>> LLVM Developers mailing list >>>>>>>> > >>> llvm-dev at lists.llvm.org >>>>>>>> > >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>>> > >>> >>>>>>>> > >>> _______________________________________________ >>>>>>>> > >> LLVM Developers mailing list >>>>>>>> > >> llvm-dev at lists.llvm.org >>>>>>>> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>>> > >> >>>>>>>> > > >>>>>>>> > _______________________________________________ >>>>>>>> > LLVM Developers mailing list >>>>>>>> > llvm-dev at lists.llvm.org >>>>>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>>> _______________________________________________ >>>>>>>> LLVM Developers mailing list >>>>>>>> llvm-dev at lists.llvm.org >>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >> >> >> _______________________________________________ >> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171113/6bc812cc/attachment.html>
Hal Finkel via llvm-dev
2017-Nov-13 23:54 UTC
[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available
On 11/13/2017 05:49 PM, Eric Christopher wrote:> > > On Mon, Nov 13, 2017 at 2:15 PM Craig Topper via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > On 11/11/2017 09:52 PM, UE US via llvm-dev wrote: >> If skylake is that bad at AVX2 > > I don't think this says anything negative about AVX2, but AVX-512. > > > Right. I think we're at AVX/AVX2 is "bad" on Haswell/Broadwell and > AVX512 is "bad" on Skylake. At least in the "random autovectorization > spread out" aspect. > > > >> it belongs in -mcpu / -march IMO. > > No. We'd still want to enable the architectural features for > vector intrinsics and the like. > > > I took this to mean that the feature should be enabled by default > for -march=skylake-avx512. > > > > Agreed.Yes. Also, GNOMETOYS clarified to me (off list) that is what he meant. -Hal> > -eric > > > > >> Based on the current performance data we're seeing, we think >> we need to ultimately default skylake-avx512 to >> -mprefer-vector-width=256. > > Craig, is this for both integer and floating-point code? > > > I believe so, but I'll try to get confirmation from the people > with more data. > > > > -Hal > >> Most people will build for the standard x86_64-pc-linux or >> whatever anyway, and completely ignore the change. This will >> mainly affect those who build their own software and optimize >> for their system, and lots there have probably caught on to >> this already. I always thought that's what -march was made >> for, really. >> >> GNOMETOYS >> >> On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> Yes - I was thinking of FeatureFastScalarFSQRT / >> FeatureFastVectorFSQRT which are used by isFsqrtCheap(). >> These were added to override the default x86 sqrt >> estimate codegen with: >> https://reviews.llvm.org/D21379 >> >> But I'm not sure we really need that kind of hack. Can we >> adjust the attribute in clang based on the target cpu? >> Ie, if you have something like: >> $ clang -O2 -march=skylake-avx512 foo.c >> >> Then you can detect that in the clang driver and pass >> -mprefer-vector-width=256 to clang codegen as an option? >> Clang codegen then adds that function attribute to >> everything it outputs. Then, the vectorizers and/or >> backend detect that attribute and adjust their behavior >> based on it. >> > > Do we have a precedent for setting a target independent flag from > a target specific cpu string in the clang driver? Want to make > sure I understand what the processing on such a thing would look > like. Particularly to get the order right so the user can override it. > >> >> So I don't think we should be messing with any kind of >> type legality checking because that stuff should all be >> correct already. We're just choosing a vector size based >> on a pref. I think we should even allow the pref to go >> bigger than a legal type. This came up somewhere on >> llvm-dev or in a bug recently in the context of vector >> reductions. >> >> >> >> On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper >> <craig.topper at gmail.com <mailto:craig.topper at gmail.com>> >> wrote: >> >> Are you referring to >> the X86TargetLowering::isFsqrtCheap hook? >> >> ~Craig >> >> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel >> <spatel at rotateright.com >> <mailto:spatel at rotateright.com>> wrote: >> >> We can tie a user preference / override to a CPU >> model. We do something like that for square root >> estimates already (although it does use a >> SubtargetFeature currently for x86; ideally, we'd >> key that off of something in the CPU scheduler >> model). >> >> >> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper >> <craig.topper at gmail.com >> <mailto:craig.topper at gmail.com>> wrote: >> >> I agree that a less x86 specific command line >> makes sense. I've been having an internal >> discussions with gcc folks and their >> evaluating switching to something like >> -mprefer-vector-width=128/256/512/none >> >> Based on the current performance data we're >> seeing, we think we need to ultimately >> default skylake-avx512 to >> -mprefer-vector-width=256. If we go with a >> target independent option/implementation is >> there someway we could still affect the >> default behavior in a target specific way? >> >> ~Craig >> >> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel >> <spatel at rotateright.com >> <mailto:spatel at rotateright.com>> wrote: >> >> It's clear from the Intel docs how this >> has evolved, but from a compiler >> perspective, this isn't a Skylake >> "feature" :) ... nor an Intel feature, >> nor an x86 feature. >> >> It's a generic programmer hint for any >> target with multiple potential vector >> lengths. >> >> On x86, there's already a potential use >> case for this hint with a different >> starting motivation: re-vectorization. >> That's where we take C code that uses >> 128-bit vector intrinsics and selectively >> widen it to 256- or 512-bit vector ops >> based on a newer CPU target than the code >> was originally written for. >> >> I think it's just a matter of time before >> a customer requests the same ability for >> another target (maybe they already have >> and I don't know about it). So we should >> have a solution that recognizes that >> possibility. >> >> Note that having a target-independent >> implementation in the optimizer doesn't >> preclude a flag alias in clang to >> maintain compatibility with gcc. >> >> >> >> On Tue, Nov 7, 2017 at 2:02 AM, Tobias >> Grosser via llvm-dev >> <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> On Fri, Nov 3, 2017, at 05:47, Craig >> Topper via llvm-dev wrote: >> > That's a very good point about the >> ordering of the command line options. >> > gcc's current implementation treats >> -mprefer-avx256 has "prefer 256 over >> > 512" and -mprefer-avx128 as "prefer >> 128 over 256". Which feels weird for >> > other reasons, but has less of an >> ordering ambiguity. >> > >> > -mprefer-avx128 has been in gcc for >> many years and predates the creation >> > of >> > avx512. -mprefer-avx256 was added a >> couple months ago. >> > >> > We've had an internal conversation >> with the implementor of >> > -mprefer-avx256 >> > in gcc about making -mprefer-avx128 >> affect 512-bit vectors as well. I'll >> > bring up the ambiguity issue with them. >> > >> > Do we want to be compatible with >> gcc here? >> >> I certainly believe we would want to >> be compatible with gcc (if we use >> the same names). >> >> Best, >> Tobias >> >> > >> > ~Craig >> > >> > On Thu, Nov 2, 2017 at 7:18 PM, >> Eric Christopher <echristo at gmail.com >> <mailto:echristo at gmail.com>> >> > wrote: >> > >> > > >> > > >> > > On Thu, Nov 2, 2017 at 7:05 PM >> James Y Knight via llvm-dev < >> > > llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> wrote: >> > > >> > >> On Wed, Nov 1, 2017 at 7:35 PM, >> Craig Topper via llvm-dev < >> > >> llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> wrote: >> > >> >> > >>> Hello all, >> > >>> >> > >>> >> > >>> >> > >>> I would like to propose adding >> the -mprefer-avx256 and -mprefer-avx128 >> > >>> command line flags supported by >> latest GCC to clang. These flags will be >> > >>> used to limit the vector >> register size presented by TTI to the >> vectorizers. >> > >>> The backend will still be able >> to use wider registers for code written >> > >>> using the instrinsics in >> x86intrin.h. And the backend will >> still be able to >> > >>> use AVX512VL instructions and >> the additional XMM16-31 and YMM16-31 >> > >>> registers. >> > >>> >> > >>> >> > >>> >> > >>> Motivation: >> > >>> >> > >>> -Using 512-bit operations on >> some Intel CPUs may cause a decrease >> in CPU >> > >>> frequency that may offset the >> gains from using the wider register >> size. See >> > >>> section 15.26 of Intel® 64 and >> IA-32 Architectures Optimization >> Reference >> > >>> Manual published October 2017. >> > >>> >> > >> >> > >> I note the doc mentions that >> 256-bit AVX operations also have the same >> > >> issue with reducing the CPU >> frequency, which is nice to see >> documented! >> > >> >> > >> There's also the issues >> discussed here <http://www.agner.org/ >> > >> optimize/blog/read.php?i=165> >> (and elsewhere) related to warm-up time >> > >> for the 256-bit execution >> pipeline, which is another issue with >> using >> > >> wide-vector ops. >> > >> >> > >> >> > >> -The vector ALUs on ports 0 and >> 1 of the Skylake Server microarchitecture >> > >>> are only 256-bits wide. 512-bit >> instructions using these ALUs must >> use both >> > >>> ports. See section 2.1 of >> Intel® 64 and IA-32 Architectures >> Optimization >> > >>> Reference Manual published >> October 2017. >> > >>> >> > >> >> > >> >> > >>> Implementation Plan: >> > >>> >> > >>> -Add prefer-avx256 and >> prefer-avx128 as SubtargetFeatures in >> X86.td not >> > >>> mapped to any CPU. >> > >>> >> > >>> -Add mprefer-avx256 and >> mprefer-avx128 and the corresponding >> > >>> -mno-prefer-avx128/256 options >> to clang's driver Options.td file. I >> believe >> > >>> this will allow clang to pass >> these straight through to the >> -target-feature >> > >>> attribute in IR. >> > >>> >> > >>> -Modify >> X86TTIImpl::getRegisterBitWidth to >> only return 512 if AVX512 is >> > >>> enabled and prefer-avx256 and >> prefer-avx128 is not set. Similarly >> return >> > >>> 256 if AVX is enabled and >> prefer-avx128 is not set. >> > >>> >> > >> >> > >> Instead of multiple flags that >> have difficult to understand intersecting >> > >> behavior, one flag with a value >> would be better. E.g., what should >> > >> "-mprefer-avx256 -mprefer-avx128 >> -mno-prefer-avx256" do? No matter the >> > >> answer, it's confusing. >> (Similarly with other such >> combinations). Just a >> > >> single arg >> "-mprefer-avx={128/256/512}" (with no >> "no" version) seems easier >> > >> to understand to me (keeping the >> same behavior as you mention: asking to >> > >> prefer a larger width than is >> supported by your architecture should >> be fine >> > >> but ignored). >> > >> >> > >> >> > > I agree with this. It's a little >> more plumbing as far as subtarget >> > > features etc (represent via an >> optional value or just various "set >> the avx >> > > width" features - the latter >> being easier, but uglier), however, it's >> > > probably the right thing to do. >> > > >> > > I was looking at this myself just >> a couple weeks ago and think this is the >> > > right direction (when and how to >> turn things off) - and probably makes >> > > sense to be a default for these >> architectures? We might end up needing to >> > > check a couple of additional TTI >> places, but it sounds like you're on top >> > > of it. :) >> > > >> > > Thanks very much for doing this work. >> > > >> > > -eric >> > > >> > > >> > >> >> > >> >> > >> There may be some other backend >> changes needed, but I plan to address >> > >>> those as we find them. >> > >>> >> > >>> >> > >>> At a later point, consider >> making -mprefer-avx256 the default for >> > >>> Skylake Server due to the above >> mentioned performance considerations. >> > >>> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >>> >> > >> Does this sound reasonable? >> > >>> >> > >>> >> > >>> >> > >>> *Latest Intel Optimization >> manual available here: >> > >>> >> https://software.intel.com/en-us/articles/intel-sdm#optimization >> > >>> >> > >>> >> > >>> -Craig Topper >> > >>> >> > >>> >> _______________________________________________ >> > >>> LLVM Developers mailing list >> > >>> llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org> >> > >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > >>> >> > >>> >> _______________________________________________ >> > >> LLVM Developers mailing list >> > >> llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org> >> > >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > >> >> > > >> > >> _______________________________________________ >> > LLVM Developers mailing list >> > llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org> >> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171113/d4ec460d/attachment-0001.html>