thr3ads.net - llvm dev - [llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2017-Nov-13 23:54 UTC

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

On 11/13/2017 05:49 PM, Eric Christopher wrote:>
>
> On Mon, Nov 13, 2017 at 2:15 PM Craig Topper via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev
>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>
>         On 11/11/2017 09:52 PM, UE US via llvm-dev wrote:
>>         If skylake is that bad at AVX2
>
>         I don't think this says anything negative about AVX2, but
AVX-512.
>
>
> Right. I think we're at AVX/AVX2 is "bad" on
Haswell/Broadwell and
> AVX512 is "bad" on Skylake. At least in the "random
autovectorization
> spread out" aspect.
>
>
>
>>         it belongs in -mcpu / -march IMO.
>
>         No. We'd still want to enable the architectural features for
>         vector intrinsics and the like.
>
>
>     I took this to mean that the feature should be enabled by default
>     for -march=skylake-avx512.
>
>
>
> Agreed.
Yes. Also, GNOMETOYS clarified to me (off list) that is what he meant.

  -Hal
>
> -eric
>
>
>
>
>>         Based on the current performance data we're seeing, we
think
>>         we need to ultimately default skylake-avx512 to
>>         -mprefer-vector-width=256.
>
>         Craig, is this for both integer and floating-point code?
>
>
>     I believe so, but I'll try to get confirmation from the people
>     with more data.
>
>
>
>          -Hal
>
>>            Most people will build for the standard x86_64-pc-linux or
>>         whatever anyway,  and completely ignore the change. This will
>>         mainly affect those who build their own software and optimize
>>         for their system, and lots there have probably caught on to
>>         this already.  I always thought that's what -march was made
>>         for, really.
>>
>>         GNOMETOYS
>>
>>         On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev
>>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>             Yes - I was thinking of FeatureFastScalarFSQRT /
>>             FeatureFastVectorFSQRT which are used by isFsqrtCheap().
>>             These were added to override the default x86 sqrt
>>             estimate codegen with:
>>             https://reviews.llvm.org/D21379
>>
>>             But I'm not sure we really need that kind of hack. Can
we
>>             adjust the attribute in clang based on the target cpu?
>>             Ie, if you have something like:
>>             $ clang -O2 -march=skylake-avx512 foo.c
>>
>>             Then you can detect that in the clang driver and pass
>>             -mprefer-vector-width=256 to clang codegen as an option?
>>             Clang codegen then adds that function attribute to
>>             everything it outputs. Then, the vectorizers and/or
>>             backend detect that attribute and adjust their behavior
>>             based on it.
>>
>
>     Do we have a precedent for setting a target independent flag from
>     a target specific cpu string in the clang driver? Want to make
>     sure I understand what the processing on such a thing would look
>     like. Particularly to get the order right so the user can override it.
>
>>
>>             So I don't think we should be messing with any kind of
>>             type legality checking because that stuff should all be
>>             correct already. We're just choosing a vector size
based
>>             on a pref. I think we should even allow the pref to go
>>             bigger than a legal type. This came up somewhere on
>>             llvm-dev or in a bug recently in the context of vector
>>             reductions.
>>
>>
>>
>>             On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper
>>             <craig.topper at gmail.com <mailto:craig.topper at
gmail.com>>
>>             wrote:
>>
>>                 Are you referring to
>>                 the X86TargetLowering::isFsqrtCheap hook?
>>
>>                 ~Craig
>>
>>                 On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel
>>                 <spatel at rotateright.com
>>                 <mailto:spatel at rotateright.com>> wrote:
>>
>>                     We can tie a user preference / override to a CPU
>>                     model. We do something like that for square root
>>                     estimates already (although it does use a
>>                     SubtargetFeature currently for x86; ideally,
we'd
>>                     key that off of something in the CPU scheduler
>>                     model).
>>
>>
>>                     On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper
>>                     <craig.topper at gmail.com
>>                     <mailto:craig.topper at gmail.com>> wrote:
>>
>>                         I agree that a less x86 specific command line
>>                         makes sense. I've been having an internal
>>                         discussions with gcc folks and their
>>                         evaluating switching to something like
>>                         -mprefer-vector-width=128/256/512/none
>>
>>                         Based on the current performance data we're
>>                         seeing, we think we need to ultimately
>>                         default skylake-avx512 to
>>                         -mprefer-vector-width=256. If we go with a
>>                         target independent option/implementation is
>>                         there someway we could still affect the
>>                         default behavior in a target specific way?
>>
>>                         ~Craig
>>
>>                         On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel
>>                         <spatel at rotateright.com
>>                         <mailto:spatel at rotateright.com>>
wrote:
>>
>>                             It's clear from the Intel docs how this
>>                             has evolved, but from a compiler
>>                             perspective, this isn't a Skylake
>>                             "feature" :) ... nor an Intel
feature,
>>                             nor an x86 feature.
>>
>>                             It's a generic programmer hint for any
>>                             target with multiple potential vector
>>                             lengths.
>>
>>                             On x86, there's already a potential use
>>                             case for this hint with a different
>>                             starting motivation: re-vectorization.
>>                             That's where we take C code that uses
>>                             128-bit vector intrinsics and selectively
>>                             widen it to 256- or 512-bit vector ops
>>                             based on a newer CPU target than the code
>>                             was originally written for.
>>
>>                             I think it's just a matter of time
before
>>                             a customer requests the same ability for
>>                             another target (maybe they already have
>>                             and I don't know about it). So we
should
>>                             have a solution that recognizes that
>>                             possibility.
>>
>>                             Note that having a target-independent
>>                             implementation in the optimizer doesn't
>>                             preclude a flag alias in clang to
>>                             maintain compatibility with gcc.
>>
>>
>>
>>                             On Tue, Nov 7, 2017 at 2:02 AM, Tobias
>>                             Grosser via llvm-dev
>>                             <llvm-dev at lists.llvm.org
>>                             <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>                                 On Fri, Nov 3, 2017, at 05:47, Craig
>>                                 Topper via llvm-dev wrote:
>>                                 > That's a very good point about
the
>>                                 ordering of the command line options.
>>                                 > gcc's current implementation
treats
>>                                 -mprefer-avx256 has "prefer 256
over
>>                                 > 512" and -mprefer-avx128 as
"prefer
>>                                 128 over 256". Which feels weird
for
>>                                 > other reasons, but has less of an
>>                                 ordering ambiguity.
>>                                 >
>>                                 > -mprefer-avx128 has been in gcc
for
>>                                 many years and predates the creation
>>                                 > of
>>                                 > avx512. -mprefer-avx256 was added
a
>>                                 couple months ago.
>>                                 >
>>                                 > We've had an internal
conversation
>>                                 with the implementor of
>>                                 > -mprefer-avx256
>>                                 > in gcc about making
-mprefer-avx128
>>                                 affect 512-bit vectors as well.
I'll
>>                                 > bring up the ambiguity issue with
them.
>>                                 >
>>                                 > Do we want to be compatible with
>>                                 gcc here?
>>
>>                                 I certainly believe we would want to
>>                                 be compatible with gcc (if we use
>>                                 the same names).
>>
>>                                 Best,
>>                                 Tobias
>>
>>                                 >
>>                                 > ~Craig
>>                                 >
>>                                 > On Thu, Nov 2, 2017 at 7:18 PM,
>>                                 Eric Christopher <echristo at
gmail.com
>>                                 <mailto:echristo at
gmail.com>>
>>                                 > wrote:
>>                                 >
>>                                 > >
>>                                 > >
>>                                 > > On Thu, Nov 2, 2017 at 7:05
PM
>>                                 James Y Knight via llvm-dev <
>>                                 > > llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>                                 > >
>>                                 > >> On Wed, Nov 1, 2017 at
7:35 PM,
>>                                 Craig Topper via llvm-dev <
>>                                 > >> llvm-dev at
lists.llvm.org
>>                                 <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>                                 > >>
>>                                 > >>> Hello all,
>>                                 > >>>
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> I would like to
propose adding
>>                                 the -mprefer-avx256 and -mprefer-avx128
>>                                 > >>> command line flags
supported by
>>                                 latest GCC to clang. These flags will
be
>>                                 > >>> used to limit the
vector
>>                                 register size presented by TTI to the
>>                                 vectorizers.
>>                                 > >>> The backend will
still be able
>>                                 to use wider registers for code written
>>                                 > >>> using the instrinsics
in
>>                                 x86intrin.h. And the backend will
>>                                 still be able to
>>                                 > >>> use AVX512VL
instructions and
>>                                 the additional XMM16-31 and YMM16-31
>>                                 > >>> registers.
>>                                 > >>>
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> Motivation:
>>                                 > >>>
>>                                 > >>> -Using 512-bit
operations on
>>                                 some Intel CPUs may cause a decrease
>>                                 in CPU
>>                                 > >>> frequency that may
offset the
>>                                 gains from using the wider register
>>                                 size. See
>>                                 > >>> section 15.26 of
Intel® 64 and
>>                                 IA-32 Architectures Optimization
>>                                 Reference
>>                                 > >>> Manual published
October 2017.
>>                                 > >>>
>>                                 > >>
>>                                 > >> I note the doc mentions
that
>>                                 256-bit AVX operations also have the
same
>>                                 > >> issue with reducing the
CPU
>>                                 frequency, which is nice to see
>>                                 documented!
>>                                 > >>
>>                                 > >> There's also the
issues
>>                                 discussed here
<http://www.agner.org/
>>                                 > >>
optimize/blog/read.php?i=165>
>>                                 (and elsewhere) related to warm-up time
>>                                 > >> for the 256-bit execution
>>                                 pipeline, which is another issue with
>>                                 using
>>                                 > >> wide-vector ops.
>>                                 > >>
>>                                 > >>
>>                                 > >> -The vector ALUs on ports
0 and
>>                                 1 of the Skylake Server
microarchitecture
>>                                 > >>> are only 256-bits
wide. 512-bit
>>                                 instructions using these ALUs must
>>                                 use both
>>                                 > >>> ports. See section
2.1 of
>>                                 Intel® 64 and IA-32 Architectures
>>                                 Optimization
>>                                 > >>> Reference Manual
published
>>                                 October 2017.
>>                                 > >>>
>>                                 > >>
>>                                 > >>
>>                                 > >>> Implementation Plan:
>>                                 > >>>
>>                                 > >>> -Add prefer-avx256
and
>>                                 prefer-avx128 as SubtargetFeatures in
>>                                 X86.td not
>>                                 > >>> mapped to any CPU.
>>                                 > >>>
>>                                 > >>> -Add mprefer-avx256
and
>>                                 mprefer-avx128 and the corresponding
>>                                 > >>>
-mno-prefer-avx128/256 options
>>                                 to clang's driver Options.td file.
I
>>                                 believe
>>                                 > >>> this will allow clang
to pass
>>                                 these straight through to the
>>                                 -target-feature
>>                                 > >>> attribute in IR.
>>                                 > >>>
>>                                 > >>> -Modify
>>                                 X86TTIImpl::getRegisterBitWidth to
>>                                 only return 512 if AVX512 is
>>                                 > >>> enabled and
prefer-avx256 and
>>                                 prefer-avx128 is not set. Similarly
>>                                 return
>>                                 > >>> 256 if AVX is enabled
and
>>                                 prefer-avx128 is not set.
>>                                 > >>>
>>                                 > >>
>>                                 > >> Instead of multiple flags
that
>>                                 have difficult to understand
intersecting
>>                                 > >> behavior, one flag with a
value
>>                                 would be better. E.g., what should
>>                                 > >> "-mprefer-avx256
-mprefer-avx128
>>                                 -mno-prefer-avx256" do? No matter
the
>>                                 > >> answer, it's
confusing.
>>                                 (Similarly with other such
>>                                 combinations). Just a
>>                                 > >> single arg
>>                                 "-mprefer-avx={128/256/512}"
(with no
>>                                 "no" version) seems easier
>>                                 > >> to understand to me
(keeping the
>>                                 same behavior as you mention: asking to
>>                                 > >> prefer a larger width
than is
>>                                 supported by your architecture should
>>                                 be fine
>>                                 > >> but ignored).
>>                                 > >>
>>                                 > >>
>>                                 > > I agree with this. It's a
little
>>                                 more plumbing as far as subtarget
>>                                 > > features etc (represent via
an
>>                                 optional value or just various
"set
>>                                 the avx
>>                                 > > width" features - the
latter
>>                                 being easier, but uglier), however,
it's
>>                                 > > probably the right thing to
do.
>>                                 > >
>>                                 > > I was looking at this myself
just
>>                                 a couple weeks ago and think this is
the
>>                                 > > right direction (when and how
to
>>                                 turn things off) - and probably makes
>>                                 > > sense to be a default for
these
>>                                 architectures? We might end up needing
to
>>                                 > > check a couple of additional
TTI
>>                                 places, but it sounds like you're
on top
>>                                 > > of it. :)
>>                                 > >
>>                                 > > Thanks very much for doing
this work.
>>                                 > >
>>                                 > > -eric
>>                                 > >
>>                                 > >
>>                                 > >>
>>                                 > >>
>>                                 > >> There may be some other
backend
>>                                 changes needed, but I plan to address
>>                                 > >>> those as we find
them.
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> At a later point,
consider
>>                                 making -mprefer-avx256 the default for
>>                                 > >>> Skylake Server due to
the above
>>                                 mentioned performance considerations.
>>                                 > >>>
>>                                 > >>
>>                                 > >>
>>                                 > >>
>>                                 > >>
>>                                 > >>
>>                                 > >>>
>>                                 > >> Does this sound
reasonable?
>>                                 > >>>
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> *Latest Intel
Optimization
>>                                 manual available here:
>>                                 > >>>
>>                                
https://software.intel.com/en-us/articles/intel-sdm#optimization
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> -Craig Topper
>>                                 > >>>
>>                                 > >>>
>>                                
_______________________________________________
>>                                 > >>> LLVM Developers
mailing list
>>                                 > >>> llvm-dev at
lists.llvm.org
>>                                 <mailto:llvm-dev at
lists.llvm.org>
>>                                 > >>>
>>                                
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                                 > >>>
>>                                 > >>>
>>                                
_______________________________________________
>>                                 > >> LLVM Developers mailing
list
>>                                 > >> llvm-dev at
lists.llvm.org
>>                                 <mailto:llvm-dev at
lists.llvm.org>
>>                                 > >>
>>                                
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                                 > >>
>>                                 > >
>>                                 >
>>                                
_______________________________________________
>>                                 > LLVM Developers mailing list
>>                                 > llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at
lists.llvm.org>
>>                                 >
>>                                
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                                
_______________________________________________
>>                                 LLVM Developers mailing list
>>                                 llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at
lists.llvm.org>
>>                                
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>>
>>
>>
>>             _______________________________________________
>>             LLVM Developers mailing list
>>             llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>             http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>>         _______________________________________________
>>         LLVM Developers mailing list
>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>         -- 
>         Hal Finkel
>         Lead, Compiler Technology and Programming Languages
>         Leadership Computing Facility
>         Argonne National Laboratory
>
>
>         _______________________________________________
>         LLVM Developers mailing list
>         llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171113/d4ec460d/attachment-0001.html>

Craig Topper via llvm-dev

2017-Nov-14 17:26 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

For the re-vectorization case mentioned by Sanjay. That seems like a
different type of limit than what's being proposed here. For
revectorization you want to remove smaller vector widths. This is removing
larger vector widths. I don't think we want the -mprefer-vector-width=256
being proposed here to say we can't do 128-bit vectors with the 256-bit.
Maybe this should be called -mlimit-vector-width?

Its not clear to be why revectorization would need a preference at all?
Shouldn't we be able to decide from the cost models? We go from scalar to
vector today based on cost models. Why couldn't we go from vector to wider
vector?

~Craig

On Mon, Nov 13, 2017 at 3:54 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>
> On 11/13/2017 05:49 PM, Eric Christopher wrote:
>
>
>
> On Mon, Nov 13, 2017 at 2:15 PM Craig Topper via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>
>>> On 11/11/2017 09:52 PM, UE US via llvm-dev wrote:
>>>
>>> If skylake is that bad at AVX2
>>>
>>>
>>> I don't think this says anything negative about AVX2, but
AVX-512.
>>>
>>
> Right. I think we're at AVX/AVX2 is "bad" on
Haswell/Broadwell and AVX512
> is "bad" on Skylake. At least in the "random
autovectorization spread out"
> aspect.
>
>
>>
>>>
>>> it belongs in -mcpu / -march IMO.
>>>
>>>
>>> No. We'd still want to enable the architectural features for
vector
>>> intrinsics and the like.
>>>
>>
>> I took this to mean that the feature should be enabled by default for
>> -march=skylake-avx512.
>>
>
>
> Agreed.
>
>
> Yes. Also, GNOMETOYS clarified to me (off list) that is what he meant.
>
>  -Hal
>
>
>
> -eric
>
>
>>
>>
>>
>>>
>>>
>>> Based on the current performance data we're seeing, we think we
need to
>>> ultimately default skylake-avx512 to -mprefer-vector-width=256.
>>>
>>>
>>> Craig, is this for both integer and floating-point code?
>>>
>>
>> I believe so, but I'll try to get confirmation from the people with
more
>> data.
>>
>>
>>>
>>>
>>>  -Hal
>>>
>>>    Most people will build for the standard x86_64-pc-linux or
whatever
>>> anyway,  and completely ignore the change. This will mainly affect
those
>>> who build their own software and optimize for their system, and
lots there
>>> have probably caught on to this already.  I always thought
that's what
>>> -march was made for, really.
>>>
>>> GNOMETOYS
>>>
>>> On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Yes - I was thinking of FeatureFastScalarFSQRT /
FeatureFastVectorFSQRT
>>>> which are used by isFsqrtCheap(). These were added to override
the default
>>>> x86 sqrt estimate codegen with:
>>>> https://reviews.llvm.org/D21379
>>>>
>>>> But I'm not sure we really need that kind of hack. Can we
adjust the
>>>> attribute in clang based on the target cpu? Ie, if you have
something like:
>>>> $ clang -O2 -march=skylake-avx512 foo.c
>>>>
>>>> Then you can detect that in the clang driver and pass
>>>> -mprefer-vector-width=256 to clang codegen as an option? Clang
codegen then
>>>> adds that function attribute to everything it outputs. Then,
the
>>>> vectorizers and/or backend detect that attribute and adjust
their behavior
>>>> based on it.
>>>>
>>>
>> Do we have a precedent for setting a target independent flag from a
>> target specific cpu string in the clang driver? Want to make sure I
>> understand what the processing on such a thing would look like.
>> Particularly to get the order right so the user can override it.
>>
>>
>>>
>>>> So I don't think we should be messing with any kind of type
legality
>>>> checking because that stuff should all be correct already.
We're just
>>>> choosing a vector size based on a pref. I think we should even
allow the
>>>> pref to go bigger than a legal type. This came up somewhere on
llvm-dev or
>>>> in a bug recently in the context of vector reductions.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper <craig.topper
at gmail.com>
>>>> wrote:
>>>>
>>>>> Are you referring to the X86TargetLowering::isFsqrtCheap
hook?
>>>>>
>>>>> ~Craig
>>>>>
>>>>> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel <spatel at
rotateright.com>
>>>>> wrote:
>>>>>
>>>>>> We can tie a user preference / override to a CPU model.
We do
>>>>>> something like that for square root estimates already
(although it does use
>>>>>> a SubtargetFeature currently for x86; ideally, we'd
key that off of
>>>>>> something in the CPU scheduler model).
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper
<craig.topper at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree that a less x86 specific command line makes
sense. I've been
>>>>>>> having an internal discussions with gcc folks and
their evaluating
>>>>>>> switching to something like
-mprefer-vector-width=128/256/512/none
>>>>>>>
>>>>>>> Based on the current performance data we're
seeing, we think we need
>>>>>>> to ultimately default skylake-avx512 to
-mprefer-vector-width=256. If we go
>>>>>>> with a target independent option/implementation is
there someway we could
>>>>>>> still affect the default behavior in a target
specific way?
>>>>>>>
>>>>>>> ~Craig
>>>>>>>
>>>>>>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel
<spatel at rotateright.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> It's clear from the Intel docs how this has
evolved, but from a
>>>>>>>> compiler perspective, this isn't a Skylake
"feature" :) ... nor an Intel
>>>>>>>> feature, nor an x86 feature.
>>>>>>>>
>>>>>>>> It's a generic programmer hint for any
target with multiple
>>>>>>>> potential vector lengths.
>>>>>>>>
>>>>>>>> On x86, there's already a potential use
case for this hint with a
>>>>>>>> different starting motivation:
re-vectorization. That's where we take C
>>>>>>>> code that uses 128-bit vector intrinsics and
selectively widen it to 256-
>>>>>>>> or 512-bit vector ops based on a newer CPU
target than the code was
>>>>>>>> originally written for.
>>>>>>>>
>>>>>>>> I think it's just a matter of time before a
customer requests the
>>>>>>>> same ability for another target (maybe they
already have and I don't know
>>>>>>>> about it). So we should have a solution that
recognizes that possibility.
>>>>>>>>
>>>>>>>> Note that having a target-independent
implementation in the
>>>>>>>> optimizer doesn't preclude a flag alias in
clang to maintain compatibility
>>>>>>>> with gcc.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser
via llvm-dev <
>>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Nov 3, 2017, at 05:47, Craig Topper
via llvm-dev wrote:
>>>>>>>>> > That's a very good point about the
ordering of the command line
>>>>>>>>> options.
>>>>>>>>> > gcc's current implementation
treats -mprefer-avx256 has "prefer
>>>>>>>>> 256 over
>>>>>>>>> > 512" and -mprefer-avx128 as
"prefer 128 over 256". Which feels
>>>>>>>>> weird for
>>>>>>>>> > other reasons, but has less of an
ordering ambiguity.
>>>>>>>>> >
>>>>>>>>> > -mprefer-avx128 has been in gcc for
many years and predates the
>>>>>>>>> creation
>>>>>>>>> > of
>>>>>>>>> > avx512. -mprefer-avx256 was added a
couple months ago.
>>>>>>>>> >
>>>>>>>>> > We've had an internal conversation
with the implementor of
>>>>>>>>> > -mprefer-avx256
>>>>>>>>> > in gcc about making -mprefer-avx128
affect 512-bit vectors as
>>>>>>>>> well. I'll
>>>>>>>>> > bring up the ambiguity issue with
them.
>>>>>>>>> >
>>>>>>>>> > Do we want to be compatible with gcc
here?
>>>>>>>>>
>>>>>>>>> I certainly believe we would want to be
compatible with gcc (if we
>>>>>>>>> use
>>>>>>>>> the same names).
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Tobias
>>>>>>>>>
>>>>>>>>> >
>>>>>>>>> > ~Craig
>>>>>>>>> >
>>>>>>>>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric
Christopher <
>>>>>>>>> echristo at gmail.com>
>>>>>>>>> > wrote:
>>>>>>>>> >
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> > > On Thu, Nov 2, 2017 at 7:05 PM
James Y Knight via llvm-dev <
>>>>>>>>> > > llvm-dev at lists.llvm.org>
wrote:
>>>>>>>>> > >
>>>>>>>>> > >> On Wed, Nov 1, 2017 at 7:35
PM, Craig Topper via llvm-dev <
>>>>>>>>> > >> llvm-dev at
lists.llvm.org> wrote:
>>>>>>>>> > >>
>>>>>>>>> > >>> Hello all,
>>>>>>>>> > >>>
>>>>>>>>> > >>>
>>>>>>>>> > >>>
>>>>>>>>> > >>> I would like to propose
adding the -mprefer-avx256 and
>>>>>>>>> -mprefer-avx128
>>>>>>>>> > >>> command line flags
supported by latest GCC to clang. These
>>>>>>>>> flags will be
>>>>>>>>> > >>> used to limit the vector
register size presented by TTI to
>>>>>>>>> the vectorizers.
>>>>>>>>> > >>> The backend will still be
able to use wider registers for
>>>>>>>>> code written
>>>>>>>>> > >>> using the instrinsics in
x86intrin.h. And the backend will
>>>>>>>>> still be able to
>>>>>>>>> > >>> use AVX512VL instructions
and the additional XMM16-31 and
>>>>>>>>> YMM16-31
>>>>>>>>> > >>> registers.
>>>>>>>>> > >>>
>>>>>>>>> > >>>
>>>>>>>>> > >>>
>>>>>>>>> > >>> Motivation:
>>>>>>>>> > >>>
>>>>>>>>> > >>> -Using 512-bit operations
on some Intel CPUs may cause a
>>>>>>>>> decrease in CPU
>>>>>>>>> > >>> frequency that may offset
the gains from using the wider
>>>>>>>>> register size. See
>>>>>>>>> > >>> section 15.26 of Intel®
64 and IA-32 Architectures
>>>>>>>>> Optimization Reference
>>>>>>>>> > >>> Manual published October
2017.
>>>>>>>>> > >>>
>>>>>>>>> > >>
>>>>>>>>> > >> I note the doc mentions that
256-bit AVX operations also have
>>>>>>>>> the same
>>>>>>>>> > >> issue with reducing the CPU
frequency, which is nice to see
>>>>>>>>> documented!
>>>>>>>>> > >>
>>>>>>>>> > >> There's also the issues
discussed here <http://www.agner.org/
>>>>>>>>> > >>
optimize/blog/read.php?i=165> (and elsewhere) related to
>>>>>>>>> warm-up time
>>>>>>>>> > >> for the 256-bit execution
pipeline, which is another issue
>>>>>>>>> with using
>>>>>>>>> > >> wide-vector ops.
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> > >> -The vector ALUs on ports 0
and 1 of the Skylake Server
>>>>>>>>> microarchitecture
>>>>>>>>> > >>> are only 256-bits wide.
512-bit instructions using these
>>>>>>>>> ALUs must use both
>>>>>>>>> > >>> ports. See section 2.1 of
Intel® 64 and IA-32 Architectures
>>>>>>>>> Optimization
>>>>>>>>> > >>> Reference Manual
published October 2017.
>>>>>>>>> > >>>
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> > >>>  Implementation Plan:
>>>>>>>>> > >>>
>>>>>>>>> > >>> -Add prefer-avx256 and
prefer-avx128 as SubtargetFeatures in
>>>>>>>>> X86.td not
>>>>>>>>> > >>> mapped to any CPU.
>>>>>>>>> > >>>
>>>>>>>>> > >>> -Add mprefer-avx256 and
mprefer-avx128 and the corresponding
>>>>>>>>> > >>> -mno-prefer-avx128/256
options to clang's driver Options.td
>>>>>>>>> file. I believe
>>>>>>>>> > >>> this will allow clang to
pass these straight through to the
>>>>>>>>> -target-feature
>>>>>>>>> > >>> attribute in IR.
>>>>>>>>> > >>>
>>>>>>>>> > >>> -Modify
X86TTIImpl::getRegisterBitWidth to only return 512
>>>>>>>>> if AVX512 is
>>>>>>>>> > >>> enabled and prefer-avx256
and prefer-avx128 is not set.
>>>>>>>>> Similarly return
>>>>>>>>> > >>> 256 if AVX is enabled and
prefer-avx128 is not set.
>>>>>>>>> > >>>
>>>>>>>>> > >>
>>>>>>>>> > >> Instead of multiple flags
that have difficult to understand
>>>>>>>>> intersecting
>>>>>>>>> > >> behavior, one flag with a
value would be better. E.g., what
>>>>>>>>> should
>>>>>>>>> > >> "-mprefer-avx256
-mprefer-avx128 -mno-prefer-avx256" do? No
>>>>>>>>> matter the
>>>>>>>>> > >> answer, it's confusing.
(Similarly with other such
>>>>>>>>> combinations). Just a
>>>>>>>>> > >> single arg
"-mprefer-avx={128/256/512}" (with no "no"
>>>>>>>>> version) seems easier
>>>>>>>>> > >> to understand to me (keeping
the same behavior as you
>>>>>>>>> mention: asking to
>>>>>>>>> > >> prefer a larger width than is
supported by your architecture
>>>>>>>>> should be fine
>>>>>>>>> > >> but ignored).
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> > > I agree with this. It's a
little more plumbing as far as
>>>>>>>>> subtarget
>>>>>>>>> > > features etc (represent via an
optional value or just various
>>>>>>>>> "set the avx
>>>>>>>>> > > width" features - the latter
being easier, but uglier),
>>>>>>>>> however, it's
>>>>>>>>> > > probably the right thing to do.
>>>>>>>>> > >
>>>>>>>>> > > I was looking at this myself just
a couple weeks ago and think
>>>>>>>>> this is the
>>>>>>>>> > > right direction (when and how to
turn things off) - and
>>>>>>>>> probably makes
>>>>>>>>> > > sense to be a default for these
architectures? We might end up
>>>>>>>>> needing to
>>>>>>>>> > > check a couple of additional TTI
places, but it sounds like
>>>>>>>>> you're on top
>>>>>>>>> > > of it. :)
>>>>>>>>> > >
>>>>>>>>> > > Thanks very much for doing this
work.
>>>>>>>>> > >
>>>>>>>>> > > -eric
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> > >> There may be some other
backend changes needed, but I plan to
>>>>>>>>> address
>>>>>>>>> > >>> those as we find them.
>>>>>>>>> > >>>
>>>>>>>>> > >>>
>>>>>>>>> > >>> At a later point,
consider making -mprefer-avx256 the
>>>>>>>>> default for
>>>>>>>>> > >>> Skylake Server due to the
above mentioned performance
>>>>>>>>> considerations.
>>>>>>>>> > >>>
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> > >>>
>>>>>>>>> > >> Does this sound reasonable?
>>>>>>>>> > >>>
>>>>>>>>> > >>>
>>>>>>>>> > >>>
>>>>>>>>> > >>> *Latest Intel
Optimization manual available here:
>>>>>>>>> > >>>
https://software.intel.com/en-us/articles/intel-sdm#
>>>>>>>>> optimization
>>>>>>>>> > >>>
>>>>>>>>> > >>>
>>>>>>>>> > >>> -Craig Topper
>>>>>>>>> > >>>
>>>>>>>>> > >>>
_______________________________________________
>>>>>>>>> > >>> LLVM Developers mailing
list
>>>>>>>>> > >>> llvm-dev at
lists.llvm.org
>>>>>>>>> > >>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>> > >>>
>>>>>>>>> > >>>
_______________________________________________
>>>>>>>>> > >> LLVM Developers mailing list
>>>>>>>>> > >> llvm-dev at lists.llvm.org
>>>>>>>>> > >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>> > >>
>>>>>>>>> > >
>>>>>>>>> >
_______________________________________________
>>>>>>>>> > LLVM Developers mailing list
>>>>>>>>> > llvm-dev at lists.llvm.org
>>>>>>>>> >
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>
_______________________________________________
>>>>>>>>> LLVM Developers mailing list
>>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171114/8236e27d/attachment-0001.html>

Sanjay Patel via llvm-dev

2017-Nov-14 18:10 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

I haven't looked into actually implementing revectorization, so we may just
want to ignore that possibility for now.

But I imagined that revectorization could hit the same problem that we're
trying to avoid here: if the cost models say that wider vectors are legal
and cheaper, but the reality is that perf will suffer when using those
wider vectors, then we want to avoid using the wider ops. The user
pref/override will be taken into account when deciding if we should go
wider.

In either scenario, we're not actually removing or limiting vector widths,
right? They're still legal as far as the ISA is concerned. We're just
avoiding those ops because the programmer and/or the CPU model says we'll
do better with narrower ops.


On Tue, Nov 14, 2017 at 10:26 AM, Craig Topper via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> For the re-vectorization case mentioned by Sanjay. That seems like a
> different type of limit than what's being proposed here. For
> revectorization you want to remove smaller vector widths. This is removing
> larger vector widths. I don't think we want the
-mprefer-vector-width=256
> being proposed here to say we can't do 128-bit vectors with the
256-bit.
> Maybe this should be called -mlimit-vector-width?
>
> Its not clear to be why revectorization would need a preference at all?
> Shouldn't we be able to decide from the cost models? We go from scalar
to
> vector today based on cost models. Why couldn't we go from vector to
wider
> vector?
>
> ~Craig
>
> On Mon, Nov 13, 2017 at 3:54 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>>
>> On 11/13/2017 05:49 PM, Eric Christopher wrote:
>>
>>
>>
>> On Mon, Nov 13, 2017 at 2:15 PM Craig Topper via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>>
>>>> On 11/11/2017 09:52 PM, UE US via llvm-dev wrote:
>>>>
>>>> If skylake is that bad at AVX2
>>>>
>>>>
>>>> I don't think this says anything negative about AVX2, but
AVX-512.
>>>>
>>>
>> Right. I think we're at AVX/AVX2 is "bad" on
Haswell/Broadwell and AVX512
>> is "bad" on Skylake. At least in the "random
autovectorization spread out"
>> aspect.
>>
>>
>>>
>>>>
>>>> it belongs in -mcpu / -march IMO.
>>>>
>>>>
>>>> No. We'd still want to enable the architectural features
for vector
>>>> intrinsics and the like.
>>>>
>>>
>>> I took this to mean that the feature should be enabled by default
for
>>> -march=skylake-avx512.
>>>
>>
>>
>> Agreed.
>>
>>
>> Yes. Also, GNOMETOYS clarified to me (off list) that is what he meant.
>>
>>  -Hal
>>
>>
>>
>> -eric
>>
>>
>>>
>>>
>>>
>>>>
>>>>
>>>> Based on the current performance data we're seeing, we
think we need to
>>>> ultimately default skylake-avx512 to -mprefer-vector-width=256.
>>>>
>>>>
>>>> Craig, is this for both integer and floating-point code?
>>>>
>>>
>>> I believe so, but I'll try to get confirmation from the people
with more
>>> data.
>>>
>>>
>>>>
>>>>
>>>>  -Hal
>>>>
>>>>    Most people will build for the standard x86_64-pc-linux or
whatever
>>>> anyway,  and completely ignore the change. This will mainly
affect those
>>>> who build their own software and optimize for their system, and
lots there
>>>> have probably caught on to this already.  I always thought
that's what
>>>> -march was made for, really.
>>>>
>>>> GNOMETOYS
>>>>
>>>> On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev
<
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> Yes - I was thinking of FeatureFastScalarFSQRT /
>>>>> FeatureFastVectorFSQRT which are used by isFsqrtCheap().
These were added
>>>>> to override the default x86 sqrt estimate codegen with:
>>>>> https://reviews.llvm.org/D21379
>>>>>
>>>>> But I'm not sure we really need that kind of hack. Can
we adjust the
>>>>> attribute in clang based on the target cpu? Ie, if you have
something like:
>>>>> $ clang -O2 -march=skylake-avx512 foo.c
>>>>>
>>>>> Then you can detect that in the clang driver and pass
>>>>> -mprefer-vector-width=256 to clang codegen as an option?
Clang codegen then
>>>>> adds that function attribute to everything it outputs.
Then, the
>>>>> vectorizers and/or backend detect that attribute and adjust
their behavior
>>>>> based on it.
>>>>>
>>>>
>>> Do we have a precedent for setting a target independent flag from a
>>> target specific cpu string in the clang driver? Want to make sure I
>>> understand what the processing on such a thing would look like.
>>> Particularly to get the order right so the user can override it.
>>>
>>>
>>>>
>>>>> So I don't think we should be messing with any kind of
type legality
>>>>> checking because that stuff should all be correct already.
We're just
>>>>> choosing a vector size based on a pref. I think we should
even allow the
>>>>> pref to go bigger than a legal type. This came up somewhere
on llvm-dev or
>>>>> in a bug recently in the context of vector reductions.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper
<craig.topper at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Are you referring to the
X86TargetLowering::isFsqrtCheap hook?
>>>>>>
>>>>>> ~Craig
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel
<spatel at rotateright.com
>>>>>> > wrote:
>>>>>>
>>>>>>> We can tie a user preference / override to a CPU
model. We do
>>>>>>> something like that for square root estimates
already (although it does use
>>>>>>> a SubtargetFeature currently for x86; ideally,
we'd key that off of
>>>>>>> something in the CPU scheduler model).
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper
<craig.topper at gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> I agree that a less x86 specific command line
makes sense. I've
>>>>>>>> been having an internal discussions with gcc
folks and their evaluating
>>>>>>>> switching to something like
-mprefer-vector-width=128/256/512/none
>>>>>>>>
>>>>>>>> Based on the current performance data we're
seeing, we think we
>>>>>>>> need to ultimately default skylake-avx512 to
-mprefer-vector-width=256. If
>>>>>>>> we go with a target independent
option/implementation is there someway we
>>>>>>>> could still affect the default behavior in a
target specific way?
>>>>>>>>
>>>>>>>> ~Craig
>>>>>>>>
>>>>>>>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel
<
>>>>>>>> spatel at rotateright.com> wrote:
>>>>>>>>
>>>>>>>>> It's clear from the Intel docs how this
has evolved, but from a
>>>>>>>>> compiler perspective, this isn't a
Skylake "feature" :) ... nor an Intel
>>>>>>>>> feature, nor an x86 feature.
>>>>>>>>>
>>>>>>>>> It's a generic programmer hint for any
target with multiple
>>>>>>>>> potential vector lengths.
>>>>>>>>>
>>>>>>>>> On x86, there's already a potential use
case for this hint with a
>>>>>>>>> different starting motivation:
re-vectorization. That's where we take C
>>>>>>>>> code that uses 128-bit vector intrinsics
and selectively widen it to 256-
>>>>>>>>> or 512-bit vector ops based on a newer CPU
target than the code was
>>>>>>>>> originally written for.
>>>>>>>>>
>>>>>>>>> I think it's just a matter of time
before a customer requests the
>>>>>>>>> same ability for another target (maybe they
already have and I don't know
>>>>>>>>> about it). So we should have a solution
that recognizes that possibility.
>>>>>>>>>
>>>>>>>>> Note that having a target-independent
implementation in the
>>>>>>>>> optimizer doesn't preclude a flag alias
in clang to maintain compatibility
>>>>>>>>> with gcc.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias
Grosser via llvm-dev <
>>>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>>>
>>>>>>>>>> On Fri, Nov 3, 2017, at 05:47, Craig
Topper via llvm-dev wrote:
>>>>>>>>>> > That's a very good point about
the ordering of the command line
>>>>>>>>>> options.
>>>>>>>>>> > gcc's current implementation
treats -mprefer-avx256 has "prefer
>>>>>>>>>> 256 over
>>>>>>>>>> > 512" and -mprefer-avx128 as
"prefer 128 over 256". Which feels
>>>>>>>>>> weird for
>>>>>>>>>> > other reasons, but has less of an
ordering ambiguity.
>>>>>>>>>> >
>>>>>>>>>> > -mprefer-avx128 has been in gcc
for many years and predates the
>>>>>>>>>> creation
>>>>>>>>>> > of
>>>>>>>>>> > avx512. -mprefer-avx256 was added
a couple months ago.
>>>>>>>>>> >
>>>>>>>>>> > We've had an internal
conversation with the implementor of
>>>>>>>>>> > -mprefer-avx256
>>>>>>>>>> > in gcc about making
-mprefer-avx128 affect 512-bit vectors as
>>>>>>>>>> well. I'll
>>>>>>>>>> > bring up the ambiguity issue with
them.
>>>>>>>>>> >
>>>>>>>>>> > Do we want to be compatible with
gcc here?
>>>>>>>>>>
>>>>>>>>>> I certainly believe we would want to be
compatible with gcc (if
>>>>>>>>>> we use
>>>>>>>>>> the same names).
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Tobias
>>>>>>>>>>
>>>>>>>>>> >
>>>>>>>>>> > ~Craig
>>>>>>>>>> >
>>>>>>>>>> > On Thu, Nov 2, 2017 at 7:18 PM,
Eric Christopher <
>>>>>>>>>> echristo at gmail.com>
>>>>>>>>>> > wrote:
>>>>>>>>>> >
>>>>>>>>>> > >
>>>>>>>>>> > >
>>>>>>>>>> > > On Thu, Nov 2, 2017 at 7:05
PM James Y Knight via llvm-dev <
>>>>>>>>>> > > llvm-dev at
lists.llvm.org> wrote:
>>>>>>>>>> > >
>>>>>>>>>> > >> On Wed, Nov 1, 2017 at
7:35 PM, Craig Topper via llvm-dev <
>>>>>>>>>> > >> llvm-dev at
lists.llvm.org> wrote:
>>>>>>>>>> > >>
>>>>>>>>>> > >>> Hello all,
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> I would like to
propose adding the -mprefer-avx256 and
>>>>>>>>>> -mprefer-avx128
>>>>>>>>>> > >>> command line flags
supported by latest GCC to clang. These
>>>>>>>>>> flags will be
>>>>>>>>>> > >>> used to limit the
vector register size presented by TTI to
>>>>>>>>>> the vectorizers.
>>>>>>>>>> > >>> The backend will
still be able to use wider registers for
>>>>>>>>>> code written
>>>>>>>>>> > >>> using the instrinsics
in x86intrin.h. And the backend will
>>>>>>>>>> still be able to
>>>>>>>>>> > >>> use AVX512VL
instructions and the additional XMM16-31 and
>>>>>>>>>> YMM16-31
>>>>>>>>>> > >>> registers.
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> Motivation:
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> -Using 512-bit
operations on some Intel CPUs may cause a
>>>>>>>>>> decrease in CPU
>>>>>>>>>> > >>> frequency that may
offset the gains from using the wider
>>>>>>>>>> register size. See
>>>>>>>>>> > >>> section 15.26 of
Intel® 64 and IA-32 Architectures
>>>>>>>>>> Optimization Reference
>>>>>>>>>> > >>> Manual published
October 2017.
>>>>>>>>>> > >>>
>>>>>>>>>> > >>
>>>>>>>>>> > >> I note the doc mentions
that 256-bit AVX operations also
>>>>>>>>>> have the same
>>>>>>>>>> > >> issue with reducing the
CPU frequency, which is nice to see
>>>>>>>>>> documented!
>>>>>>>>>> > >>
>>>>>>>>>> > >> There's also the
issues discussed here <
>>>>>>>>>> http://www.agner.org/
>>>>>>>>>> > >>
optimize/blog/read.php?i=165> (and elsewhere) related to
>>>>>>>>>> warm-up time
>>>>>>>>>> > >> for the 256-bit execution
pipeline, which is another issue
>>>>>>>>>> with using
>>>>>>>>>> > >> wide-vector ops.
>>>>>>>>>> > >>
>>>>>>>>>> > >>
>>>>>>>>>> > >> -The vector ALUs on ports
0 and 1 of the Skylake Server
>>>>>>>>>> microarchitecture
>>>>>>>>>> > >>> are only 256-bits
wide. 512-bit instructions using these
>>>>>>>>>> ALUs must use both
>>>>>>>>>> > >>> ports. See section
2.1 of Intel® 64 and IA-32 Architectures
>>>>>>>>>> Optimization
>>>>>>>>>> > >>> Reference Manual
published October 2017.
>>>>>>>>>> > >>>
>>>>>>>>>> > >>
>>>>>>>>>> > >>
>>>>>>>>>> > >>>  Implementation Plan:
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> -Add prefer-avx256
and prefer-avx128 as SubtargetFeatures
>>>>>>>>>> in X86.td not
>>>>>>>>>> > >>> mapped to any CPU.
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> -Add mprefer-avx256
and mprefer-avx128 and the corresponding
>>>>>>>>>> > >>>
-mno-prefer-avx128/256 options to clang's driver Options.td
>>>>>>>>>> file. I believe
>>>>>>>>>> > >>> this will allow clang
to pass these straight through to the
>>>>>>>>>> -target-feature
>>>>>>>>>> > >>> attribute in IR.
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> -Modify
X86TTIImpl::getRegisterBitWidth to only return 512
>>>>>>>>>> if AVX512 is
>>>>>>>>>> > >>> enabled and
prefer-avx256 and prefer-avx128 is not set.
>>>>>>>>>> Similarly return
>>>>>>>>>> > >>> 256 if AVX is enabled
and prefer-avx128 is not set.
>>>>>>>>>> > >>>
>>>>>>>>>> > >>
>>>>>>>>>> > >> Instead of multiple flags
that have difficult to understand
>>>>>>>>>> intersecting
>>>>>>>>>> > >> behavior, one flag with a
value would be better. E.g., what
>>>>>>>>>> should
>>>>>>>>>> > >> "-mprefer-avx256
-mprefer-avx128 -mno-prefer-avx256" do? No
>>>>>>>>>> matter the
>>>>>>>>>> > >> answer, it's
confusing. (Similarly with other such
>>>>>>>>>> combinations). Just a
>>>>>>>>>> > >> single arg
"-mprefer-avx={128/256/512}" (with no "no"
>>>>>>>>>> version) seems easier
>>>>>>>>>> > >> to understand to me
(keeping the same behavior as you
>>>>>>>>>> mention: asking to
>>>>>>>>>> > >> prefer a larger width
than is supported by your architecture
>>>>>>>>>> should be fine
>>>>>>>>>> > >> but ignored).
>>>>>>>>>> > >>
>>>>>>>>>> > >>
>>>>>>>>>> > > I agree with this. It's a
little more plumbing as far as
>>>>>>>>>> subtarget
>>>>>>>>>> > > features etc (represent via
an optional value or just various
>>>>>>>>>> "set the avx
>>>>>>>>>> > > width" features - the
latter being easier, but uglier),
>>>>>>>>>> however, it's
>>>>>>>>>> > > probably the right thing to
do.
>>>>>>>>>> > >
>>>>>>>>>> > > I was looking at this myself
just a couple weeks ago and
>>>>>>>>>> think this is the
>>>>>>>>>> > > right direction (when and how
to turn things off) - and
>>>>>>>>>> probably makes
>>>>>>>>>> > > sense to be a default for
these architectures? We might end
>>>>>>>>>> up needing to
>>>>>>>>>> > > check a couple of additional
TTI places, but it sounds like
>>>>>>>>>> you're on top
>>>>>>>>>> > > of it. :)
>>>>>>>>>> > >
>>>>>>>>>> > > Thanks very much for doing
this work.
>>>>>>>>>> > >
>>>>>>>>>> > > -eric
>>>>>>>>>> > >
>>>>>>>>>> > >
>>>>>>>>>> > >>
>>>>>>>>>> > >>
>>>>>>>>>> > >> There may be some other
backend changes needed, but I plan
>>>>>>>>>> to address
>>>>>>>>>> > >>> those as we find
them.
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> At a later point,
consider making -mprefer-avx256 the
>>>>>>>>>> default for
>>>>>>>>>> > >>> Skylake Server due to
the above mentioned performance
>>>>>>>>>> considerations.
>>>>>>>>>> > >>>
>>>>>>>>>> > >>
>>>>>>>>>> > >>
>>>>>>>>>> > >>
>>>>>>>>>> > >>
>>>>>>>>>> > >>
>>>>>>>>>> > >>>
>>>>>>>>>> > >> Does this sound
reasonable?
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> *Latest Intel
Optimization manual available here:
>>>>>>>>>> > >>>
https://software.intel.com/en-
>>>>>>>>>> us/articles/intel-sdm#optimization
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
>>>>>>>>>> > >>> -Craig Topper
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
_______________________________________________
>>>>>>>>>> > >>> LLVM Developers
mailing list
>>>>>>>>>> > >>> llvm-dev at
lists.llvm.org
>>>>>>>>>> > >>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>> > >>>
>>>>>>>>>> > >>>
_______________________________________________
>>>>>>>>>> > >> LLVM Developers mailing
list
>>>>>>>>>> > >> llvm-dev at
lists.llvm.org
>>>>>>>>>> > >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>> > >>
>>>>>>>>>> > >
>>>>>>>>>> >
_______________________________________________
>>>>>>>>>> > LLVM Developers mailing list
>>>>>>>>>> > llvm-dev at lists.llvm.org
>>>>>>>>>> >
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>>
_______________________________________________
>>>>>>>>>> LLVM Developers mailing list
>>>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171114/4ba9213d/attachment-0001.html>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Nov 2017 - RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Apparently Analagous Threads