thr3ads.net - llvm dev - [llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Craig Topper via llvm-dev

2017-Nov-11 01:04 UTC

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Are you referring to the X86TargetLowering::isFsqrtCheap hook?

~Craig

On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> We can tie a user preference / override to a CPU model. We do something
> like that for square root estimates already (although it does use a
> SubtargetFeature currently for x86; ideally, we'd key that off of
something
> in the CPU scheduler model).
>
>
> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper <craig.topper at
gmail.com>
> wrote:
>
>> I agree that a less x86 specific command line makes sense. I've
been
>> having an internal discussions with gcc folks and their evaluating
>> switching to something like -mprefer-vector-width=128/256/512/none
>>
>> Based on the current performance data we're seeing, we think we
need to
>> ultimately default skylake-avx512 to -mprefer-vector-width=256. If we
go
>> with a target independent option/implementation is there someway we
could
>> still affect the default behavior in a target specific way?
>>
>> ~Craig
>>
>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel <spatel at
rotateright.com>
>> wrote:
>>
>>> It's clear from the Intel docs how this has evolved, but from a
compiler
>>> perspective, this isn't a Skylake "feature" :) ...
nor an Intel feature,
>>> nor an x86 feature.
>>>
>>> It's a generic programmer hint for any target with multiple
potential
>>> vector lengths.
>>>
>>> On x86, there's already a potential use case for this hint with
a
>>> different starting motivation: re-vectorization. That's where
we take C
>>> code that uses 128-bit vector intrinsics and selectively widen it
to 256-
>>> or 512-bit vector ops based on a newer CPU target than the code was
>>> originally written for.
>>>
>>> I think it's just a matter of time before a customer requests
the same
>>> ability for another target (maybe they already have and I don't
know about
>>> it). So we should have a solution that recognizes that possibility.
>>>
>>> Note that having a target-independent implementation in the
optimizer
>>> doesn't preclude a flag alias in clang to maintain
compatibility with gcc.
>>>
>>>
>>>
>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev wrote:
>>>> > That's a very good point about the ordering of the
command line
>>>> options.
>>>> > gcc's current implementation treats -mprefer-avx256
has "prefer 256
>>>> over
>>>> > 512" and -mprefer-avx128 as "prefer 128 over
256". Which feels weird
>>>> for
>>>> > other reasons, but has less of an ordering ambiguity.
>>>> >
>>>> > -mprefer-avx128 has been in gcc for many years and
predates the
>>>> creation
>>>> > of
>>>> > avx512. -mprefer-avx256 was added a couple months ago.
>>>> >
>>>> > We've had an internal conversation with the
implementor of
>>>> > -mprefer-avx256
>>>> > in gcc about making -mprefer-avx128 affect 512-bit vectors
as well.
>>>> I'll
>>>> > bring up the ambiguity issue with them.
>>>> >
>>>> > Do we want to be compatible with gcc here?
>>>>
>>>> I certainly believe we would want to be compatible with gcc (if
we use
>>>> the same names).
>>>>
>>>> Best,
>>>> Tobias
>>>>
>>>> >
>>>> > ~Craig
>>>> >
>>>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher
<echristo at gmail.com>
>>>> > wrote:
>>>> >
>>>> > >
>>>> > >
>>>> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via
llvm-dev <
>>>> > > llvm-dev at lists.llvm.org> wrote:
>>>> > >
>>>> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via
llvm-dev <
>>>> > >> llvm-dev at lists.llvm.org> wrote:
>>>> > >>
>>>> > >>> Hello all,
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> I would like to propose adding the
-mprefer-avx256 and
>>>> -mprefer-avx128
>>>> > >>> command line flags supported by latest GCC to
clang. These flags
>>>> will be
>>>> > >>> used to limit the vector register size
presented by TTI to the
>>>> vectorizers.
>>>> > >>> The backend will still be able to use wider
registers for code
>>>> written
>>>> > >>> using the instrinsics in x86intrin.h. And the
backend will still
>>>> be able to
>>>> > >>> use AVX512VL instructions and the additional
XMM16-31 and YMM16-31
>>>> > >>> registers.
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Motivation:
>>>> > >>>
>>>> > >>> -Using 512-bit operations on some Intel CPUs
may cause a decrease
>>>> in CPU
>>>> > >>> frequency that may offset the gains from
using the wider register
>>>> size. See
>>>> > >>> section 15.26 of Intel® 64 and IA-32
Architectures Optimization
>>>> Reference
>>>> > >>> Manual published October 2017.
>>>> > >>>
>>>> > >>
>>>> > >> I note the doc mentions that 256-bit AVX
operations also have the
>>>> same
>>>> > >> issue with reducing the CPU frequency, which is
nice to see
>>>> documented!
>>>> > >>
>>>> > >> There's also the issues discussed here
<http://www.agner.org/
>>>> > >> optimize/blog/read.php?i=165> (and elsewhere)
related to warm-up
>>>> time
>>>> > >> for the 256-bit execution pipeline, which is
another issue with
>>>> using
>>>> > >> wide-vector ops.
>>>> > >>
>>>> > >>
>>>> > >> -The vector ALUs on ports 0 and 1 of the Skylake
Server
>>>> microarchitecture
>>>> > >>> are only 256-bits wide. 512-bit instructions
using these ALUs
>>>> must use both
>>>> > >>> ports. See section 2.1 of Intel® 64 and IA-32
Architectures
>>>> Optimization
>>>> > >>> Reference Manual published October 2017.
>>>> > >>>
>>>> > >>
>>>> > >>
>>>> > >>>  Implementation Plan:
>>>> > >>>
>>>> > >>> -Add prefer-avx256 and prefer-avx128 as
SubtargetFeatures in
>>>> X86.td not
>>>> > >>> mapped to any CPU.
>>>> > >>>
>>>> > >>> -Add mprefer-avx256 and mprefer-avx128 and
the corresponding
>>>> > >>> -mno-prefer-avx128/256 options to clang's
driver Options.td file.
>>>> I believe
>>>> > >>> this will allow clang to pass these straight
through to the
>>>> -target-feature
>>>> > >>> attribute in IR.
>>>> > >>>
>>>> > >>> -Modify X86TTIImpl::getRegisterBitWidth to
only return 512 if
>>>> AVX512 is
>>>> > >>> enabled and prefer-avx256 and prefer-avx128
is not set. Similarly
>>>> return
>>>> > >>> 256 if AVX is enabled and prefer-avx128 is
not set.
>>>> > >>>
>>>> > >>
>>>> > >> Instead of multiple flags that have difficult to
understand
>>>> intersecting
>>>> > >> behavior, one flag with a value would be better.
E.g., what should
>>>> > >> "-mprefer-avx256 -mprefer-avx128
-mno-prefer-avx256" do? No matter
>>>> the
>>>> > >> answer, it's confusing. (Similarly with other
such combinations).
>>>> Just a
>>>> > >> single arg "-mprefer-avx={128/256/512}"
(with no "no" version)
>>>> seems easier
>>>> > >> to understand to me (keeping the same behavior as
you mention:
>>>> asking to
>>>> > >> prefer a larger width than is supported by your
architecture
>>>> should be fine
>>>> > >> but ignored).
>>>> > >>
>>>> > >>
>>>> > > I agree with this. It's a little more plumbing as
far as subtarget
>>>> > > features etc (represent via an optional value or just
various "set
>>>> the avx
>>>> > > width" features - the latter being easier, but
uglier), however,
>>>> it's
>>>> > > probably the right thing to do.
>>>> > >
>>>> > > I was looking at this myself just a couple weeks ago
and think this
>>>> is the
>>>> > > right direction (when and how to turn things off) -
and probably
>>>> makes
>>>> > > sense to be a default for these architectures? We
might end up
>>>> needing to
>>>> > > check a couple of additional TTI places, but it
sounds like you're
>>>> on top
>>>> > > of it. :)
>>>> > >
>>>> > > Thanks very much for doing this work.
>>>> > >
>>>> > > -eric
>>>> > >
>>>> > >
>>>> > >>
>>>> > >>
>>>> > >> There may be some other backend changes needed,
but I plan to
>>>> address
>>>> > >>> those as we find them.
>>>> > >>>
>>>> > >>>
>>>> > >>> At a later point, consider making
-mprefer-avx256 the default for
>>>> > >>> Skylake Server due to the above mentioned
performance
>>>> considerations.
>>>> > >>>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >>>
>>>> > >> Does this sound reasonable?
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> *Latest Intel Optimization manual available
here:
>>>> > >>>
https://software.intel.com/en-us/articles/intel-sdm#optimization
>>>> > >>>
>>>> > >>>
>>>> > >>> -Craig Topper
>>>> > >>>
>>>> > >>>
_______________________________________________
>>>> > >>> LLVM Developers mailing list
>>>> > >>> llvm-dev at lists.llvm.org
>>>> > >>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> > >>>
>>>> > >>>
_______________________________________________
>>>> > >> LLVM Developers mailing list
>>>> > >> llvm-dev at lists.llvm.org
>>>> > >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> > >>
>>>> > >
>>>> > _______________________________________________
>>>> > LLVM Developers mailing list
>>>> > llvm-dev at lists.llvm.org
>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171110/aa419eef/attachment.html>

Sanjay Patel via llvm-dev

2017-Nov-11 16:25 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Yes - I was thinking of FeatureFastScalarFSQRT / FeatureFastVectorFSQRT
which are used by isFsqrtCheap(). These were added to override the default
x86 sqrt estimate codegen with:
https://reviews.llvm.org/D21379

But I'm not sure we really need that kind of hack. Can we adjust the
attribute in clang based on the target cpu? Ie, if you have something like:
$ clang -O2 -march=skylake-avx512 foo.c

Then you can detect that in the clang driver and pass
-mprefer-vector-width=256 to clang codegen as an option? Clang codegen then
adds that function attribute to everything it outputs. Then, the
vectorizers and/or backend detect that attribute and adjust their behavior
based on it.

So I don't think we should be messing with any kind of type legality
checking because that stuff should all be correct already. We're just
choosing a vector size based on a pref. I think we should even allow the
pref to go bigger than a legal type. This came up somewhere on llvm-dev or
in a bug recently in the context of vector reductions.



On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper <craig.topper at gmail.com>
wrote:
> Are you referring to the X86TargetLowering::isFsqrtCheap hook?
>
> ~Craig
>
> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel <spatel at
rotateright.com>
> wrote:
>
>> We can tie a user preference / override to a CPU model. We do something
>> like that for square root estimates already (although it does use a
>> SubtargetFeature currently for x86; ideally, we'd key that off of
something
>> in the CPU scheduler model).
>>
>>
>> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper <craig.topper at
gmail.com>
>> wrote:
>>
>>> I agree that a less x86 specific command line makes sense. I've
been
>>> having an internal discussions with gcc folks and their evaluating
>>> switching to something like -mprefer-vector-width=128/256/512/none
>>>
>>> Based on the current performance data we're seeing, we think we
need to
>>> ultimately default skylake-avx512 to -mprefer-vector-width=256. If
we go
>>> with a target independent option/implementation is there someway we
could
>>> still affect the default behavior in a target specific way?
>>>
>>> ~Craig
>>>
>>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel <spatel at
rotateright.com>
>>> wrote:
>>>
>>>> It's clear from the Intel docs how this has evolved, but
from a
>>>> compiler perspective, this isn't a Skylake
"feature" :) ... nor an Intel
>>>> feature, nor an x86 feature.
>>>>
>>>> It's a generic programmer hint for any target with multiple
potential
>>>> vector lengths.
>>>>
>>>> On x86, there's already a potential use case for this hint
with a
>>>> different starting motivation: re-vectorization. That's
where we take C
>>>> code that uses 128-bit vector intrinsics and selectively widen
it to 256-
>>>> or 512-bit vector ops based on a newer CPU target than the code
was
>>>> originally written for.
>>>>
>>>> I think it's just a matter of time before a customer
requests the same
>>>> ability for another target (maybe they already have and I
don't know about
>>>> it). So we should have a solution that recognizes that
possibility.
>>>>
>>>> Note that having a target-independent implementation in the
optimizer
>>>> doesn't preclude a flag alias in clang to maintain
compatibility with gcc.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev
<
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev
wrote:
>>>>> > That's a very good point about the ordering of the
command line
>>>>> options.
>>>>> > gcc's current implementation treats
-mprefer-avx256 has "prefer 256
>>>>> over
>>>>> > 512" and -mprefer-avx128 as "prefer 128 over
256". Which feels weird
>>>>> for
>>>>> > other reasons, but has less of an ordering ambiguity.
>>>>> >
>>>>> > -mprefer-avx128 has been in gcc for many years and
predates the
>>>>> creation
>>>>> > of
>>>>> > avx512. -mprefer-avx256 was added a couple months ago.
>>>>> >
>>>>> > We've had an internal conversation with the
implementor of
>>>>> > -mprefer-avx256
>>>>> > in gcc about making -mprefer-avx128 affect 512-bit
vectors as well.
>>>>> I'll
>>>>> > bring up the ambiguity issue with them.
>>>>> >
>>>>> > Do we want to be compatible with gcc here?
>>>>>
>>>>> I certainly believe we would want to be compatible with gcc
(if we use
>>>>> the same names).
>>>>>
>>>>> Best,
>>>>> Tobias
>>>>>
>>>>> >
>>>>> > ~Craig
>>>>> >
>>>>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher
<echristo at gmail.com
>>>>> >
>>>>> > wrote:
>>>>> >
>>>>> > >
>>>>> > >
>>>>> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via
llvm-dev <
>>>>> > > llvm-dev at lists.llvm.org> wrote:
>>>>> > >
>>>>> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper
via llvm-dev <
>>>>> > >> llvm-dev at lists.llvm.org> wrote:
>>>>> > >>
>>>>> > >>> Hello all,
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> I would like to propose adding the
-mprefer-avx256 and
>>>>> -mprefer-avx128
>>>>> > >>> command line flags supported by latest
GCC to clang. These flags
>>>>> will be
>>>>> > >>> used to limit the vector register size
presented by TTI to the
>>>>> vectorizers.
>>>>> > >>> The backend will still be able to use
wider registers for code
>>>>> written
>>>>> > >>> using the instrinsics in x86intrin.h. And
the backend will still
>>>>> be able to
>>>>> > >>> use AVX512VL instructions and the
additional XMM16-31 and
>>>>> YMM16-31
>>>>> > >>> registers.
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> Motivation:
>>>>> > >>>
>>>>> > >>> -Using 512-bit operations on some Intel
CPUs may cause a
>>>>> decrease in CPU
>>>>> > >>> frequency that may offset the gains from
using the wider
>>>>> register size. See
>>>>> > >>> section 15.26 of Intel® 64 and IA-32
Architectures Optimization
>>>>> Reference
>>>>> > >>> Manual published October 2017.
>>>>> > >>>
>>>>> > >>
>>>>> > >> I note the doc mentions that 256-bit AVX
operations also have the
>>>>> same
>>>>> > >> issue with reducing the CPU frequency, which
is nice to see
>>>>> documented!
>>>>> > >>
>>>>> > >> There's also the issues discussed here
<http://www.agner.org/
>>>>> > >> optimize/blog/read.php?i=165> (and
elsewhere) related to warm-up
>>>>> time
>>>>> > >> for the 256-bit execution pipeline, which is
another issue with
>>>>> using
>>>>> > >> wide-vector ops.
>>>>> > >>
>>>>> > >>
>>>>> > >> -The vector ALUs on ports 0 and 1 of the
Skylake Server
>>>>> microarchitecture
>>>>> > >>> are only 256-bits wide. 512-bit
instructions using these ALUs
>>>>> must use both
>>>>> > >>> ports. See section 2.1 of Intel® 64 and
IA-32 Architectures
>>>>> Optimization
>>>>> > >>> Reference Manual published October 2017.
>>>>> > >>>
>>>>> > >>
>>>>> > >>
>>>>> > >>>  Implementation Plan:
>>>>> > >>>
>>>>> > >>> -Add prefer-avx256 and prefer-avx128 as
SubtargetFeatures in
>>>>> X86.td not
>>>>> > >>> mapped to any CPU.
>>>>> > >>>
>>>>> > >>> -Add mprefer-avx256 and mprefer-avx128
and the corresponding
>>>>> > >>> -mno-prefer-avx128/256 options to
clang's driver Options.td
>>>>> file. I believe
>>>>> > >>> this will allow clang to pass these
straight through to the
>>>>> -target-feature
>>>>> > >>> attribute in IR.
>>>>> > >>>
>>>>> > >>> -Modify X86TTIImpl::getRegisterBitWidth
to only return 512 if
>>>>> AVX512 is
>>>>> > >>> enabled and prefer-avx256 and
prefer-avx128 is not set.
>>>>> Similarly return
>>>>> > >>> 256 if AVX is enabled and prefer-avx128
is not set.
>>>>> > >>>
>>>>> > >>
>>>>> > >> Instead of multiple flags that have difficult
to understand
>>>>> intersecting
>>>>> > >> behavior, one flag with a value would be
better. E.g., what should
>>>>> > >> "-mprefer-avx256 -mprefer-avx128
-mno-prefer-avx256" do? No
>>>>> matter the
>>>>> > >> answer, it's confusing. (Similarly with
other such combinations).
>>>>> Just a
>>>>> > >> single arg
"-mprefer-avx={128/256/512}" (with no "no" version)
>>>>> seems easier
>>>>> > >> to understand to me (keeping the same
behavior as you mention:
>>>>> asking to
>>>>> > >> prefer a larger width than is supported by
your architecture
>>>>> should be fine
>>>>> > >> but ignored).
>>>>> > >>
>>>>> > >>
>>>>> > > I agree with this. It's a little more
plumbing as far as subtarget
>>>>> > > features etc (represent via an optional value or
just various "set
>>>>> the avx
>>>>> > > width" features - the latter being easier,
but uglier), however,
>>>>> it's
>>>>> > > probably the right thing to do.
>>>>> > >
>>>>> > > I was looking at this myself just a couple weeks
ago and think
>>>>> this is the
>>>>> > > right direction (when and how to turn things off)
- and probably
>>>>> makes
>>>>> > > sense to be a default for these architectures? We
might end up
>>>>> needing to
>>>>> > > check a couple of additional TTI places, but it
sounds like you're
>>>>> on top
>>>>> > > of it. :)
>>>>> > >
>>>>> > > Thanks very much for doing this work.
>>>>> > >
>>>>> > > -eric
>>>>> > >
>>>>> > >
>>>>> > >>
>>>>> > >>
>>>>> > >> There may be some other backend changes
needed, but I plan to
>>>>> address
>>>>> > >>> those as we find them.
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> At a later point, consider making
-mprefer-avx256 the default for
>>>>> > >>> Skylake Server due to the above mentioned
performance
>>>>> considerations.
>>>>> > >>>
>>>>> > >>
>>>>> > >>
>>>>> > >>
>>>>> > >>
>>>>> > >>
>>>>> > >>>
>>>>> > >> Does this sound reasonable?
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> *Latest Intel Optimization manual
available here:
>>>>> > >>>
https://software.intel.com/en-us/articles/intel-sdm#optimization
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> -Craig Topper
>>>>> > >>>
>>>>> > >>>
_______________________________________________
>>>>> > >>> LLVM Developers mailing list
>>>>> > >>> llvm-dev at lists.llvm.org
>>>>> > >>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> > >>>
>>>>> > >>>
_______________________________________________
>>>>> > >> LLVM Developers mailing list
>>>>> > >> llvm-dev at lists.llvm.org
>>>>> > >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> > >>
>>>>> > >
>>>>> > _______________________________________________
>>>>> > LLVM Developers mailing list
>>>>> > llvm-dev at lists.llvm.org
>>>>> >
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171111/13723922/attachment-0001.html>

UE US via llvm-dev

2017-Nov-12 03:52 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

If skylake is that bad at AVX2 it belongs in -mcpu / -march IMO.    Most
people will build for the standard x86_64-pc-linux or whatever anyway,  and
completely ignore the change. This will mainly affect those who build their
own software and optimize for their system, and lots there have probably
caught on to this already.  I always thought that's what -march was made
for, really.

GNOMETOYS

On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Yes - I was thinking of FeatureFastScalarFSQRT / FeatureFastVectorFSQRT
> which are used by isFsqrtCheap(). These were added to override the default
> x86 sqrt estimate codegen with:
> https://reviews.llvm.org/D21379
>
> But I'm not sure we really need that kind of hack. Can we adjust the
> attribute in clang based on the target cpu? Ie, if you have something like:
> $ clang -O2 -march=skylake-avx512 foo.c
>
> Then you can detect that in the clang driver and pass
> -mprefer-vector-width=256 to clang codegen as an option? Clang codegen then
> adds that function attribute to everything it outputs. Then, the
> vectorizers and/or backend detect that attribute and adjust their behavior
> based on it.
>
> So I don't think we should be messing with any kind of type legality
> checking because that stuff should all be correct already. We're just
> choosing a vector size based on a pref. I think we should even allow the
> pref to go bigger than a legal type. This came up somewhere on llvm-dev or
> in a bug recently in the context of vector reductions.
>
>
>
> On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper <craig.topper at
gmail.com>
> wrote:
>
>> Are you referring to the X86TargetLowering::isFsqrtCheap hook?
>>
>> ~Craig
>>
>> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel <spatel at
rotateright.com>
>> wrote:
>>
>>> We can tie a user preference / override to a CPU model. We do
something
>>> like that for square root estimates already (although it does use a
>>> SubtargetFeature currently for x86; ideally, we'd key that off
of something
>>> in the CPU scheduler model).
>>>
>>>
>>> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper <craig.topper at
gmail.com>
>>> wrote:
>>>
>>>> I agree that a less x86 specific command line makes sense.
I've been
>>>> having an internal discussions with gcc folks and their
evaluating
>>>> switching to something like
-mprefer-vector-width=128/256/512/none
>>>>
>>>> Based on the current performance data we're seeing, we
think we need to
>>>> ultimately default skylake-avx512 to -mprefer-vector-width=256.
If we go
>>>> with a target independent option/implementation is there
someway we could
>>>> still affect the default behavior in a target specific way?
>>>>
>>>> ~Craig
>>>>
>>>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel <spatel at
rotateright.com>
>>>> wrote:
>>>>
>>>>> It's clear from the Intel docs how this has evolved,
but from a
>>>>> compiler perspective, this isn't a Skylake
"feature" :) ... nor an Intel
>>>>> feature, nor an x86 feature.
>>>>>
>>>>> It's a generic programmer hint for any target with
multiple potential
>>>>> vector lengths.
>>>>>
>>>>> On x86, there's already a potential use case for this
hint with a
>>>>> different starting motivation: re-vectorization. That's
where we take C
>>>>> code that uses 128-bit vector intrinsics and selectively
widen it to 256-
>>>>> or 512-bit vector ops based on a newer CPU target than the
code was
>>>>> originally written for.
>>>>>
>>>>> I think it's just a matter of time before a customer
requests the same
>>>>> ability for another target (maybe they already have and I
don't know about
>>>>> it). So we should have a solution that recognizes that
possibility.
>>>>>
>>>>> Note that having a target-independent implementation in the
optimizer
>>>>> doesn't preclude a flag alias in clang to maintain
compatibility with gcc.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev
<
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> On Fri, Nov 3, 2017, at 05:47, Craig Topper via
llvm-dev wrote:
>>>>>> > That's a very good point about the ordering of
the command line
>>>>>> options.
>>>>>> > gcc's current implementation treats
-mprefer-avx256 has "prefer 256
>>>>>> over
>>>>>> > 512" and -mprefer-avx128 as "prefer 128
over 256". Which feels
>>>>>> weird for
>>>>>> > other reasons, but has less of an ordering
ambiguity.
>>>>>> >
>>>>>> > -mprefer-avx128 has been in gcc for many years and
predates the
>>>>>> creation
>>>>>> > of
>>>>>> > avx512. -mprefer-avx256 was added a couple months
ago.
>>>>>> >
>>>>>> > We've had an internal conversation with the
implementor of
>>>>>> > -mprefer-avx256
>>>>>> > in gcc about making -mprefer-avx128 affect 512-bit
vectors as well.
>>>>>> I'll
>>>>>> > bring up the ambiguity issue with them.
>>>>>> >
>>>>>> > Do we want to be compatible with gcc here?
>>>>>>
>>>>>> I certainly believe we would want to be compatible with
gcc (if we use
>>>>>> the same names).
>>>>>>
>>>>>> Best,
>>>>>> Tobias
>>>>>>
>>>>>> >
>>>>>> > ~Craig
>>>>>> >
>>>>>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher
<
>>>>>> echristo at gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> > >
>>>>>> > >
>>>>>> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight
via llvm-dev <
>>>>>> > > llvm-dev at lists.llvm.org> wrote:
>>>>>> > >
>>>>>> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig
Topper via llvm-dev <
>>>>>> > >> llvm-dev at lists.llvm.org> wrote:
>>>>>> > >>
>>>>>> > >>> Hello all,
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> I would like to propose adding the
-mprefer-avx256 and
>>>>>> -mprefer-avx128
>>>>>> > >>> command line flags supported by
latest GCC to clang. These
>>>>>> flags will be
>>>>>> > >>> used to limit the vector register
size presented by TTI to the
>>>>>> vectorizers.
>>>>>> > >>> The backend will still be able to use
wider registers for code
>>>>>> written
>>>>>> > >>> using the instrinsics in x86intrin.h.
And the backend will
>>>>>> still be able to
>>>>>> > >>> use AVX512VL instructions and the
additional XMM16-31 and
>>>>>> YMM16-31
>>>>>> > >>> registers.
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> Motivation:
>>>>>> > >>>
>>>>>> > >>> -Using 512-bit operations on some
Intel CPUs may cause a
>>>>>> decrease in CPU
>>>>>> > >>> frequency that may offset the gains
from using the wider
>>>>>> register size. See
>>>>>> > >>> section 15.26 of Intel® 64 and IA-32
Architectures Optimization
>>>>>> Reference
>>>>>> > >>> Manual published October 2017.
>>>>>> > >>>
>>>>>> > >>
>>>>>> > >> I note the doc mentions that 256-bit AVX
operations also have
>>>>>> the same
>>>>>> > >> issue with reducing the CPU frequency,
which is nice to see
>>>>>> documented!
>>>>>> > >>
>>>>>> > >> There's also the issues discussed
here <http://www.agner.org/
>>>>>> > >> optimize/blog/read.php?i=165> (and
elsewhere) related to warm-up
>>>>>> time
>>>>>> > >> for the 256-bit execution pipeline, which
is another issue with
>>>>>> using
>>>>>> > >> wide-vector ops.
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> -The vector ALUs on ports 0 and 1 of the
Skylake Server
>>>>>> microarchitecture
>>>>>> > >>> are only 256-bits wide. 512-bit
instructions using these ALUs
>>>>>> must use both
>>>>>> > >>> ports. See section 2.1 of Intel® 64
and IA-32 Architectures
>>>>>> Optimization
>>>>>> > >>> Reference Manual published October
2017.
>>>>>> > >>>
>>>>>> > >>
>>>>>> > >>
>>>>>> > >>>  Implementation Plan:
>>>>>> > >>>
>>>>>> > >>> -Add prefer-avx256 and prefer-avx128
as SubtargetFeatures in
>>>>>> X86.td not
>>>>>> > >>> mapped to any CPU.
>>>>>> > >>>
>>>>>> > >>> -Add mprefer-avx256 and
mprefer-avx128 and the corresponding
>>>>>> > >>> -mno-prefer-avx128/256 options to
clang's driver Options.td
>>>>>> file. I believe
>>>>>> > >>> this will allow clang to pass these
straight through to the
>>>>>> -target-feature
>>>>>> > >>> attribute in IR.
>>>>>> > >>>
>>>>>> > >>> -Modify
X86TTIImpl::getRegisterBitWidth to only return 512 if
>>>>>> AVX512 is
>>>>>> > >>> enabled and prefer-avx256 and
prefer-avx128 is not set.
>>>>>> Similarly return
>>>>>> > >>> 256 if AVX is enabled and
prefer-avx128 is not set.
>>>>>> > >>>
>>>>>> > >>
>>>>>> > >> Instead of multiple flags that have
difficult to understand
>>>>>> intersecting
>>>>>> > >> behavior, one flag with a value would be
better. E.g., what
>>>>>> should
>>>>>> > >> "-mprefer-avx256 -mprefer-avx128
-mno-prefer-avx256" do? No
>>>>>> matter the
>>>>>> > >> answer, it's confusing. (Similarly
with other such
>>>>>> combinations). Just a
>>>>>> > >> single arg
"-mprefer-avx={128/256/512}" (with no "no" version)
>>>>>> seems easier
>>>>>> > >> to understand to me (keeping the same
behavior as you mention:
>>>>>> asking to
>>>>>> > >> prefer a larger width than is supported
by your architecture
>>>>>> should be fine
>>>>>> > >> but ignored).
>>>>>> > >>
>>>>>> > >>
>>>>>> > > I agree with this. It's a little more
plumbing as far as subtarget
>>>>>> > > features etc (represent via an optional value
or just various
>>>>>> "set the avx
>>>>>> > > width" features - the latter being
easier, but uglier), however,
>>>>>> it's
>>>>>> > > probably the right thing to do.
>>>>>> > >
>>>>>> > > I was looking at this myself just a couple
weeks ago and think
>>>>>> this is the
>>>>>> > > right direction (when and how to turn things
off) - and probably
>>>>>> makes
>>>>>> > > sense to be a default for these
architectures? We might end up
>>>>>> needing to
>>>>>> > > check a couple of additional TTI places, but
it sounds like
>>>>>> you're on top
>>>>>> > > of it. :)
>>>>>> > >
>>>>>> > > Thanks very much for doing this work.
>>>>>> > >
>>>>>> > > -eric
>>>>>> > >
>>>>>> > >
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> There may be some other backend changes
needed, but I plan to
>>>>>> address
>>>>>> > >>> those as we find them.
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> At a later point, consider making
-mprefer-avx256 the default
>>>>>> for
>>>>>> > >>> Skylake Server due to the above
mentioned performance
>>>>>> considerations.
>>>>>> > >>>
>>>>>> > >>
>>>>>> > >>
>>>>>> > >>
>>>>>> > >>
>>>>>> > >>
>>>>>> > >>>
>>>>>> > >> Does this sound reasonable?
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> *Latest Intel Optimization manual
available here:
>>>>>> > >>>
https://software.intel.com/en-us/articles/intel-sdm#optimiza
>>>>>> tion
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> -Craig Topper
>>>>>> > >>>
>>>>>> > >>>
_______________________________________________
>>>>>> > >>> LLVM Developers mailing list
>>>>>> > >>> llvm-dev at lists.llvm.org
>>>>>> > >>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>> > >>>
>>>>>> > >>>
_______________________________________________
>>>>>> > >> LLVM Developers mailing list
>>>>>> > >> llvm-dev at lists.llvm.org
>>>>>> > >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>> > >>
>>>>>> > >
>>>>>> > _______________________________________________
>>>>>> > LLVM Developers mailing list
>>>>>> > llvm-dev at lists.llvm.org
>>>>>> >
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171111/924dba7b/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Nov 2017 - RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Possibly Parallel Threads