thr3ads.net - llvm dev - [llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser via llvm-dev

2017-Nov-07 09:02 UTC

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev
wrote:> That's a very good point about the ordering of the command line
options.
> gcc's current implementation treats -mprefer-avx256 has "prefer
256 over
> 512" and -mprefer-avx128 as "prefer 128 over 256". Which
feels weird for
> other reasons, but has less of an ordering ambiguity.
> 
> -mprefer-avx128 has been in gcc for many years and predates the creation
> of
> avx512. -mprefer-avx256 was added a couple months ago.
> 
> We've had an internal conversation with the implementor of
> -mprefer-avx256
> in gcc about making -mprefer-avx128 affect 512-bit vectors as well.
I'll
> bring up the ambiguity issue with them.
> 
> Do we want to be compatible with gcc here?
I certainly believe we would want to be compatible with gcc (if we use
the same names).

Best,
Tobias
> 
> ~Craig
> 
> On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher <echristo at
gmail.com>
> wrote:
> 
> >
> >
> > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev <
> > llvm-dev at lists.llvm.org> wrote:
> >
> >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev <
> >> llvm-dev at lists.llvm.org> wrote:
> >>
> >>> Hello all,
> >>>
> >>>
> >>>
> >>> I would like to propose adding the -mprefer-avx256 and
-mprefer-avx128
> >>> command line flags supported by latest GCC to clang. These
flags will be
> >>> used to limit the vector register size presented by TTI to the
vectorizers.
> >>> The backend will still be able to use wider registers for code
written
> >>> using the instrinsics in x86intrin.h. And the backend will
still be able to
> >>> use AVX512VL instructions and the additional XMM16-31 and
YMM16-31
> >>> registers.
> >>>
> >>>
> >>>
> >>> Motivation:
> >>>
> >>> -Using 512-bit operations on some Intel CPUs may cause a
decrease in CPU
> >>> frequency that may offset the gains from using the wider
register size. See
> >>> section 15.26 of Intel® 64 and IA-32 Architectures
Optimization Reference
> >>> Manual published October 2017.
> >>>
> >>
> >> I note the doc mentions that 256-bit AVX operations also have the
same
> >> issue with reducing the CPU frequency, which is nice to see
documented!
> >>
> >> There's also the issues discussed here
<http://www.agner.org/
> >> optimize/blog/read.php?i=165> (and elsewhere) related to
warm-up time
> >> for the 256-bit execution pipeline, which is another issue with
using
> >> wide-vector ops.
> >>
> >>
> >> -The vector ALUs on ports 0 and 1 of the Skylake Server
microarchitecture
> >>> are only 256-bits wide. 512-bit instructions using these ALUs
must use both
> >>> ports. See section 2.1 of Intel® 64 and IA-32 Architectures
Optimization
> >>> Reference Manual published October 2017.
> >>>
> >>
> >>
> >>>  Implementation Plan:
> >>>
> >>> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in
X86.td not
> >>> mapped to any CPU.
> >>>
> >>> -Add mprefer-avx256 and mprefer-avx128 and the corresponding
> >>> -mno-prefer-avx128/256 options to clang's driver
Options.td file. I believe
> >>> this will allow clang to pass these straight through to the
-target-feature
> >>> attribute in IR.
> >>>
> >>> -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if
AVX512 is
> >>> enabled and prefer-avx256 and prefer-avx128 is not set.
Similarly return
> >>> 256 if AVX is enabled and prefer-avx128 is not set.
> >>>
> >>
> >> Instead of multiple flags that have difficult to understand
intersecting
> >> behavior, one flag with a value would be better. E.g., what should
> >> "-mprefer-avx256 -mprefer-avx128 -mno-prefer-avx256" do?
No matter the
> >> answer, it's confusing. (Similarly with other such
combinations). Just a
> >> single arg "-mprefer-avx={128/256/512}" (with no
"no" version) seems easier
> >> to understand to me (keeping the same behavior as you mention:
asking to
> >> prefer a larger width than is supported by your architecture
should be fine
> >> but ignored).
> >>
> >>
> > I agree with this. It's a little more plumbing as far as subtarget
> > features etc (represent via an optional value or just various
"set the avx
> > width" features - the latter being easier, but uglier), however,
it's
> > probably the right thing to do.
> >
> > I was looking at this myself just a couple weeks ago and think this is
the
> > right direction (when and how to turn things off) - and probably makes
> > sense to be a default for these architectures? We might end up needing
to
> > check a couple of additional TTI places, but it sounds like you're
on top
> > of it. :)
> >
> > Thanks very much for doing this work.
> >
> > -eric
> >
> >
> >>
> >>
> >> There may be some other backend changes needed, but I plan to
address
> >>> those as we find them.
> >>>
> >>>
> >>> At a later point, consider making -mprefer-avx256 the default
for
> >>> Skylake Server due to the above mentioned performance
considerations.
> >>>
> >>
> >>
> >>
> >>
> >>
> >>>
> >> Does this sound reasonable?
> >>>
> >>>
> >>>
> >>> *Latest Intel Optimization manual available here:
> >>>
https://software.intel.com/en-us/articles/intel-sdm#optimization
> >>>
> >>>
> >>> -Craig Topper
> >>>
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>
> >>> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sanjay Patel via llvm-dev

2017-Nov-07 17:06 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

It's clear from the Intel docs how this has evolved, but from a compiler
perspective, this isn't a Skylake "feature" :) ... nor an Intel
feature,
nor an x86 feature.

It's a generic programmer hint for any target with multiple potential
vector lengths.

On x86, there's already a potential use case for this hint with a different
starting motivation: re-vectorization. That's where we take C code that
uses 128-bit vector intrinsics and selectively widen it to 256- or 512-bit
vector ops based on a newer CPU target than the code was originally written
for.

I think it's just a matter of time before a customer requests the same
ability for another target (maybe they already have and I don't know about
it). So we should have a solution that recognizes that possibility.

Note that having a target-independent implementation in the optimizer
doesn't preclude a flag alias in clang to maintain compatibility with gcc.


On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev wrote:
> > That's a very good point about the ordering of the command line
options.
> > gcc's current implementation treats -mprefer-avx256 has
"prefer 256 over
> > 512" and -mprefer-avx128 as "prefer 128 over 256".
Which feels weird for
> > other reasons, but has less of an ordering ambiguity.
> >
> > -mprefer-avx128 has been in gcc for many years and predates the
creation
> > of
> > avx512. -mprefer-avx256 was added a couple months ago.
> >
> > We've had an internal conversation with the implementor of
> > -mprefer-avx256
> > in gcc about making -mprefer-avx128 affect 512-bit vectors as well.
I'll
> > bring up the ambiguity issue with them.
> >
> > Do we want to be compatible with gcc here?
>
> I certainly believe we would want to be compatible with gcc (if we use
> the same names).
>
> Best,
> Tobias
>
> >
> > ~Craig
> >
> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher <echristo at
gmail.com>
> > wrote:
> >
> > >
> > >
> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev <
> > > llvm-dev at lists.llvm.org> wrote:
> > >
> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev
<
> > >> llvm-dev at lists.llvm.org> wrote:
> > >>
> > >>> Hello all,
> > >>>
> > >>>
> > >>>
> > >>> I would like to propose adding the -mprefer-avx256 and
> -mprefer-avx128
> > >>> command line flags supported by latest GCC to clang.
These flags
> will be
> > >>> used to limit the vector register size presented by TTI
to the
> vectorizers.
> > >>> The backend will still be able to use wider registers for
code
> written
> > >>> using the instrinsics in x86intrin.h. And the backend
will still be
> able to
> > >>> use AVX512VL instructions and the additional XMM16-31 and
YMM16-31
> > >>> registers.
> > >>>
> > >>>
> > >>>
> > >>> Motivation:
> > >>>
> > >>> -Using 512-bit operations on some Intel CPUs may cause a
decrease in
> CPU
> > >>> frequency that may offset the gains from using the wider
register
> size. See
> > >>> section 15.26 of Intel® 64 and IA-32 Architectures
Optimization
> Reference
> > >>> Manual published October 2017.
> > >>>
> > >>
> > >> I note the doc mentions that 256-bit AVX operations also have
the same
> > >> issue with reducing the CPU frequency, which is nice to see
> documented!
> > >>
> > >> There's also the issues discussed here
<http://www.agner.org/
> > >> optimize/blog/read.php?i=165> (and elsewhere) related to
warm-up time
> > >> for the 256-bit execution pipeline, which is another issue
with using
> > >> wide-vector ops.
> > >>
> > >>
> > >> -The vector ALUs on ports 0 and 1 of the Skylake Server
> microarchitecture
> > >>> are only 256-bits wide. 512-bit instructions using these
ALUs must
> use both
> > >>> ports. See section 2.1 of Intel® 64 and IA-32
Architectures
> Optimization
> > >>> Reference Manual published October 2017.
> > >>>
> > >>
> > >>
> > >>>  Implementation Plan:
> > >>>
> > >>> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures
in X86.td
> not
> > >>> mapped to any CPU.
> > >>>
> > >>> -Add mprefer-avx256 and mprefer-avx128 and the
corresponding
> > >>> -mno-prefer-avx128/256 options to clang's driver
Options.td file. I
> believe
> > >>> this will allow clang to pass these straight through to
the
> -target-feature
> > >>> attribute in IR.
> > >>>
> > >>> -Modify X86TTIImpl::getRegisterBitWidth to only return
512 if
> AVX512 is
> > >>> enabled and prefer-avx256 and prefer-avx128 is not set.
Similarly
> return
> > >>> 256 if AVX is enabled and prefer-avx128 is not set.
> > >>>
> > >>
> > >> Instead of multiple flags that have difficult to understand
> intersecting
> > >> behavior, one flag with a value would be better. E.g., what
should
> > >> "-mprefer-avx256 -mprefer-avx128
-mno-prefer-avx256" do? No matter the
> > >> answer, it's confusing. (Similarly with other such
combinations).
> Just a
> > >> single arg "-mprefer-avx={128/256/512}" (with no
"no" version) seems
> easier
> > >> to understand to me (keeping the same behavior as you
mention: asking
> to
> > >> prefer a larger width than is supported by your architecture
should
> be fine
> > >> but ignored).
> > >>
> > >>
> > > I agree with this. It's a little more plumbing as far as
subtarget
> > > features etc (represent via an optional value or just various
"set the
> avx
> > > width" features - the latter being easier, but uglier),
however, it's
> > > probably the right thing to do.
> > >
> > > I was looking at this myself just a couple weeks ago and think
this is
> the
> > > right direction (when and how to turn things off) - and probably
makes
> > > sense to be a default for these architectures? We might end up
needing
> to
> > > check a couple of additional TTI places, but it sounds like
you're on
> top
> > > of it. :)
> > >
> > > Thanks very much for doing this work.
> > >
> > > -eric
> > >
> > >
> > >>
> > >>
> > >> There may be some other backend changes needed, but I plan to
address
> > >>> those as we find them.
> > >>>
> > >>>
> > >>> At a later point, consider making -mprefer-avx256 the
default for
> > >>> Skylake Server due to the above mentioned performance
considerations.
> > >>>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>>
> > >> Does this sound reasonable?
> > >>>
> > >>>
> > >>>
> > >>> *Latest Intel Optimization manual available here:
> > >>>
https://software.intel.com/en-us/articles/intel-sdm#optimization
> > >>>
> > >>>
> > >>> -Craig Topper
> > >>>
> > >>> _______________________________________________
> > >>> LLVM Developers mailing list
> > >>> llvm-dev at lists.llvm.org
> > >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >>>
> > >>> _______________________________________________
> > >> LLVM Developers mailing list
> > >> llvm-dev at lists.llvm.org
> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >>
> > >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171107/55b5aa35/attachment-0001.html>

Craig Topper via llvm-dev

2017-Nov-09 23:21 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

I agree that a less x86 specific command line makes sense. I've been having
an internal discussions with gcc folks and their evaluating switching to
something like -mprefer-vector-width=128/256/512/none

Based on the current performance data we're seeing, we think we need to
ultimately default skylake-avx512 to -mprefer-vector-width=256. If we go
with a target independent option/implementation is there someway we could
still affect the default behavior in a target specific way?

~Craig

On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> It's clear from the Intel docs how this has evolved, but from a
compiler
> perspective, this isn't a Skylake "feature" :) ... nor an
Intel feature,
> nor an x86 feature.
>
> It's a generic programmer hint for any target with multiple potential
> vector lengths.
>
> On x86, there's already a potential use case for this hint with a
> different starting motivation: re-vectorization. That's where we take C
> code that uses 128-bit vector intrinsics and selectively widen it to 256-
> or 512-bit vector ops based on a newer CPU target than the code was
> originally written for.
>
> I think it's just a matter of time before a customer requests the same
> ability for another target (maybe they already have and I don't know
about
> it). So we should have a solution that recognizes that possibility.
>
> Note that having a target-independent implementation in the optimizer
> doesn't preclude a flag alias in clang to maintain compatibility with
gcc.
>
>
>
> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev wrote:
>> > That's a very good point about the ordering of the command
line options.
>> > gcc's current implementation treats -mprefer-avx256 has
"prefer 256 over
>> > 512" and -mprefer-avx128 as "prefer 128 over 256".
Which feels weird for
>> > other reasons, but has less of an ordering ambiguity.
>> >
>> > -mprefer-avx128 has been in gcc for many years and predates the
creation
>> > of
>> > avx512. -mprefer-avx256 was added a couple months ago.
>> >
>> > We've had an internal conversation with the implementor of
>> > -mprefer-avx256
>> > in gcc about making -mprefer-avx128 affect 512-bit vectors as
well. I'll
>> > bring up the ambiguity issue with them.
>> >
>> > Do we want to be compatible with gcc here?
>>
>> I certainly believe we would want to be compatible with gcc (if we use
>> the same names).
>>
>> Best,
>> Tobias
>>
>> >
>> > ~Craig
>> >
>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher <echristo at
gmail.com>
>> > wrote:
>> >
>> > >
>> > >
>> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev
<
>> > > llvm-dev at lists.llvm.org> wrote:
>> > >
>> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev
<
>> > >> llvm-dev at lists.llvm.org> wrote:
>> > >>
>> > >>> Hello all,
>> > >>>
>> > >>>
>> > >>>
>> > >>> I would like to propose adding the -mprefer-avx256
and
>> -mprefer-avx128
>> > >>> command line flags supported by latest GCC to clang.
These flags
>> will be
>> > >>> used to limit the vector register size presented by
TTI to the
>> vectorizers.
>> > >>> The backend will still be able to use wider registers
for code
>> written
>> > >>> using the instrinsics in x86intrin.h. And the backend
will still be
>> able to
>> > >>> use AVX512VL instructions and the additional XMM16-31
and YMM16-31
>> > >>> registers.
>> > >>>
>> > >>>
>> > >>>
>> > >>> Motivation:
>> > >>>
>> > >>> -Using 512-bit operations on some Intel CPUs may
cause a decrease
>> in CPU
>> > >>> frequency that may offset the gains from using the
wider register
>> size. See
>> > >>> section 15.26 of Intel® 64 and IA-32 Architectures
Optimization
>> Reference
>> > >>> Manual published October 2017.
>> > >>>
>> > >>
>> > >> I note the doc mentions that 256-bit AVX operations also
have the
>> same
>> > >> issue with reducing the CPU frequency, which is nice to
see
>> documented!
>> > >>
>> > >> There's also the issues discussed here
<http://www.agner.org/
>> > >> optimize/blog/read.php?i=165> (and elsewhere) related
to warm-up time
>> > >> for the 256-bit execution pipeline, which is another
issue with using
>> > >> wide-vector ops.
>> > >>
>> > >>
>> > >> -The vector ALUs on ports 0 and 1 of the Skylake Server
>> microarchitecture
>> > >>> are only 256-bits wide. 512-bit instructions using
these ALUs must
>> use both
>> > >>> ports. See section 2.1 of Intel® 64 and IA-32
Architectures
>> Optimization
>> > >>> Reference Manual published October 2017.
>> > >>>
>> > >>
>> > >>
>> > >>>  Implementation Plan:
>> > >>>
>> > >>> -Add prefer-avx256 and prefer-avx128 as
SubtargetFeatures in X86.td
>> not
>> > >>> mapped to any CPU.
>> > >>>
>> > >>> -Add mprefer-avx256 and mprefer-avx128 and the
corresponding
>> > >>> -mno-prefer-avx128/256 options to clang's driver
Options.td file. I
>> believe
>> > >>> this will allow clang to pass these straight through
to the
>> -target-feature
>> > >>> attribute in IR.
>> > >>>
>> > >>> -Modify X86TTIImpl::getRegisterBitWidth to only
return 512 if
>> AVX512 is
>> > >>> enabled and prefer-avx256 and prefer-avx128 is not
set. Similarly
>> return
>> > >>> 256 if AVX is enabled and prefer-avx128 is not set.
>> > >>>
>> > >>
>> > >> Instead of multiple flags that have difficult to
understand
>> intersecting
>> > >> behavior, one flag with a value would be better. E.g.,
what should
>> > >> "-mprefer-avx256 -mprefer-avx128
-mno-prefer-avx256" do? No matter
>> the
>> > >> answer, it's confusing. (Similarly with other such
combinations).
>> Just a
>> > >> single arg "-mprefer-avx={128/256/512}" (with
no "no" version) seems
>> easier
>> > >> to understand to me (keeping the same behavior as you
mention:
>> asking to
>> > >> prefer a larger width than is supported by your
architecture should
>> be fine
>> > >> but ignored).
>> > >>
>> > >>
>> > > I agree with this. It's a little more plumbing as far as
subtarget
>> > > features etc (represent via an optional value or just various
"set
>> the avx
>> > > width" features - the latter being easier, but uglier),
however, it's
>> > > probably the right thing to do.
>> > >
>> > > I was looking at this myself just a couple weeks ago and
think this
>> is the
>> > > right direction (when and how to turn things off) - and
probably makes
>> > > sense to be a default for these architectures? We might end
up
>> needing to
>> > > check a couple of additional TTI places, but it sounds like
you're on
>> top
>> > > of it. :)
>> > >
>> > > Thanks very much for doing this work.
>> > >
>> > > -eric
>> > >
>> > >
>> > >>
>> > >>
>> > >> There may be some other backend changes needed, but I
plan to address
>> > >>> those as we find them.
>> > >>>
>> > >>>
>> > >>> At a later point, consider making -mprefer-avx256 the
default for
>> > >>> Skylake Server due to the above mentioned performance
>> considerations.
>> > >>>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>>
>> > >> Does this sound reasonable?
>> > >>>
>> > >>>
>> > >>>
>> > >>> *Latest Intel Optimization manual available here:
>> > >>>
https://software.intel.com/en-us/articles/intel-sdm#optimization
>> > >>>
>> > >>>
>> > >>> -Craig Topper
>> > >>>
>> > >>> _______________________________________________
>> > >>> LLVM Developers mailing list
>> > >>> llvm-dev at lists.llvm.org
>> > >>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> > >>>
>> > >>> _______________________________________________
>> > >> LLVM Developers mailing list
>> > >> llvm-dev at lists.llvm.org
>> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> > >>
>> > >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171109/bb3a4fe2/attachment-0001.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Nov 2017 - RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Possibly Parallel Threads