thr3ads.net - llvm dev - [llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Craig Topper via llvm-dev

2017-Nov-01 23:35 UTC

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Hello all,



I would like to propose adding the -mprefer-avx256 and -mprefer-avx128
command line flags supported by latest GCC to clang. These flags will be
used to limit the vector register size presented by TTI to the vectorizers.
The backend will still be able to use wider registers for code written
using the instrinsics in x86intrin.h. And the backend will still be able to
use AVX512VL instructions and the additional XMM16-31 and YMM16-31
registers.



Motivation:

-Using 512-bit operations on some Intel CPUs may cause a decrease in CPU
frequency that may offset the gains from using the wider register size. See
section 15.26 of Intel® 64 and IA-32 Architectures Optimization Reference
Manual published October 2017.

-The vector ALUs on ports 0 and 1 of the Skylake Server microarchitecture
are only 256-bits wide. 512-bit instructions using these ALUs must use both
ports. See section 2.1 of Intel® 64 and IA-32 Architectures Optimization
Reference Manual published October 2017.



Implementation Plan:

-Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in X86.td not
mapped to any CPU.

-Add mprefer-avx256 and mprefer-avx128 and the corresponding
-mno-prefer-avx128/256 options to clang's driver Options.td file. I believe
this will allow clang to pass these straight through to the -target-feature
attribute in IR.

-Modify X86TTIImpl::getRegisterBitWidth to only return 512 if AVX512 is
enabled and prefer-avx256 and prefer-avx128 is not set. Similarly return
256 if AVX is enabled and prefer-avx128 is not set.



There may be some other backend changes needed, but I plan to address those
as we find them.


At a later point, consider making -mprefer-avx256 the default for Skylake
Server due to the above mentioned performance considerations.



Does this sound reasonable?



*Latest Intel Optimization manual available here:
https://software.intel.com/en-us/articles/intel-sdm#optimization


-Craig Topper
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171101/fb83fa5f/attachment.html>

Hal Finkel via llvm-dev

2017-Nov-02 00:32 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

On 11/01/2017 06:35 PM, Craig Topper via llvm-dev wrote:>
> Hello all,
>
> I would like to propose adding the -mprefer-avx256 and -mprefer-avx128 
> command line flags supported by latest GCC to clang. These flags will 
> be used to limit the vector register size presented by TTI to the 
> vectorizers. The backend will still be able to use wider registers for 
> code written using the instrinsics in x86intrin.h. And the backend 
> will still be able to use AVX512VL instructions and the additional 
> XMM16-31 and YMM16-31 registers.
>
> Motivation:
>
> -Using 512-bit operations on some Intel CPUs may cause a decrease in 
> CPU frequency that may offset the gains from using the wider register 
> size. See section 15.26 of Intel® 64 and IA-32 Architectures 
> Optimization Reference Manual published October 2017.
>
I'd certainly like to see these options (especially for this reason).

  -Hal
> -The vector ALUs on ports 0 and 1 of the Skylake Server 
> microarchitecture are only 256-bits wide. 512-bit instructions using 
> these ALUs must use both ports. See section 2.1 of Intel® 64 and IA-32 
> Architectures Optimization Reference Manual published October 2017.
>
> Implementation Plan:
>
> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in X86.td 
> not mapped to any CPU.
>
> -Add mprefer-avx256 and mprefer-avx128 and the corresponding 
> -mno-prefer-avx128/256 options to clang's driver Options.td file. I 
> believe this will allow clang to pass these straight through to the 
> -target-feature attribute in IR.
>
> -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if AVX512 
> is enabled and prefer-avx256 and prefer-avx128 is not set. Similarly 
> return 256 if AVX is enabled and prefer-avx128 is not set.
>
>
> There may be some other backend changes needed, but I plan to address 
> those as we find them.
>
>
> At a later point, consider making -mprefer-avx256 the default for 
> Skylake Server due to the above mentioned performance considerations.
>
> Does this sound reasonable?
>
> *Latest Intel Optimization manual available here: 
> https://software.intel.com/en-us/articles/intel-sdm#optimization
>
>
> -Craig Topper
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171101/8dabc22a/attachment.html>

Tobias Grosser via llvm-dev

2017-Nov-02 11:57 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Hi Craig,

this sounds like a good idea.

Best,
Tobias

On Thu, Nov 2, 2017, at 00:35, Craig Topper via llvm-dev
wrote:> Hello all,
> 
> 
> 
> I would like to propose adding the -mprefer-avx256 and -mprefer-avx128
> command line flags supported by latest GCC to clang. These flags will be
> used to limit the vector register size presented by TTI to the
> vectorizers.
> The backend will still be able to use wider registers for code written
> using the instrinsics in x86intrin.h. And the backend will still be able
> to
> use AVX512VL instructions and the additional XMM16-31 and YMM16-31
> registers.
> 
> 
> 
> Motivation:
> 
> -Using 512-bit operations on some Intel CPUs may cause a decrease in CPU
> frequency that may offset the gains from using the wider register size.
> See
> section 15.26 of Intel® 64 and IA-32 Architectures Optimization Reference
> Manual published October 2017.
> 
> -The vector ALUs on ports 0 and 1 of the Skylake Server microarchitecture
> are only 256-bits wide. 512-bit instructions using these ALUs must use
> both
> ports. See section 2.1 of Intel® 64 and IA-32 Architectures Optimization
> Reference Manual published October 2017.
> 
> 
> 
> Implementation Plan:
> 
> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in X86.td not
> mapped to any CPU.
> 
> -Add mprefer-avx256 and mprefer-avx128 and the corresponding
> -mno-prefer-avx128/256 options to clang's driver Options.td file. I
> believe
> this will allow clang to pass these straight through to the
> -target-feature
> attribute in IR.
> 
> -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if AVX512 is
> enabled and prefer-avx256 and prefer-avx128 is not set. Similarly return
> 256 if AVX is enabled and prefer-avx128 is not set.
> 
> 
> 
> There may be some other backend changes needed, but I plan to address
> those
> as we find them.
> 
> 
> At a later point, consider making -mprefer-avx256 the default for Skylake
> Server due to the above mentioned performance considerations.
> 
> 
> 
> Does this sound reasonable?
> 
> 
> 
> *Latest Intel Optimization manual available here:
> https://software.intel.com/en-us/articles/intel-sdm#optimization
> 
> 
> -Craig Topper
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Craig Topper via llvm-dev

2017-Nov-02 22:44 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Reviews of the initial plumbing have been posted

https://reviews.llvm.org/D39575
https://reviews.llvm.org/D39576

~Craig

On Thu, Nov 2, 2017 at 4:57 AM, Tobias Grosser <tobias.grosser at
inf.ethz.ch>
wrote:
> Hi Craig,
>
> this sounds like a good idea.
>
> Best,
> Tobias
>
> On Thu, Nov 2, 2017, at 00:35, Craig Topper via llvm-dev wrote:
> > Hello all,
> >
> >
> >
> > I would like to propose adding the -mprefer-avx256 and -mprefer-avx128
> > command line flags supported by latest GCC to clang. These flags will
be
> > used to limit the vector register size presented by TTI to the
> > vectorizers.
> > The backend will still be able to use wider registers for code written
> > using the instrinsics in x86intrin.h. And the backend will still be
able
> > to
> > use AVX512VL instructions and the additional XMM16-31 and YMM16-31
> > registers.
> >
> >
> >
> > Motivation:
> >
> > -Using 512-bit operations on some Intel CPUs may cause a decrease in
CPU
> > frequency that may offset the gains from using the wider register
size.
> > See
> > section 15.26 of Intel® 64 and IA-32 Architectures Optimization
Reference
> > Manual published October 2017.
> >
> > -The vector ALUs on ports 0 and 1 of the Skylake Server
microarchitecture
> > are only 256-bits wide. 512-bit instructions using these ALUs must use
> > both
> > ports. See section 2.1 of Intel® 64 and IA-32 Architectures
Optimization
> > Reference Manual published October 2017.
> >
> >
> >
> > Implementation Plan:
> >
> > -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in X86.td
not
> > mapped to any CPU.
> >
> > -Add mprefer-avx256 and mprefer-avx128 and the corresponding
> > -mno-prefer-avx128/256 options to clang's driver Options.td file.
I
> > believe
> > this will allow clang to pass these straight through to the
> > -target-feature
> > attribute in IR.
> >
> > -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if AVX512
is
> > enabled and prefer-avx256 and prefer-avx128 is not set. Similarly
return
> > 256 if AVX is enabled and prefer-avx128 is not set.
> >
> >
> >
> > There may be some other backend changes needed, but I plan to address
> > those
> > as we find them.
> >
> >
> > At a later point, consider making -mprefer-avx256 the default for
Skylake
> > Server due to the above mentioned performance considerations.
> >
> >
> >
> > Does this sound reasonable?
> >
> >
> >
> > *Latest Intel Optimization manual available here:
> > https://software.intel.com/en-us/articles/intel-sdm#optimization
> >
> >
> > -Craig Topper
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171102/de6f58b0/attachment-0001.html>

James Y Knight via llvm-dev

2017-Nov-03 02:04 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hello all,
>
>
>
> I would like to propose adding the -mprefer-avx256 and -mprefer-avx128
> command line flags supported by latest GCC to clang. These flags will be
> used to limit the vector register size presented by TTI to the vectorizers.
> The backend will still be able to use wider registers for code written
> using the instrinsics in x86intrin.h. And the backend will still be able to
> use AVX512VL instructions and the additional XMM16-31 and YMM16-31
> registers.
>
>
>
> Motivation:
>
> -Using 512-bit operations on some Intel CPUs may cause a decrease in CPU
> frequency that may offset the gains from using the wider register size. See
> section 15.26 of Intel® 64 and IA-32 Architectures Optimization Reference
> Manual published October 2017.
>
I note the doc mentions that 256-bit AVX operations also have the same
issue with reducing the CPU frequency, which is nice to see documented!

There's also the issues discussed here <http://www.agner.org/
optimize/blog/read.php?i=165> (and elsewhere) related to warm-up time for
the 256-bit execution pipeline, which is another issue with using
wide-vector ops.


-The vector ALUs on ports 0 and 1 of the Skylake Server
microarchitecture> are only 256-bits wide. 512-bit instructions using these ALUs must use both
> ports. See section 2.1 of Intel® 64 and IA-32 Architectures Optimization
> Reference Manual published October 2017.
>
>  Implementation Plan:
>
> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in X86.td not
> mapped to any CPU.
>
> -Add mprefer-avx256 and mprefer-avx128 and the corresponding
> -mno-prefer-avx128/256 options to clang's driver Options.td file. I
believe
> this will allow clang to pass these straight through to the -target-feature
> attribute in IR.
>
> -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if AVX512 is
> enabled and prefer-avx256 and prefer-avx128 is not set. Similarly return
> 256 if AVX is enabled and prefer-avx128 is not set.
>
Instead of multiple flags that have difficult to understand intersecting
behavior, one flag with a value would be better. E.g., what should
"-mprefer-avx256 -mprefer-avx128 -mno-prefer-avx256" do? No matter the
answer, it's confusing. (Similarly with other such combinations). Just a
single arg "-mprefer-avx={128/256/512}" (with no "no"
version) seems easier
to understand to me (keeping the same behavior as you mention: asking to
prefer a larger width than is supported by your architecture should be fine
but ignored).




There may be some other backend changes needed, but I plan to address
those> as we find them.
>
>
> At a later point, consider making -mprefer-avx256 the default for Skylake
> Server due to the above mentioned performance considerations.
>



>
Does this sound reasonable?>
>
>
> *Latest Intel Optimization manual available here: https://software.intel.c
> om/en-us/articles/intel-sdm#optimization
>
>
> -Craig Topper
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171102/90c967a7/attachment.html>

Eric Christopher via llvm-dev

2017-Nov-03 02:18 UTC

head link

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hello all,
>>
>>
>>
>> I would like to propose adding the -mprefer-avx256 and -mprefer-avx128
>> command line flags supported by latest GCC to clang. These flags will
be
>> used to limit the vector register size presented by TTI to the
vectorizers.
>> The backend will still be able to use wider registers for code written
>> using the instrinsics in x86intrin.h. And the backend will still be
able to
>> use AVX512VL instructions and the additional XMM16-31 and YMM16-31
>> registers.
>>
>>
>>
>> Motivation:
>>
>> -Using 512-bit operations on some Intel CPUs may cause a decrease in
CPU
>> frequency that may offset the gains from using the wider register size.
See
>> section 15.26 of Intel® 64 and IA-32 Architectures Optimization
Reference
>> Manual published October 2017.
>>
>
> I note the doc mentions that 256-bit AVX operations also have the same
> issue with reducing the CPU frequency, which is nice to see documented!
>
> There's also the issues discussed here <
> http://www.agner.org/optimize/blog/read.php?i=165> (and elsewhere)
> related to warm-up time for the 256-bit execution pipeline, which is
> another issue with using wide-vector ops.
>
>
> -The vector ALUs on ports 0 and 1 of the Skylake Server microarchitecture
>> are only 256-bits wide. 512-bit instructions using these ALUs must use
both
>> ports. See section 2.1 of Intel® 64 and IA-32 Architectures
Optimization
>> Reference Manual published October 2017.
>>
>
>
>>  Implementation Plan:
>>
>> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in X86.td not
>> mapped to any CPU.
>>
>> -Add mprefer-avx256 and mprefer-avx128 and the corresponding
>> -mno-prefer-avx128/256 options to clang's driver Options.td file. I
believe
>> this will allow clang to pass these straight through to the
-target-feature
>> attribute in IR.
>>
>> -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if AVX512 is
>> enabled and prefer-avx256 and prefer-avx128 is not set. Similarly
return
>> 256 if AVX is enabled and prefer-avx128 is not set.
>>
>
> Instead of multiple flags that have difficult to understand intersecting
> behavior, one flag with a value would be better. E.g., what should
> "-mprefer-avx256 -mprefer-avx128 -mno-prefer-avx256" do? No
matter the
> answer, it's confusing. (Similarly with other such combinations). Just
a
> single arg "-mprefer-avx={128/256/512}" (with no "no"
version) seems easier
> to understand to me (keeping the same behavior as you mention: asking to
> prefer a larger width than is supported by your architecture should be fine
> but ignored).
>
>I agree with this. It's a little more plumbing as far as subtarget features
etc (represent via an optional value or just various "set the avx
width"
features - the latter being easier, but uglier), however, it's probably the
right thing to do.

I was looking at this myself just a couple weeks ago and think this is the
right direction (when and how to turn things off) - and probably makes
sense to be a default for these architectures? We might end up needing to
check a couple of additional TTI places, but it sounds like you're on top
of it. :)

Thanks very much for doing this work.

-eric

>
>
> There may be some other backend changes needed, but I plan to address
>> those as we find them.
>>
>>
>> At a later point, consider making -mprefer-avx256 the default for
Skylake
>> Server due to the above mentioned performance considerations.
>>
>
>
>
>
>
>>
> Does this sound reasonable?
>>
>>
>>
>> *Latest Intel Optimization manual available here:
>> https://software.intel.com/en-us/articles/intel-sdm#optimization
>>
>>
>> -Craig Topper
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171103/16c66407/attachment.html>

llvm dev - Nov 2017 - RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available