thr3ads.net - llvm dev - [llvm-dev] New x86-64 micro-architecture levels [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Florian Weimer via llvm-dev

2020-Jul-10 17:30 UTC

[llvm-dev] New x86-64 micro-architecture levels

Most Linux distributions still compile against the original x86-64
baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel
EM64T compatibility).

There has been an attempt to use the existing AT_PLATFORM-based loading
mechanism in the glibc dynamic linker to enable a selection of optimized
libraries.  But the general selection mechanism in glibc is problematic:

  hwcaps subdirectory selection in the dynamic loader
  <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>

We also have the problem that the glibc version of "haswell" is
distinct
from GCC's -march=haswell (and presumably other compilers):

  Definition of "haswell" platform is inconsistent with GCC 
  <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>

And that the selection criteria are not what people expect:

  Epyc and other current AMD CPUs do not select the "haswell" platform
  subdirectory
  <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>

Since the hwcaps-based selection does not work well regardless of
architecture (even in cases the kernel provides glibc with data), I
worked on a new mechanism that does not have the problems associated
with the old mechanism:

  [PATCH 00/30] RFC: elf: glibc-hwcaps support
  <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>

(Don't be concerned that these patches have not been reviewed; we are
busy preparing the glibc 2.32 release, and these changes do not alter
the glibc ABI itself, so they do not have immediate priority.  I'm
fairly confident that a version of these changes will make it into glibc
2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat
Enterprise Linux 8.4.  Debian as well, but I have never done anything
like it there, so I don't know if the patches will be accepted.)

Out of the box, this should work fairly well for IBM POWER and Z, where
there is a clear progression of silicon versions (at least on paper
—virtualization may blur the picture somewhat).

However, for x86, we do not have such a clear progression of
micro-architecture versions.  This is not just as a result of the
AMD/Intel competition, but also due to ongoing product differentiation
within one chip vendor.  I think we need these levels broadly for the
following reasons:

* Selecting on individual CPU features (similar to the old hwcaps
  mechanism) in glibc has scalability issues, particularly for
  LD_LIBRARY_PATH processing.

* Developers need guidance about useful targets for optimization.  I
  think there is value in limiting the choices, in the sense that “if
  you are able to test three builds in total, these are the things you
  should build”.

* glibc and the compilers should align in their definition of the
  levels, so that developers can use an -march= option to build for a
  particular level that is recognized by glibc.  This is why I think the
  description of the levels should go into the psABI supplement.

* A preference order for these levels avoids falling back to the K8
  baseline if the platform progresses to a new version due to
  glibc/kernel/hypervisor/hardware upgrades.

I'm including a proposal for the levels below.  I use single letters for
them, but I expect that the concrete implementation of this proposal
will use names like “x86-100”, “x86-101”, like in the glibc patch
referenced above.  (But we can discuss other approaches.)

I looked at various machines in the Red Hat labs and talked to Intel and
AMD engineers about this, but this concrete proposal is based on my own
analysis of the situation.  I excluded CPU features related to
cryptography and cache management, including hardware transactional
memory, and CPU timing.  I assume that we will see some of these
features being disabled by the firmware or the kernel over time.  That
would eliminate entire levels from selection, which is not desirable.
For cryptographic code, I expect that localized selection of an
optimized implementation works because such code tends to be isolated
blocks, running for dozens of cycles each time, not something that gets
scattered all over the place by the compiler.

We previously discussed not emitting VZEROUPPER at later levels, but I
don't think this is beneficial because the ABI does not have
callee-saved vector registers, so it can only be useful with local
functions (or whatever LTO considers local), where there is no ABI
impact anyway.

I did not include FSGSBASE because the FS base is already available at
%fs:0.  Changing the FS base in userspace breaks too much, so the main
benefit is the tighter encoding of rdfsbase, which seems very slim.

Not covered in this are tuning decisions.  I think we can benefit from
some variance in this area between implementations; it should not affect
correctness.  32-bit support is also a separate matter.

* Level A

CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3

This is one step above the K8 baseline and corresponds to a mainline CPU
model ca. 2008 to 2011.  It is also implemented by recent-ish
generations of Intel Atom server CPUs (although I haven't tested the
latest version).  A 32-bit variant would have to list many additional
CPU features here.

* Level B

AVX, plus everything in level A.

This step is so small that it probably can be dropped, unless the
benefits from using VEX encoding are truly significant.

For AVX and some of the following features, it is assumed that the
run-time selection takes full support coverage (from silicon to the
kernel) into account.

* Level C

AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.

This is close to what glibc currently calls "haswell".

* Level D

AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
level C.

This is the AVX-512 level implemented by Xeon Scalable Processors, not
the Xeon Phi variant.


glibc (or an alternative loader implementation) would search for
libraries starting at level D, going back to level A, and finally the
baseline implementation in the default library location.

I expect that some distributions will also use these levels to set a
baseline for the entire distribution (i.e., everything would be built to
level A or maybe even level C), and these libraries would then be
installed in the default location.

I'll be glad if I can get any feedback on this proposal.  I plan to turn
it into a merge request for the x86-64 psABI document eventually.

Thanks,
Florian

Joseph Myers via llvm-dev

2020-Jul-10 19:14 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

On Fri, 10 Jul 2020, Florian Weimer via Gcc wrote:
> * Level A
> 
> CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
> 
> This is one step above the K8 baseline and corresponds to a mainline CPU
> model ca. 2008 to 2011.  It is also implemented by recent-ish
> generations of Intel Atom server CPUs (although I haven't tested the
> latest version).  A 32-bit variant would have to list many additional
> CPU features here.
FWIW, this is also the limit of what can be run under QEMU emulation, as 
QEMU lacks support for AVX and newer instruction set features.

On the other hand, virtual machines seem liable to report something closer 
to the K8 baseline to the guest OS, missing the level A features, even 
when the underlying hardware supports everything in level B or level C.

-- 
Joseph S. Myers
joseph at codesourcery.com

H.J. Lu via llvm-dev

2020-Jul-10 21:42 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <fweimer at redhat.com>
wrote:>
> Most Linux distributions still compile against the original x86-64
> baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel
> EM64T compatibility).
>
> There has been an attempt to use the existing AT_PLATFORM-based loading
> mechanism in the glibc dynamic linker to enable a selection of optimized
> libraries.  But the general selection mechanism in glibc is problematic:
>
>   hwcaps subdirectory selection in the dynamic loader
>   <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>
>
> We also have the problem that the glibc version of "haswell" is
distinct
> from GCC's -march=haswell (and presumably other compilers):
>
>   Definition of "haswell" platform is inconsistent with GCC
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>
>
> And that the selection criteria are not what people expect:
>
>   Epyc and other current AMD CPUs do not select the "haswell"
platform
>   subdirectory
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>
>
> Since the hwcaps-based selection does not work well regardless of
> architecture (even in cases the kernel provides glibc with data), I
> worked on a new mechanism that does not have the problems associated
> with the old mechanism:
>
>   [PATCH 00/30] RFC: elf: glibc-hwcaps support
>   <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>
>
> (Don't be concerned that these patches have not been reviewed; we are
> busy preparing the glibc 2.32 release, and these changes do not alter
> the glibc ABI itself, so they do not have immediate priority.  I'm
> fairly confident that a version of these changes will make it into glibc
> 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat
> Enterprise Linux 8.4.  Debian as well, but I have never done anything
> like it there, so I don't know if the patches will be accepted.)
>
> Out of the box, this should work fairly well for IBM POWER and Z, where
> there is a clear progression of silicon versions (at least on paper
> —virtualization may blur the picture somewhat).
>
> However, for x86, we do not have such a clear progression of
> micro-architecture versions.  This is not just as a result of the
> AMD/Intel competition, but also due to ongoing product differentiation
> within one chip vendor.  I think we need these levels broadly for the
> following reasons:
>
> * Selecting on individual CPU features (similar to the old hwcaps
>   mechanism) in glibc has scalability issues, particularly for
>   LD_LIBRARY_PATH processing.
>
> * Developers need guidance about useful targets for optimization.  I
>   think there is value in limiting the choices, in the sense that “if
>   you are able to test three builds in total, these are the things you
>   should build”.
>
> * glibc and the compilers should align in their definition of the
>   levels, so that developers can use an -march= option to build for a
>   particular level that is recognized by glibc.  This is why I think the
>   description of the levels should go into the psABI supplement.
>
> * A preference order for these levels avoids falling back to the K8
>   baseline if the platform progresses to a new version due to
>   glibc/kernel/hypervisor/hardware upgrades.
>
> I'm including a proposal for the levels below.  I use single letters
for
> them, but I expect that the concrete implementation of this proposal
> will use names like “x86-100”, “x86-101”, like in the glibc patch
> referenced above.  (But we can discuss other approaches.)
>
> I looked at various machines in the Red Hat labs and talked to Intel and
> AMD engineers about this, but this concrete proposal is based on my own
> analysis of the situation.  I excluded CPU features related to
> cryptography and cache management, including hardware transactional
> memory, and CPU timing.  I assume that we will see some of these
> features being disabled by the firmware or the kernel over time.  That
> would eliminate entire levels from selection, which is not desirable.
> For cryptographic code, I expect that localized selection of an
> optimized implementation works because such code tends to be isolated
> blocks, running for dozens of cycles each time, not something that gets
> scattered all over the place by the compiler.
>
> We previously discussed not emitting VZEROUPPER at later levels, but I
> don't think this is beneficial because the ABI does not have
> callee-saved vector registers, so it can only be useful with local
> functions (or whatever LTO considers local), where there is no ABI
> impact anyway.
>
> I did not include FSGSBASE because the FS base is already available at
> %fs:0.  Changing the FS base in userspace breaks too much, so the main
> benefit is the tighter encoding of rdfsbase, which seems very slim.
>
> Not covered in this are tuning decisions.  I think we can benefit from
> some variance in this area between implementations; it should not affect
> correctness.  32-bit support is also a separate matter.
>
> * Level A
>
> CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
>
> This is one step above the K8 baseline and corresponds to a mainline CPU
> model ca. 2008 to 2011.  It is also implemented by recent-ish
> generations of Intel Atom server CPUs (although I haven't tested the
> latest version).  A 32-bit variant would have to list many additional
> CPU features here.
>
> * Level B
>
> AVX, plus everything in level A.
>
> This step is so small that it probably can be dropped, unless the
> benefits from using VEX encoding are truly significant.
>
> For AVX and some of the following features, it is assumed that the
> run-time selection takes full support coverage (from silicon to the
> kernel) into account.
>
> * Level C
>
> AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.
>
> This is close to what glibc currently calls "haswell".
>
> * Level D
>
> AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
> level C.
>
> This is the AVX-512 level implemented by Xeon Scalable Processors, not
> the Xeon Phi variant.
>
>
> glibc (or an alternative loader implementation) would search for
> libraries starting at level D, going back to level A, and finally the
> baseline implementation in the default library location.
>
> I expect that some distributions will also use these levels to set a
> baseline for the entire distribution (i.e., everything would be built to
> level A or maybe even level C), and these libraries would then be
> installed in the default location.
>
> I'll be glad if I can get any feedback on this proposal.  I plan to
turn
> it into a merge request for the x86-64 psABI document eventually.
>
Looks good.  I like it.   My only concerns are

1. Names like “x86-100”, “x86-101”, what features do they support?
2. I have a library with AVX2 and FMA, which directory should it go?

Can we pass such info to ld.so and ld.so prints out the best directory
name?

-- 
H.J.

Allan Sandfeld Jensen via llvm-dev

2020-Jul-11 07:40 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

On Freitag, 10. Juli 2020 19:30:09 CEST Florian Weimer via Gcc
wrote:> glibc (or an alternative loader implementation) would search for
> libraries starting at level D, going back to level A, and finally the
> baseline implementation in the default library location.
> 
> I expect that some distributions will also use these levels to set a
> baseline for the entire distribution (i.e., everything would be built to
> level A or maybe even level C), and these libraries would then be
> installed in the default location.
> 
> I'll be glad if I can get any feedback on this proposal.  I plan to
turn
> it into a merge request for the x86-64 psABI document eventually.
> Sounds good, though if I could dream I would also love a partial replacement 
option. So that you could have a generic x86-64 binary that only had some AVX2 
optimized replacement functions in a supplementary library.

Perhaps implemented by marked the library as a partial replacement, so the 
dynamic linker would also load the base or lower libraries except for 
functions already resolved.

You could also add a level E for the AVX512 instructions in ice lake and 
above. The VBMI1/2 instructions would likely be useful for autovectorization 
in GCC.

'Allan

Richard Biener via llvm-dev

2020-Jul-13 06:23 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

On Fri, Jul 10, 2020 at 11:45 PM H.J. Lu via Gcc <gcc at gcc.gnu.org>
wrote:>
> On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <fweimer at
redhat.com> wrote:
> >
> > Most Linux distributions still compile against the original x86-64
> > baseline that was based on the AMD K8 (minus the 3DNow! parts, for
Intel
> > EM64T compatibility).
> >
> > There has been an attempt to use the existing AT_PLATFORM-based
loading
> > mechanism in the glibc dynamic linker to enable a selection of
optimized
> > libraries.  But the general selection mechanism in glibc is
problematic:
> >
> >   hwcaps subdirectory selection in the dynamic loader
> >  
<https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>
> >
> > We also have the problem that the glibc version of "haswell"
is distinct
> > from GCC's -march=haswell (and presumably other compilers):
> >
> >   Definition of "haswell" platform is inconsistent with GCC
> >   <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>
> >
> > And that the selection criteria are not what people expect:
> >
> >   Epyc and other current AMD CPUs do not select the
"haswell" platform
> >   subdirectory
> >   <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>
> >
> > Since the hwcaps-based selection does not work well regardless of
> > architecture (even in cases the kernel provides glibc with data), I
> > worked on a new mechanism that does not have the problems associated
> > with the old mechanism:
> >
> >   [PATCH 00/30] RFC: elf: glibc-hwcaps support
> >  
<https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>
> >
> > (Don't be concerned that these patches have not been reviewed; we
are
> > busy preparing the glibc 2.32 release, and these changes do not alter
> > the glibc ABI itself, so they do not have immediate priority.  I'm
> > fairly confident that a version of these changes will make it into
glibc
> > 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red
Hat
> > Enterprise Linux 8.4.  Debian as well, but I have never done anything
> > like it there, so I don't know if the patches will be accepted.)
> >
> > Out of the box, this should work fairly well for IBM POWER and Z,
where
> > there is a clear progression of silicon versions (at least on paper
> > —virtualization may blur the picture somewhat).
> >
> > However, for x86, we do not have such a clear progression of
> > micro-architecture versions.  This is not just as a result of the
> > AMD/Intel competition, but also due to ongoing product differentiation
> > within one chip vendor.  I think we need these levels broadly for the
> > following reasons:
> >
> > * Selecting on individual CPU features (similar to the old hwcaps
> >   mechanism) in glibc has scalability issues, particularly for
> >   LD_LIBRARY_PATH processing.
> >
> > * Developers need guidance about useful targets for optimization.  I
> >   think there is value in limiting the choices, in the sense that “if
> >   you are able to test three builds in total, these are the things you
> >   should build”.
> >
> > * glibc and the compilers should align in their definition of the
> >   levels, so that developers can use an -march= option to build for a
> >   particular level that is recognized by glibc.  This is why I think
the
> >   description of the levels should go into the psABI supplement.
> >
> > * A preference order for these levels avoids falling back to the K8
> >   baseline if the platform progresses to a new version due to
> >   glibc/kernel/hypervisor/hardware upgrades.
> >
> > I'm including a proposal for the levels below.  I use single
letters for
> > them, but I expect that the concrete implementation of this proposal
> > will use names like “x86-100”, “x86-101”, like in the glibc patch
> > referenced above.  (But we can discuss other approaches.)
> >
> > I looked at various machines in the Red Hat labs and talked to Intel
and
> > AMD engineers about this, but this concrete proposal is based on my
own
> > analysis of the situation.  I excluded CPU features related to
> > cryptography and cache management, including hardware transactional
> > memory, and CPU timing.  I assume that we will see some of these
> > features being disabled by the firmware or the kernel over time.  That
> > would eliminate entire levels from selection, which is not desirable.
> > For cryptographic code, I expect that localized selection of an
> > optimized implementation works because such code tends to be isolated
> > blocks, running for dozens of cycles each time, not something that
gets
> > scattered all over the place by the compiler.
> >
> > We previously discussed not emitting VZEROUPPER at later levels, but I
> > don't think this is beneficial because the ABI does not have
> > callee-saved vector registers, so it can only be useful with local
> > functions (or whatever LTO considers local), where there is no ABI
> > impact anyway.
> >
> > I did not include FSGSBASE because the FS base is already available at
> > %fs:0.  Changing the FS base in userspace breaks too much, so the main
> > benefit is the tighter encoding of rdfsbase, which seems very slim.
> >
> > Not covered in this are tuning decisions.  I think we can benefit from
> > some variance in this area between implementations; it should not
affect
> > correctness.  32-bit support is also a separate matter.
> >
> > * Level A
> >
> > CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
> >
> > This is one step above the K8 baseline and corresponds to a mainline
CPU
> > model ca. 2008 to 2011.  It is also implemented by recent-ish
> > generations of Intel Atom server CPUs (although I haven't tested
the
> > latest version).  A 32-bit variant would have to list many additional
> > CPU features here.
> >
> > * Level B
> >
> > AVX, plus everything in level A.
> >
> > This step is so small that it probably can be dropped, unless the
> > benefits from using VEX encoding are truly significant.
> >
> > For AVX and some of the following features, it is assumed that the
> > run-time selection takes full support coverage (from silicon to the
> > kernel) into account.
> >
> > * Level C
> >
> > AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.
> >
> > This is close to what glibc currently calls "haswell".
> >
> > * Level D
> >
> > AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
> > level C.
> >
> > This is the AVX-512 level implemented by Xeon Scalable Processors, not
> > the Xeon Phi variant.
> >
> >
> > glibc (or an alternative loader implementation) would search for
> > libraries starting at level D, going back to level A, and finally the
> > baseline implementation in the default library location.
> >
> > I expect that some distributions will also use these levels to set a
> > baseline for the entire distribution (i.e., everything would be built
to
> > level A or maybe even level C), and these libraries would then be
> > installed in the default location.
> >
> > I'll be glad if I can get any feedback on this proposal.  I plan
to turn
> > it into a merge request for the x86-64 psABI document eventually.
> >
>
> Looks good.  I like it.
Likewise.  Btw, did you check that VIA family chips slot into Level A
at least?  Where do AMD bdverN slot in?
>  My only concerns are
>
> 1. Names like “x86-100”, “x86-101”, what features do they support?
Indeed I didn't get the -100, -101 part.  On the GCC side I'd have
suggested -march=generic-{A,B,C,D} implying the respective
-mtune.

Do the patches end up annotating ELF binaries with the architecture
level and does ld.so check that info?

For example IIRC there's a penalty to switch between VEX and
not VEX encoded instructions so even on AVX capable hardware
it might be profitable to use non-AVX libraries if the program is
using only architecture level A?

On that side, does architecture level B+ suggest using VEX encoding
everywhere?  It would be indeed nice to have the architecture levels
documented in the psABI.
> 2. I have a library with AVX2 and FMA, which directory should it go?
Eventually GCC/gas can annotate objects with the lowest architecture
level that is applicable?

Thanks for doing this,
Richard.
> Can we pass such info to ld.so and ld.so prints out the best directory
> name?
>
> --
> H.J.

Florian Weimer via llvm-dev

2020-Jul-13 06:49 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

* H. J. Lu:
> Looks good.  I like it.
Thanks.  What do you think about Level B?  Should we keep it?
> My only concerns are
>
> 1. Names like “x86-100”, “x86-101”, what features do they support?
I think we can add more diagnostic output to ld.so --help.  My patch
does not show individual CPU flags, but I agree this could be useful.
(It's not needed for the legacy HWCAP subdirectories because in general,
those are named & defined by the kernel, not by individually named CPU
feature flags.)
> 2. I have a library with AVX2 and FMA, which directory should it go?
>
> Can we pass such info to ld.so and ld.so prints out the best directory
> name?
I think this would require generating matching GNU property notes (list
the CPU features required by the binary).  Once we have that, we can add
something to binutils or indeed ld.so to analyze them and print the
recommended directory.  But I think this is something that could come
later.

We can also write a GCC header which looks at macros such as __AVX2__
and prints a #warning with the recommended directory name.  Checking for
excess flags will be tricky in this context, though, and if we miss
something, a wrong recommendation will be the result.

Thanks,
Florian

Florian Weimer via llvm-dev

2020-Jul-13 06:58 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

* Allan Sandfeld Jensen:
> On Freitag, 10. Juli 2020 19:30:09 CEST Florian Weimer via Gcc wrote:
>> glibc (or an alternative loader implementation) would search for
>> libraries starting at level D, going back to level A, and finally the
>> baseline implementation in the default library location.
>> 
>> I expect that some distributions will also use these levels to set a
>> baseline for the entire distribution (i.e., everything would be built
to
>> level A or maybe even level C), and these libraries would then be
>> installed in the default location.
>> 
>> I'll be glad if I can get any feedback on this proposal.  I plan to
turn
>> it into a merge request for the x86-64 psABI document eventually.
> Sounds good, though if I could dream I would also love a partial
> replacement option. So that you could have a generic x86-64 binary
> that only had some AVX2 optimized replacement functions in a
> supplementary library.
>
> Perhaps implemented by marked the library as a partial replacement, so
> the dynamic linker would also load the base or lower libraries except
> for functions already resolved.
I think you can do something like it today, at least from the glibc
dynamic loader perspective.  Programs link against the soname of the
optimized shared object (which can be empty), and that shared object
depends on the object with the fallback implementation.  A special
link-only shared object containing just the ABI under the front soname
(that of the optimized library) would be used via a linker object, so
that it is not possible to accidentally link against the wrong soname.

For non-versioned symbols, this setup has worked since forever.  For
versioned symbols, delegating from the optimized to the unoptimized
library needs at least glibc 2.30, with commit f0b2132b35248c1f4a80
("ld.so: Support moving versioned symbols between sonames [BZ
#24741]"),
although some of us have backported this commit into earlier releases.

Where this falls flat is support for LTO and
-fno-semantic-interposition.  Some care is needed to make precisely the
right set of symbols interposable.  But to honest, I'm not sure if this
entire mechanism is a big improvement over function multi-versioning.

Thanks,
Florian

Florian Weimer via llvm-dev

2020-Jul-13 07:55 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

* Joseph Myers:
> On Fri, 10 Jul 2020, Florian Weimer via Gcc wrote:
>
>> * Level A
>> 
>> CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
>> 
>> This is one step above the K8 baseline and corresponds to a mainline
CPU
>> model ca. 2008 to 2011.  It is also implemented by recent-ish
>> generations of Intel Atom server CPUs (although I haven't tested
the
>> latest version).  A 32-bit variant would have to list many additional
>> CPU features here.
>
> FWIW, this is also the limit of what can be run under QEMU emulation, as 
> QEMU lacks support for AVX and newer instruction set features.
Oh, I had forgotten about.  I should have Cc:ed the QEMU folks as well.
We'll need to make sure that we have matching CPU models in
QEMU/libvirt, even for the levels that do not have TCG support.

valgrind is another consumer, but in my tests, it was mostly okay with
AVX2 code (but that was without auto-vectorization).  AVX-512 is a
different matter, but that is also much further out.
> On the other hand, virtual machines seem liable to report something closer 
> to the K8 baseline to the guest OS, missing the level A features, even 
> when the underlying hardware supports everything in level B or level C.
They do this to support migration.  I'm suspect that in many cases,
those are just configuration errors.  That's why I want at least one
major distribution to switch to Level C as the baseline, to clean the
pipes.  Then even those distributions that depend on run-time selection
for performance-critical code will benefit. 8-/

Thanks,
Florian

Mark Wielaard via llvm-dev

2020-Jul-15 14:38 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

Hi Florian,

I understand you want to discuss the x86_64 micro-architecture levels
only in this thread, but it would be nice to have a similar discussion
for other architectures.

One thing that wasn't clear to me from this proposal is how the glibc
dynamic loader checks for the CPU feature flags. This is important for
valgrind since it can communicate those through different means. cpuid
interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM
at the moment) and of course we can generate SIGILL for unsupported
instructions. We currently don't intercept /proc/cpuinfo (but could).

I think it is important to be precise here, because in the past this
has sometimes caused confusion. For example for how to check correctly
for avx, lzcnt, or fma[4] support.

Thanks,

Mark

P.S. I don't particular like the numbered names, but well, bike-shed...

H.J. Lu via llvm-dev

2020-Jul-15 14:45 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

On Wed, Jul 15, 2020 at 7:38 AM Mark Wielaard <mark at klomp.org>
wrote:>
> Hi Florian,
>
> I understand you want to discuss the x86_64 micro-architecture levels
> only in this thread, but it would be nice to have a similar discussion
> for other architectures.
>
> One thing that wasn't clear to me from this proposal is how the glibc
> dynamic loader checks for the CPU feature flags. This is important for
> valgrind since it can communicate those through different means. cpuid
> interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM
> at the moment) and of course we can generate SIGILL for unsupported
> instructions. We currently don't intercept /proc/cpuinfo (but could).
In library, we can use <sys/platform/x86.h>:

https://sourceware.org/pipermail/libc-alpha/2020-June/115546.html

In GCC, we can use __builtin_cpu_supports.

<sys/platform/x86.h> supports all features and __builtin_cpu_supports in
GCC 11 supports all features which GCC has codegen for.
> I think it is important to be precise here, because in the past this
> has sometimes caused confusion. For example for how to check correctly
> for avx, lzcnt, or fma[4] support.
>
> Thanks,
>
> Mark
>
> P.S. I don't particular like the numbered names, but well, bike-shed...


-- 
H.J.

Florian Weimer via llvm-dev

2020-Jul-15 14:56 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

* Mark Wielaard:
> One thing that wasn't clear to me from this proposal is how the glibc
> dynamic loader checks for the CPU feature flags. This is important for
> valgrind since it can communicate those through different means. cpuid
> interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM
> at the moment) and of course we can generate SIGILL for unsupported
> instructions. We currently don't intercept /proc/cpuinfo (but could).
glibc uses CPUID in combination with XGETBV.  There is also a masking
feature which I have not reviewed, but given that it only takes features
away, I don't think it matters to valgrind.

Thanks,
Florian

Mallappa, Premachandra via llvm-dev

2020-Jul-21 16:05 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

[AMD Public Use]

Hi Floarian,
> I'm including a proposal for the levels below.  I use single letters
for them, but I expect that the concrete implementation of this proposal will
use
> names like “x86-100”, “x86-101”, like in the glibc patch referenced above. 
(But we can discuss other approaches.)
Personally I am not a big fan of this, for 2 reasons 
1. uses just x86 in name on x86_64 as well
2. 100/101 not very intuitive

> * Level A
...> * Level B
> This step is so small that it probably can be dropped, unless the benefits
from using VEX encoding are truly significant.
Yes, Agree, the delta is too small, can be clubbed into A or C.
> * Level C
> * Level D
Others are inline with the what we expect as logical grouping.

As you mentioned it is not easy tackle this,
Also we would also like to have dynamic loader support for "zen" /
"zen2" as a version of "Level D" and takes preference over
Level D,
which may have super-optimized libraries from AMD or other vendors.
These libraries are expected to be optimized according to micro-architectural
details, not just ISA.

Probably we can discuss this on the hwcaps thread.

-Prem

Florian Weimer via llvm-dev

2020-Jul-21 18:04 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

* Premachandra Mallappa:
> [AMD Public Use]
>
> Hi Floarian,
>
>> I'm including a proposal for the levels below.  I use single
letters for them, but I expect that the concrete implementation of this proposal
will use
>> names like “x86-100”, “x86-101”, like in the glibc patch referenced
above.  (But we can discuss other approaches.)
>
> Personally I am not a big fan of this, for 2 reasons 
> 1. uses just x86 in name on x86_64 as well
That's deliberate, so that we can use the same x86-* names for 32-bit
library selection (once we define matching micro-architecture levels
there).

GCC has -m32 -march=x86-64 for K8 without 3DNow! (essentially the shared
x86-64/EMT64 baseline), but I find this a bit confusing.
> 2. 100/101 not very intuitive
Any suggestions?  The advantage is that these numbers show a strong
preference ordering.  They do make in false suggestions about feature
sets: if we named Level C "x86-avx2", it would still be wrong for
glibc
to load libraries found in that directory just because a system has AVX2
support, because the libraries might also need FMA, based on the Level C
definition).  On the GCC side, it avoids a confusion between -mavx2 and
-march=x86-avx2.

If numbers are out, what should we use instead?
x86-sse4, x86-avx2, x86-avx512?  Would that work?
>> * Level A
> ...
>> * Level B
>> This step is so small that it probably can be dropped, unless the
benefits from using VEX encoding are truly significant.
>
> Yes, Agree, the delta is too small, can be clubbed into A or C.
Let's merge Level B into level C then?
>> * Level C
>> * Level D
>
> Others are inline with the what we expect as logical grouping.
Thanks.
> Also we would also like to have dynamic loader support for "zen"
/
> "zen2" as a version of "Level D" and takes preference
over Level D,
> which may have super-optimized libraries from AMD or other vendors.
*That* shouldn't be too hard to implement if we can nail down the
selection criteria.  Let's call this Zen-specific Level C x86-zen-avx2
for the sake of exposition.

What's going to be difficult is the choice for a hypothetical Zen
successor that's compatible feature-flag-wise with Level D.

Basically, there are two choices here:

  * Level D wins because it's the more powerful ISA.
  * x86-zen-avx2 wins because it has the Zen architecture optimizations.

There's also a related issue with Level C vs x86-zen-avx2 depending on
how we implement the Zen detection for AMD family numbers in the glibc
dynamic linker.  What I mean by this?  glibc detects that this a Level C
capable Zen-type CPU, but it's not one of the family/model numbers that
were hard-coded into the glibc sources.  What should we do then?  Should
we still prefer the x86-zen-avx2 library over the Level C library?
> These libraries are expected to be optimized according to
> micro-architectural details, not just ISA.
If it's supposed to be generally useful, we really need to document the
selection criteria for the subdirectory and make sure that it matches
what these libraries actually require at run time in terms of ISA.

I want to avoid two things here specifically: A hardware upgrade results
in crashes because we incorrectly load an incompatible library.  And, if
possible: A hardware upgrade (or kernel/hypervisor upgrade that exposes
more of the actual hardware) causes us to drop optimizations, so that
users experience a performance regression.

With the levels I proposed, these aspects are covered.  But if we start
to create vendor-specific forks in the feature progression, things get
complicated.

Do you think we need to figure this out in this iteration?  If yes, then
I really need a semi-formal description of the selection criteria for
this x86-zen-avx2 directory, so that I can passed it along with my psABI
proposal.

Thanks,
Florian

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jul 2020 - New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

Reasonably Related Threads