thr3ads.net - llvm dev - [llvm-dev] New x86-64 micro-architecture levels [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Richard Biener via llvm-dev

2020-Jul-13 06:23 UTC

[llvm-dev] New x86-64 micro-architecture levels

On Fri, Jul 10, 2020 at 11:45 PM H.J. Lu via Gcc <gcc at gcc.gnu.org>
wrote:>
> On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <fweimer at
redhat.com> wrote:
> >
> > Most Linux distributions still compile against the original x86-64
> > baseline that was based on the AMD K8 (minus the 3DNow! parts, for
Intel
> > EM64T compatibility).
> >
> > There has been an attempt to use the existing AT_PLATFORM-based
loading
> > mechanism in the glibc dynamic linker to enable a selection of
optimized
> > libraries.  But the general selection mechanism in glibc is
problematic:
> >
> >   hwcaps subdirectory selection in the dynamic loader
> >  
<https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>
> >
> > We also have the problem that the glibc version of "haswell"
is distinct
> > from GCC's -march=haswell (and presumably other compilers):
> >
> >   Definition of "haswell" platform is inconsistent with GCC
> >   <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>
> >
> > And that the selection criteria are not what people expect:
> >
> >   Epyc and other current AMD CPUs do not select the
"haswell" platform
> >   subdirectory
> >   <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>
> >
> > Since the hwcaps-based selection does not work well regardless of
> > architecture (even in cases the kernel provides glibc with data), I
> > worked on a new mechanism that does not have the problems associated
> > with the old mechanism:
> >
> >   [PATCH 00/30] RFC: elf: glibc-hwcaps support
> >  
<https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>
> >
> > (Don't be concerned that these patches have not been reviewed; we
are
> > busy preparing the glibc 2.32 release, and these changes do not alter
> > the glibc ABI itself, so they do not have immediate priority.  I'm
> > fairly confident that a version of these changes will make it into
glibc
> > 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red
Hat
> > Enterprise Linux 8.4.  Debian as well, but I have never done anything
> > like it there, so I don't know if the patches will be accepted.)
> >
> > Out of the box, this should work fairly well for IBM POWER and Z,
where
> > there is a clear progression of silicon versions (at least on paper
> > —virtualization may blur the picture somewhat).
> >
> > However, for x86, we do not have such a clear progression of
> > micro-architecture versions.  This is not just as a result of the
> > AMD/Intel competition, but also due to ongoing product differentiation
> > within one chip vendor.  I think we need these levels broadly for the
> > following reasons:
> >
> > * Selecting on individual CPU features (similar to the old hwcaps
> >   mechanism) in glibc has scalability issues, particularly for
> >   LD_LIBRARY_PATH processing.
> >
> > * Developers need guidance about useful targets for optimization.  I
> >   think there is value in limiting the choices, in the sense that “if
> >   you are able to test three builds in total, these are the things you
> >   should build”.
> >
> > * glibc and the compilers should align in their definition of the
> >   levels, so that developers can use an -march= option to build for a
> >   particular level that is recognized by glibc.  This is why I think
the
> >   description of the levels should go into the psABI supplement.
> >
> > * A preference order for these levels avoids falling back to the K8
> >   baseline if the platform progresses to a new version due to
> >   glibc/kernel/hypervisor/hardware upgrades.
> >
> > I'm including a proposal for the levels below.  I use single
letters for
> > them, but I expect that the concrete implementation of this proposal
> > will use names like “x86-100”, “x86-101”, like in the glibc patch
> > referenced above.  (But we can discuss other approaches.)
> >
> > I looked at various machines in the Red Hat labs and talked to Intel
and
> > AMD engineers about this, but this concrete proposal is based on my
own
> > analysis of the situation.  I excluded CPU features related to
> > cryptography and cache management, including hardware transactional
> > memory, and CPU timing.  I assume that we will see some of these
> > features being disabled by the firmware or the kernel over time.  That
> > would eliminate entire levels from selection, which is not desirable.
> > For cryptographic code, I expect that localized selection of an
> > optimized implementation works because such code tends to be isolated
> > blocks, running for dozens of cycles each time, not something that
gets
> > scattered all over the place by the compiler.
> >
> > We previously discussed not emitting VZEROUPPER at later levels, but I
> > don't think this is beneficial because the ABI does not have
> > callee-saved vector registers, so it can only be useful with local
> > functions (or whatever LTO considers local), where there is no ABI
> > impact anyway.
> >
> > I did not include FSGSBASE because the FS base is already available at
> > %fs:0.  Changing the FS base in userspace breaks too much, so the main
> > benefit is the tighter encoding of rdfsbase, which seems very slim.
> >
> > Not covered in this are tuning decisions.  I think we can benefit from
> > some variance in this area between implementations; it should not
affect
> > correctness.  32-bit support is also a separate matter.
> >
> > * Level A
> >
> > CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
> >
> > This is one step above the K8 baseline and corresponds to a mainline
CPU
> > model ca. 2008 to 2011.  It is also implemented by recent-ish
> > generations of Intel Atom server CPUs (although I haven't tested
the
> > latest version).  A 32-bit variant would have to list many additional
> > CPU features here.
> >
> > * Level B
> >
> > AVX, plus everything in level A.
> >
> > This step is so small that it probably can be dropped, unless the
> > benefits from using VEX encoding are truly significant.
> >
> > For AVX and some of the following features, it is assumed that the
> > run-time selection takes full support coverage (from silicon to the
> > kernel) into account.
> >
> > * Level C
> >
> > AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.
> >
> > This is close to what glibc currently calls "haswell".
> >
> > * Level D
> >
> > AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
> > level C.
> >
> > This is the AVX-512 level implemented by Xeon Scalable Processors, not
> > the Xeon Phi variant.
> >
> >
> > glibc (or an alternative loader implementation) would search for
> > libraries starting at level D, going back to level A, and finally the
> > baseline implementation in the default library location.
> >
> > I expect that some distributions will also use these levels to set a
> > baseline for the entire distribution (i.e., everything would be built
to
> > level A or maybe even level C), and these libraries would then be
> > installed in the default location.
> >
> > I'll be glad if I can get any feedback on this proposal.  I plan
to turn
> > it into a merge request for the x86-64 psABI document eventually.
> >
>
> Looks good.  I like it.
Likewise.  Btw, did you check that VIA family chips slot into Level A
at least?  Where do AMD bdverN slot in?
>  My only concerns are
>
> 1. Names like “x86-100”, “x86-101”, what features do they support?
Indeed I didn't get the -100, -101 part.  On the GCC side I'd have
suggested -march=generic-{A,B,C,D} implying the respective
-mtune.

Do the patches end up annotating ELF binaries with the architecture
level and does ld.so check that info?

For example IIRC there's a penalty to switch between VEX and
not VEX encoded instructions so even on AVX capable hardware
it might be profitable to use non-AVX libraries if the program is
using only architecture level A?

On that side, does architecture level B+ suggest using VEX encoding
everywhere?  It would be indeed nice to have the architecture levels
documented in the psABI.
> 2. I have a library with AVX2 and FMA, which directory should it go?
Eventually GCC/gas can annotate objects with the lowest architecture
level that is applicable?

Thanks for doing this,
Richard.
> Can we pass such info to ld.so and ld.so prints out the best directory
> name?
>
> --
> H.J.

Florian Weimer via llvm-dev

2020-Jul-13 07:40 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

* Richard Biener:
>> Looks good.  I like it.
>
> Likewise.  Btw, did you check that VIA family chips slot into Level A
> at least?
Those seem to lack SSE4.2, so they land in the baseline.
> Where do AMD bdverN slot in?
bdver1 to bdver3 (as defined by GCC) should land in Level B (so Level A
if that is dropped).  bdver4 and znver1 (and later) should land in
Level C.
>>  My only concerns are
>>
>> 1. Names like “x86-100”, “x86-101”, what features do they support?
>
> Indeed I didn't get the -100, -101 part.  On the GCC side I'd have
> suggested -march=generic-{A,B,C,D} implying the respective
> -mtune.
With literal A, B, C, D, or are they just placeholders?  If not literal
levels, then what we should use there?

I like the simplicity of numbers.  I used letters in the proposal to
avoid confusion if we alter the proposal by dropping or levels, shifting
the meaning of those that come later.  I expect to switch back to
numbers again for the final version.
> Do the patches end up annotating ELF binaries with the architecture
> level and does ld.so check that info?
This is a separate feature that H.J. has been working on.
> For example IIRC there's a penalty to switch between VEX and
> not VEX encoded instructions so even on AVX capable hardware
> it might be profitable to use non-AVX libraries if the program is
> using only architecture level A?
But this is impossible to know in general.  It may also be possible that
the library contains an inner loop that can be nicely vectorized with
AVX instructions, but not with SSE4.2 instructions and earlier.  Then
preferring the non-AVX version would be a mistake.

Regarding the transition penalty, I believe this is mostly addressed by
those VZEROUPPER instructions?  I've already explained why I think those
aren't a viable optimization target, given the current calling
convention.

My glibc patches already provide a way to mask subdirectories which
would otherwise be selected, so manual optimization is still possible.
> On that side, does architecture level B+ suggest using VEX encoding
> everywhere?  It would be indeed nice to have the architecture levels
> documented in the psABI.
I think this falls under optimization, and I really did not want to
discuss.

If there is a plan to change/amend the calling convention and some of
the levels should prefer to that, it's a different matter, of course.
(glibc can only give you four callee-saved 256-bit wide registers
easily, though, more would need close cooperation with GCC.)

The new glibc-hwcaps scheme in glibc scales a bit better than the old
one, so we do not have to settle this immediately and could add
additional subdirectories for objects that follow new calling convention
requirements.
>> 2. I have a library with AVX2 and FMA, which directory should it go?
>
> Eventually GCC/gas can annotate objects with the lowest architecture
> level that is applicable?
H.J. has patches for ELF program properties.  I think
GNU_PROPERTY_X86_ISA_1_NEEDED would convey this information.  This
proposal and the glibc patches are independent of that.

If that function ever gets deployed, I plan to add those notes to
ld.so.cache, so that ld.so can select shared objects based on them (or
any allocated ELF note, really).  Efficient LD_LIBRARY_PATH support is
not possible, I think, so those designated glibc-hwcaps subdirectories
still have a place.

Thanks,
Florian

Jan Beulich via llvm-dev

2020-Jul-13 07:47 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

On 13.07.2020 09:40, Florian Weimer wrote:> * Richard Biener:
>>> 2. I have a library with AVX2 and FMA, which directory should it
go?
>>
>> Eventually GCC/gas can annotate objects with the lowest architecture
>> level that is applicable?
> 
> H.J. has patches for ELF program properties.  I think
> GNU_PROPERTY_X86_ISA_1_NEEDED would convey this information.  This
> proposal and the glibc patches are independent of that.
>From (partly just halfway) recent discussions with H.J. I gainedthe understanding that the piece we're aiming at getting to work
properly is the recording of GNU_PROPERTY_X86_FEATURE_2_*, not
so much GNU_PROPERTY_X86_ISA_1_*. If the ISA one is to be used as
a basis here, a lot of new flags will need adding (and properly
setting) first, I think.

Jan

Richard Biener via llvm-dev

2020-Jul-13 08:57 UTC

head link

[llvm-dev] New x86-64 micro-architecture levels

On Mon, Jul 13, 2020 at 9:40 AM Florian Weimer <fweimer at redhat.com>
wrote:>
> * Richard Biener:
>
> >> Looks good.  I like it.
> >
> > Likewise.  Btw, did you check that VIA family chips slot into Level A
> > at least?
>
> Those seem to lack SSE4.2, so they land in the baseline.
>
> > Where do AMD bdverN slot in?
>
> bdver1 to bdver3 (as defined by GCC) should land in Level B (so Level A
> if that is dropped).  bdver4 and znver1 (and later) should land in
> Level C.
>
> >>  My only concerns are
> >>
> >> 1. Names like “x86-100”, “x86-101”, what features do they support?
> >
> > Indeed I didn't get the -100, -101 part.  On the GCC side I'd
have
> > suggested -march=generic-{A,B,C,D} implying the respective
> > -mtune.
>
> With literal A, B, C, D, or are they just placeholders?  If not literal
> levels, then what we should use there?
>
> I like the simplicity of numbers.  I used letters in the proposal to
> avoid confusion if we alter the proposal by dropping or levels, shifting
> the meaning of those that come later.  I expect to switch back to
> numbers again for the final version.
They are indeed placeholders though I somehow prefer letters to
numbers.  But this is really bike-shedding territory.  Good documentation
on the tools side will be more imporant as well as consistent spelling
between tools sets, possibly driven by a good choice from within the
psABI document.

Richard.

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Jul 2020 - New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

[llvm-dev] New x86-64 micro-architecture levels

Apparently Analagous Threads