Richard Biener via llvm-dev
2020-Jul-13 06:23 UTC
[llvm-dev] New x86-64 micro-architecture levels
On Fri, Jul 10, 2020 at 11:45 PM H.J. Lu via Gcc <gcc at gcc.gnu.org> wrote:> > On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <fweimer at redhat.com> wrote: > > > > Most Linux distributions still compile against the original x86-64 > > baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel > > EM64T compatibility). > > > > There has been an attempt to use the existing AT_PLATFORM-based loading > > mechanism in the glibc dynamic linker to enable a selection of optimized > > libraries. But the general selection mechanism in glibc is problematic: > > > > hwcaps subdirectory selection in the dynamic loader > > <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html> > > > > We also have the problem that the glibc version of "haswell" is distinct > > from GCC's -march=haswell (and presumably other compilers): > > > > Definition of "haswell" platform is inconsistent with GCC > > <https://sourceware.org/bugzilla/show_bug.cgi?id=24080> > > > > And that the selection criteria are not what people expect: > > > > Epyc and other current AMD CPUs do not select the "haswell" platform > > subdirectory > > <https://sourceware.org/bugzilla/show_bug.cgi?id=23249> > > > > Since the hwcaps-based selection does not work well regardless of > > architecture (even in cases the kernel provides glibc with data), I > > worked on a new mechanism that does not have the problems associated > > with the old mechanism: > > > > [PATCH 00/30] RFC: elf: glibc-hwcaps support > > <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html> > > > > (Don't be concerned that these patches have not been reviewed; we are > > busy preparing the glibc 2.32 release, and these changes do not alter > > the glibc ABI itself, so they do not have immediate priority. I'm > > fairly confident that a version of these changes will make it into glibc > > 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat > > Enterprise Linux 8.4. Debian as well, but I have never done anything > > like it there, so I don't know if the patches will be accepted.) > > > > Out of the box, this should work fairly well for IBM POWER and Z, where > > there is a clear progression of silicon versions (at least on paper > > —virtualization may blur the picture somewhat). > > > > However, for x86, we do not have such a clear progression of > > micro-architecture versions. This is not just as a result of the > > AMD/Intel competition, but also due to ongoing product differentiation > > within one chip vendor. I think we need these levels broadly for the > > following reasons: > > > > * Selecting on individual CPU features (similar to the old hwcaps > > mechanism) in glibc has scalability issues, particularly for > > LD_LIBRARY_PATH processing. > > > > * Developers need guidance about useful targets for optimization. I > > think there is value in limiting the choices, in the sense that “if > > you are able to test three builds in total, these are the things you > > should build”. > > > > * glibc and the compilers should align in their definition of the > > levels, so that developers can use an -march= option to build for a > > particular level that is recognized by glibc. This is why I think the > > description of the levels should go into the psABI supplement. > > > > * A preference order for these levels avoids falling back to the K8 > > baseline if the platform progresses to a new version due to > > glibc/kernel/hypervisor/hardware upgrades. > > > > I'm including a proposal for the levels below. I use single letters for > > them, but I expect that the concrete implementation of this proposal > > will use names like “x86-100”, “x86-101”, like in the glibc patch > > referenced above. (But we can discuss other approaches.) > > > > I looked at various machines in the Red Hat labs and talked to Intel and > > AMD engineers about this, but this concrete proposal is based on my own > > analysis of the situation. I excluded CPU features related to > > cryptography and cache management, including hardware transactional > > memory, and CPU timing. I assume that we will see some of these > > features being disabled by the firmware or the kernel over time. That > > would eliminate entire levels from selection, which is not desirable. > > For cryptographic code, I expect that localized selection of an > > optimized implementation works because such code tends to be isolated > > blocks, running for dozens of cycles each time, not something that gets > > scattered all over the place by the compiler. > > > > We previously discussed not emitting VZEROUPPER at later levels, but I > > don't think this is beneficial because the ABI does not have > > callee-saved vector registers, so it can only be useful with local > > functions (or whatever LTO considers local), where there is no ABI > > impact anyway. > > > > I did not include FSGSBASE because the FS base is already available at > > %fs:0. Changing the FS base in userspace breaks too much, so the main > > benefit is the tighter encoding of rdfsbase, which seems very slim. > > > > Not covered in this are tuning decisions. I think we can benefit from > > some variance in this area between implementations; it should not affect > > correctness. 32-bit support is also a separate matter. > > > > * Level A > > > > CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3 > > > > This is one step above the K8 baseline and corresponds to a mainline CPU > > model ca. 2008 to 2011. It is also implemented by recent-ish > > generations of Intel Atom server CPUs (although I haven't tested the > > latest version). A 32-bit variant would have to list many additional > > CPU features here. > > > > * Level B > > > > AVX, plus everything in level A. > > > > This step is so small that it probably can be dropped, unless the > > benefits from using VEX encoding are truly significant. > > > > For AVX and some of the following features, it is assumed that the > > run-time selection takes full support coverage (from silicon to the > > kernel) into account. > > > > * Level C > > > > AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B. > > > > This is close to what glibc currently calls "haswell". > > > > * Level D > > > > AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in > > level C. > > > > This is the AVX-512 level implemented by Xeon Scalable Processors, not > > the Xeon Phi variant. > > > > > > glibc (or an alternative loader implementation) would search for > > libraries starting at level D, going back to level A, and finally the > > baseline implementation in the default library location. > > > > I expect that some distributions will also use these levels to set a > > baseline for the entire distribution (i.e., everything would be built to > > level A or maybe even level C), and these libraries would then be > > installed in the default location. > > > > I'll be glad if I can get any feedback on this proposal. I plan to turn > > it into a merge request for the x86-64 psABI document eventually. > > > > Looks good. I like it.Likewise. Btw, did you check that VIA family chips slot into Level A at least? Where do AMD bdverN slot in?> My only concerns are > > 1. Names like “x86-100”, “x86-101”, what features do they support?Indeed I didn't get the -100, -101 part. On the GCC side I'd have suggested -march=generic-{A,B,C,D} implying the respective -mtune. Do the patches end up annotating ELF binaries with the architecture level and does ld.so check that info? For example IIRC there's a penalty to switch between VEX and not VEX encoded instructions so even on AVX capable hardware it might be profitable to use non-AVX libraries if the program is using only architecture level A? On that side, does architecture level B+ suggest using VEX encoding everywhere? It would be indeed nice to have the architecture levels documented in the psABI.> 2. I have a library with AVX2 and FMA, which directory should it go?Eventually GCC/gas can annotate objects with the lowest architecture level that is applicable? Thanks for doing this, Richard.> Can we pass such info to ld.so and ld.so prints out the best directory > name? > > -- > H.J.
Florian Weimer via llvm-dev
2020-Jul-13 07:40 UTC
[llvm-dev] New x86-64 micro-architecture levels
* Richard Biener:>> Looks good. I like it. > > Likewise. Btw, did you check that VIA family chips slot into Level A > at least?Those seem to lack SSE4.2, so they land in the baseline.> Where do AMD bdverN slot in?bdver1 to bdver3 (as defined by GCC) should land in Level B (so Level A if that is dropped). bdver4 and znver1 (and later) should land in Level C.>> My only concerns are >> >> 1. Names like “x86-100”, “x86-101”, what features do they support? > > Indeed I didn't get the -100, -101 part. On the GCC side I'd have > suggested -march=generic-{A,B,C,D} implying the respective > -mtune.With literal A, B, C, D, or are they just placeholders? If not literal levels, then what we should use there? I like the simplicity of numbers. I used letters in the proposal to avoid confusion if we alter the proposal by dropping or levels, shifting the meaning of those that come later. I expect to switch back to numbers again for the final version.> Do the patches end up annotating ELF binaries with the architecture > level and does ld.so check that info?This is a separate feature that H.J. has been working on.> For example IIRC there's a penalty to switch between VEX and > not VEX encoded instructions so even on AVX capable hardware > it might be profitable to use non-AVX libraries if the program is > using only architecture level A?But this is impossible to know in general. It may also be possible that the library contains an inner loop that can be nicely vectorized with AVX instructions, but not with SSE4.2 instructions and earlier. Then preferring the non-AVX version would be a mistake. Regarding the transition penalty, I believe this is mostly addressed by those VZEROUPPER instructions? I've already explained why I think those aren't a viable optimization target, given the current calling convention. My glibc patches already provide a way to mask subdirectories which would otherwise be selected, so manual optimization is still possible.> On that side, does architecture level B+ suggest using VEX encoding > everywhere? It would be indeed nice to have the architecture levels > documented in the psABI.I think this falls under optimization, and I really did not want to discuss. If there is a plan to change/amend the calling convention and some of the levels should prefer to that, it's a different matter, of course. (glibc can only give you four callee-saved 256-bit wide registers easily, though, more would need close cooperation with GCC.) The new glibc-hwcaps scheme in glibc scales a bit better than the old one, so we do not have to settle this immediately and could add additional subdirectories for objects that follow new calling convention requirements.>> 2. I have a library with AVX2 and FMA, which directory should it go? > > Eventually GCC/gas can annotate objects with the lowest architecture > level that is applicable?H.J. has patches for ELF program properties. I think GNU_PROPERTY_X86_ISA_1_NEEDED would convey this information. This proposal and the glibc patches are independent of that. If that function ever gets deployed, I plan to add those notes to ld.so.cache, so that ld.so can select shared objects based on them (or any allocated ELF note, really). Efficient LD_LIBRARY_PATH support is not possible, I think, so those designated glibc-hwcaps subdirectories still have a place. Thanks, Florian
Jan Beulich via llvm-dev
2020-Jul-13 07:47 UTC
[llvm-dev] New x86-64 micro-architecture levels
On 13.07.2020 09:40, Florian Weimer wrote:> * Richard Biener: >>> 2. I have a library with AVX2 and FMA, which directory should it go? >> >> Eventually GCC/gas can annotate objects with the lowest architecture >> level that is applicable? > > H.J. has patches for ELF program properties. I think > GNU_PROPERTY_X86_ISA_1_NEEDED would convey this information. This > proposal and the glibc patches are independent of that.>From (partly just halfway) recent discussions with H.J. I gainedthe understanding that the piece we're aiming at getting to work properly is the recording of GNU_PROPERTY_X86_FEATURE_2_*, not so much GNU_PROPERTY_X86_ISA_1_*. If the ISA one is to be used as a basis here, a lot of new flags will need adding (and properly setting) first, I think. Jan
Richard Biener via llvm-dev
2020-Jul-13 08:57 UTC
[llvm-dev] New x86-64 micro-architecture levels
On Mon, Jul 13, 2020 at 9:40 AM Florian Weimer <fweimer at redhat.com> wrote:> > * Richard Biener: > > >> Looks good. I like it. > > > > Likewise. Btw, did you check that VIA family chips slot into Level A > > at least? > > Those seem to lack SSE4.2, so they land in the baseline. > > > Where do AMD bdverN slot in? > > bdver1 to bdver3 (as defined by GCC) should land in Level B (so Level A > if that is dropped). bdver4 and znver1 (and later) should land in > Level C. > > >> My only concerns are > >> > >> 1. Names like “x86-100”, “x86-101”, what features do they support? > > > > Indeed I didn't get the -100, -101 part. On the GCC side I'd have > > suggested -march=generic-{A,B,C,D} implying the respective > > -mtune. > > With literal A, B, C, D, or are they just placeholders? If not literal > levels, then what we should use there? > > I like the simplicity of numbers. I used letters in the proposal to > avoid confusion if we alter the proposal by dropping or levels, shifting > the meaning of those that come later. I expect to switch back to > numbers again for the final version.They are indeed placeholders though I somehow prefer letters to numbers. But this is really bike-shedding territory. Good documentation on the tools side will be more imporant as well as consistent spelling between tools sets, possibly driven by a good choice from within the psABI document. Richard.