Florian Weimer via llvm-dev
2020-Jul-10 17:30 UTC
[llvm-dev] New x86-64 micro-architecture levels
Most Linux distributions still compile against the original x86-64 baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel EM64T compatibility). There has been an attempt to use the existing AT_PLATFORM-based loading mechanism in the glibc dynamic linker to enable a selection of optimized libraries. But the general selection mechanism in glibc is problematic: hwcaps subdirectory selection in the dynamic loader <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html> We also have the problem that the glibc version of "haswell" is distinct from GCC's -march=haswell (and presumably other compilers): Definition of "haswell" platform is inconsistent with GCC <https://sourceware.org/bugzilla/show_bug.cgi?id=24080> And that the selection criteria are not what people expect: Epyc and other current AMD CPUs do not select the "haswell" platform subdirectory <https://sourceware.org/bugzilla/show_bug.cgi?id=23249> Since the hwcaps-based selection does not work well regardless of architecture (even in cases the kernel provides glibc with data), I worked on a new mechanism that does not have the problems associated with the old mechanism: [PATCH 00/30] RFC: elf: glibc-hwcaps support <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html> (Don't be concerned that these patches have not been reviewed; we are busy preparing the glibc 2.32 release, and these changes do not alter the glibc ABI itself, so they do not have immediate priority. I'm fairly confident that a version of these changes will make it into glibc 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat Enterprise Linux 8.4. Debian as well, but I have never done anything like it there, so I don't know if the patches will be accepted.) Out of the box, this should work fairly well for IBM POWER and Z, where there is a clear progression of silicon versions (at least on paper —virtualization may blur the picture somewhat). However, for x86, we do not have such a clear progression of micro-architecture versions. This is not just as a result of the AMD/Intel competition, but also due to ongoing product differentiation within one chip vendor. I think we need these levels broadly for the following reasons: * Selecting on individual CPU features (similar to the old hwcaps mechanism) in glibc has scalability issues, particularly for LD_LIBRARY_PATH processing. * Developers need guidance about useful targets for optimization. I think there is value in limiting the choices, in the sense that “if you are able to test three builds in total, these are the things you should build”. * glibc and the compilers should align in their definition of the levels, so that developers can use an -march= option to build for a particular level that is recognized by glibc. This is why I think the description of the levels should go into the psABI supplement. * A preference order for these levels avoids falling back to the K8 baseline if the platform progresses to a new version due to glibc/kernel/hypervisor/hardware upgrades. I'm including a proposal for the levels below. I use single letters for them, but I expect that the concrete implementation of this proposal will use names like “x86-100”, “x86-101”, like in the glibc patch referenced above. (But we can discuss other approaches.) I looked at various machines in the Red Hat labs and talked to Intel and AMD engineers about this, but this concrete proposal is based on my own analysis of the situation. I excluded CPU features related to cryptography and cache management, including hardware transactional memory, and CPU timing. I assume that we will see some of these features being disabled by the firmware or the kernel over time. That would eliminate entire levels from selection, which is not desirable. For cryptographic code, I expect that localized selection of an optimized implementation works because such code tends to be isolated blocks, running for dozens of cycles each time, not something that gets scattered all over the place by the compiler. We previously discussed not emitting VZEROUPPER at later levels, but I don't think this is beneficial because the ABI does not have callee-saved vector registers, so it can only be useful with local functions (or whatever LTO considers local), where there is no ABI impact anyway. I did not include FSGSBASE because the FS base is already available at %fs:0. Changing the FS base in userspace breaks too much, so the main benefit is the tighter encoding of rdfsbase, which seems very slim. Not covered in this are tuning decisions. I think we can benefit from some variance in this area between implementations; it should not affect correctness. 32-bit support is also a separate matter. * Level A CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3 This is one step above the K8 baseline and corresponds to a mainline CPU model ca. 2008 to 2011. It is also implemented by recent-ish generations of Intel Atom server CPUs (although I haven't tested the latest version). A 32-bit variant would have to list many additional CPU features here. * Level B AVX, plus everything in level A. This step is so small that it probably can be dropped, unless the benefits from using VEX encoding are truly significant. For AVX and some of the following features, it is assumed that the run-time selection takes full support coverage (from silicon to the kernel) into account. * Level C AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B. This is close to what glibc currently calls "haswell". * Level D AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in level C. This is the AVX-512 level implemented by Xeon Scalable Processors, not the Xeon Phi variant. glibc (or an alternative loader implementation) would search for libraries starting at level D, going back to level A, and finally the baseline implementation in the default library location. I expect that some distributions will also use these levels to set a baseline for the entire distribution (i.e., everything would be built to level A or maybe even level C), and these libraries would then be installed in the default location. I'll be glad if I can get any feedback on this proposal. I plan to turn it into a merge request for the x86-64 psABI document eventually. Thanks, Florian
Joseph Myers via llvm-dev
2020-Jul-10 19:14 UTC
[llvm-dev] New x86-64 micro-architecture levels
On Fri, 10 Jul 2020, Florian Weimer via Gcc wrote:> * Level A > > CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3 > > This is one step above the K8 baseline and corresponds to a mainline CPU > model ca. 2008 to 2011. It is also implemented by recent-ish > generations of Intel Atom server CPUs (although I haven't tested the > latest version). A 32-bit variant would have to list many additional > CPU features here.FWIW, this is also the limit of what can be run under QEMU emulation, as QEMU lacks support for AVX and newer instruction set features. On the other hand, virtual machines seem liable to report something closer to the K8 baseline to the guest OS, missing the level A features, even when the underlying hardware supports everything in level B or level C. -- Joseph S. Myers joseph at codesourcery.com
H.J. Lu via llvm-dev
2020-Jul-10 21:42 UTC
[llvm-dev] New x86-64 micro-architecture levels
On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <fweimer at redhat.com> wrote:> > Most Linux distributions still compile against the original x86-64 > baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel > EM64T compatibility). > > There has been an attempt to use the existing AT_PLATFORM-based loading > mechanism in the glibc dynamic linker to enable a selection of optimized > libraries. But the general selection mechanism in glibc is problematic: > > hwcaps subdirectory selection in the dynamic loader > <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html> > > We also have the problem that the glibc version of "haswell" is distinct > from GCC's -march=haswell (and presumably other compilers): > > Definition of "haswell" platform is inconsistent with GCC > <https://sourceware.org/bugzilla/show_bug.cgi?id=24080> > > And that the selection criteria are not what people expect: > > Epyc and other current AMD CPUs do not select the "haswell" platform > subdirectory > <https://sourceware.org/bugzilla/show_bug.cgi?id=23249> > > Since the hwcaps-based selection does not work well regardless of > architecture (even in cases the kernel provides glibc with data), I > worked on a new mechanism that does not have the problems associated > with the old mechanism: > > [PATCH 00/30] RFC: elf: glibc-hwcaps support > <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html> > > (Don't be concerned that these patches have not been reviewed; we are > busy preparing the glibc 2.32 release, and these changes do not alter > the glibc ABI itself, so they do not have immediate priority. I'm > fairly confident that a version of these changes will make it into glibc > 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat > Enterprise Linux 8.4. Debian as well, but I have never done anything > like it there, so I don't know if the patches will be accepted.) > > Out of the box, this should work fairly well for IBM POWER and Z, where > there is a clear progression of silicon versions (at least on paper > —virtualization may blur the picture somewhat). > > However, for x86, we do not have such a clear progression of > micro-architecture versions. This is not just as a result of the > AMD/Intel competition, but also due to ongoing product differentiation > within one chip vendor. I think we need these levels broadly for the > following reasons: > > * Selecting on individual CPU features (similar to the old hwcaps > mechanism) in glibc has scalability issues, particularly for > LD_LIBRARY_PATH processing. > > * Developers need guidance about useful targets for optimization. I > think there is value in limiting the choices, in the sense that “if > you are able to test three builds in total, these are the things you > should build”. > > * glibc and the compilers should align in their definition of the > levels, so that developers can use an -march= option to build for a > particular level that is recognized by glibc. This is why I think the > description of the levels should go into the psABI supplement. > > * A preference order for these levels avoids falling back to the K8 > baseline if the platform progresses to a new version due to > glibc/kernel/hypervisor/hardware upgrades. > > I'm including a proposal for the levels below. I use single letters for > them, but I expect that the concrete implementation of this proposal > will use names like “x86-100”, “x86-101”, like in the glibc patch > referenced above. (But we can discuss other approaches.) > > I looked at various machines in the Red Hat labs and talked to Intel and > AMD engineers about this, but this concrete proposal is based on my own > analysis of the situation. I excluded CPU features related to > cryptography and cache management, including hardware transactional > memory, and CPU timing. I assume that we will see some of these > features being disabled by the firmware or the kernel over time. That > would eliminate entire levels from selection, which is not desirable. > For cryptographic code, I expect that localized selection of an > optimized implementation works because such code tends to be isolated > blocks, running for dozens of cycles each time, not something that gets > scattered all over the place by the compiler. > > We previously discussed not emitting VZEROUPPER at later levels, but I > don't think this is beneficial because the ABI does not have > callee-saved vector registers, so it can only be useful with local > functions (or whatever LTO considers local), where there is no ABI > impact anyway. > > I did not include FSGSBASE because the FS base is already available at > %fs:0. Changing the FS base in userspace breaks too much, so the main > benefit is the tighter encoding of rdfsbase, which seems very slim. > > Not covered in this are tuning decisions. I think we can benefit from > some variance in this area between implementations; it should not affect > correctness. 32-bit support is also a separate matter. > > * Level A > > CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3 > > This is one step above the K8 baseline and corresponds to a mainline CPU > model ca. 2008 to 2011. It is also implemented by recent-ish > generations of Intel Atom server CPUs (although I haven't tested the > latest version). A 32-bit variant would have to list many additional > CPU features here. > > * Level B > > AVX, plus everything in level A. > > This step is so small that it probably can be dropped, unless the > benefits from using VEX encoding are truly significant. > > For AVX and some of the following features, it is assumed that the > run-time selection takes full support coverage (from silicon to the > kernel) into account. > > * Level C > > AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B. > > This is close to what glibc currently calls "haswell". > > * Level D > > AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in > level C. > > This is the AVX-512 level implemented by Xeon Scalable Processors, not > the Xeon Phi variant. > > > glibc (or an alternative loader implementation) would search for > libraries starting at level D, going back to level A, and finally the > baseline implementation in the default library location. > > I expect that some distributions will also use these levels to set a > baseline for the entire distribution (i.e., everything would be built to > level A or maybe even level C), and these libraries would then be > installed in the default location. > > I'll be glad if I can get any feedback on this proposal. I plan to turn > it into a merge request for the x86-64 psABI document eventually. >Looks good. I like it. My only concerns are 1. Names like “x86-100”, “x86-101”, what features do they support? 2. I have a library with AVX2 and FMA, which directory should it go? Can we pass such info to ld.so and ld.so prints out the best directory name? -- H.J.
Allan Sandfeld Jensen via llvm-dev
2020-Jul-11 07:40 UTC
[llvm-dev] New x86-64 micro-architecture levels
On Freitag, 10. Juli 2020 19:30:09 CEST Florian Weimer via Gcc wrote:> glibc (or an alternative loader implementation) would search for > libraries starting at level D, going back to level A, and finally the > baseline implementation in the default library location. > > I expect that some distributions will also use these levels to set a > baseline for the entire distribution (i.e., everything would be built to > level A or maybe even level C), and these libraries would then be > installed in the default location. > > I'll be glad if I can get any feedback on this proposal. I plan to turn > it into a merge request for the x86-64 psABI document eventually. >Sounds good, though if I could dream I would also love a partial replacement option. So that you could have a generic x86-64 binary that only had some AVX2 optimized replacement functions in a supplementary library. Perhaps implemented by marked the library as a partial replacement, so the dynamic linker would also load the base or lower libraries except for functions already resolved. You could also add a level E for the AVX512 instructions in ice lake and above. The VBMI1/2 instructions would likely be useful for autovectorization in GCC. 'Allan
Richard Biener via llvm-dev
2020-Jul-13 06:23 UTC
[llvm-dev] New x86-64 micro-architecture levels
On Fri, Jul 10, 2020 at 11:45 PM H.J. Lu via Gcc <gcc at gcc.gnu.org> wrote:> > On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <fweimer at redhat.com> wrote: > > > > Most Linux distributions still compile against the original x86-64 > > baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel > > EM64T compatibility). > > > > There has been an attempt to use the existing AT_PLATFORM-based loading > > mechanism in the glibc dynamic linker to enable a selection of optimized > > libraries. But the general selection mechanism in glibc is problematic: > > > > hwcaps subdirectory selection in the dynamic loader > > <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html> > > > > We also have the problem that the glibc version of "haswell" is distinct > > from GCC's -march=haswell (and presumably other compilers): > > > > Definition of "haswell" platform is inconsistent with GCC > > <https://sourceware.org/bugzilla/show_bug.cgi?id=24080> > > > > And that the selection criteria are not what people expect: > > > > Epyc and other current AMD CPUs do not select the "haswell" platform > > subdirectory > > <https://sourceware.org/bugzilla/show_bug.cgi?id=23249> > > > > Since the hwcaps-based selection does not work well regardless of > > architecture (even in cases the kernel provides glibc with data), I > > worked on a new mechanism that does not have the problems associated > > with the old mechanism: > > > > [PATCH 00/30] RFC: elf: glibc-hwcaps support > > <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html> > > > > (Don't be concerned that these patches have not been reviewed; we are > > busy preparing the glibc 2.32 release, and these changes do not alter > > the glibc ABI itself, so they do not have immediate priority. I'm > > fairly confident that a version of these changes will make it into glibc > > 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat > > Enterprise Linux 8.4. Debian as well, but I have never done anything > > like it there, so I don't know if the patches will be accepted.) > > > > Out of the box, this should work fairly well for IBM POWER and Z, where > > there is a clear progression of silicon versions (at least on paper > > —virtualization may blur the picture somewhat). > > > > However, for x86, we do not have such a clear progression of > > micro-architecture versions. This is not just as a result of the > > AMD/Intel competition, but also due to ongoing product differentiation > > within one chip vendor. I think we need these levels broadly for the > > following reasons: > > > > * Selecting on individual CPU features (similar to the old hwcaps > > mechanism) in glibc has scalability issues, particularly for > > LD_LIBRARY_PATH processing. > > > > * Developers need guidance about useful targets for optimization. I > > think there is value in limiting the choices, in the sense that “if > > you are able to test three builds in total, these are the things you > > should build”. > > > > * glibc and the compilers should align in their definition of the > > levels, so that developers can use an -march= option to build for a > > particular level that is recognized by glibc. This is why I think the > > description of the levels should go into the psABI supplement. > > > > * A preference order for these levels avoids falling back to the K8 > > baseline if the platform progresses to a new version due to > > glibc/kernel/hypervisor/hardware upgrades. > > > > I'm including a proposal for the levels below. I use single letters for > > them, but I expect that the concrete implementation of this proposal > > will use names like “x86-100”, “x86-101”, like in the glibc patch > > referenced above. (But we can discuss other approaches.) > > > > I looked at various machines in the Red Hat labs and talked to Intel and > > AMD engineers about this, but this concrete proposal is based on my own > > analysis of the situation. I excluded CPU features related to > > cryptography and cache management, including hardware transactional > > memory, and CPU timing. I assume that we will see some of these > > features being disabled by the firmware or the kernel over time. That > > would eliminate entire levels from selection, which is not desirable. > > For cryptographic code, I expect that localized selection of an > > optimized implementation works because such code tends to be isolated > > blocks, running for dozens of cycles each time, not something that gets > > scattered all over the place by the compiler. > > > > We previously discussed not emitting VZEROUPPER at later levels, but I > > don't think this is beneficial because the ABI does not have > > callee-saved vector registers, so it can only be useful with local > > functions (or whatever LTO considers local), where there is no ABI > > impact anyway. > > > > I did not include FSGSBASE because the FS base is already available at > > %fs:0. Changing the FS base in userspace breaks too much, so the main > > benefit is the tighter encoding of rdfsbase, which seems very slim. > > > > Not covered in this are tuning decisions. I think we can benefit from > > some variance in this area between implementations; it should not affect > > correctness. 32-bit support is also a separate matter. > > > > * Level A > > > > CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3 > > > > This is one step above the K8 baseline and corresponds to a mainline CPU > > model ca. 2008 to 2011. It is also implemented by recent-ish > > generations of Intel Atom server CPUs (although I haven't tested the > > latest version). A 32-bit variant would have to list many additional > > CPU features here. > > > > * Level B > > > > AVX, plus everything in level A. > > > > This step is so small that it probably can be dropped, unless the > > benefits from using VEX encoding are truly significant. > > > > For AVX and some of the following features, it is assumed that the > > run-time selection takes full support coverage (from silicon to the > > kernel) into account. > > > > * Level C > > > > AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B. > > > > This is close to what glibc currently calls "haswell". > > > > * Level D > > > > AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in > > level C. > > > > This is the AVX-512 level implemented by Xeon Scalable Processors, not > > the Xeon Phi variant. > > > > > > glibc (or an alternative loader implementation) would search for > > libraries starting at level D, going back to level A, and finally the > > baseline implementation in the default library location. > > > > I expect that some distributions will also use these levels to set a > > baseline for the entire distribution (i.e., everything would be built to > > level A or maybe even level C), and these libraries would then be > > installed in the default location. > > > > I'll be glad if I can get any feedback on this proposal. I plan to turn > > it into a merge request for the x86-64 psABI document eventually. > > > > Looks good. I like it.Likewise. Btw, did you check that VIA family chips slot into Level A at least? Where do AMD bdverN slot in?> My only concerns are > > 1. Names like “x86-100”, “x86-101”, what features do they support?Indeed I didn't get the -100, -101 part. On the GCC side I'd have suggested -march=generic-{A,B,C,D} implying the respective -mtune. Do the patches end up annotating ELF binaries with the architecture level and does ld.so check that info? For example IIRC there's a penalty to switch between VEX and not VEX encoded instructions so even on AVX capable hardware it might be profitable to use non-AVX libraries if the program is using only architecture level A? On that side, does architecture level B+ suggest using VEX encoding everywhere? It would be indeed nice to have the architecture levels documented in the psABI.> 2. I have a library with AVX2 and FMA, which directory should it go?Eventually GCC/gas can annotate objects with the lowest architecture level that is applicable? Thanks for doing this, Richard.> Can we pass such info to ld.so and ld.so prints out the best directory > name? > > -- > H.J.
Florian Weimer via llvm-dev
2020-Jul-13 06:49 UTC
[llvm-dev] New x86-64 micro-architecture levels
* H. J. Lu:> Looks good. I like it.Thanks. What do you think about Level B? Should we keep it?> My only concerns are > > 1. Names like “x86-100”, “x86-101”, what features do they support?I think we can add more diagnostic output to ld.so --help. My patch does not show individual CPU flags, but I agree this could be useful. (It's not needed for the legacy HWCAP subdirectories because in general, those are named & defined by the kernel, not by individually named CPU feature flags.)> 2. I have a library with AVX2 and FMA, which directory should it go? > > Can we pass such info to ld.so and ld.so prints out the best directory > name?I think this would require generating matching GNU property notes (list the CPU features required by the binary). Once we have that, we can add something to binutils or indeed ld.so to analyze them and print the recommended directory. But I think this is something that could come later. We can also write a GCC header which looks at macros such as __AVX2__ and prints a #warning with the recommended directory name. Checking for excess flags will be tricky in this context, though, and if we miss something, a wrong recommendation will be the result. Thanks, Florian
Florian Weimer via llvm-dev
2020-Jul-13 06:58 UTC
[llvm-dev] New x86-64 micro-architecture levels
* Allan Sandfeld Jensen:> On Freitag, 10. Juli 2020 19:30:09 CEST Florian Weimer via Gcc wrote: >> glibc (or an alternative loader implementation) would search for >> libraries starting at level D, going back to level A, and finally the >> baseline implementation in the default library location. >> >> I expect that some distributions will also use these levels to set a >> baseline for the entire distribution (i.e., everything would be built to >> level A or maybe even level C), and these libraries would then be >> installed in the default location. >> >> I'll be glad if I can get any feedback on this proposal. I plan to turn >> it into a merge request for the x86-64 psABI document eventually.> Sounds good, though if I could dream I would also love a partial > replacement option. So that you could have a generic x86-64 binary > that only had some AVX2 optimized replacement functions in a > supplementary library. > > Perhaps implemented by marked the library as a partial replacement, so > the dynamic linker would also load the base or lower libraries except > for functions already resolved.I think you can do something like it today, at least from the glibc dynamic loader perspective. Programs link against the soname of the optimized shared object (which can be empty), and that shared object depends on the object with the fallback implementation. A special link-only shared object containing just the ABI under the front soname (that of the optimized library) would be used via a linker object, so that it is not possible to accidentally link against the wrong soname. For non-versioned symbols, this setup has worked since forever. For versioned symbols, delegating from the optimized to the unoptimized library needs at least glibc 2.30, with commit f0b2132b35248c1f4a80 ("ld.so: Support moving versioned symbols between sonames [BZ #24741]"), although some of us have backported this commit into earlier releases. Where this falls flat is support for LTO and -fno-semantic-interposition. Some care is needed to make precisely the right set of symbols interposable. But to honest, I'm not sure if this entire mechanism is a big improvement over function multi-versioning. Thanks, Florian
Florian Weimer via llvm-dev
2020-Jul-13 07:55 UTC
[llvm-dev] New x86-64 micro-architecture levels
* Joseph Myers:> On Fri, 10 Jul 2020, Florian Weimer via Gcc wrote: > >> * Level A >> >> CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3 >> >> This is one step above the K8 baseline and corresponds to a mainline CPU >> model ca. 2008 to 2011. It is also implemented by recent-ish >> generations of Intel Atom server CPUs (although I haven't tested the >> latest version). A 32-bit variant would have to list many additional >> CPU features here. > > FWIW, this is also the limit of what can be run under QEMU emulation, as > QEMU lacks support for AVX and newer instruction set features.Oh, I had forgotten about. I should have Cc:ed the QEMU folks as well. We'll need to make sure that we have matching CPU models in QEMU/libvirt, even for the levels that do not have TCG support. valgrind is another consumer, but in my tests, it was mostly okay with AVX2 code (but that was without auto-vectorization). AVX-512 is a different matter, but that is also much further out.> On the other hand, virtual machines seem liable to report something closer > to the K8 baseline to the guest OS, missing the level A features, even > when the underlying hardware supports everything in level B or level C.They do this to support migration. I'm suspect that in many cases, those are just configuration errors. That's why I want at least one major distribution to switch to Level C as the baseline, to clean the pipes. Then even those distributions that depend on run-time selection for performance-critical code will benefit. 8-/ Thanks, Florian
Mark Wielaard via llvm-dev
2020-Jul-15 14:38 UTC
[llvm-dev] New x86-64 micro-architecture levels
Hi Florian, I understand you want to discuss the x86_64 micro-architecture levels only in this thread, but it would be nice to have a similar discussion for other architectures. One thing that wasn't clear to me from this proposal is how the glibc dynamic loader checks for the CPU feature flags. This is important for valgrind since it can communicate those through different means. cpuid interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM at the moment) and of course we can generate SIGILL for unsupported instructions. We currently don't intercept /proc/cpuinfo (but could). I think it is important to be precise here, because in the past this has sometimes caused confusion. For example for how to check correctly for avx, lzcnt, or fma[4] support. Thanks, Mark P.S. I don't particular like the numbered names, but well, bike-shed...
H.J. Lu via llvm-dev
2020-Jul-15 14:45 UTC
[llvm-dev] New x86-64 micro-architecture levels
On Wed, Jul 15, 2020 at 7:38 AM Mark Wielaard <mark at klomp.org> wrote:> > Hi Florian, > > I understand you want to discuss the x86_64 micro-architecture levels > only in this thread, but it would be nice to have a similar discussion > for other architectures. > > One thing that wasn't clear to me from this proposal is how the glibc > dynamic loader checks for the CPU feature flags. This is important for > valgrind since it can communicate those through different means. cpuid > interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM > at the moment) and of course we can generate SIGILL for unsupported > instructions. We currently don't intercept /proc/cpuinfo (but could).In library, we can use <sys/platform/x86.h>: https://sourceware.org/pipermail/libc-alpha/2020-June/115546.html In GCC, we can use __builtin_cpu_supports. <sys/platform/x86.h> supports all features and __builtin_cpu_supports in GCC 11 supports all features which GCC has codegen for.> I think it is important to be precise here, because in the past this > has sometimes caused confusion. For example for how to check correctly > for avx, lzcnt, or fma[4] support. > > Thanks, > > Mark > > P.S. I don't particular like the numbered names, but well, bike-shed...-- H.J.
Florian Weimer via llvm-dev
2020-Jul-15 14:56 UTC
[llvm-dev] New x86-64 micro-architecture levels
* Mark Wielaard:> One thing that wasn't clear to me from this proposal is how the glibc > dynamic loader checks for the CPU feature flags. This is important for > valgrind since it can communicate those through different means. cpuid > interception, auxv AT_HWCAP/AT_HWCAP2 interception (but not AT_PLATFORM > at the moment) and of course we can generate SIGILL for unsupported > instructions. We currently don't intercept /proc/cpuinfo (but could).glibc uses CPUID in combination with XGETBV. There is also a masking feature which I have not reviewed, but given that it only takes features away, I don't think it matters to valgrind. Thanks, Florian
Mallappa, Premachandra via llvm-dev
2020-Jul-21 16:05 UTC
[llvm-dev] New x86-64 micro-architecture levels
[AMD Public Use] Hi Floarian,> I'm including a proposal for the levels below. I use single letters for them, but I expect that the concrete implementation of this proposal will use > names like “x86-100”, “x86-101”, like in the glibc patch referenced above. (But we can discuss other approaches.)Personally I am not a big fan of this, for 2 reasons 1. uses just x86 in name on x86_64 as well 2. 100/101 not very intuitive> * Level A...> * Level B > This step is so small that it probably can be dropped, unless the benefits from using VEX encoding are truly significant.Yes, Agree, the delta is too small, can be clubbed into A or C.> * Level C > * Level DOthers are inline with the what we expect as logical grouping. As you mentioned it is not easy tackle this, Also we would also like to have dynamic loader support for "zen" / "zen2" as a version of "Level D" and takes preference over Level D, which may have super-optimized libraries from AMD or other vendors. These libraries are expected to be optimized according to micro-architectural details, not just ISA. Probably we can discuss this on the hwcaps thread. -Prem
Florian Weimer via llvm-dev
2020-Jul-21 18:04 UTC
[llvm-dev] New x86-64 micro-architecture levels
* Premachandra Mallappa:> [AMD Public Use] > > Hi Floarian, > >> I'm including a proposal for the levels below. I use single letters for them, but I expect that the concrete implementation of this proposal will use >> names like “x86-100”, “x86-101”, like in the glibc patch referenced above. (But we can discuss other approaches.) > > Personally I am not a big fan of this, for 2 reasons > 1. uses just x86 in name on x86_64 as wellThat's deliberate, so that we can use the same x86-* names for 32-bit library selection (once we define matching micro-architecture levels there). GCC has -m32 -march=x86-64 for K8 without 3DNow! (essentially the shared x86-64/EMT64 baseline), but I find this a bit confusing.> 2. 100/101 not very intuitiveAny suggestions? The advantage is that these numbers show a strong preference ordering. They do make in false suggestions about feature sets: if we named Level C "x86-avx2", it would still be wrong for glibc to load libraries found in that directory just because a system has AVX2 support, because the libraries might also need FMA, based on the Level C definition). On the GCC side, it avoids a confusion between -mavx2 and -march=x86-avx2. If numbers are out, what should we use instead? x86-sse4, x86-avx2, x86-avx512? Would that work?>> * Level A > ... >> * Level B >> This step is so small that it probably can be dropped, unless the benefits from using VEX encoding are truly significant. > > Yes, Agree, the delta is too small, can be clubbed into A or C.Let's merge Level B into level C then?>> * Level C >> * Level D > > Others are inline with the what we expect as logical grouping.Thanks.> Also we would also like to have dynamic loader support for "zen" / > "zen2" as a version of "Level D" and takes preference over Level D, > which may have super-optimized libraries from AMD or other vendors.*That* shouldn't be too hard to implement if we can nail down the selection criteria. Let's call this Zen-specific Level C x86-zen-avx2 for the sake of exposition. What's going to be difficult is the choice for a hypothetical Zen successor that's compatible feature-flag-wise with Level D. Basically, there are two choices here: * Level D wins because it's the more powerful ISA. * x86-zen-avx2 wins because it has the Zen architecture optimizations. There's also a related issue with Level C vs x86-zen-avx2 depending on how we implement the Zen detection for AMD family numbers in the glibc dynamic linker. What I mean by this? glibc detects that this a Level C capable Zen-type CPU, but it's not one of the family/model numbers that were hard-coded into the glibc sources. What should we do then? Should we still prefer the x86-zen-avx2 library over the Level C library?> These libraries are expected to be optimized according to > micro-architectural details, not just ISA.If it's supposed to be generally useful, we really need to document the selection criteria for the subdirectory and make sure that it matches what these libraries actually require at run time in terms of ISA. I want to avoid two things here specifically: A hardware upgrade results in crashes because we incorrectly load an incompatible library. And, if possible: A hardware upgrade (or kernel/hypervisor upgrade that exposes more of the actual hardware) causes us to drop optimizations, so that users experience a performance regression. With the levels I proposed, these aspects are covered. But if we start to create vendor-specific forks in the feature progression, things get complicated. Do you think we need to figure this out in this iteration? If yes, then I really need a semi-formal description of the selection criteria for this x86-zen-avx2 directory, so that I can passed it along with my psABI proposal. Thanks, Florian