James Y Knight via llvm-dev
2020-Aug-30  23:10 UTC
[llvm-dev] Proposal to remove MMX support.
I recently diagnosed a bug in someone else's software, which turned out to be due to incorrect MMX intrinsics usage: if you use any of the x86 intrinsics that accept or return __m64 values, then you, the *programmer* are required to call _mm_empty() before using any x87 floating point instructions or leaving the function. I was aware that this was required at the assembly-level, but not that the compiler forced users to deal with this when using intrinsics. This is a real nasty footgun -- if you get this wrong, your program doesn't crash -- no, that would be too easy! Instead, every x87 instruction will simply result in a NaN value. Even more unfortunately than all that, it is currently impossible to correctly use _mm_empty() to resolve the problem, because the compiler has no restrictions against placing x87 FPU operations between an MMX instruction and the EMMS instruction. Of course, I didn't discover any of this -- it was already well-known...just not to me. But let's actually fix it. *Existing bugs*: llvm.org/PR35982 <https://bugs.llvm.org/show_bug.cgi?id=35982> -- POSTRAScheduler disarrange emms and mmx instruction llvm.org/PR41029 <https://bugs.llvm.org/show_bug.cgi?id=41029> -- The __m64 not passed according to i386 ABI llvm.org/PR42319 <https://bugs.llvm.org/show_bug.cgi?id=42319> -- Add pass to insert EMMS/FEMMS instructions to separate MMX and X87 states llvm.org/PR42320 <https://bugs.llvm.org/show_bug.cgi?id=42320> -- Implement MMX intrinsics with SSE equivalents *Proposal* We should re-implement all the currently-MMX intrinsics in Clang's *mmintrin.h headers by using the existing SSE/SSE2 compiler builtins, on both x86-32 and x86-64, and then *delete the MMX implementation of these intrinsics*. We would thus stop supporting the use of these intrinsics, without SSE2 also enabled. I've created a preliminary patch for these header changes, https://reviews.llvm.org/D86855. Sometime later, we should then remove the MMX intrinsics in LLVM IR. (Only the intrinsics -- the machine-instruction and register definitions for MMX should be kept indefinitely for use by assembly code.) That raises the question of bitcode compat. Maybe we do something to prevent new use of the intrinsics, but keep the implementations around for bitcode compatibility for a while longer? We might also consider defaulting to -mno-mmx for new compilations in x86-64, which would have the additional effect of disabling the "y" constraint in inline-asm. (MMX instructions could still exist in the binary, but they'd need to be entirely contained within an inline-asm blob). Unfortunately, given the ABI requirement in x86-32 to use MMX registers for 8-byte-vector arguments and returns -- which we've been violating for 7 years -- we probably cannot simply use -mno-mmx by default on x86-32. Unless, of course, we decide that we might as well just continue violating the ABI indefinitely. (Why bother to be correct, after the better part of a decade being incorrect...) *Impact* - No more %mm* register usage on x86-64, other than via inline-asm. No more %mm* register usage on x86-32, other than inline-asm and when calling a function that takes/returns 8-byte vectors (assuming we fix the ABI-compliance issue). - Since the default compiler flags include SSE2, most code will switch to using SSE2 instructions instead of MMX instructions when using intrinsics, and continue to compile fine. It'll also likely be faster, since MMX instructions are legacy, and not optimized in CPUs anymore. - Code explicitly disabling SSE2 (e.g. -mno-sse2 or -march=penium2) will stop compiling if it requires MMX intrinsics. - Code using the intrinsics will run faster, especially on x86-64, where the vectors are passed around in xmm registers, and is being copied to mm registers just to run a legacy instruction. But even without that, the mmx instructions also just have less throughput than the sse2 variants on modern CPUs. *Alternatives* We could keep both implementations of the functions in mmintrin.h, in order to preserve the ability to use the intrinsics when compiling for a CPU without SSE2. However, this doesn't seem worthwhile to me -- we're talking about dropping the ability to generate vectorized code using compiler intrinsics for Intel Pentium MMX, Pentium, II, and Pentium III (released 1997-1999), as well as AMD K6 and K7 series chips of around the same timeframe. We could also keep the clang headers mostly-unmodified, and make the llvm IR builtins themselves expand to SSE2 instructions. I believe GCC has effectively chosen this option. That seems less desirable; it'll be more complex to implement at that level, versus in the headers, and doesn't leave a path towards eliminating the builtins in the future. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200830/f31f1d37/attachment.html>
Eli Friedman via llvm-dev
2020-Aug-31  19:02 UTC
[llvm-dev] Proposal to remove MMX support.
Broadly speaking, I see two problems with implicitly enabling MMX emulation on a target that has SSE2: 1. The interaction with inline asm. Inline asm can still have MMX operands/results/clobbers, and can still put the processor in MMX mode. If code is mixing MMX intrinsics and inline asm, there could be a significant penalty to moving values across register files. And it’s not clear what we want to do with _mm_empty(): under full emulation, it should be a no-op, but if there’s MMX asm, we need to actually clear the register file. 2. The calling convention problem; your description covers this. If we add an explicit opt-in, I don’t see any problem with emulation (even on non-x86 targets, if someone implements that). If we’re going make changes that could break people’s software, I’d prefer it to break with a compile-time error: require the user affirmatively state that there aren’t any interactions with assembly code. For the MMX intrinsics in particular, code that’s using intrinsics is likely to also be using inline asm, so the interaction is important. In terms of whether it’s okay to assume people don’t need support for MMX intrinsics on targets without SSE2, that’s probably fine. Wikipedia says Intel stopped manufacturing PC chips without SSE2 around 2007. And supported versions of Windows require SSE2. -Eli From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of James Y Knight via llvm-dev Sent: Sunday, August 30, 2020 4:11 PM To: llvm-dev <llvm-dev at lists.llvm.org> Subject: [EXT] [llvm-dev] Proposal to remove MMX support. I recently diagnosed a bug in someone else's software, which turned out to be due to incorrect MMX intrinsics usage: if you use any of the x86 intrinsics that accept or return __m64 values, then you, the programmer are required to call _mm_empty() before using any x87 floating point instructions or leaving the function. I was aware that this was required at the assembly-level, but not that the compiler forced users to deal with this when using intrinsics. This is a real nasty footgun -- if you get this wrong, your program doesn't crash -- no, that would be too easy! Instead, every x87 instruction will simply result in a NaN value. Even more unfortunately than all that, it is currently impossible to correctly use _mm_empty() to resolve the problem, because the compiler has no restrictions against placing x87 FPU operations between an MMX instruction and the EMMS instruction. Of course, I didn't discover any of this -- it was already well-known...just not to me. But let's actually fix it. Existing bugs: llvm.org/PR35982<https://bugs.llvm.org/show_bug.cgi?id=35982> -- POSTRAScheduler disarrange emms and mmx instruction llvm.org/PR41029<https://bugs.llvm.org/show_bug.cgi?id=41029> -- The __m64 not passed according to i386 ABI llvm.org/PR42319<https://bugs.llvm.org/show_bug.cgi?id=42319> -- Add pass to insert EMMS/FEMMS instructions to separate MMX and X87 states llvm.org/PR42320<https://bugs.llvm.org/show_bug.cgi?id=42320> -- Implement MMX intrinsics with SSE equivalents Proposal We should re-implement all the currently-MMX intrinsics in Clang's *mmintrin.h headers by using the existing SSE/SSE2 compiler builtins, on both x86-32 and x86-64, and then delete the MMX implementation of these intrinsics. We would thus stop supporting the use of these intrinsics, without SSE2 also enabled. I've created a preliminary patch for these header changes, https://reviews.llvm.org/D86855. Sometime later, we should then remove the MMX intrinsics in LLVM IR. (Only the intrinsics -- the machine-instruction and register definitions for MMX should be kept indefinitely for use by assembly code.) That raises the question of bitcode compat. Maybe we do something to prevent new use of the intrinsics, but keep the implementations around for bitcode compatibility for a while longer? We might also consider defaulting to -mno-mmx for new compilations in x86-64, which would have the additional effect of disabling the "y" constraint in inline-asm. (MMX instructions could still exist in the binary, but they'd need to be entirely contained within an inline-asm blob). Unfortunately, given the ABI requirement in x86-32 to use MMX registers for 8-byte-vector arguments and returns -- which we've been violating for 7 years -- we probably cannot simply use -mno-mmx by default on x86-32. Unless, of course, we decide that we might as well just continue violating the ABI indefinitely. (Why bother to be correct, after the better part of a decade being incorrect...) Impact - No more %mm* register usage on x86-64, other than via inline-asm. No more %mm* register usage on x86-32, other than inline-asm and when calling a function that takes/returns 8-byte vectors (assuming we fix the ABI-compliance issue). - Since the default compiler flags include SSE2, most code will switch to using SSE2 instructions instead of MMX instructions when using intrinsics, and continue to compile fine. It'll also likely be faster, since MMX instructions are legacy, and not optimized in CPUs anymore. - Code explicitly disabling SSE2 (e.g. -mno-sse2 or -march=penium2) will stop compiling if it requires MMX intrinsics. - Code using the intrinsics will run faster, especially on x86-64, where the vectors are passed around in xmm registers, and is being copied to mm registers just to run a legacy instruction. But even without that, the mmx instructions also just have less throughput than the sse2 variants on modern CPUs. Alternatives We could keep both implementations of the functions in mmintrin.h, in order to preserve the ability to use the intrinsics when compiling for a CPU without SSE2. However, this doesn't seem worthwhile to me -- we're talking about dropping the ability to generate vectorized code using compiler intrinsics for Intel Pentium MMX, Pentium, II, and Pentium III (released 1997-1999), as well as AMD K6 and K7 series chips of around the same timeframe. We could also keep the clang headers mostly-unmodified, and make the llvm IR builtins themselves expand to SSE2 instructions. I believe GCC has effectively chosen this option. That seems less desirable; it'll be more complex to implement at that level, versus in the headers, and doesn't leave a path towards eliminating the builtins in the future. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200831/8032d1e7/attachment.html>
On 8/31/20 12:02 PM, Eli Friedman via llvm-dev wrote:> In terms of whether it’s okay to assume people don’t need support for > MMX intrinsics on targets without SSE2, that’s probably fine. Wikipedia > says Intel stopped manufacturing PC chips without SSE2 around 2007. And > supported versions of Windows require SSE2.Similarly, Fedora 29 made SSE2 mandatory, released October 30, 2018. https://fedoraproject.org/wiki/Changes/Update_i686_architectural_baseline_to_include_SSE2 Then Fedora 28 without SSE2 reached EOL on May 28, 2019.
James Y Knight via llvm-dev
2020-Aug-31  20:30 UTC
[llvm-dev] Proposal to remove MMX support.
On Mon, Aug 31, 2020 at 3:02 PM Eli Friedman <efriedma at quicinc.com> wrote:> Broadly speaking, I see two problems with implicitly enabling MMX > emulation on a target that has SSE2: > > > > 1. The interaction with inline asm. Inline asm can still have MMX > operands/results/clobbers, and can still put the processor in MMX mode. If > code is mixing MMX intrinsics and inline asm, there could be a significant > penalty to moving values across register files. And it’s not clear what we > want to do with _mm_empty(): under full emulation, it should be a no-op, > but if there’s MMX asm, we need to actually clear the register file. > > Moving data between the register files in order to call an inline asm isnot a correctness issue, however, just a potential performance issue. The compiler will insert movdq2q and movq2dq instructions as needed to copy the data (introduced in SSE2). If this is slow in current CPUs, then your code will be slow...but, if such code is being used in a performance critical location now, it really shouldn't be using MMX still, so I don't think this is a seriosu issue. For _mm_empty, I think the best thing to do is to follow what GCC did, and make it a no-op only if MMX is disabled, and have it continue to emit EMMS otherwise -- even though that is usually a waste.> 1. The calling convention problem; your description covers this. > > Well, I covered it...but I didn't come to an actual conclusion there. :)If we add an explicit opt-in, I don’t see any problem with emulation (even> on non-x86 targets, if someone implements that). > > > > If we’re going make changes that could break people’s software, I’d prefer > it to break with a compile-time error: require the user affirmatively state > that there aren’t any interactions with assembly code. For the MMX > intrinsics in particular, code that’s using intrinsics is likely to also be > using inline asm, so the interaction is important. > >It is a compile-time error to use MMX ("y" constraint) operands and outputs for inline-asm with MMX disabled. SoIf _mm_empty() only becomes a no-op when MMX is disabled, that should address the vast majority of the issue. There is a case where it can go wrong, still: it is not an error to use MMX in asm, even with mmx disabled. Thus it is possible that an inline-asm statement which has no MMX ins/outs, but which does use MMX registers, and which depends on a subsequent call to _mm_empty() to reset the state, could be broken at runtime if compiled with -mno-mmx. I think this is unlikely to occur in the wild, but it's possible. It's also possible that a standalone asm function might use MMX, and depend on its caller to clear the state (even though it should be clearing the mmx state itself). In terms of whether it’s okay to assume people don’t need support for MMX> intrinsics on targets without SSE2, that’s probably fine. Wikipedia says > Intel stopped manufacturing PC chips without SSE2 around 2007. And > supported versions of Windows require SSE2. > > > > -Eli > > > > *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> *On Behalf Of *James Y > Knight via llvm-dev > *Sent:* Sunday, August 30, 2020 4:11 PM > *To:* llvm-dev <llvm-dev at lists.llvm.org> > *Subject:* [EXT] [llvm-dev] Proposal to remove MMX support. > > > > I recently diagnosed a bug in someone else's software, which turned out to > be due to incorrect MMX intrinsics usage: if you use any of the x86 > intrinsics that accept or return __m64 values, then you, the *programmer* are > required to call _mm_empty() before using any x87 floating point > instructions or leaving the function. I was aware that this was required at > the assembly-level, but not that the compiler forced users to deal with > this when using intrinsics. > > > > This is a real nasty footgun -- if you get this wrong, your program > doesn't crash -- no, that would be too easy! Instead, every x87 instruction > will simply result in a NaN value. > > > > Even more unfortunately than all that, it is currently impossible to > correctly use _mm_empty() to resolve the problem, because the compiler has > no restrictions against placing x87 FPU operations between an MMX > instruction and the EMMS instruction. > > > > Of course, I didn't discover any of this -- it was already > well-known...just not to me. But let's actually fix it. > > > > *Existing bugs*: > > llvm.org/PR35982 <https://bugs.llvm.org/show_bug.cgi?id=35982> -- > POSTRAScheduler disarrange emms and mmx instruction > > llvm.org/PR41029 <https://bugs.llvm.org/show_bug.cgi?id=41029> -- The > __m64 not passed according to i386 ABI > > llvm.org/PR42319 <https://bugs.llvm.org/show_bug.cgi?id=42319> -- Add > pass to insert EMMS/FEMMS instructions to separate MMX and X87 states > > llvm.org/PR42320 <https://bugs.llvm.org/show_bug.cgi?id=42320> > -- Implement MMX intrinsics with SSE equivalents > > > > *Proposal* > > We should re-implement all the currently-MMX intrinsics in Clang's > *mmintrin.h headers by using the existing SSE/SSE2 compiler builtins, on > both x86-32 and x86-64, and then *delete the MMX implementation of these > intrinsics*. We would thus stop supporting the use of these intrinsics, > without SSE2 also enabled. I've created a preliminary patch for these > header changes, https://reviews.llvm.org/D86855. > > > > Sometime later, we should then remove the MMX intrinsics in LLVM IR. (Only > the intrinsics -- the machine-instruction and register definitions for MMX > should be kept indefinitely for use by assembly code.) That raises the > question of bitcode compat. Maybe we do something to prevent new use of the > intrinsics, but keep the implementations around for bitcode compatibility > for a while longer? > > > > We might also consider defaulting to -mno-mmx for new compilations in > x86-64, which would have the additional effect of disabling the "y" > constraint in inline-asm. (MMX instructions could still exist in the > binary, but they'd need to be entirely contained within an inline-asm blob). > > > > Unfortunately, given the ABI requirement in x86-32 to use MMX registers > for 8-byte-vector arguments and returns -- which we've been violating for 7 > years -- we probably cannot simply use -mno-mmx by default on x86-32. > Unless, of course, we decide that we might as well just continue violating > the ABI indefinitely. (Why bother to be correct, after the better part of a > decade being incorrect...) > > > > *Impact* > > - No more %mm* register usage on x86-64, other than via inline-asm. No > more %mm* register usage on x86-32, other than inline-asm and when calling > a function that takes/returns 8-byte vectors (assuming we fix the > ABI-compliance issue). > > - Since the default compiler flags include SSE2, most code will switch to > using SSE2 instructions instead of MMX instructions when using intrinsics, > and continue to compile fine. It'll also likely be faster, since MMX > instructions are legacy, and not optimized in CPUs anymore. > > - Code explicitly disabling SSE2 (e.g. -mno-sse2 or -march=penium2) will > stop compiling if it requires MMX intrinsics. > > - Code using the intrinsics will run faster, especially on x86-64, where > the vectors are passed around in xmm registers, and is being copied to mm > registers just to run a legacy instruction. But even without that, the mmx > instructions also just have less throughput than the sse2 variants on > modern CPUs. > > > > *Alternatives* > > We could keep both implementations of the functions in mmintrin.h, in > order to preserve the ability to use the intrinsics when compiling for a > CPU without SSE2. > > > > However, this doesn't seem worthwhile to me -- we're talking about > dropping the ability to generate vectorized code using compiler intrinsics > for Intel Pentium MMX, Pentium, II, and Pentium III (released 1997-1999), > as well as AMD K6 and K7 series chips of around the same timeframe. > > > > We could also keep the clang headers mostly-unmodified, and make the llvm > IR builtins themselves expand to SSE2 instructions. I believe GCC has > effectively chosen this option. That seems less desirable; it'll be more > complex to implement at that level, versus in the headers, and doesn't > leave a path towards eliminating the builtins in the future. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200831/29f8cb06/attachment-0001.html>