Ristow, Warren via llvm-dev
2017-Sep-29 00:56 UTC
[llvm-dev] Trouble when suppressing a portion of fast-math-transformations
Hi all, In a mailing-list post last November: http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html I raised some concerns that having the IR-level fast-math-flag 'fast' act as an "umbrella" to implicitly turn on all the lower-level fast-math-flags, causes some fundamental problems. Those fundamental problems are related to situations where a user wants to disable a portion of the fast-math behavior. For example, to enable all the fast-math transformations except for the reciprocal-math transformation, a command like the following is what a user would expect to work: clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp But that isn't what it's doing. I believe this is a serious problem, but I also want to avoid over-stating the seriousness. To be explicit, the problems I'm describing here happen when '-ffast-math' is used with one or more of the underlying fast-math-related aspects _disabled_ (like the '-fno-reciprocal-math' example, above). Conversely, when '-ffast-math' is used "on its own", the situation is fine. For terminology here, I'll refer to these underlying fast-math-related aspects (like reciprocal-math, associative-math, math-errno, and others) as "sub-fast-math" aspects. I apologize for the length of this post. I'm putting the summary up front, so that anyone interested in fast-math issues can quickly get the big-picture of the issues I'm describing here. In Summary: 1. With the change of r297837, the driver now more cleanly handles '-ffast-math', and other sub-fast-math switches (like '-f[no]-reciprocal-math', '-f[no-]math-errno', and others). 2. Prior to that change, the disabling of a sub-fast-math switch was often ineffective. So as an example, the following two commands often resulted in the same code-gen, even if there were fast-math-reciprocal-transformations that were done: clang++ -O2 -ffast-math -c foo.cpp clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp 3. Since that change, the disabling of a sub-fast-math switch disables many more sub-fast-math transformations than just the one specified. So now, the following two commands often result in very similar (and sometimes identical) code-gen: clang++ -O2 -c foo.cpp clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp That is, disabling a single sub-fast-math transformation in some (many?) cases now ends up disabling almost all the fast-math transformations. This causes a performance hit for people that have been doing this. 4. To fix this, I think that additional fast-math-flags are likely needed in the IR. Instead of the following set: 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' something like this: 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' would be more useful. Related to this, the current 'fast' flag which acts as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract') may not be needed. A discussion on this point was raised last November on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html TL;DR More details are in that thread from November, but the problem in its entirety involved both back-end LLVM issues, and front-end Clang (driver) issues. The LLVM issues are related to the umbrella aspect of 'fast', along with other fast-math-flags implementation details (described below). The front-end aspects in Clang are related to the driver's handling of '-ffast-math' (which also had an "umbrella" aspect). The driver code has been refactored since that November post, fixing the umbrella aspect of the front-end. But I never got around to working on the related back-end issues (nor has anyone else), and the refactored front-end now results in the back-end issues manifesting differently, and arguably in a worse way (details on the "worse" aspect, below). For reference, the refactored driver code was done in r297837: [Driver] Restructure handling of -ffast-math and similar options To be clear, I'm not at all suggesting that the above change was incorrect. I think that refactoring of the driver code is the right thing to do. An aspect of this refactoring is that prior to it, when a user passed '-ffast-math' on the command-line, it was also passed to the cc1 process, even if a sub-fast-math component was disabled. With the refactoring, the driver only passes '-ffast-math' to cc1 when a specific set of sub-fast-math components are enabled. More specifically, when a user specifies just '-ffast-math' on the command-line, the following 7 sub-fast-math switches: -fno-honor-infinities -fno-honor-nans -fno-math-errno -fassociative-math -freciprocal-math -fno-signed-zeros -fno-trapping-math get passed to cc1 (this is true both with the old (pre r297837) and new (since r297837) compilers). Furthermore, the "umbrella" '-ffast-math' is also passed to cc1 in this case of the user specifying just '-ffast-math' on the command-line (again, in both the old and new compilers). The difference related to this issue in the old/new behavior, is that when a user turns on fast-math but disables one (or more) of the sub-fast-math switches, for example, as in: clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp then in the old mode '-ffast-math' was still passed to cc1 (acting as an umbrella, causing trouble), but in the new mode '-ffast-math' is no longer passed to cc1 in this case. (In both the old and new modes, '-freciprocal-math' is not passed to cc1 with this command-line, as you'd expect.) What's happening is that in the old mode, it was the user passing '-ffast-math' on the command-line that resulted in passing the umbrella '-ffast-math' to cc1 (even if all 7 of the sub-fast-math switches were disabled by the user). Whereas in the new mode, the '-ffast-math' switch is passed to cc1 iff all 7 of the underlying sub-fast-math switches are enabled. I'd say that's an improvement in the handling of the switches, and also on the plus side, I think it makes dealing with the concerns I raised in November LLVM a little clearer, and so more manageable in some sense. But on the negative side, since the new behavior in LLVM is arguably worse, fixing the back-end issues is now a higher priority for my customers. The behavior that is arguably worse, is that when a user enables fast-math, but attempts to disable one of the sub-fast-math aspects, the old behavior (pre r297837) was that the sub-fast-math aspect to be disabled, generally (often?) remained enabled. The new behavior (since r297837) is that when disabling a sub-fast-math aspect, that aspect plus many more (possibly often the majority) of the fast-math transformations are disabled. So this results in a performance regression in these fast-math contexts when a sub-fast-math aspect is disabled, which is why it is a fairly high priority for us. FTR, r297837 was made during llvm 5.0 development, so the new behavior has the effect of a performance regression in moving from 4.0 to 5.0. In describing things here, I'll compare llvm 4.0 with llvm 5.0 behavior. But more precisely, it's pre-r297837 with post-r297837 behavior. Here is a tiny example, to illustrate it concretely: $ cat assoc.cpp //////////// "assoc.cpp" //////////// float foo(float a, float x) { return ((a + x) - x); // fastmath reassociation eliminates the arithmetic } ///////////////////////////////////// $ When -ffast-math is specified, the reassociation enabled by it allows us to simply return the first argument (and that reassociation does happen with '-ffast-math', with both the old and new compilers): $ clang -c -O2 -o x.o assoc.cpp $ llvm-objdump -d x.o | grep "^ .*: " 0: f3 0f 58 c1 addss %xmm1, %xmm0 4: f3 0f 5c c1 subss %xmm1, %xmm0 8: c3 retq $ clang -c -O2 -ffast-math -o x.o assoc.cpp $ llvm-objdump -d x.o | grep "^ .*: " 0: c3 retq $ FTR, GCC also does the reassociation transformation here when '-ffast-math' is used, as expected. But when using '-ffast-math' and disabling a sub-fast-math aspect of it (say via '-fno-reciprocal-math', '-fno-associative-math', or '-fmath-errno'), both the old and new compilers exhibit incorrect behavior in some cases. With the old compiler, the behavior was that using any of these switches did not disable the transformation. Those switches were mostly ineffective. (Only '-fno-associative-math' should disable the transformation in this example, so the fact that the other ones didn't disable it is correct/desired.) Here is the old behavior for the above test-case, when some example sub-fast-math aspects are individually disabled: $ old/bin/clang --version | grep version clang version 4.0.0 (tags/RELEASE_400/final) $ old/bin/clang -c -O2 -o x.o assoc.cpp $ llvm-objdump -d x.o | grep "^ .*: " 0: f3 0f 58 c1 addss %xmm1, %xmm0 4: f3 0f 5c c1 subss %xmm1, %xmm0 8: c3 retq $ old/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp $ llvm-objdump -d x.o | grep "^ .*: " 0: c3 retq $ old/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp $ llvm-objdump -d x.o | grep "^ .*: " 0: c3 retq $ old/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Error $ llvm-objdump -d x.o | grep "^ .*: " 0: c3 retq $ old/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp $ llvm-objdump -d x.o | grep "^ .*: " 0: c3 retq $ So with the old compiler, the case marked 'Error' above is incorrect, in that the reassociation should be suppressed in that case, but it isn't. Again FTR, the GCC behavior disables the re-association in the case marked 'Error' above. Moving on to the new compiler, instead of '-fno-associative-math' being ineffective, the problem is that when disabling other sub-fast-math aspects (unrelated to reassociation), the transformation is suppressed, when it should not be. Here is the new behavior with that same set of sub-fast-math aspects individually disabled: $ new/bin/clang --version | grep version clang version 5.0.0 (tags/RELEASE_500/final) $ new/bin/clang -c -O2 -o x.o assoc.cpp $ llvm-objdump -d x.o | grep "^ .*: " 0: f3 0f 58 c1 addss %xmm1, %xmm0 4: f3 0f 5c c1 subss %xmm1, %xmm0 8: c3 retq $ new/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp $ llvm-objdump -d x.o | grep "^ .*: " 0: c3 retq $ new/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp # Error $ llvm-objdump -d x.o | grep "^ .*: " 0: f3 0f 58 c1 addss %xmm1, %xmm0 4: f3 0f 5c c1 subss %xmm1, %xmm0 8: c3 retq $ new/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Good $ llvm-objdump -d x.o | grep "^ .*: " 0: f3 0f 58 c1 addss %xmm1, %xmm0 4: f3 0f 5c c1 subss %xmm1, %xmm0 8: c3 retq $ new/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp # Error $ llvm-objdump -d x.o | grep "^ .*: " 0: f3 0f 58 c1 addss %xmm1, %xmm0 4: f3 0f 5c c1 subss %xmm1, %xmm0 8: c3 retq $ The two cases marked as 'Error' are incorrectly suppressing the re-association. The case marked as 'Good' is now doing the right thing for this test-case. Again FTR, the GCC behavior allows the re-association in the cases marked 'Error' above to happen. __________________________________________________________________ Note that the '-f[no-]associative-math' flag has other problems, reported in PR27372 (https://bugs.llvm.org/show_bug.cgi?id=27372). Those "other problems" are related to the fact that there isn't an LLVM IR fast-math-flag that explicitly indicates whether reassociation is enabled or disabled. As a consequence, the front-end essentially drops that flag on the floor. The back-end has no way of explicitly looking for that capability, and so the back-end implementation instead relies on the "umbrella" aspect of 'fast' implicitly turning on all the lower-level fast-math-flags. This is a key aspect of the problem. Near the start of this post, I mentioned that the LLVM issues are related to the umbrella aspect of 'fast', along with other fast-math-flag implementation details. The fact that the back-end has no way of explicitly checking whether reassociation is enabled is what I meant by those other implementation details. Going to a more general discussion of the problem, the documentation of the fast-math-flags at: http://llvm.org/docs/LangRef.html#fast-math-flags can be described loosely as: nnan Allow optimizations to assume the arguments and result are not NaN ninf Allow optimizations to assume the arguments and result are not +/-Inf nsz Allow optimizations to treat the sign of a zero argument or result as insignificant arcp Allow optimizations to use the reciprocal of an argument rather than perform division contract Allow floating-point contraction (e.g. fused multiply-and-add) And the flag 'fast' is defined there as: fast Fast - Allow algebraically equivalent transformations that may dramatically change results in floating point (e.g. reassociate). This flag implies all the others. (Side point: Back in November, 'contract' was not an explicit fast-math-flag. This is a recent change, but it doesn't impact the issue I'm raising here.) To summarize, and to relate this somewhat back to the November 2016 post: http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html as described in that older post, this means that 'fast' could be described as: Very loosely, 'fast' means "all the aggressive FP-transformations that are not controlled by one of the other 5, plus it implies all the other 5". If for terminology, we call those additional aggressive optimizations 'aggr', then we have: 'fast' == 'aggr' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' But there isn't a specific flag for 'aggr' (it's just "on" when all the other flags are "on"). Reassociation is part of these additional 'aggr' transformations. Back in November, Hal pointed out that libm transformations are another part of these 'aggr' transformations. With that, one possible direction is to add two more sub-fast-math flags, say 'reassoc' and 'libm': 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' This would allow disabling (for example) 'arcp' without suppressing reassociation. Whether there would be a need for an "umbrella" flag 'fast' that implies all the others is somewhat orthogonal, although personally I feel it complicates the issue and doesn't provide any significant benefit. I can imagine that there is a benefit that I haven't thought of -- I don't claim to have a deep understanding of the implementation. So I'd like to hear what others think. One important aspect of this is that it appears to me there are quite a few fast-math transformations that are enabled only when all the underlying sub-fast-math flags are on (that is, only when the 'fast' umbrella flag is set). That's a key part of the problem of PR27372. In this context, the change in behavior from r297837 is that with the old behavior, the following two commands are almost equivalent (in many cases, they are equivalent): $ # Old behavior: The following two commands are nearly identical: $ clang -c -O2 -ffast-math foo.cpp $ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp $ Whereas with the new behavior (post-r297837), the following two commands are almost always equivalent: $ # New behavior: The following two commands are nearly identical: $ clang -c -O2 foo.cpp $ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp $ (Again, '-fno-reciprocal-math' is just an example of the suppression of a sub-fast-math aspect here. '-fno-associative-math and '-fmath-errno' would also be good examples.) Succinctly, if a '-ffast-math' user now disables a sub-fast-math aspect, they will be frustrated that they end up disabling almost the entire set of fast-math transformations. Whereas previously, they would be frustrated that their attempt of disabling a specific sub-fast-math aspect was ineffective. So previously, they might try to "fix a numerical instability" by disabling a sub-fast-math aspect (and be frustrated by it not being effective), and now if they try to "fix that numerical instability", they will succeed, but they will see a performance-hit of losing nearly all the performance gain that '-ffast-math' was providing. As an aside, on the PS4 with llvm 4.0 (and earlier) compilers, we've had a few customers frustrated that '-ffast-math -fno-reciprocal-math' was still doing reciprocal transformations. So we've had a private change to make '-fno-reciprocal-math' suppress the reciprocal optimization. With a vanilla llvm 5.0, those customers would see a performance hit (so we have a different private change to address that). As a final point here, to give more weight to this, I took a random bit of code I found on github that that has floating-point fast-math opportunities in it, and experimented with it. (I just searched for 'mandelbrot', and took the first thing I found.) Specifically: https://gist.github.com/andrejbauer/7919569 This test-case has a few divisions in it, but it doesn't contain any reciprocal-transformation opportunities (so '-f[no-]reciprocal-math' should essentially be a no-op). The old Clang behavior has the following two commands being nearly identical (they generate essentially equivalent code -- just some minor register change): $ # Old Clang behavior: $ # No significant difference when -fno-reciprocal-math is added (as desired) $ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c $ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c $ diff O2fm.s O2fm.no_arcp.s | wc 4 10 56 $ That is, as expected/desired, the '-fno-reciprocal-math' switch has essentially no impact on this, since there are no reciprocal transformations being done. Also as expected, the difference between "plain -O2" and '-O2 -ffast-math' is more substantial: $ # Old Clang behavior: $ # '-O2' vs '-O2 -ffast-math' shows a significant difference (as desired) $ clang -S -O2 -o O2.s mandelbrot.c $ diff O2.s O2fm.s | wc 43 184 1305 $ That is, adding '-ffast-math' to '-O2' is transforming the code, presumably making it faster (at the cost of a potential loss in numerical accuracy). With GCC for this example (I used version 4.8.4, which isn't particularly modern, but I happen to have it handy), I get similar behavior. For example, the following two commands produce identical assembly code: $ gcc -S -O2 -ffast-math -o O2fm.s mandelbrot.c $ gcc -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c $ diff O2fm.s O2fm.no_arcp.s $ and that code is substantially different than the GCC "plain -O2" code: $ gcc -S -O2 -o O2.s mandelbrot.c $ diff O2.s O2fm.s | wc 44 126 719 $ But comparing this to the new Clang behavior, we see that '-fno-reciprocal-math' is mow "disabling too much", as discussed in detail above for the simple "assoc.cpp" test-case. Specifically: $ # New Clang behavior: $ clang -S -O2 -o O2.s mandelbrot.c $ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c $ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c $ $ # Adding -ffast-math to -O2 continues to show significant diffs (expected) $ diff O2.s O2fm.s | wc 35 105 622 $ $ # too many differences -- should be nearly the same $ diff O2fm.s O2fm.no_arcp.s | wc 29 89 526 $ So with the new behavior, even though there are no reciprocal transformation opportunities, disabling that transformation via '-fno-reciprocal-math' disables many (most) of the fast-math features. In fact, comparing plain '-O2' with '-O2 -ffast-math -fno-reciprocal-math', it's clear that they are virtually identical with the new Clang behavior. Specifically, we get only a minor difference (of swapping of two register operands in a comparison, and changing the sense of the associated branch) when comparing '-O2' with '-O2 -ffast-math -fno-reciprocal-math': $ # New Clang behavior: $ # nearly identical, but there should be many diffs $ diff O2.s O2fm.no_arcp.s 188,189c188,189 < ucomisd %xmm5, %xmm6 < ja .LBB0_7 ---> ucomisd %xmm6, %xmm5 > jb .LBB0_7$ In full disclosure, for this "mandelbrot.c" test-case, I don't know if any of the changes in code-gen done by us or by GCC when '-ffast-math' is enabled are helpful (from a performance perspective) or dangerous (from a precise IEEE FP math perspective). All I know is that for both us and GCC at -O2, the switch '-ffast-math' changed the code-gen, and that '-ffast-math -fno-reciprocal-math' didn't suppress any of those changes for GCC, but it suppressed essentially all of the changes for us. For continuity, I'm repeating the summary here (that I had near the beginning). In Summary: 1. With the change of r297837, the driver now more cleanly handles '-ffast-math', and other sub-fast-math switches (like '-f[no]-reciprocal-math', '-f[no-]math-errno', and others). 2. Prior to that change, the disabling of a sub-fast-math switch was often ineffective. So as an example, the following two commands often resulted in the same code-gen, even if there were fast-math-reciprocal-transformations that were done: clang++ -O2 -ffast-math -c foo.cpp clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp 3. Since that change, the disabling of a sub-fast-math switch disables many more sub-fast-math transformations than just the one specified. So now, the following two commands often result in very similar (and sometimes identical) code-gen: clang++ -O2 -c foo.cpp clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp That is, disabling a single sub-fast-math transformation in some (many?) cases now ends up disabling almost all the fast-math transformations. This causes a performance hit for people that have been doing this. 4. To fix this, I think that additional fast-math-flags are likely needed in the IR. Instead of the following set: 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' something like this: 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' would be more useful. Related to this, the current 'fast' flag which acts as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract') may not be needed. A discussion on this point was raised last November on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html Thanks, -Warren -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170929/b0d22582/attachment-0001.html>
Hal Finkel via llvm-dev
2017-Sep-29 01:36 UTC
[llvm-dev] Trouble when suppressing a portion of fast-math-transformations
Hi, Warren, Thanks for writing all of this up. In short, regarding your suggested solution:> 4. To fix this, I think that additional fast-math-flags are likely > needed in > > the IR. Instead of the following set: > > 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > something like this: > > 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > would be more useful. Related to this, the current 'fast' flag which acts > > as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + > 'contract') may > > not be needed. A discussion on this point was raised last November on the > > mailing list: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.htmlI agree. I'm happy to help review the patches. It will be best to have only the finer-grained flags where there's no "fast" flag that implies all of the others. -Hal On 09/28/2017 07:56 PM, Ristow, Warren via llvm-dev wrote:> > Hi all, > > In a mailing-list post last November: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > I raised some concerns that having the IR-level fast-math-flag 'fast' > act as an > > "umbrella" to implicitly turn on all the lower-level fast-math-flags, > causes > > some fundamental problems. Those fundamental problems are related to > > situations where a user wants to disable a portion of the fast-math > behavior. > > For example, to enable all the fast-math transformations except for the > > reciprocal-math transformation, a command like the following is what a > user > > would expect to work: > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > But that isn't what it's doing. > > I believe this is a serious problem, but I also want to avoid > over-stating the > > seriousness. To be explicit, the problems I'm describing here happen when > > '-ffast-math' is used with one or more of the underlying fast-math-related > > aspects _disabled_ (like the '-fno-reciprocal-math' example, above). > > Conversely, when '-ffast-math' is used "on its own", the situation is > fine. > > For terminology here, I'll refer to these underlying fast-math-related > aspects > > (like reciprocal-math, associative-math, math-errno, and others) as > > "sub-fast-math" aspects. > > I apologize for the length of this post. I'm putting the summary up > front, so > > that anyone interested in fast-math issues can quickly get the > big-picture of > > the issues I'm describing here. > > In Summary: > > 1. With the change of r297837, the driver now more cleanly handles > > '-ffast-math', and other sub-fast-math switches (like > > '-f[no]-reciprocal-math', '-f[no-]math-errno', and others). > > 2. Prior to that change, the disabling of a sub-fast-math switch was often > > ineffective. So as an example, the following two commands often resulted > > in the same code-gen, even if there were > > fast-math-reciprocal-transformations that were done: > > clang++ -O2 -ffast-math -c foo.cpp > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > 3. Since that change, the disabling of a sub-fast-math switch disables > many > > more sub-fast-math transformations than just the one specified. So now, > > the following two commands often result in very similar (and sometimes > > identical) code-gen: > > clang++ -O2 -c foo.cpp > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > That is, disabling a single sub-fast-math transformation in some (many?) > > cases now ends up disabling almost all the fast-math transformations. > > This causes a performance hit for people that have been doing this. > > 4. To fix this, I think that additional fast-math-flags are likely > needed in > > the IR. Instead of the following set: > > 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > something like this: > > 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > would be more useful. Related to this, the current 'fast' flag which acts > > as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + > 'contract') may > > not be needed. A discussion on this point was raised last November on the > > mailing list: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > TL;DR > > More details are in that thread from November, but the problem in its > entirety > > involved both back-end LLVM issues, and front-end Clang (driver) > issues. The > > LLVM issues are related to the umbrella aspect of 'fast', along with other > > fast-math-flags implementation details (described below). The front-end > > aspects in Clang are related to the driver's handling of '-ffast-math' > (which > > also had an "umbrella" aspect). The driver code has been refactored > since that > > November post, fixing the umbrella aspect of the front-end. But I > never got > > around to working on the related back-end issues (nor has anyone > else), and the > > refactored front-end now results in the back-end issues manifesting > > differently, and arguably in a worse way (details on the "worse" aspect, > > below). > > For reference, the refactored driver code was done in r297837: > > [Driver] Restructure handling of -ffast-math and similar options > > To be clear, I'm not at all suggesting that the above change was > incorrect. I > > think that refactoring of the driver code is the right thing to do. > An aspect > > of this refactoring is that prior to it, when a user passed > '-ffast-math' on > > the command-line, it was also passed to the cc1 process, even if a > > sub-fast-math component was disabled. With the refactoring, the > driver only > > passes '-ffast-math' to cc1 when a specific set of sub-fast-math > components are > > enabled. > > More specifically, when a user specifies just '-ffast-math' on the > > command-line, the following 7 sub-fast-math switches: > > -fno-honor-infinities > > -fno-honor-nans > > -fno-math-errno > > -fassociative-math > > -freciprocal-math > > -fno-signed-zeros > > -fno-trapping-math > > get passed to cc1 (this is true both with the old (pre r297837) and > new (since > > r297837) compilers). Furthermore, the "umbrella" '-ffast-math' is > also passed > > to cc1 in this case of the user specifying just '-ffast-math' on the > > command-line (again, in both the old and new compilers). > > The difference related to this issue in the old/new behavior, is that > when a > > user turns on fast-math but disables one (or more) of the sub-fast-math > > switches, for example, as in: > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > then in the old mode '-ffast-math' was still passed to cc1 (acting as an > > umbrella, causing trouble), but in the new mode '-ffast-math' is no longer > > passed to cc1 in this case. (In both the old and new modes, > > '-freciprocal-math' is not passed to cc1 with this command-line, as you'd > > expect.) > > What's happening is that in the old mode, it was the user passing > '-ffast-math' > > on the command-line that resulted in passing the umbrella > '-ffast-math' to cc1 > > (even if all 7 of the sub-fast-math switches were disabled by the user). > > Whereas in the new mode, the '-ffast-math' switch is passed to cc1 iff > all 7 of > > the underlying sub-fast-math switches are enabled. > > I'd say that's an improvement in the handling of the switches, and > also on the > > plus side, I think it makes dealing with the concerns I raised in > November LLVM > > a little clearer, and so more manageable in some sense. But on the > negative > > side, since the new behavior in LLVM is arguably worse, fixing the > back-end > > issues is now a higher priority for my customers. > > The behavior that is arguably worse, is that when a user enables > fast-math, but > > attempts to disable one of the sub-fast-math aspects, the old behavior > (pre > > r297837) was that the sub-fast-math aspect to be disabled, generally > (often?) > > remained enabled. The new behavior (since r297837) is that when > disabling a > > sub-fast-math aspect, that aspect plus many more (possibly often the > majority) > > of the fast-math transformations are disabled. So this results in a > > performance regression in these fast-math contexts when a > sub-fast-math aspect > > is disabled, which is why it is a fairly high priority for us. > > FTR, r297837 was made during llvm 5.0 development, so the new behavior > has the > > effect of a performance regression in moving from 4.0 to 5.0. In > describing > > things here, I'll compare llvm 4.0 with llvm 5.0 behavior. But more > precisely, > > it's pre-r297837 with post-r297837 behavior. > > Here is a tiny example, to illustrate it concretely: > > $ cat assoc.cpp > > //////////// "assoc.cpp" //////////// > > float foo(float a, float x) > > { > > return ((a + x) - x); // fastmath reassociation eliminates the arithmetic > > } > > ///////////////////////////////////// > > $ > > When -ffast-math is specified, the reassociation enabled by it allows > us to > > simply return the first argument (and that reassociation does happen with > > '-ffast-math', with both the old and new compilers): > > $ clang -c -O2 -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ clang -c -O2 -ffast-math -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ > > FTR, GCC also does the reassociation transformation here when > '-ffast-math' is > > used, as expected. > > But when using '-ffast-math' and disabling a sub-fast-math aspect of > it (say > > via '-fno-reciprocal-math', '-fno-associative-math', or > '-fmath-errno'), both > > the old and new compilers exhibit incorrect behavior in some cases. > With the > > old compiler, the behavior was that using any of these switches did > not disable > > the transformation. Those switches were mostly ineffective. (Only > > '-fno-associative-math' should disable the transformation in this > example, so > > the fact that the other ones didn't disable it is correct/desired.) > Here is > > the old behavior for the above test-case, when some example sub-fast-math > > aspects are individually disabled: > > $ old/bin/clang --version | grep version > > clang version 4.0.0 (tags/RELEASE_400/final) > > $ old/bin/clang -c -O2 -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ old/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ old/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ old/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o > assoc.cpp # Error > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ old/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ > > So with the old compiler, the case marked 'Error' above is incorrect, > in that > > the reassociation should be suppressed in that case, but it isn't. > > Again FTR, the GCC behavior disables the re-association in the case marked > > 'Error' above. > > Moving on to the new compiler, instead of '-fno-associative-math' being > > ineffective, the problem is that when disabling other sub-fast-math > aspects > > (unrelated to reassociation), the transformation is suppressed, when > it should > > not be. Here is the new behavior with that same set of sub-fast-math > aspects > > individually disabled: > > $ new/bin/clang --version | grep version > > clang version 5.0.0 (tags/RELEASE_500/final) > > $ new/bin/clang -c -O2 -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ new/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ new/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o > assoc.cpp # Error > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ new/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o > assoc.cpp # Good > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ new/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp # Error > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ > > The two cases marked as 'Error' are incorrectly suppressing the > re-association. > > The case marked as 'Good' is now doing the right thing for this test-case. > > Again FTR, the GCC behavior allows the re-association in the cases marked > > 'Error' above to happen. > > __________________________________________________________________ > > Note that the '-f[no-]associative-math' flag has other problems, > reported in > > PR27372 (https://bugs.llvm.org/show_bug.cgi?id=27372). Those "other > problems" > > are related to the fact that there isn't an LLVM IR fast-math-flag that > > explicitly indicates whether reassociation is enabled or disabled. As a > > consequence, the front-end essentially drops that flag on the floor. The > > back-end has no way of explicitly looking for that capability, and so the > > back-end implementation instead relies on the "umbrella" aspect of 'fast' > > implicitly turning on all the lower-level fast-math-flags. This is a key > > aspect of the problem. Near the start of this post, I mentioned that > the LLVM > > issues are related to the umbrella aspect of 'fast', along with other > > fast-math-flag implementation details. The fact that the back-end has > no way > > of explicitly checking whether reassociation is enabled is what I meant by > > those other implementation details. > > Going to a more general discussion of the problem, the documentation > of the > > fast-math-flags at: > > http://llvm.org/docs/LangRef.html#fast-math-flags > > can be described loosely as: > > nnan Allow optimizations to assume the arguments and result are not NaN > > ninf Allow optimizations to assume the arguments and result are not +/-Inf > > nsz Allow optimizations to treat the sign of a zero argument or result > > as insignificant > > arcp Allow optimizations to use the reciprocal of an argument rather than > > perform division > > contract Allow floating-point contraction (e.g. fused multiply-and-add) > > And the flag 'fast' is defined there as: > > fast Fast - Allow algebraically equivalent transformations that may > > dramatically change results in floating point (e.g. reassociate). > > This flag implies all the others. > > (Side point: Back in November, 'contract' was not an explicit > fast-math-flag. > > This is a recent change, but it doesn't impact the issue I'm raising > here.) > > To summarize, and to relate this somewhat back to the November 2016 post: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > as described in that older post, this means that 'fast' could be > described as: > > Very loosely, 'fast' means "all the aggressive FP-transformations that > > are not controlled by one of the other 5, plus it implies all the other > > 5". If for terminology, we call those additional aggressive > > optimizations 'aggr', then we have: > > 'fast' == 'aggr' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > But there isn't a specific flag for 'aggr' (it's just "on" when all > the other > > flags are "on"). Reassociation is part of these additional 'aggr' > > transformations. Back in November, Hal pointed out that libm > transformations > > are another part of these 'aggr' transformations. With that, one possible > > direction is to add two more sub-fast-math flags, say 'reassoc' and > 'libm': > > 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > This would allow disabling (for example) 'arcp' without suppressing > > reassociation. Whether there would be a need for an "umbrella" flag 'fast' > > that implies all the others is somewhat orthogonal, although > personally I feel > > it complicates the issue and doesn't provide any significant benefit. > I can > > imagine that there is a benefit that I haven't thought of -- I don't > claim to > > have a deep understanding of the implementation. So I'd like to hear what > > others think. > > One important aspect of this is that it appears to me there are quite > a few > > fast-math transformations that are enabled only when all the underlying > > sub-fast-math flags are on (that is, only when the 'fast' umbrella flag is > > set). That's a key part of the problem of PR27372. In this context, the > > change in behavior from r297837 is that with the old behavior, the > following > > two commands are almost equivalent (in many cases, they are equivalent): > > $ # Old behavior: The following two commands are nearly identical: > > $ clang -c -O2 -ffast-math foo.cpp > > $ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp > > $ > > Whereas with the new behavior (post-r297837), the following two > commands are > > almost always equivalent: > > $ # New behavior: The following two commands are nearly identical: > > $ clang -c -O2 foo.cpp > > $ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp > > $ > > (Again, '-fno-reciprocal-math' is just an example of the suppression of a > > sub-fast-math aspect here. '-fno-associative-math and '-fmath-errno' > would > > also be good examples.) > > Succinctly, if a '-ffast-math' user now disables a sub-fast-math > aspect, they > > will be frustrated that they end up disabling almost the entire set of > > fast-math transformations. Whereas previously, they would be > frustrated that > > their attempt of disabling a specific sub-fast-math aspect was > ineffective. So > > previously, they might try to "fix a numerical instability" by disabling a > > sub-fast-math aspect (and be frustrated by it not being effective), > and now if > > they try to "fix that numerical instability", they will succeed, but > they will > > see a performance-hit of losing nearly all the performance gain that > > '-ffast-math' was providing. > > As an aside, on the PS4 with llvm 4.0 (and earlier) compilers, we've > had a few > > customers frustrated that '-ffast-math -fno-reciprocal-math' was still > doing > > reciprocal transformations. So we've had a private change to make > > '-fno-reciprocal-math' suppress the reciprocal optimization. With a > vanilla > > llvm 5.0, those customers would see a performance hit (so we have a > different > > private change to address that). > > As a final point here, to give more weight to this, I took a random > bit of code > > I found on github that that has floating-point fast-math opportunities > in it, > > and experimented with it. (I just searched for 'mandelbrot', and took the > > first thing I found.) Specifically: > > https://gist.github.com/andrejbauer/7919569 > > This test-case has a few divisions in it, but it doesn't contain any > > reciprocal-transformation opportunities (so '-f[no-]reciprocal-math' > should > > essentially be a no-op). > > The old Clang behavior has the following two commands being nearly > identical > > (they generate essentially equivalent code -- just some minor register > > change): > > $ # Old Clang behavior: > > $ # No significant difference when –fno-reciprocal-math is added (as > desired) > > $ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c > > $ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s > mandelbrot.c > > $ diff O2fm.s O2fm.no_arcp.s | wc > > 4 10 56 > > $ > > That is, as expected/desired, the '-fno-reciprocal-math' switch has > essentially > > no impact on this, since there are no reciprocal transformations being > done. > > Also as expected, the difference between "plain -O2" and '-O2 > -ffast-math' is > > more substantial: > > $ # Old Clang behavior: > > $ # '-O2' vs '–O2 –ffast-math' shows a significant difference (as desired) > > $ clang -S -O2 -o O2.s mandelbrot.c > > $ diff O2.s O2fm.s | wc > > 43 184 1305 > > $ > > That is, adding '-ffast-math' to '-O2' is transforming the code, > presumably > > making it faster (at the cost of a potential loss in numerical accuracy). > > With GCC for this example (I used version 4.8.4, which isn't particularly > > modern, but I happen to have it handy), I get similar behavior. For > example, > > the following two commands produce identical assembly code: > > $ gcc -S -O2 -ffast-math -o O2fm.s mandelbrot.c > > $ gcc -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s > mandelbrot.c > > $ diff O2fm.s O2fm.no_arcp.s > > $ > > and that code is substantially different than the GCC "plain -O2" code: > > $ gcc -S -O2 -o O2.s mandelbrot.c > > $ diff O2.s O2fm.s | wc > > 44 126 719 > > $ > > But comparing this to the new Clang behavior, we see that > > '-fno-reciprocal-math' is mow "disabling too much", as discussed in detail > > above for the simple "assoc.cpp" test-case. Specifically: > > $ # New Clang behavior: > > $ clang -S -O2 -o O2.s mandelbrot.c > > $ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c > > $ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s > mandelbrot.c > > $ > > $ # Adding -ffast-math to -O2 continues to show significant diffs > (expected) > > $ diff O2.s O2fm.s | wc > > 35 105 622 > > $ > > $ # too many differences -- should be nearly the same > > $ diff O2fm.s O2fm.no_arcp.s | wc > > 29 89 526 > > $ > > So with the new behavior, even though there are no reciprocal > transformation > > opportunities, disabling that transformation via '-fno-reciprocal-math' > > disables many (most) of the fast-math features. In fact, comparing > plain '-O2' > > with '-O2 -ffast-math -fno-reciprocal-math', it's clear that they are > virtually > > identical with the new Clang behavior. Specifically, we get only a minor > > difference (of swapping of two register operands in a comparison, and > changing > > the sense of the associated branch) when comparing '-O2' with > > '-O2 -ffast-math -fno-reciprocal-math': > > $ # New Clang behavior: > > $ # nearly identical, but there should be many diffs > > $ diff O2.s O2fm.no_arcp.s > > 188,189c188,189 > > < ucomisd %xmm5, %xmm6 > > < ja .LBB0_7 > > --- > > >ucomisd %xmm6, %xmm5 > > >jb .LBB0_7 > > $ > > In full disclosure, for this "mandelbrot.c" test-case, I don't know if > any of > > the changes in code-gen done by us or by GCC when '-ffast-math' is > enabled are > > helpful (from a performance perspective) or dangerous (from a precise > IEEE FP > > math perspective). All I know is that for both us and GCC at -O2, the > switch > > '-ffast-math' changed the code-gen, and that '-ffast-math > -fno-reciprocal-math' > > didn't suppress any of those changes for GCC, but it suppressed > essentially all > > of the changes for us. > > For continuity, I'm repeating the summary here (that I had near the > beginning). > > In Summary: > > 1. With the change of r297837, the driver now more cleanly handles > > '-ffast-math', and other sub-fast-math switches (like > > '-f[no]-reciprocal-math', '-f[no-]math-errno', and others). > > 2. Prior to that change, the disabling of a sub-fast-math switch was often > > ineffective. So as an example, the following two commands often resulted > > in the same code-gen, even if there were > > fast-math-reciprocal-transformations that were done: > > clang++ -O2 -ffast-math -c foo.cpp > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > 3. Since that change, the disabling of a sub-fast-math switch disables > many > > more sub-fast-math transformations than just the one specified. So now, > > the following two commands often result in very similar (and sometimes > > identical) code-gen: > > clang++ -O2 -c foo.cpp > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > That is, disabling a single sub-fast-math transformation in some (many?) > > cases now ends up disabling almost all the fast-math transformations. > > This causes a performance hit for people that have been doing this. > > 4. To fix this, I think that additional fast-math-flags are likely > needed in > > the IR. Instead of the following set: > > 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > something like this: > > 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > would be more useful. Related to this, the current 'fast' flag which acts > > as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + > 'contract') may > > not be needed. A discussion on this point was raised last November on the > > mailing list: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > Thanks, > > -Warren > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Ristow, Warren via llvm-dev
2017-Sep-30 02:16 UTC
[llvm-dev] Trouble when suppressing a portion of fast-math-transformations
Hi Hal,>> 4. To fix this, I think that additional fast-math-flags are likely >> needed in the IR. Instead of the following set: >> >> 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' >> >> something like this: >> >> 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' >> >> would be more useful. Related to this, the current 'fast' flag which acts >> as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract') may >> not be needed. A discussion on this point was raised last November on the >> mailing list: >> >> http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > I agree. I'm happy to help review the patches. It will be best to have > only the finer-grained flags where there's no "fast" flag that implies > all of the others.Thanks for the quick response, and for the willingness to review. I won't let this languish so long, like the post from last November. Happy to hear that you feel it's best not to have the umbrella "fast" flag. Thanks again, -Warren
Possibly Parallel Threads
- Trouble when suppressing a portion of fast-math-transformations
- Trouble when suppressing a portion of fast-math-transformations
- Trouble when suppressing a portion of fast-math-transformations
- RFC: Consider changing the semantics of 'fast' flag implying all fast-math-flags
- RFC: Consider changing the semantics of 'fast' flag implying all fast-math-flags