Ristow, Warren via llvm-dev
2017-Sep-29 00:56 UTC
[llvm-dev] Trouble when suppressing a portion of fast-math-transformations
Hi all,
In a mailing-list post last November:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html
I raised some concerns that having the IR-level fast-math-flag 'fast'
act as an
"umbrella" to implicitly turn on all the lower-level fast-math-flags,
causes
some fundamental problems. Those fundamental problems are related to
situations where a user wants to disable a portion of the fast-math behavior.
For example, to enable all the fast-math transformations except for the
reciprocal-math transformation, a command like the following is what a user
would expect to work:
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
But that isn't what it's doing.
I believe this is a serious problem, but I also want to avoid over-stating the
seriousness. To be explicit, the problems I'm describing here happen when
'-ffast-math' is used with one or more of the underlying
fast-math-related
aspects _disabled_ (like the '-fno-reciprocal-math' example, above).
Conversely, when '-ffast-math' is used "on its own", the
situation is fine.
For terminology here, I'll refer to these underlying fast-math-related
aspects
(like reciprocal-math, associative-math, math-errno, and others) as
"sub-fast-math" aspects.
I apologize for the length of this post. I'm putting the summary up front,
so
that anyone interested in fast-math issues can quickly get the big-picture of
the issues I'm describing here.
In Summary:
1. With the change of r297837, the driver now more cleanly handles
'-ffast-math', and other sub-fast-math switches (like
'-f[no]-reciprocal-math', '-f[no-]math-errno', and others).
2. Prior to that change, the disabling of a sub-fast-math switch was often
ineffective. So as an example, the following two commands often resulted
in the same code-gen, even if there were
fast-math-reciprocal-transformations that were done:
clang++ -O2 -ffast-math -c foo.cpp
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
3. Since that change, the disabling of a sub-fast-math switch disables many
more sub-fast-math transformations than just the one specified. So now,
the following two commands often result in very similar (and sometimes
identical) code-gen:
clang++ -O2 -c foo.cpp
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
That is, disabling a single sub-fast-math transformation in some (many?)
cases now ends up disabling almost all the fast-math transformations.
This causes a performance hit for people that have been doing this.
4. To fix this, I think that additional fast-math-flags are likely needed in
the IR. Instead of the following set:
'nnan' + 'ninf' + 'nsz' + 'arcp' +
'contract'
something like this:
'reassoc' + 'libm' + 'nnan' + 'ninf'
+ 'nsz' + 'arcp' + 'contract'
would be more useful. Related to this, the current 'fast' flag
which acts
as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' +
'arcp' + 'contract') may
not be needed. A discussion on this point was raised last November on the
mailing list:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html
TL;DR
More details are in that thread from November, but the problem in its entirety
involved both back-end LLVM issues, and front-end Clang (driver) issues. The
LLVM issues are related to the umbrella aspect of 'fast', along with
other
fast-math-flags implementation details (described below). The front-end
aspects in Clang are related to the driver's handling of
'-ffast-math' (which
also had an "umbrella" aspect). The driver code has been refactored
since that
November post, fixing the umbrella aspect of the front-end. But I never got
around to working on the related back-end issues (nor has anyone else), and the
refactored front-end now results in the back-end issues manifesting
differently, and arguably in a worse way (details on the "worse"
aspect,
below).
For reference, the refactored driver code was done in r297837:
[Driver] Restructure handling of -ffast-math and similar options
To be clear, I'm not at all suggesting that the above change was incorrect.
I
think that refactoring of the driver code is the right thing to do. An aspect
of this refactoring is that prior to it, when a user passed
'-ffast-math' on
the command-line, it was also passed to the cc1 process, even if a
sub-fast-math component was disabled. With the refactoring, the driver only
passes '-ffast-math' to cc1 when a specific set of sub-fast-math
components are
enabled.
More specifically, when a user specifies just '-ffast-math' on the
command-line, the following 7 sub-fast-math switches:
-fno-honor-infinities
-fno-honor-nans
-fno-math-errno
-fassociative-math
-freciprocal-math
-fno-signed-zeros
-fno-trapping-math
get passed to cc1 (this is true both with the old (pre r297837) and new (since
r297837) compilers). Furthermore, the "umbrella"
'-ffast-math' is also passed
to cc1 in this case of the user specifying just '-ffast-math' on the
command-line (again, in both the old and new compilers).
The difference related to this issue in the old/new behavior, is that when a
user turns on fast-math but disables one (or more) of the sub-fast-math
switches, for example, as in:
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
then in the old mode '-ffast-math' was still passed to cc1 (acting as an
umbrella, causing trouble), but in the new mode '-ffast-math' is no
longer
passed to cc1 in this case. (In both the old and new modes,
'-freciprocal-math' is not passed to cc1 with this command-line, as
you'd
expect.)
What's happening is that in the old mode, it was the user passing
'-ffast-math'
on the command-line that resulted in passing the umbrella '-ffast-math'
to cc1
(even if all 7 of the sub-fast-math switches were disabled by the user).
Whereas in the new mode, the '-ffast-math' switch is passed to cc1 iff
all 7 of
the underlying sub-fast-math switches are enabled.
I'd say that's an improvement in the handling of the switches, and also
on the
plus side, I think it makes dealing with the concerns I raised in November LLVM
a little clearer, and so more manageable in some sense. But on the negative
side, since the new behavior in LLVM is arguably worse, fixing the back-end
issues is now a higher priority for my customers.
The behavior that is arguably worse, is that when a user enables fast-math, but
attempts to disable one of the sub-fast-math aspects, the old behavior (pre
r297837) was that the sub-fast-math aspect to be disabled, generally (often?)
remained enabled. The new behavior (since r297837) is that when disabling a
sub-fast-math aspect, that aspect plus many more (possibly often the majority)
of the fast-math transformations are disabled. So this results in a
performance regression in these fast-math contexts when a sub-fast-math aspect
is disabled, which is why it is a fairly high priority for us.
FTR, r297837 was made during llvm 5.0 development, so the new behavior has the
effect of a performance regression in moving from 4.0 to 5.0. In describing
things here, I'll compare llvm 4.0 with llvm 5.0 behavior. But more
precisely,
it's pre-r297837 with post-r297837 behavior.
Here is a tiny example, to illustrate it concretely:
$ cat assoc.cpp
//////////// "assoc.cpp" ////////////
float foo(float a, float x)
{
return ((a + x) - x); // fastmath reassociation eliminates the arithmetic
}
/////////////////////////////////////
$
When -ffast-math is specified, the reassociation enabled by it allows us to
simply return the first argument (and that reassociation does happen with
'-ffast-math', with both the old and new compilers):
$ clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$
FTR, GCC also does the reassociation transformation here when
'-ffast-math' is
used, as expected.
But when using '-ffast-math' and disabling a sub-fast-math aspect of it
(say
via '-fno-reciprocal-math', '-fno-associative-math', or
'-fmath-errno'), both
the old and new compilers exhibit incorrect behavior in some cases. With the
old compiler, the behavior was that using any of these switches did not disable
the transformation. Those switches were mostly ineffective. (Only
'-fno-associative-math' should disable the transformation in this
example, so
the fact that the other ones didn't disable it is correct/desired.) Here is
the old behavior for the above test-case, when some example sub-fast-math
aspects are individually disabled:
$ old/bin/clang --version | grep version
clang version 4.0.0 (tags/RELEASE_400/final)
$ old/bin/clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ old/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$ old/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$ old/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp #
Error
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$ old/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$
So with the old compiler, the case marked 'Error' above is incorrect, in
that
the reassociation should be suppressed in that case, but it isn't.
Again FTR, the GCC behavior disables the re-association in the case marked
'Error' above.
Moving on to the new compiler, instead of '-fno-associative-math' being
ineffective, the problem is that when disabling other sub-fast-math aspects
(unrelated to reassociation), the transformation is suppressed, when it should
not be. Here is the new behavior with that same set of sub-fast-math aspects
individually disabled:
$ new/bin/clang --version | grep version
clang version 5.0.0 (tags/RELEASE_500/final)
$ new/bin/clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ new/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$ new/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp # Error
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ new/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Good
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ new/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp # Error
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$
The two cases marked as 'Error' are incorrectly suppressing the
re-association.
The case marked as 'Good' is now doing the right thing for this
test-case.
Again FTR, the GCC behavior allows the re-association in the cases marked
'Error' above to happen.
__________________________________________________________________
Note that the '-f[no-]associative-math' flag has other problems,
reported in
PR27372 (https://bugs.llvm.org/show_bug.cgi?id=27372). Those "other
problems"
are related to the fact that there isn't an LLVM IR fast-math-flag that
explicitly indicates whether reassociation is enabled or disabled. As a
consequence, the front-end essentially drops that flag on the floor. The
back-end has no way of explicitly looking for that capability, and so the
back-end implementation instead relies on the "umbrella" aspect of
'fast'
implicitly turning on all the lower-level fast-math-flags. This is a key
aspect of the problem. Near the start of this post, I mentioned that the LLVM
issues are related to the umbrella aspect of 'fast', along with other
fast-math-flag implementation details. The fact that the back-end has no way
of explicitly checking whether reassociation is enabled is what I meant by
those other implementation details.
Going to a more general discussion of the problem, the documentation of the
fast-math-flags at:
http://llvm.org/docs/LangRef.html#fast-math-flags
can be described loosely as:
nnan Allow optimizations to assume the arguments and result are not NaN
ninf Allow optimizations to assume the arguments and result are not +/-Inf
nsz Allow optimizations to treat the sign of a zero argument or result
as insignificant
arcp Allow optimizations to use the reciprocal of an argument rather than
perform division
contract Allow floating-point contraction (e.g. fused multiply-and-add)
And the flag 'fast' is defined there as:
fast Fast - Allow algebraically equivalent transformations that may
dramatically change results in floating point (e.g. reassociate).
This flag implies all the others.
(Side point: Back in November, 'contract' was not an explicit
fast-math-flag.
This is a recent change, but it doesn't impact the issue I'm raising
here.)
To summarize, and to relate this somewhat back to the November 2016 post:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html
as described in that older post, this means that 'fast' could be
described as:
Very loosely, 'fast' means "all the aggressive
FP-transformations that
are not controlled by one of the other 5, plus it implies all the other
5". If for terminology, we call those additional aggressive
optimizations 'aggr', then we have:
'fast' == 'aggr' + 'nnan' + 'ninf' +
'nsz' + 'arcp' + 'contract'
But there isn't a specific flag for 'aggr' (it's just
"on" when all the other
flags are "on"). Reassociation is part of these additional
'aggr'
transformations. Back in November, Hal pointed out that libm transformations
are another part of these 'aggr' transformations. With that, one
possible
direction is to add two more sub-fast-math flags, say 'reassoc' and
'libm':
'reassoc' + 'libm' + 'nnan' + 'ninf'
+ 'nsz' + 'arcp' + 'contract'
This would allow disabling (for example) 'arcp' without suppressing
reassociation. Whether there would be a need for an "umbrella" flag
'fast'
that implies all the others is somewhat orthogonal, although personally I feel
it complicates the issue and doesn't provide any significant benefit. I can
imagine that there is a benefit that I haven't thought of -- I don't
claim to
have a deep understanding of the implementation. So I'd like to hear what
others think.
One important aspect of this is that it appears to me there are quite a few
fast-math transformations that are enabled only when all the underlying
sub-fast-math flags are on (that is, only when the 'fast' umbrella flag
is
set). That's a key part of the problem of PR27372. In this context, the
change in behavior from r297837 is that with the old behavior, the following
two commands are almost equivalent (in many cases, they are equivalent):
$ # Old behavior: The following two commands are nearly identical:
$ clang -c -O2 -ffast-math foo.cpp
$ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp
$
Whereas with the new behavior (post-r297837), the following two commands are
almost always equivalent:
$ # New behavior: The following two commands are nearly identical:
$ clang -c -O2 foo.cpp
$ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp
$
(Again, '-fno-reciprocal-math' is just an example of the suppression of
a
sub-fast-math aspect here. '-fno-associative-math and
'-fmath-errno' would
also be good examples.)
Succinctly, if a '-ffast-math' user now disables a sub-fast-math aspect,
they
will be frustrated that they end up disabling almost the entire set of
fast-math transformations. Whereas previously, they would be frustrated that
their attempt of disabling a specific sub-fast-math aspect was ineffective. So
previously, they might try to "fix a numerical instability" by
disabling a
sub-fast-math aspect (and be frustrated by it not being effective), and now if
they try to "fix that numerical instability", they will succeed, but
they will
see a performance-hit of losing nearly all the performance gain that
'-ffast-math' was providing.
As an aside, on the PS4 with llvm 4.0 (and earlier) compilers, we've had a
few
customers frustrated that '-ffast-math -fno-reciprocal-math' was still
doing
reciprocal transformations. So we've had a private change to make
'-fno-reciprocal-math' suppress the reciprocal optimization. With a
vanilla
llvm 5.0, those customers would see a performance hit (so we have a different
private change to address that).
As a final point here, to give more weight to this, I took a random bit of code
I found on github that that has floating-point fast-math opportunities in it,
and experimented with it. (I just searched for 'mandelbrot', and took
the
first thing I found.) Specifically:
https://gist.github.com/andrejbauer/7919569
This test-case has a few divisions in it, but it doesn't contain any
reciprocal-transformation opportunities (so '-f[no-]reciprocal-math'
should
essentially be a no-op).
The old Clang behavior has the following two commands being nearly identical
(they generate essentially equivalent code -- just some minor register
change):
$ # Old Clang behavior:
$ # No significant difference when -fno-reciprocal-math is added (as desired)
$ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c
$ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c
$ diff O2fm.s O2fm.no_arcp.s | wc
4 10 56
$
That is, as expected/desired, the '-fno-reciprocal-math' switch has
essentially
no impact on this, since there are no reciprocal transformations being done.
Also as expected, the difference between "plain -O2" and '-O2
-ffast-math' is
more substantial:
$ # Old Clang behavior:
$ # '-O2' vs '-O2 -ffast-math' shows a significant difference
(as desired)
$ clang -S -O2 -o O2.s mandelbrot.c
$ diff O2.s O2fm.s | wc
43 184 1305
$
That is, adding '-ffast-math' to '-O2' is transforming the code,
presumably
making it faster (at the cost of a potential loss in numerical accuracy).
With GCC for this example (I used version 4.8.4, which isn't particularly
modern, but I happen to have it handy), I get similar behavior. For example,
the following two commands produce identical assembly code:
$ gcc -S -O2 -ffast-math -o O2fm.s mandelbrot.c
$ gcc -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c
$ diff O2fm.s O2fm.no_arcp.s
$
and that code is substantially different than the GCC "plain -O2"
code:
$ gcc -S -O2 -o O2.s mandelbrot.c
$ diff O2.s O2fm.s | wc
44 126 719
$
But comparing this to the new Clang behavior, we see that
'-fno-reciprocal-math' is mow "disabling too much", as
discussed in detail
above for the simple "assoc.cpp" test-case. Specifically:
$ # New Clang behavior:
$ clang -S -O2 -o O2.s mandelbrot.c
$ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c
$ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c
$
$ # Adding -ffast-math to -O2 continues to show significant diffs (expected)
$ diff O2.s O2fm.s | wc
35 105 622
$
$ # too many differences -- should be nearly the same
$ diff O2fm.s O2fm.no_arcp.s | wc
29 89 526
$
So with the new behavior, even though there are no reciprocal transformation
opportunities, disabling that transformation via '-fno-reciprocal-math'
disables many (most) of the fast-math features. In fact, comparing plain
'-O2'
with '-O2 -ffast-math -fno-reciprocal-math', it's clear that they
are virtually
identical with the new Clang behavior. Specifically, we get only a minor
difference (of swapping of two register operands in a comparison, and changing
the sense of the associated branch) when comparing '-O2' with
'-O2 -ffast-math -fno-reciprocal-math':
$ # New Clang behavior:
$ # nearly identical, but there should be many diffs
$ diff O2.s O2fm.no_arcp.s
188,189c188,189
< ucomisd %xmm5, %xmm6
< ja .LBB0_7
---> ucomisd %xmm6, %xmm5
> jb .LBB0_7
$
In full disclosure, for this "mandelbrot.c" test-case, I don't
know if any of
the changes in code-gen done by us or by GCC when '-ffast-math' is
enabled are
helpful (from a performance perspective) or dangerous (from a precise IEEE FP
math perspective). All I know is that for both us and GCC at -O2, the switch
'-ffast-math' changed the code-gen, and that '-ffast-math
-fno-reciprocal-math'
didn't suppress any of those changes for GCC, but it suppressed essentially
all
of the changes for us.
For continuity, I'm repeating the summary here (that I had near the
beginning).
In Summary:
1. With the change of r297837, the driver now more cleanly handles
'-ffast-math', and other sub-fast-math switches (like
'-f[no]-reciprocal-math', '-f[no-]math-errno', and others).
2. Prior to that change, the disabling of a sub-fast-math switch was often
ineffective. So as an example, the following two commands often resulted
in the same code-gen, even if there were
fast-math-reciprocal-transformations that were done:
clang++ -O2 -ffast-math -c foo.cpp
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
3. Since that change, the disabling of a sub-fast-math switch disables many
more sub-fast-math transformations than just the one specified. So now,
the following two commands often result in very similar (and sometimes
identical) code-gen:
clang++ -O2 -c foo.cpp
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
That is, disabling a single sub-fast-math transformation in some (many?)
cases now ends up disabling almost all the fast-math transformations.
This causes a performance hit for people that have been doing this.
4. To fix this, I think that additional fast-math-flags are likely needed in
the IR. Instead of the following set:
'nnan' + 'ninf' + 'nsz' + 'arcp' +
'contract'
something like this:
'reassoc' + 'libm' + 'nnan' + 'ninf'
+ 'nsz' + 'arcp' + 'contract'
would be more useful. Related to this, the current 'fast' flag
which acts
as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' +
'arcp' + 'contract') may
not be needed. A discussion on this point was raised last November on the
mailing list:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html
Thanks,
-Warren
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170929/b0d22582/attachment-0001.html>
Hal Finkel via llvm-dev
2017-Sep-29 01:36 UTC
[llvm-dev] Trouble when suppressing a portion of fast-math-transformations
Hi, Warren, Thanks for writing all of this up. In short, regarding your suggested solution:> 4. To fix this, I think that additional fast-math-flags are likely > needed in > > the IR. Instead of the following set: > > 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > something like this: > > 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > would be more useful. Related to this, the current 'fast' flag which acts > > as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + > 'contract') may > > not be needed. A discussion on this point was raised last November on the > > mailing list: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.htmlI agree. I'm happy to help review the patches. It will be best to have only the finer-grained flags where there's no "fast" flag that implies all of the others. -Hal On 09/28/2017 07:56 PM, Ristow, Warren via llvm-dev wrote:> > Hi all, > > In a mailing-list post last November: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > I raised some concerns that having the IR-level fast-math-flag 'fast' > act as an > > "umbrella" to implicitly turn on all the lower-level fast-math-flags, > causes > > some fundamental problems. Those fundamental problems are related to > > situations where a user wants to disable a portion of the fast-math > behavior. > > For example, to enable all the fast-math transformations except for the > > reciprocal-math transformation, a command like the following is what a > user > > would expect to work: > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > But that isn't what it's doing. > > I believe this is a serious problem, but I also want to avoid > over-stating the > > seriousness. To be explicit, the problems I'm describing here happen when > > '-ffast-math' is used with one or more of the underlying fast-math-related > > aspects _disabled_ (like the '-fno-reciprocal-math' example, above). > > Conversely, when '-ffast-math' is used "on its own", the situation is > fine. > > For terminology here, I'll refer to these underlying fast-math-related > aspects > > (like reciprocal-math, associative-math, math-errno, and others) as > > "sub-fast-math" aspects. > > I apologize for the length of this post. I'm putting the summary up > front, so > > that anyone interested in fast-math issues can quickly get the > big-picture of > > the issues I'm describing here. > > In Summary: > > 1. With the change of r297837, the driver now more cleanly handles > > '-ffast-math', and other sub-fast-math switches (like > > '-f[no]-reciprocal-math', '-f[no-]math-errno', and others). > > 2. Prior to that change, the disabling of a sub-fast-math switch was often > > ineffective. So as an example, the following two commands often resulted > > in the same code-gen, even if there were > > fast-math-reciprocal-transformations that were done: > > clang++ -O2 -ffast-math -c foo.cpp > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > 3. Since that change, the disabling of a sub-fast-math switch disables > many > > more sub-fast-math transformations than just the one specified. So now, > > the following two commands often result in very similar (and sometimes > > identical) code-gen: > > clang++ -O2 -c foo.cpp > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > That is, disabling a single sub-fast-math transformation in some (many?) > > cases now ends up disabling almost all the fast-math transformations. > > This causes a performance hit for people that have been doing this. > > 4. To fix this, I think that additional fast-math-flags are likely > needed in > > the IR. Instead of the following set: > > 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > something like this: > > 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > would be more useful. Related to this, the current 'fast' flag which acts > > as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + > 'contract') may > > not be needed. A discussion on this point was raised last November on the > > mailing list: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > TL;DR > > More details are in that thread from November, but the problem in its > entirety > > involved both back-end LLVM issues, and front-end Clang (driver) > issues. The > > LLVM issues are related to the umbrella aspect of 'fast', along with other > > fast-math-flags implementation details (described below). The front-end > > aspects in Clang are related to the driver's handling of '-ffast-math' > (which > > also had an "umbrella" aspect). The driver code has been refactored > since that > > November post, fixing the umbrella aspect of the front-end. But I > never got > > around to working on the related back-end issues (nor has anyone > else), and the > > refactored front-end now results in the back-end issues manifesting > > differently, and arguably in a worse way (details on the "worse" aspect, > > below). > > For reference, the refactored driver code was done in r297837: > > [Driver] Restructure handling of -ffast-math and similar options > > To be clear, I'm not at all suggesting that the above change was > incorrect. I > > think that refactoring of the driver code is the right thing to do. > An aspect > > of this refactoring is that prior to it, when a user passed > '-ffast-math' on > > the command-line, it was also passed to the cc1 process, even if a > > sub-fast-math component was disabled. With the refactoring, the > driver only > > passes '-ffast-math' to cc1 when a specific set of sub-fast-math > components are > > enabled. > > More specifically, when a user specifies just '-ffast-math' on the > > command-line, the following 7 sub-fast-math switches: > > -fno-honor-infinities > > -fno-honor-nans > > -fno-math-errno > > -fassociative-math > > -freciprocal-math > > -fno-signed-zeros > > -fno-trapping-math > > get passed to cc1 (this is true both with the old (pre r297837) and > new (since > > r297837) compilers). Furthermore, the "umbrella" '-ffast-math' is > also passed > > to cc1 in this case of the user specifying just '-ffast-math' on the > > command-line (again, in both the old and new compilers). > > The difference related to this issue in the old/new behavior, is that > when a > > user turns on fast-math but disables one (or more) of the sub-fast-math > > switches, for example, as in: > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > then in the old mode '-ffast-math' was still passed to cc1 (acting as an > > umbrella, causing trouble), but in the new mode '-ffast-math' is no longer > > passed to cc1 in this case. (In both the old and new modes, > > '-freciprocal-math' is not passed to cc1 with this command-line, as you'd > > expect.) > > What's happening is that in the old mode, it was the user passing > '-ffast-math' > > on the command-line that resulted in passing the umbrella > '-ffast-math' to cc1 > > (even if all 7 of the sub-fast-math switches were disabled by the user). > > Whereas in the new mode, the '-ffast-math' switch is passed to cc1 iff > all 7 of > > the underlying sub-fast-math switches are enabled. > > I'd say that's an improvement in the handling of the switches, and > also on the > > plus side, I think it makes dealing with the concerns I raised in > November LLVM > > a little clearer, and so more manageable in some sense. But on the > negative > > side, since the new behavior in LLVM is arguably worse, fixing the > back-end > > issues is now a higher priority for my customers. > > The behavior that is arguably worse, is that when a user enables > fast-math, but > > attempts to disable one of the sub-fast-math aspects, the old behavior > (pre > > r297837) was that the sub-fast-math aspect to be disabled, generally > (often?) > > remained enabled. The new behavior (since r297837) is that when > disabling a > > sub-fast-math aspect, that aspect plus many more (possibly often the > majority) > > of the fast-math transformations are disabled. So this results in a > > performance regression in these fast-math contexts when a > sub-fast-math aspect > > is disabled, which is why it is a fairly high priority for us. > > FTR, r297837 was made during llvm 5.0 development, so the new behavior > has the > > effect of a performance regression in moving from 4.0 to 5.0. In > describing > > things here, I'll compare llvm 4.0 with llvm 5.0 behavior. But more > precisely, > > it's pre-r297837 with post-r297837 behavior. > > Here is a tiny example, to illustrate it concretely: > > $ cat assoc.cpp > > //////////// "assoc.cpp" //////////// > > float foo(float a, float x) > > { > > return ((a + x) - x); // fastmath reassociation eliminates the arithmetic > > } > > ///////////////////////////////////// > > $ > > When -ffast-math is specified, the reassociation enabled by it allows > us to > > simply return the first argument (and that reassociation does happen with > > '-ffast-math', with both the old and new compilers): > > $ clang -c -O2 -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ clang -c -O2 -ffast-math -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ > > FTR, GCC also does the reassociation transformation here when > '-ffast-math' is > > used, as expected. > > But when using '-ffast-math' and disabling a sub-fast-math aspect of > it (say > > via '-fno-reciprocal-math', '-fno-associative-math', or > '-fmath-errno'), both > > the old and new compilers exhibit incorrect behavior in some cases. > With the > > old compiler, the behavior was that using any of these switches did > not disable > > the transformation. Those switches were mostly ineffective. (Only > > '-fno-associative-math' should disable the transformation in this > example, so > > the fact that the other ones didn't disable it is correct/desired.) > Here is > > the old behavior for the above test-case, when some example sub-fast-math > > aspects are individually disabled: > > $ old/bin/clang --version | grep version > > clang version 4.0.0 (tags/RELEASE_400/final) > > $ old/bin/clang -c -O2 -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ old/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ old/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ old/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o > assoc.cpp # Error > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ old/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ > > So with the old compiler, the case marked 'Error' above is incorrect, > in that > > the reassociation should be suppressed in that case, but it isn't. > > Again FTR, the GCC behavior disables the re-association in the case marked > > 'Error' above. > > Moving on to the new compiler, instead of '-fno-associative-math' being > > ineffective, the problem is that when disabling other sub-fast-math > aspects > > (unrelated to reassociation), the transformation is suppressed, when > it should > > not be. Here is the new behavior with that same set of sub-fast-math > aspects > > individually disabled: > > $ new/bin/clang --version | grep version > > clang version 5.0.0 (tags/RELEASE_500/final) > > $ new/bin/clang -c -O2 -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ new/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: c3 retq > > $ new/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o > assoc.cpp # Error > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ new/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o > assoc.cpp # Good > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ new/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp # Error > > $ llvm-objdump -d x.o | grep "^ .*: " > > 0: f3 0f 58 c1 addss %xmm1, %xmm0 > > 4: f3 0f 5c c1 subss %xmm1, %xmm0 > > 8: c3 retq > > $ > > The two cases marked as 'Error' are incorrectly suppressing the > re-association. > > The case marked as 'Good' is now doing the right thing for this test-case. > > Again FTR, the GCC behavior allows the re-association in the cases marked > > 'Error' above to happen. > > __________________________________________________________________ > > Note that the '-f[no-]associative-math' flag has other problems, > reported in > > PR27372 (https://bugs.llvm.org/show_bug.cgi?id=27372). Those "other > problems" > > are related to the fact that there isn't an LLVM IR fast-math-flag that > > explicitly indicates whether reassociation is enabled or disabled. As a > > consequence, the front-end essentially drops that flag on the floor. The > > back-end has no way of explicitly looking for that capability, and so the > > back-end implementation instead relies on the "umbrella" aspect of 'fast' > > implicitly turning on all the lower-level fast-math-flags. This is a key > > aspect of the problem. Near the start of this post, I mentioned that > the LLVM > > issues are related to the umbrella aspect of 'fast', along with other > > fast-math-flag implementation details. The fact that the back-end has > no way > > of explicitly checking whether reassociation is enabled is what I meant by > > those other implementation details. > > Going to a more general discussion of the problem, the documentation > of the > > fast-math-flags at: > > http://llvm.org/docs/LangRef.html#fast-math-flags > > can be described loosely as: > > nnan Allow optimizations to assume the arguments and result are not NaN > > ninf Allow optimizations to assume the arguments and result are not +/-Inf > > nsz Allow optimizations to treat the sign of a zero argument or result > > as insignificant > > arcp Allow optimizations to use the reciprocal of an argument rather than > > perform division > > contract Allow floating-point contraction (e.g. fused multiply-and-add) > > And the flag 'fast' is defined there as: > > fast Fast - Allow algebraically equivalent transformations that may > > dramatically change results in floating point (e.g. reassociate). > > This flag implies all the others. > > (Side point: Back in November, 'contract' was not an explicit > fast-math-flag. > > This is a recent change, but it doesn't impact the issue I'm raising > here.) > > To summarize, and to relate this somewhat back to the November 2016 post: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > as described in that older post, this means that 'fast' could be > described as: > > Very loosely, 'fast' means "all the aggressive FP-transformations that > > are not controlled by one of the other 5, plus it implies all the other > > 5". If for terminology, we call those additional aggressive > > optimizations 'aggr', then we have: > > 'fast' == 'aggr' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > But there isn't a specific flag for 'aggr' (it's just "on" when all > the other > > flags are "on"). Reassociation is part of these additional 'aggr' > > transformations. Back in November, Hal pointed out that libm > transformations > > are another part of these 'aggr' transformations. With that, one possible > > direction is to add two more sub-fast-math flags, say 'reassoc' and > 'libm': > > 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > This would allow disabling (for example) 'arcp' without suppressing > > reassociation. Whether there would be a need for an "umbrella" flag 'fast' > > that implies all the others is somewhat orthogonal, although > personally I feel > > it complicates the issue and doesn't provide any significant benefit. > I can > > imagine that there is a benefit that I haven't thought of -- I don't > claim to > > have a deep understanding of the implementation. So I'd like to hear what > > others think. > > One important aspect of this is that it appears to me there are quite > a few > > fast-math transformations that are enabled only when all the underlying > > sub-fast-math flags are on (that is, only when the 'fast' umbrella flag is > > set). That's a key part of the problem of PR27372. In this context, the > > change in behavior from r297837 is that with the old behavior, the > following > > two commands are almost equivalent (in many cases, they are equivalent): > > $ # Old behavior: The following two commands are nearly identical: > > $ clang -c -O2 -ffast-math foo.cpp > > $ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp > > $ > > Whereas with the new behavior (post-r297837), the following two > commands are > > almost always equivalent: > > $ # New behavior: The following two commands are nearly identical: > > $ clang -c -O2 foo.cpp > > $ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp > > $ > > (Again, '-fno-reciprocal-math' is just an example of the suppression of a > > sub-fast-math aspect here. '-fno-associative-math and '-fmath-errno' > would > > also be good examples.) > > Succinctly, if a '-ffast-math' user now disables a sub-fast-math > aspect, they > > will be frustrated that they end up disabling almost the entire set of > > fast-math transformations. Whereas previously, they would be > frustrated that > > their attempt of disabling a specific sub-fast-math aspect was > ineffective. So > > previously, they might try to "fix a numerical instability" by disabling a > > sub-fast-math aspect (and be frustrated by it not being effective), > and now if > > they try to "fix that numerical instability", they will succeed, but > they will > > see a performance-hit of losing nearly all the performance gain that > > '-ffast-math' was providing. > > As an aside, on the PS4 with llvm 4.0 (and earlier) compilers, we've > had a few > > customers frustrated that '-ffast-math -fno-reciprocal-math' was still > doing > > reciprocal transformations. So we've had a private change to make > > '-fno-reciprocal-math' suppress the reciprocal optimization. With a > vanilla > > llvm 5.0, those customers would see a performance hit (so we have a > different > > private change to address that). > > As a final point here, to give more weight to this, I took a random > bit of code > > I found on github that that has floating-point fast-math opportunities > in it, > > and experimented with it. (I just searched for 'mandelbrot', and took the > > first thing I found.) Specifically: > > https://gist.github.com/andrejbauer/7919569 > > This test-case has a few divisions in it, but it doesn't contain any > > reciprocal-transformation opportunities (so '-f[no-]reciprocal-math' > should > > essentially be a no-op). > > The old Clang behavior has the following two commands being nearly > identical > > (they generate essentially equivalent code -- just some minor register > > change): > > $ # Old Clang behavior: > > $ # No significant difference when –fno-reciprocal-math is added (as > desired) > > $ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c > > $ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s > mandelbrot.c > > $ diff O2fm.s O2fm.no_arcp.s | wc > > 4 10 56 > > $ > > That is, as expected/desired, the '-fno-reciprocal-math' switch has > essentially > > no impact on this, since there are no reciprocal transformations being > done. > > Also as expected, the difference between "plain -O2" and '-O2 > -ffast-math' is > > more substantial: > > $ # Old Clang behavior: > > $ # '-O2' vs '–O2 –ffast-math' shows a significant difference (as desired) > > $ clang -S -O2 -o O2.s mandelbrot.c > > $ diff O2.s O2fm.s | wc > > 43 184 1305 > > $ > > That is, adding '-ffast-math' to '-O2' is transforming the code, > presumably > > making it faster (at the cost of a potential loss in numerical accuracy). > > With GCC for this example (I used version 4.8.4, which isn't particularly > > modern, but I happen to have it handy), I get similar behavior. For > example, > > the following two commands produce identical assembly code: > > $ gcc -S -O2 -ffast-math -o O2fm.s mandelbrot.c > > $ gcc -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s > mandelbrot.c > > $ diff O2fm.s O2fm.no_arcp.s > > $ > > and that code is substantially different than the GCC "plain -O2" code: > > $ gcc -S -O2 -o O2.s mandelbrot.c > > $ diff O2.s O2fm.s | wc > > 44 126 719 > > $ > > But comparing this to the new Clang behavior, we see that > > '-fno-reciprocal-math' is mow "disabling too much", as discussed in detail > > above for the simple "assoc.cpp" test-case. Specifically: > > $ # New Clang behavior: > > $ clang -S -O2 -o O2.s mandelbrot.c > > $ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c > > $ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s > mandelbrot.c > > $ > > $ # Adding -ffast-math to -O2 continues to show significant diffs > (expected) > > $ diff O2.s O2fm.s | wc > > 35 105 622 > > $ > > $ # too many differences -- should be nearly the same > > $ diff O2fm.s O2fm.no_arcp.s | wc > > 29 89 526 > > $ > > So with the new behavior, even though there are no reciprocal > transformation > > opportunities, disabling that transformation via '-fno-reciprocal-math' > > disables many (most) of the fast-math features. In fact, comparing > plain '-O2' > > with '-O2 -ffast-math -fno-reciprocal-math', it's clear that they are > virtually > > identical with the new Clang behavior. Specifically, we get only a minor > > difference (of swapping of two register operands in a comparison, and > changing > > the sense of the associated branch) when comparing '-O2' with > > '-O2 -ffast-math -fno-reciprocal-math': > > $ # New Clang behavior: > > $ # nearly identical, but there should be many diffs > > $ diff O2.s O2fm.no_arcp.s > > 188,189c188,189 > > < ucomisd %xmm5, %xmm6 > > < ja .LBB0_7 > > --- > > >ucomisd %xmm6, %xmm5 > > >jb .LBB0_7 > > $ > > In full disclosure, for this "mandelbrot.c" test-case, I don't know if > any of > > the changes in code-gen done by us or by GCC when '-ffast-math' is > enabled are > > helpful (from a performance perspective) or dangerous (from a precise > IEEE FP > > math perspective). All I know is that for both us and GCC at -O2, the > switch > > '-ffast-math' changed the code-gen, and that '-ffast-math > -fno-reciprocal-math' > > didn't suppress any of those changes for GCC, but it suppressed > essentially all > > of the changes for us. > > For continuity, I'm repeating the summary here (that I had near the > beginning). > > In Summary: > > 1. With the change of r297837, the driver now more cleanly handles > > '-ffast-math', and other sub-fast-math switches (like > > '-f[no]-reciprocal-math', '-f[no-]math-errno', and others). > > 2. Prior to that change, the disabling of a sub-fast-math switch was often > > ineffective. So as an example, the following two commands often resulted > > in the same code-gen, even if there were > > fast-math-reciprocal-transformations that were done: > > clang++ -O2 -ffast-math -c foo.cpp > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > 3. Since that change, the disabling of a sub-fast-math switch disables > many > > more sub-fast-math transformations than just the one specified. So now, > > the following two commands often result in very similar (and sometimes > > identical) code-gen: > > clang++ -O2 -c foo.cpp > > clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp > > That is, disabling a single sub-fast-math transformation in some (many?) > > cases now ends up disabling almost all the fast-math transformations. > > This causes a performance hit for people that have been doing this. > > 4. To fix this, I think that additional fast-math-flags are likely > needed in > > the IR. Instead of the following set: > > 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > something like this: > > 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' > > would be more useful. Related to this, the current 'fast' flag which acts > > as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + > 'contract') may > > not be needed. A discussion on this point was raised last November on the > > mailing list: > > http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > Thanks, > > -Warren > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Ristow, Warren via llvm-dev
2017-Sep-30 02:16 UTC
[llvm-dev] Trouble when suppressing a portion of fast-math-transformations
Hi Hal,>> 4. To fix this, I think that additional fast-math-flags are likely >> needed in the IR. Instead of the following set: >> >> 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' >> >> something like this: >> >> 'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract' >> >> would be more useful. Related to this, the current 'fast' flag which acts >> as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract') may >> not be needed. A discussion on this point was raised last November on the >> mailing list: >> >> http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html > > I agree. I'm happy to help review the patches. It will be best to have > only the finer-grained flags where there's no "fast" flag that implies > all of the others.Thanks for the quick response, and for the willingness to review. I won't let this languish so long, like the post from last November. Happy to hear that you feel it's best not to have the umbrella "fast" flag. Thanks again, -Warren
Apparently Analagous Threads
- Trouble when suppressing a portion of fast-math-transformations
- RFC: Consider changing the semantics of 'fast' flag implying all fast-math-flags
- RFC: Consider changing the semantics of 'fast' flag implying all fast-math-flags
- RFC: Consider changing the semantics of 'fast' flag implying all fast-math-flags
- RFC: Consider changing the semantics of 'fast' flag implying all fast-math-flags