Kaylor, Andrew via llvm-dev
2019-Sep-16 23:57 UTC
[llvm-dev] Handling of FP denormal values
Hi all, While reviewing a recent clang documentation change, I became aware of an issue with the way that clang is handling FP denormals. There is currently some support for variations in the way denormals are handled, but it isn't consistent across architectures and generally feels kind of half-baked. I'd like to discuss possible solutions to this problem. First, there is a clang command line option: -fdenormal-fp-math=<arg> Select which denormal numbers the code is permitted to require. Valid values are: ieee, preserve-sign, and positive-zero, which correspond to IEEE 754 denormal numbers, the sign of a flushed-to-zero number is preserved in the sign of 0, denormals are flushed to positive zero, respectively. A quick survey of the code leads me to believe this has no effect for targets other than ARM. For X86 targets we may want different options. I'll say more about that below. The wording of the documentation is sufficiently ambiguous that I'm not entirely certain whether it is intended to control the target hardware or just the optimizer. In addition, when either -Ofast or -ffast-math is used, we attempt to link 'crtfastmath.o' if it can be found. For X86 targets, this object file adds a static constructor that sets the DAZ and FTZ bits of the MXCSR register. I expect that it has analogous behavior for other architectures when it is available. This object file is typically available on Linux systems, possibly also with things like MinGW. If it isn't found, the denomral control flags will be left in their default state. There is also a CUDA-specific option, -f[no-]cuda-flush-denormals-to-zero. I don't know how this is implemented, but the documentation says it is specific to CUDA device mode. Finally, there is an OpenCL-specific option, -cl-denorms-are-zero. Again, I don't know how it is implemented. So.... I'd like to talk about how we can corral all of this into some interface that is consistent (or at least consistently sensible) across architectures. The problems I see are: 1. -fdenormal-fp-math needs to handle all scenarios needed by all architectures (or needs to be limited to a common subset). 2. -fdenormal-fp-math needs to be reconciled with -ffast-math and its variants. 3. -fdenormal-fp-math needs to be consistent about whether or not it imposes hardware changes when applicable. I can only really speak to X86, so I'll say a few words about that to start the discussion. The current choices for -fdenormal-fp-math are: ieee, preserve-sign, and positive-zero. With X86, you get ieee behavior if neither DAZ or FTZ are set. If FTZ is set you get 'preserve sign' behavior -- i.e. denormal results are flushed to zero and the sign of the result is kept. There is no way to get 'positive zero' behavior with X86. At the hardware level, modern X86 processors have separate controls for ftz (results are flushed to zero) and daz (inputs are flushed to zero before calculations), but I doubt that they are used independently often enough to distinguish them at the command line option level. Also, any X87 instructions that happen to be generated (such as if the code contains 'long double' data on Linux) will ignore the ftz and daz settings. There are some early Pentium 4 processors that don't support 'daz' but I hope we can safely ignore that fact. Linking in crtfastmath.o when -Ofast or -ffast-math are used is consistent with GCC's behavior. However, it implicitly ignores -fdenormal-fp-math, which GCC doesn't have. In most cases if a user sets a fast math option they probably also want DAZ and FTZ, but there might be some reason why an advanced user would want to treat them separately. This can be done with intrinsics, of course, but if we have an option to control it, we should respect that option. Also, it is possible to construct fast math behavior cafeteria-style (i.e. setting some fast math flags and not others) so we should probably have a way to add ftz behaviors a la carte. FWIW, ICC sets the FTZ and DAZ flags from a function call that is inserted into main depending on the options used to compile the file containing main. Trying to go back to the general case, I'd like to solicit information about whether other targets have/need different denormal options than are described above. Futher, I'd suggest that for any architecture that supports FTZ behavior, a well-document default be automatically set when fast math is enabled via -Ofast, -ffast-math, or -funsafe-math-optimizations unless that option is turned off by a subsequent -fno-fast-math/-fno-unsafe-math-optimizations option or overridden by a subsequent -fdenormal-fp-math option, and if -fdenormal-fp-math is used, some code will be emitted to set the relevant hardware controls. I don't have a strong opinion on whether it is better to emit a static constructor or to inject a call into main. The latter seems more predictable. I'd like to avoid a dependency on crtfastmath.o either way. Do we need an ftz fast-math flag? Are there any other facets to this problem that I've overlooked? Thanks, Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190916/fc1db5bd/attachment-0001.html>
Cameron McInally via llvm-dev
2019-Sep-17 00:58 UTC
[llvm-dev] [cfe-dev] Handling of FP denormal values
On Mon, Sep 16, 2019 at 7:58 PM Kaylor, Andrew via cfe-dev < cfe-dev at lists.llvm.org> wrote:> Hi all, > > > > While reviewing a recent clang documentation change, I became aware of an > issue with the way that clang is handling FP denormals. There is currently > some support for variations in the way denormals are handled, but it isn't > consistent across architectures and generally feels kind of half-baked. I'd > like to discuss possible solutions to this problem. > > > > First, there is a clang command line option: > > > > -fdenormal-fp-math=<arg> > > > > Select which denormal numbers the code is permitted to require. > > > > Valid values are: ieee, preserve-sign, and positive-zero, which > > correspond to IEEE 754 denormal numbers, the sign of a flushed-to-zero > > number is preserved in the sign of 0, denormals are flushed to positive > > zero, respectively. > > > > A quick survey of the code leads me to believe this has no effect for > targets other than ARM. For X86 targets we may want different options. I'll > say more about that below. The wording of the documentation is sufficiently > ambiguous that I’m not entirely certain whether it is intended to control > the target hardware or just the optimizer. > > > > In addition, when either -Ofast or -ffast-math is used, we attempt to link > 'crtfastmath.o' if it can be found. For X86 targets, this object file adds > a static constructor that sets the DAZ and FTZ bits of the MXCSR register. > I expect that it has analogous behavior for other architectures when it is > available. This object file is typically available on Linux systems, > possibly also with things like MinGW. If it isn't found, the denomral > control flags will be left in their default state. > > > > There is also a CUDA-specific option, -f[no-]cuda-flush-denormals-to-zero. > I don't know how this is implemented, but the documentation says it is > specific to CUDA device mode. > > > > Finally, there is an OpenCL-specific option, -cl-denorms-are-zero. Again, > I don't know how it is implemented. > > > > So.... I'd like to talk about how we can corral all of this into some > interface that is consistent (or at least consistently sensible) across > architectures. > > > > The problems I see are: > > > > 1. -fdenormal-fp-math needs to handle all scenarios needed by all > architectures (or needs to be limited to a common subset). > > 2. -fdenormal-fp-math needs to be reconciled with -ffast-math and its > variants. > > 3. -fdenormal-fp-math needs to be consistent about whether or not it > imposes hardware changes when applicable. > > I can only really speak to X86, so I'll say a few words about that to > start the discussion. > > > > The current choices for -fdenormal-fp-math are: ieee, preserve-sign, and > positive-zero. With X86, you get ieee behavior if neither DAZ or FTZ are > set. If FTZ is set you get 'preserve sign' behavior -- i.e. denormal > results are flushed to zero and the sign of the result is kept. There is no > way to get 'positive zero' behavior with X86. At the hardware level, modern > X86 processors have separate controls for ftz (results are flushed to zero) > and daz (inputs are flushed to zero before calculations), but I doubt that > they are used independently often enough to distinguish them at the command > line option level. > > > > Also, any X87 instructions that happen to be generated (such as if the > code contains 'long double' data on Linux) will ignore the ftz and daz > settings. There are some early Pentium 4 processors that don't support > 'daz' but I hope we can safely ignore that fact. > > > > Linking in crtfastmath.o when -Ofast or -ffast-math are used is consistent > with GCC's behavior. However, it implicitly ignores -fdenormal-fp-math, > which GCC doesn't have. In most cases if a user sets a fast math option > they probably also want DAZ and FTZ, but there might be some reason why an > advanced user would want to treat them separately. This can be done with > intrinsics, of course, but if we have an option to control it, we should > respect that option. Also, it is possible to construct fast math behavior > cafeteria-style (i.e. setting some fast math flags and not others) so we > should probably have a way to add ftz behaviors a la carte. > > > > FWIW, ICC sets the FTZ and DAZ flags from a function call that is inserted > into main depending on the options used to compile the file containing main. > > > > Trying to go back to the general case, I'd like to solicit information > about whether other targets have/need different denormal options than are > described above. Futher, I'd suggest that for any architecture that > supports FTZ behavior, a well-document default be automatically set when > fast math is enabled via > > -Ofast, -ffast-math, or -funsafe-math-optimizations unless that option is > turned off by a subsequent -fno-fast-math/-fno-unsafe-math-optimizations > option or overridden by a subsequent -fdenormal-fp-math option, and if > -fdenormal-fp-math is used, some code will be emitted to set the relevant > hardware controls. > > > > I don't have a strong opinion on whether it is better to emit a static > constructor or to inject a call into main. The latter seems more > predictable. I’d like to avoid a dependency on crtfastmath.o either way. >I would like to see it called from .init_array (or equivalent) with the highest init_priority. That way, dynamic initializers get the benefit too. If we're requesting DAZ+FTZ on the command line, there's no need for a slow start-up. Digressing a bit, but I don't like how some implementations of crtfastmath.o clear all the flags while setting the DAZ+FTZ flags (e.g. AArch64). Seems unnecessary and makes its position on the link line significant.> > > Do we need an ftz fast-math flag? > > > > Are there any other facets to this problem that I've overlooked? > > > > Thanks, > > Andy > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190916/096d99a9/attachment.html>
Matt Arsenault via llvm-dev
2019-Sep-17 01:43 UTC
[llvm-dev] Handling of FP denormal values
> On Sep 16, 2019, at 19:57, Kaylor, Andrew via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > Do we need an ftz fast-math flag?This would be useful for matching a handful of AMDGPU instructions (a fmad that only always flushes being the most important). We have a dedicated intrinsic to allow flushing in this case when denormals are enabled> > Are there any other facets to this problem that I've overlooked?For AMDGPU we need to split -denormal-fp-math into per-FP type flags (and the corresponding IR attribute). The denorm mode register has separate fields for f32, and f64+f16. The default for each of these is different depending on the subtarget/language combination. Mostly we want f64+f16 to always be on, and only change the f32 mode. The current naming implies changing all of the modes. The different sign of 0 modes as exist now aren’t available. There are however separate flags for enabling flushing on input and output. This isn’t particular important, and currently we just set both bits at the same time but it might be something to think about if this is being expanded. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190916/87e3a6ee/attachment.html>
Cameron McInally via llvm-dev
2019-Sep-17 15:07 UTC
[llvm-dev] [cfe-dev] Handling of FP denormal values
On Mon, Sep 16, 2019 at 9:43 PM Matt Arsenault via cfe-dev < cfe-dev at lists.llvm.org> wrote:> > > On Sep 16, 2019, at 19:57, Kaylor, Andrew via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > Do we need an ftz fast-math flag? > > > This would be useful for matching a handful of AMDGPU instructions (a fmad > that only always flushes being the most important). We have a dedicated > intrinsic to allow flushing in this case when denormals are enabled >+1 For FTZ/DAZ, we're currently getting cases like this incorrect: %add = fadd nnan ninf nsz float %a, 0.000000e+00 That cannot be safely optimized to 'a' with FTZ/DAZ enabled. Although, there's admittedly a small chance of problems, since a following FP operation would normalize it, but here be dragons. Are there any other facets to this problem that I've overlooked?> > > For AMDGPU we need to split -denormal-fp-math into per-FP type flags (and > the corresponding IR attribute). The denorm mode register has separate > fields for f32, and f64+f16. The default for each of these is different > depending on the subtarget/language combination. Mostly we want f64+f16 to > always be on, and only change the f32 mode. The current naming implies > changing all of the modes. > > The different sign of 0 modes as exist now aren’t available. There are > however separate flags for enabling flushing on input and output. This > isn’t particular important, and currently we just set both bits at the same > time but it might be something to think about if this is being expanded. >At the command-line level, I don't see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190917/d727b2c1/attachment.html>