Hello, at the recent EuroLLVM developer meeting in Bristol I held a BoF session on the topic "Towards implementing #pragma STDC FENV_ACCESS". I've also had a number of follow-on discussions both on-site in Bristol and online since. This post is intended as a summary of my current understanding set of requirements and implementation details covering the overall topic. I'm posting this here in the hope this can serve as a basis for the various more detailed discussions that are still ongoing (e.g. in various Phabricator proposals right now). Any comments are welcome! Semantics of #pragma STDC FENV_ACCESS ==================================== To provide a baseline for the implementation discussion, first an overview of the features required to handle the strict floating-point mode defined by the C and IEEE standard: 1. Floating-point rounding modes 2. Default floating-point exception handling 3. Trapping floating-point exception handling Each of these separate features imposes different constraints on the optimizations that LLVM may perform involving FP expressions: 1. Floating-point rounding modes Outside of FENV_ACCESS regions, all FP operations are supposed to be performed in the "default" rounding mode. But inside FENV_ACCESS regions, FP operations implicitly depend on a "current" rounding mode setting, which may be changed by certain C library calls (plus some platform-specific intrinsics). In addition, those calls may be performed within subroutines (as long as those are also within FENV_ACCESS), so *any* function call within a FENV_ACCESS must be considered as potentially changing the rounding mode. In effect, this means the compiler may not move or combine FP operations accross function call sites. 2. Default floating-point exception handling Inside FENV_ACCESS regions, every floating-point operation that causes an exception must be considered to set a "status flag" associated with this exception type. Those flags can be queried using C library calls (plus some platform-specific intrinsics), and there are other such calls to explicitly set or clear those flags as well. As with the rounding modes, those calls may be performed in subroutines as well, so any function call within a FENV_ACCESS region must be considered as potentially *using* and changing the floating-point exception status flags. The values of the status flags on entry to a FENV_ACCESS are to be considered undefined according to the C standard. Compiler optimizations are supposed to preserve the values of all exception status bits at any point where they can be (potentially) inspected by the program, i.e. at all call sites within FENV_ACCESS regions. This still allows a number of optimizations, e.g. to reorder FP operations or combine two identical operations within a region uninterrupted by calls. But other optimizations should be avoided, e.g. optimizing away an unused FP operation may result in an exception flag now being unset that would otherwise have been set. The same applies to floating-point constant folding. 3. Trapping floating-point exception handling Within a FENV_ACCESS region, library calls may be used to switch exception handling semantics to a "trapping" mode by setting corresponding mask bits. Any subsequent FP instruction that raises an exception with the associated mask bit set will cause a trap. Usually, this will be a hardware trap that is translated by the operating system into some form of software exception that can by handled by the applcation; on Linux systems this takes the form of a SIGFPE signal. As above, those mask bits can be set and reset via (operating- system specific) library calls and/or platform-specific intrinsics, all of which may also be done within subroutine calls. In effect, this requires the compiler to treat any floating-point operation within a FENV_ACCESS region as potentially trapping, which means the same restrictions apply as with e.g. memory accesses (cannot be speculated etc.) However, according to the C standard, the implementation is not required to preserve the *number* of different traps, so identical operations may still be combined (unless there is an intervening function call). The C standard requires all user code to explicitly switch back to non-trapping mode for all exceptions whenever leaving a FENV_ACCESS region (both by "falling off the end" of the region and by calling a subroutine defined outside of FENV_ACCESS). Implementation requirements on parts of the compiler =================================================== A. clang front end The front end needs to determine which instructions are part of FENV_ACCESS regions and which are not. This takes into account both the semantics of the #pragma as defined by the standard, and the implementation-defined default rules that apply to code outside of any #pragma. GCC currently has the following two related command-line options: -frounding-math: Do not assume default rounding mode -ftrapping-math: Assume FP operations may trap clang accepts but (basically) ignores those options. As a first step, it might make sense to have the FENV_ACCESS default behavior triggered by these options, even while the front end does not yet support the actual #pragma. The front end then needs to transmit the information about FENV_ACCESS regions to later passes. However, I believe that we do not actually have to implement "regions" as such at the IR level. Instead, it would be sufficient to track the follwing information: - For each FP operation, whether it is within a FENV_ACCESS region. - For each call site, whether it is within a FENV_ACCESS region. The former requires new IR support; the approach currently under investigation uses the experimental "constrained FP" intrinsics instead of traditional floating-point operations for this. The latter can be done simply by annotating those call sites with an attribute. In addition to that, the front-end itself needs to disable any early optimizations that do not preserve strict FP semantics, in particular it must not speculate FP operations if they may trap. (Currently, the front end transforms "? :" on floating- point types into a select IR statement; for trapping FP operations, an explicit branch must be used instead.) B. LLVM IR and LLVM common optimizations As mentioned in the previous section, we need some IR to annotate FP instructions and call sites within FENV_ACCESS regions. All common optimizations then need to respect the strict FP semantics associated with those regions. The current approach uses experimental intrinsics. This has the advantage that most optimizations never trigger since they don't even recognize those new intrinsics. Also, the intrinsics can be marked as having side-effects and/or being non-speculatable. The overall effect is that more optimizations are suppressed than would be strictly necessary. But this may still be a good first step, since the result is now safe but maybe not optimal -- which can be improved upon over time by teaching the specific semantics of those intrinsics to optimization passes. However, some open questions remain. If at some point we want to model the constrained FP semantics more precisely than just as "unmodeled side effects", this may have to be reflected at the IR level directly. For example, to model rounding mode behavior, at some point we might require explicit tracking of data dependencies on the rounding mode by representing the rounding mode as SSA values defined by function calls and used by FP intrinsics. Similarly, to track exception status flags, they might be modeled as SSA values set by FP intrinsics and used by function calls. (There is a possibly related question of how to optimally model the property of many math library routines that they may access the "errno" variable but no other memory ... It might also be possible to model e.g. exception status as a thread-local "memory" location that is modified by FP operations, just like errno.) Another currently unresolved issue is that at the moment nothing prevents *standard* floating-point operations from being moved *inside* FENV_ACCESS regions. This may also be invalid, since those operations now may cause unexpected traps etc. (More specifically, what is invalid is moving any standard FP operation across a *call site* within a FENV_ACCESS region.) Note that this is even an issue if we only support changing the default (and no actual #pragma) if mutiple object files using different default settings are being linked together using LTO. This last issue could in theory be solved by having all optimization passes respect the requirement that floating-point operations may not be moved across call sites marked with the strict FP attribute. But that does not appear to be straightforward since it would introduce a "new" type of dependeny that would have to be added throughout LLVM code. If this must be avoided, we'd have to find a way to explicity track dependencies at the IR level. In the extreme, this could end up equivalent to just always using the constrained intrinsics for everything ... C. Code generation In the back end, effects of strict FP mode have to passed through to lower-level representations including SelectionDAG and MI. Currently, the "unmodeled side effect" logic of the constrained intrinsics is modeled by putting them on the chain during SelectionDAG. (If we ever model semantics more precisely at the IR level, that would need to be reflected on SelectionDAG accordingly.) At the MI level, there is no representation at all. One option to fix this would be to model target-specific registers that implement the IEEE semantics. Most platforms have registers (or parts of registers) that hold: - the current rounding mode - the exception status flags - the exception masks (which enable traps) Marking FP instructions as using and/or defining these registers would enforce ordering requirements. It may be too strict in some cases (e.g. two instructions setting exception status flags may still be reordered). On the other hand, I believe if instructions may actually *trap*, we actually need the hasSideEffects flag even if register dependencies are modeled. If we do need hasSideEffects, there is a separate discussion on whether this can be implemented without each back end having to duplicate all FP instruction patterns (one with hasSideEffects and one without), e.g. by having a new feature that allows to describe the side-effect status using an MI operand. Next steps ========= I believe it is important to break up the full amount of work into incremental steps that provide some useful benefits on their own. At first, we should be able to get to a state where clang can be used to build programs that use some (maybe not all) strict FP features, where the generated code is always correct but may not always be optimal. To get there, I think we need at a minimum: - Implement clang support for the default flags, e.g. GCC's -frounding-math and -ftrapping-math, and generate always the constrained intrinsics. clang should also mark all call sites then (as mentioned above). - For now, add the requirement that LTO is not supported if this would cause mixing of strict and non-strict FP code. In the alternative, have the LTO pass automatically transform and floating-point operation into a constrained intrinsic if *any* (other) module already uses the latter. - At the IR level, complete the set of supported constrained FP intrinsics (there are still some missing, see e.g https://reviews.llvm.org/D43515). Also, it seems not all variants (e.g. for vector types) are supported correctly through codegen (see e.g. https://reviews.llvm.org/D46967). - Allow targets to correctly reflect constrained intrinsics semantics at the MI level and final machine code generation (see e.g. https://reviews.llvm.org/D45576). - Review all optimization and codegen passes to verify they fully respect strict FP semantics. Once this is done, we can improve on the solution by: - Supporting mixing strict and non-strict FP operations (would lift the LTO restriction). (Note: there seems to be still some "invention required" here, see above.) - Actually implementing the #pragma supporting different regions within a compilation unit (prereq: support for mixing strict and non-strict FP operations). - Add more optimization of constrained FP intrinsics in common optimizers and/or target back ends. Does this look reasonable? Please let me know if there's anything I overlooked, or you have any additional comments or questions. Mit freundlichen Gruessen / Best Regards Ulrich Weigand -- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU/Linux compilers and toolchain IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180523/de6c1b04/attachment-0001.html>
Hi Ulrich, I am interested in knowing if the current proposals also take into account the FP_CONTRACT pragma and the ability to implement options that imply a specific value for the FLT_EVAL_METHOD macro. Additionally, I am not aware of the IR being able to represent the potentially deferred loss of precision that the C language semantics provide; in particular, applying such semantics to the existing IR would hit an issue that the limits of such deferment would need an agreed representation. As for the mixing of strict and non-strict modes, I would be interested in where LLVM is in its handling of non-SSA (pseudo-memory?) dependencies. I have a vague impression that it is very coarse-grained in that respect, but I admit to not being particularly informed in that space. If there is a good model for such dependencies, then I think it could be used to handle the strict/non-strict mixing. -- Hubert Tong, IBM PS A nitpick on wording: The idea of being inside or outside of FENV_ACCESS regions is instead be expressed in terms of the state of the FENV_ACCESS pragma within the C Standard. On Wed, May 23, 2018 at 10:48 AM, Ulrich Weigand via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hello, > > at the recent EuroLLVM developer meeting in Bristol I held a BoF > session on the topic "Towards implementing #pragma STDC FENV_ACCESS". > I've also had a number of follow-on discussions both on-site in > Bristol and online since. This post is intended as a summary of > my current understanding set of requirements and implementation > details covering the overall topic. > > I'm posting this here in the hope this can serve as a basis for > the various more detailed discussions that are still ongoing > (e.g. in various Phabricator proposals right now). Any comments > are welcome! > > > Semantics of #pragma STDC FENV_ACCESS > ====================================> > To provide a baseline for the implementation discussion, first an > overview of the features required to handle the strict floating-point > mode defined by the C and IEEE standard: > > 1. Floating-point rounding modes > 2. Default floating-point exception handling > 3. Trapping floating-point exception handling > > Each of these separate features imposes different constraints on the > optimizations that LLVM may perform involving FP expressions: > > 1. Floating-point rounding modes > > Outside of FENV_ACCESS regions, all FP operations are supposed to be > performed in the "default" rounding mode. > > But inside FENV_ACCESS regions, FP operations implicitly depend on > a "current" rounding mode setting, which may be changed by certain > C library calls (plus some platform-specific intrinsics). In addition, > those calls may be performed within subroutines (as long as those are > also within FENV_ACCESS), so *any* function call within a FENV_ACCESS > must be considered as potentially changing the rounding mode. > > In effect, this means the compiler may not move or combine FP > operations accross function call sites. > > 2. Default floating-point exception handling > > Inside FENV_ACCESS regions, every floating-point operation that > causes an exception must be considered to set a "status flag" > associated with this exception type. Those flags can be queried > using C library calls (plus some platform-specific intrinsics), > and there are other such calls to explicitly set or clear those > flags as well. As with the rounding modes, those calls may be > performed in subroutines as well, so any function call within a > FENV_ACCESS region must be considered as potentially *using* and > changing the floating-point exception status flags. > > The values of the status flags on entry to a FENV_ACCESS are to > be considered undefined according to the C standard. > > Compiler optimizations are supposed to preserve the values of > all exception status bits at any point where they can be > (potentially) inspected by the program, i.e. at all call sites > within FENV_ACCESS regions. This still allows a number of > optimizations, e.g. to reorder FP operations or combine two > identical operations within a region uninterrupted by calls. > But other optimizations should be avoided, e.g. optimizing > away an unused FP operation may result in an exception flag > now being unset that would otherwise have been set. The same > applies to floating-point constant folding. > > 3. Trapping floating-point exception handling > > Within a FENV_ACCESS region, library calls may be used to switch > exception handling semantics to a "trapping" mode by setting > corresponding mask bits. Any subsequent FP instruction that > raises an exception with the associated mask bit set will cause > a trap. Usually, this will be a hardware trap that is translated > by the operating system into some form of software exception that > can by handled by the applcation; on Linux systems this takes the > form of a SIGFPE signal. > > As above, those mask bits can be set and reset via (operating- > system specific) library calls and/or platform-specific intrinsics, > all of which may also be done within subroutine calls. > > In effect, this requires the compiler to treat any floating-point > operation within a FENV_ACCESS region as potentially trapping, > which means the same restrictions apply as with e.g. memory accesses > (cannot be speculated etc.) However, according to the C standard, > the implementation is not required to preserve the *number* of > different traps, so identical operations may still be combined > (unless there is an intervening function call). > > The C standard requires all user code to explicitly switch back > to non-trapping mode for all exceptions whenever leaving a > FENV_ACCESS region (both by "falling off the end" of the region > and by calling a subroutine defined outside of FENV_ACCESS). > > > Implementation requirements on parts of the compiler > ===================================================> > A. clang front end > > The front end needs to determine which instructions are part of > FENV_ACCESS regions and which are not. This takes into account > both the semantics of the #pragma as defined by the standard, > and the implementation-defined default rules that apply to code > outside of any #pragma. GCC currently has the following two > related command-line options: > > -frounding-math: Do not assume default rounding mode > -ftrapping-math: Assume FP operations may trap > > clang accepts but (basically) ignores those options. As a first > step, it might make sense to have the FENV_ACCESS default > behavior triggered by these options, even while the front end > does not yet support the actual #pragma. > > The front end then needs to transmit the information about > FENV_ACCESS regions to later passes. However, I believe that > we do not actually have to implement "regions" as such at the > IR level. Instead, it would be sufficient to track the follwing > information: > > - For each FP operation, whether it is within a FENV_ACCESS region. > - For each call site, whether it is within a FENV_ACCESS region. > > The former requires new IR support; the approach currently under > investigation uses the experimental "constrained FP" intrinsics > instead of traditional floating-point operations for this. The > latter can be done simply by annotating those call sites with an > attribute. > > In addition to that, the front-end itself needs to disable any > early optimizations that do not preserve strict FP semantics, > in particular it must not speculate FP operations if they may > trap. (Currently, the front end transforms "? :" on floating- > point types into a select IR statement; for trapping FP > operations, an explicit branch must be used instead.) > > > B. LLVM IR and LLVM common optimizations > > As mentioned in the previous section, we need some IR to annotate > FP instructions and call sites within FENV_ACCESS regions. All > common optimizations then need to respect the strict FP semantics > associated with those regions. > > The current approach uses experimental intrinsics. This has the > advantage that most optimizations never trigger since they don't > even recognize those new intrinsics. Also, the intrinsics can > be marked as having side-effects and/or being non-speculatable. > > The overall effect is that more optimizations are suppressed > than would be strictly necessary. But this may still be a good > first step, since the result is now safe but maybe not optimal > -- which can be improved upon over time by teaching the specific > semantics of those intrinsics to optimization passes. > > However, some open questions remain. If at some point we want > to model the constrained FP semantics more precisely than just > as "unmodeled side effects", this may have to be reflected at > the IR level directly. For example, to model rounding mode > behavior, at some point we might require explicit tracking of > data dependencies on the rounding mode by representing the > rounding mode as SSA values defined by function calls and used > by FP intrinsics. Similarly, to track exception status flags, > they might be modeled as SSA values set by FP intrinsics and > used by function calls. > > (There is a possibly related question of how to optimally model > the property of many math library routines that they may access > the "errno" variable but no other memory ... It might also be > possible to model e.g. exception status as a thread-local "memory" > location that is modified by FP operations, just like errno.) > > Another currently unresolved issue is that at the moment nothing > prevents *standard* floating-point operations from being moved > *inside* FENV_ACCESS regions. This may also be invalid, since > those operations now may cause unexpected traps etc. (More > specifically, what is invalid is moving any standard FP operation > across a *call site* within a FENV_ACCESS region.) Note that > this is even an issue if we only support changing the default > (and no actual #pragma) if mutiple object files using different > default settings are being linked together using LTO. > > This last issue could in theory be solved by having all optimization > passes respect the requirement that floating-point operations may > not be moved across call sites marked with the strict FP attribute. > But that does not appear to be straightforward since it would > introduce a "new" type of dependeny that would have to be added > throughout LLVM code. If this must be avoided, we'd have to > find a way to explicity track dependencies at the IR level. In > the extreme, this could end up equivalent to just always using > the constrained intrinsics for everything ... > > > C. Code generation > > In the back end, effects of strict FP mode have to passed through > to lower-level representations including SelectionDAG and MI. > > Currently, the "unmodeled side effect" logic of the constrained > intrinsics is modeled by putting them on the chain during SelectionDAG. > (If we ever model semantics more precisely at the IR level, that > would need to be reflected on SelectionDAG accordingly.) > > At the MI level, there is no representation at all. One option to > fix this would be to model target-specific registers that implement > the IEEE semantics. Most platforms have registers (or parts of > registers) that hold: > - the current rounding mode > - the exception status flags > - the exception masks (which enable traps) > Marking FP instructions as using and/or defining these registers > would enforce ordering requirements. It may be too strict in some > cases (e.g. two instructions setting exception status flags may > still be reordered). On the other hand, I believe if instructions > may actually *trap*, we actually need the hasSideEffects flag even > if register dependencies are modeled. > > If we do need hasSideEffects, there is a separate discussion on > whether this can be implemented without each back end having to > duplicate all FP instruction patterns (one with hasSideEffects > and one without), e.g. by having a new feature that allows to > describe the side-effect status using an MI operand. > > > Next steps > =========> > I believe it is important to break up the full amount of work > into incremental steps that provide some useful benefits on their > own. At first, we should be able to get to a state where clang > can be used to build programs that use some (maybe not all) strict > FP features, where the generated code is always correct but may > not always be optimal. To get there, I think we need at a > minimum: > > - Implement clang support for the default flags, e.g. GCC's > -frounding-math and -ftrapping-math, and generate always > the constrained intrinsics. clang should also mark all > call sites then (as mentioned above). > > - For now, add the requirement that LTO is not supported if > this would cause mixing of strict and non-strict FP code. > In the alternative, have the LTO pass automatically transform > and floating-point operation into a constrained intrinsic > if *any* (other) module already uses the latter. > > - At the IR level, complete the set of supported constrained > FP intrinsics (there are still some missing, see e.g > https://reviews.llvm.org/D43515). > Also, it seems not all variants (e.g. for vector types) are > supported correctly through codegen (see e.g. > https://reviews.llvm.org/D46967). > > - Allow targets to correctly reflect constrained intrinsics > semantics at the MI level and final machine code generation > (see e.g. https://reviews.llvm.org/D45576). > > - Review all optimization and codegen passes to verify they > fully respect strict FP semantics. > > Once this is done, we can improve on the solution by: > > - Supporting mixing strict and non-strict FP operations > (would lift the LTO restriction). (Note: there seems > to be still some "invention required" here, see above.) > > - Actually implementing the #pragma supporting different > regions within a compilation unit (prereq: support for > mixing strict and non-strict FP operations). > > - Add more optimization of constrained FP intrinsics in > common optimizers and/or target back ends. > > Does this look reasonable? Please let me know if there's > anything I overlooked, or you have any additional comments > or questions. > > > > Mit freundlichen Gruessen / Best Regards > > Ulrich Weigand > > -- > Dr. Ulrich Weigand | Phone: +49-7031/16-3727 > STSM, GNU/Linux compilers and toolchain > IBM Deutschland Research & Development GmbH > Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk > Wittkopp > Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, > HRB 243294 > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180523/6df93c4e/attachment.html>
On 05/23/2018 11:06 AM, Hubert Tong via llvm-dev wrote:> Hi Ulrich, > > I am interested in knowing if the current proposals also take into > account the FP_CONTRACT pragmaWe should already do this (we turn relevant operations into the @llvm.fmuladd. when FP_CONTRACT is set to on during IR generation).> and the ability to implement options that imply a specific value for > the FLT_EVAL_METHOD macro.What do you mean by this? -Hal> > Additionally, I am not aware of the IR being able to represent the > potentially deferred loss of precision that the C language semantics > provide; in particular, applying such semantics to the existing IR > would hit an issue that the limits of such deferment would need an > agreed representation. > > As for the mixing of strict and non-strict modes, I would be > interested in where LLVM is in its handling of non-SSA > (pseudo-memory?) dependencies. I have a vague impression that it is > very coarse-grained in that respect, but I admit to not being > particularly informed in that space. If there is a good model for such > dependencies, then I think it could be used to handle the > strict/non-strict mixing. > > -- Hubert Tong, IBM > > PS A nitpick on wording: The idea of being inside or outside of > FENV_ACCESS regions is instead be expressed in terms of the state of > the FENV_ACCESS pragma within the C Standard. > > On Wed, May 23, 2018 at 10:48 AM, Ulrich Weigand via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > Hello, > > at the recent EuroLLVM developer meeting in Bristol I held a BoF > session on the topic "Towards implementing #pragma STDC FENV_ACCESS". > I've also had a number of follow-on discussions both on-site in > Bristol and online since. This post is intended as a summary of > my current understanding set of requirements and implementation > details covering the overall topic. > > I'm posting this here in the hope this can serve as a basis for > the various more detailed discussions that are still ongoing > (e.g. in various Phabricator proposals right now). Any comments > are welcome! > > > Semantics of #pragma STDC FENV_ACCESS > ====================================> > To provide a baseline for the implementation discussion, first an > overview of the features required to handle the strict floating-point > mode defined by the C and IEEE standard: > > 1. Floating-point rounding modes > 2. Default floating-point exception handling > 3. Trapping floating-point exception handling > > Each of these separate features imposes different constraints on the > optimizations that LLVM may perform involving FP expressions: > > 1. Floating-point rounding modes > > Outside of FENV_ACCESS regions, all FP operations are supposed to be > performed in the "default" rounding mode. > > But inside FENV_ACCESS regions, FP operations implicitly depend on > a "current" rounding mode setting, which may be changed by certain > C library calls (plus some platform-specific intrinsics). In addition, > those calls may be performed within subroutines (as long as those are > also within FENV_ACCESS), so *any* function call within a FENV_ACCESS > must be considered as potentially changing the rounding mode. > > In effect, this means the compiler may not move or combine FP > operations accross function call sites. > > 2. Default floating-point exception handling > > Inside FENV_ACCESS regions, every floating-point operation that > causes an exception must be considered to set a "status flag" > associated with this exception type. Those flags can be queried > using C library calls (plus some platform-specific intrinsics), > and there are other such calls to explicitly set or clear those > flags as well. As with the rounding modes, those calls may be > performed in subroutines as well, so any function call within a > FENV_ACCESS region must be considered as potentially *using* and > changing the floating-point exception status flags. > > The values of the status flags on entry to a FENV_ACCESS are to > be considered undefined according to the C standard. > > Compiler optimizations are supposed to preserve the values of > all exception status bits at any point where they can be > (potentially) inspected by the program, i.e. at all call sites > within FENV_ACCESS regions. This still allows a number of > optimizations, e.g. to reorder FP operations or combine two > identical operations within a region uninterrupted by calls. > But other optimizations should be avoided, e.g. optimizing > away an unused FP operation may result in an exception flag > now being unset that would otherwise have been set. The same > applies to floating-point constant folding. > > 3. Trapping floating-point exception handling > > Within a FENV_ACCESS region, library calls may be used to switch > exception handling semantics to a "trapping" mode by setting > corresponding mask bits. Any subsequent FP instruction that > raises an exception with the associated mask bit set will cause > a trap. Usually, this will be a hardware trap that is translated > by the operating system into some form of software exception that > can by handled by the applcation; on Linux systems this takes the > form of a SIGFPE signal. > > As above, those mask bits can be set and reset via (operating- > system specific) library calls and/or platform-specific intrinsics, > all of which may also be done within subroutine calls. > > In effect, this requires the compiler to treat any floating-point > operation within a FENV_ACCESS region as potentially trapping, > which means the same restrictions apply as with e.g. memory accesses > (cannot be speculated etc.) However, according to the C standard, > the implementation is not required to preserve the *number* of > different traps, so identical operations may still be combined > (unless there is an intervening function call). > > The C standard requires all user code to explicitly switch back > to non-trapping mode for all exceptions whenever leaving a > FENV_ACCESS region (both by "falling off the end" of the region > and by calling a subroutine defined outside of FENV_ACCESS). > > > Implementation requirements on parts of the compiler > ===================================================> > A. clang front end > > The front end needs to determine which instructions are part of > FENV_ACCESS regions and which are not. This takes into account > both the semantics of the #pragma as defined by the standard, > and the implementation-defined default rules that apply to code > outside of any #pragma. GCC currently has the following two > related command-line options: > > -frounding-math: Do not assume default rounding mode > -ftrapping-math: Assume FP operations may trap > > clang accepts but (basically) ignores those options. As a first > step, it might make sense to have the FENV_ACCESS default > behavior triggered by these options, even while the front end > does not yet support the actual #pragma. > > The front end then needs to transmit the information about > FENV_ACCESS regions to later passes. However, I believe that > we do not actually have to implement "regions" as such at the > IR level. Instead, it would be sufficient to track the follwing > information: > > - For each FP operation, whether it is within a FENV_ACCESS region. > - For each call site, whether it is within a FENV_ACCESS region. > > The former requires new IR support; the approach currently under > investigation uses the experimental "constrained FP" intrinsics > instead of traditional floating-point operations for this. The > latter can be done simply by annotating those call sites with an > attribute. > > In addition to that, the front-end itself needs to disable any > early optimizations that do not preserve strict FP semantics, > in particular it must not speculate FP operations if they may > trap. (Currently, the front end transforms "? :" on floating- > point types into a select IR statement; for trapping FP > operations, an explicit branch must be used instead.) > > > B. LLVM IR and LLVM common optimizations > > As mentioned in the previous section, we need some IR to annotate > FP instructions and call sites within FENV_ACCESS regions. All > common optimizations then need to respect the strict FP semantics > associated with those regions. > > The current approach uses experimental intrinsics. This has the > advantage that most optimizations never trigger since they don't > even recognize those new intrinsics. Also, the intrinsics can > be marked as having side-effects and/or being non-speculatable. > > The overall effect is that more optimizations are suppressed > than would be strictly necessary. But this may still be a good > first step, since the result is now safe but maybe not optimal > -- which can be improved upon over time by teaching the specific > semantics of those intrinsics to optimization passes. > > However, some open questions remain. If at some point we want > to model the constrained FP semantics more precisely than just > as "unmodeled side effects", this may have to be reflected at > the IR level directly. For example, to model rounding mode > behavior, at some point we might require explicit tracking of > data dependencies on the rounding mode by representing the > rounding mode as SSA values defined by function calls and used > by FP intrinsics. Similarly, to track exception status flags, > they might be modeled as SSA values set by FP intrinsics and > used by function calls. > > (There is a possibly related question of how to optimally model > the property of many math library routines that they may access > the "errno" variable but no other memory ... It might also be > possible to model e.g. exception status as a thread-local "memory" > location that is modified by FP operations, just like errno.) > > Another currently unresolved issue is that at the moment nothing > prevents *standard* floating-point operations from being moved > *inside* FENV_ACCESS regions. This may also be invalid, since > those operations now may cause unexpected traps etc. (More > specifically, what is invalid is moving any standard FP operation > across a *call site* within a FENV_ACCESS region.) Note that > this is even an issue if we only support changing the default > (and no actual #pragma) if mutiple object files using different > default settings are being linked together using LTO. > > This last issue could in theory be solved by having all optimization > passes respect the requirement that floating-point operations may > not be moved across call sites marked with the strict FP attribute. > But that does not appear to be straightforward since it would > introduce a "new" type of dependeny that would have to be added > throughout LLVM code. If this must be avoided, we'd have to > find a way to explicity track dependencies at the IR level. In > the extreme, this could end up equivalent to just always using > the constrained intrinsics for everything ... > > > C. Code generation > > In the back end, effects of strict FP mode have to passed through > to lower-level representations including SelectionDAG and MI. > > Currently, the "unmodeled side effect" logic of the constrained > intrinsics is modeled by putting them on the chain during > SelectionDAG. > (If we ever model semantics more precisely at the IR level, that > would need to be reflected on SelectionDAG accordingly.) > > At the MI level, there is no representation at all. One option to > fix this would be to model target-specific registers that implement > the IEEE semantics. Most platforms have registers (or parts of > registers) that hold: > - the current rounding mode > - the exception status flags > - the exception masks (which enable traps) > Marking FP instructions as using and/or defining these registers > would enforce ordering requirements. It may be too strict in some > cases (e.g. two instructions setting exception status flags may > still be reordered). On the other hand, I believe if instructions > may actually *trap*, we actually need the hasSideEffects flag even > if register dependencies are modeled. > > If we do need hasSideEffects, there is a separate discussion on > whether this can be implemented without each back end having to > duplicate all FP instruction patterns (one with hasSideEffects > and one without), e.g. by having a new feature that allows to > describe the side-effect status using an MI operand. > > > Next steps > =========> > I believe it is important to break up the full amount of work > into incremental steps that provide some useful benefits on their > own. At first, we should be able to get to a state where clang > can be used to build programs that use some (maybe not all) strict > FP features, where the generated code is always correct but may > not always be optimal. To get there, I think we need at a > minimum: > > - Implement clang support for the default flags, e.g. GCC's > -frounding-math and -ftrapping-math, and generate always > the constrained intrinsics. clang should also mark all > call sites then (as mentioned above). > > - For now, add the requirement that LTO is not supported if > this would cause mixing of strict and non-strict FP code. > In the alternative, have the LTO pass automatically transform > and floating-point operation into a constrained intrinsic > if *any* (other) module already uses the latter. > > - At the IR level, complete the set of supported constrained > FP intrinsics (there are still some missing, see e.g > https://reviews.llvm.org/D43515 <https://reviews.llvm.org/D43515>). > Also, it seems not all variants (e.g. for vector types) are > supported correctly through codegen (see e.g. > https://reviews.llvm.org/D46967 <https://reviews.llvm.org/D46967>). > > - Allow targets to correctly reflect constrained intrinsics > semantics at the MI level and final machine code generation > (see e.g. https://reviews.llvm.org/D45576 > <https://reviews.llvm.org/D45576>). > > - Review all optimization and codegen passes to verify they > fully respect strict FP semantics. > > Once this is done, we can improve on the solution by: > > - Supporting mixing strict and non-strict FP operations > (would lift the LTO restriction). (Note: there seems > to be still some "invention required" here, see above.) > > - Actually implementing the #pragma supporting different > regions within a compilation unit (prereq: support for > mixing strict and non-strict FP operations). > > - Add more optimization of constrained FP intrinsics in > common optimizers and/or target back ends. > > Does this look reasonable? Please let me know if there's > anything I overlooked, or you have any additional comments > or questions. > > > > Mit freundlichen Gruessen / Best Regards > > Ulrich Weigand > > -- > Dr. Ulrich Weigand | Phone: +49-7031/16-3727 > STSM, GNU/Linux compilers and toolchain > IBM Deutschland Research & Development GmbH > Vorsitzende des Aufsichtsrats: Martina Koederitz | > Geschäftsführung: Dirk Wittkopp > Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht > Stuttgart, HRB 243294 > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180523/430e68b2/attachment-0001.html>
Hi Hubert,
I had not really thought about FP_CONTRACT.  As Hal mentioned, LLVM uses
FP_CONTRACT only to allow use of floating-point multiply-and-add, and this
would remain valid even when FENV_ACCESS is on.
However, it seems that FP_CONTRACT on would also allow constant folding to
be performed even when FENV_ACCESS is on.  This is what you're refering to,
right?  This should certainly be considered as option when implementing
clang front-end support for FENV_ACCESS.
I have thought even less about using different values for FLT_EVAL_METHOD.
However, one concern we found with this in the past (on s390x GCC) is that
changing that value on an existing target may have ABI impact: glibc header
files choose the definition of float_t and double_t based on the current
value of FLT_EVAL_METHOD, and changing those types could result in an ABI
break for applications using them in interfaces.
Finding a good model for non-SSA dependencies is indeed the main problem
here.  See e.g. the recent discussion here https://reviews.llvm.org/D45576
about possibly modeling FP status flags as "memory" at the MI level. 
I
guess it might be possible to do the same at the IR level, but I'm less
familiar with that part of LLVM.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
  Dr. Ulrich Weigand | Phone: +49-7031/16-3727
  STSM, GNU/Linux compilers and toolchain
  IBM Deutschland Research & Development GmbH
  Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk
Wittkopp
  Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
From:	Hubert Tong <hubert.reinterpretcast at gmail.com>
To:	Ulrich Weigand <Ulrich.Weigand at de.ibm.com>
Cc:	llvm-dev <llvm-dev at lists.llvm.org>, Ahmed Bougacha
            <abougacha at apple.com>, Kevin Neal <Kevin.Neal at
sas.com>
Date:	23.05.2018 18:09
Subject:	Re: [llvm-dev] Update on strict FP status
Hi Ulrich,
I am interested in knowing if the current proposals also take into account
the FP_CONTRACT pragma and the ability to implement options that imply a
specific value for the FLT_EVAL_METHOD macro.
Additionally, I am not aware of the IR being able to represent the
potentially deferred loss of precision that the C language semantics
provide; in particular, applying such semantics to the existing IR would
hit an issue that the limits of such deferment would need an agreed
representation.
As for the mixing of strict and non-strict modes, I would be interested in
where LLVM is in its handling of non-SSA (pseudo-memory?) dependencies. I
have a vague impression that it is very coarse-grained in that respect, but
I admit to not being particularly informed in that space. If there is a
good model for such dependencies, then I think it could be used to handle
the strict/non-strict mixing.
-- Hubert Tong, IBM
PS A nitpick on wording: The idea of being inside or outside of FENV_ACCESS
regions is instead be expressed in terms of the state of the FENV_ACCESS
pragma within the C Standard.
On Wed, May 23, 2018 at 10:48 AM, Ulrich Weigand via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
  Hello,
  at the recent EuroLLVM developer meeting in Bristol I held a BoF
  session on the topic "Towards implementing #pragma STDC
FENV_ACCESS".
  I've also had a number of follow-on discussions both on-site in
  Bristol and online since. This post is intended as a summary of
  my current understanding set of requirements and implementation
  details covering the overall topic.
  I'm posting this here in the hope this can serve as a basis for
  the various more detailed discussions that are still ongoing
  (e.g. in various Phabricator proposals right now). Any comments
  are welcome!
  Semantics of #pragma STDC FENV_ACCESS
  ====================================
  To provide a baseline for the implementation discussion, first an
  overview of the features required to handle the strict floating-point
  mode defined by the C and IEEE standard:
  1. Floating-point rounding modes
  2. Default floating-point exception handling
  3. Trapping floating-point exception handling
  Each of these separate features imposes different constraints on the
  optimizations that LLVM may perform involving FP expressions:
  1. Floating-point rounding modes
  Outside of FENV_ACCESS regions, all FP operations are supposed to be
  performed in the "default" rounding mode.
  But inside FENV_ACCESS regions, FP operations implicitly depend on
  a "current" rounding mode setting, which may be changed by certain
  C library calls (plus some platform-specific intrinsics). In addition,
  those calls may be performed within subroutines (as long as those are
  also within FENV_ACCESS), so *any* function call within a FENV_ACCESS
  must be considered as potentially changing the rounding mode.
  In effect, this means the compiler may not move or combine FP
  operations accross function call sites.
  2. Default floating-point exception handling
  Inside FENV_ACCESS regions, every floating-point operation that
  causes an exception must be considered to set a "status flag"
  associated with this exception type. Those flags can be queried
  using C library calls (plus some platform-specific intrinsics),
  and there are other such calls to explicitly set or clear those
  flags as well. As with the rounding modes, those calls may be
  performed in subroutines as well, so any function call within a
  FENV_ACCESS region must be considered as potentially *using* and
  changing the floating-point exception status flags.
  The values of the status flags on entry to a FENV_ACCESS are to
  be considered undefined according to the C standard.
  Compiler optimizations are supposed to preserve the values of
  all exception status bits at any point where they can be
  (potentially) inspected by the program, i.e. at all call sites
  within FENV_ACCESS regions. This still allows a number of
  optimizations, e.g. to reorder FP operations or combine two
  identical operations within a region uninterrupted by calls.
  But other optimizations should be avoided, e.g. optimizing
  away an unused FP operation may result in an exception flag
  now being unset that would otherwise have been set. The same
  applies to floating-point constant folding.
  3. Trapping floating-point exception handling
  Within a FENV_ACCESS region, library calls may be used to switch
  exception handling semantics to a "trapping" mode by setting
  corresponding mask bits. Any subsequent FP instruction that
  raises an exception with the associated mask bit set will cause
  a trap. Usually, this will be a hardware trap that is translated
  by the operating system into some form of software exception that
  can by handled by the applcation; on Linux systems this takes the
  form of a SIGFPE signal.
  As above, those mask bits can be set and reset via (operating-
  system specific) library calls and/or platform-specific intrinsics,
  all of which may also be done within subroutine calls.
  In effect, this requires the compiler to treat any floating-point
  operation within a FENV_ACCESS region as potentially trapping,
  which means the same restrictions apply as with e.g. memory accesses
  (cannot be speculated etc.) However, according to the C standard,
  the implementation is not required to preserve the *number* of
  different traps, so identical operations may still be combined
  (unless there is an intervening function call).
  The C standard requires all user code to explicitly switch back
  to non-trapping mode for all exceptions whenever leaving a
  FENV_ACCESS region (both by "falling off the end" of the region
  and by calling a subroutine defined outside of FENV_ACCESS).
  Implementation requirements on parts of the compiler
  ===================================================
  A. clang front end
  The front end needs to determine which instructions are part of
  FENV_ACCESS regions and which are not. This takes into account
  both the semantics of the #pragma as defined by the standard,
  and the implementation-defined default rules that apply to code
  outside of any #pragma. GCC currently has the following two
  related command-line options:
  -frounding-math: Do not assume default rounding mode
  -ftrapping-math: Assume FP operations may trap
  clang accepts but (basically) ignores those options. As a first
  step, it might make sense to have the FENV_ACCESS default
  behavior triggered by these options, even while the front end
  does not yet support the actual #pragma.
  The front end then needs to transmit the information about
  FENV_ACCESS regions to later passes. However, I believe that
  we do not actually have to implement "regions" as such at the
  IR level. Instead, it would be sufficient to track the follwing
  information:
  - For each FP operation, whether it is within a FENV_ACCESS region.
  - For each call site, whether it is within a FENV_ACCESS region.
  The former requires new IR support; the approach currently under
  investigation uses the experimental "constrained FP" intrinsics
  instead of traditional floating-point operations for this. The
  latter can be done simply by annotating those call sites with an
  attribute.
  In addition to that, the front-end itself needs to disable any
  early optimizations that do not preserve strict FP semantics,
  in particular it must not speculate FP operations if they may
  trap. (Currently, the front end transforms "? :" on floating-
  point types into a select IR statement; for trapping FP
  operations, an explicit branch must be used instead.)
  B. LLVM IR and LLVM common optimizations
  As mentioned in the previous section, we need some IR to annotate
  FP instructions and call sites within FENV_ACCESS regions. All
  common optimizations then need to respect the strict FP semantics
  associated with those regions.
  The current approach uses experimental intrinsics. This has the
  advantage that most optimizations never trigger since they don't
  even recognize those new intrinsics. Also, the intrinsics can
  be marked as having side-effects and/or being non-speculatable.
  The overall effect is that more optimizations are suppressed
  than would be strictly necessary. But this may still be a good
  first step, since the result is now safe but maybe not optimal
  -- which can be improved upon over time by teaching the specific
  semantics of those intrinsics to optimization passes.
  However, some open questions remain. If at some point we want
  to model the constrained FP semantics more precisely than just
  as "unmodeled side effects", this may have to be reflected at
  the IR level directly. For example, to model rounding mode
  behavior, at some point we might require explicit tracking of
  data dependencies on the rounding mode by representing the
  rounding mode as SSA values defined by function calls and used
  by FP intrinsics. Similarly, to track exception status flags,
  they might be modeled as SSA values set by FP intrinsics and
  used by function calls.
  (There is a possibly related question of how to optimally model
  the property of many math library routines that they may access
  the "errno" variable but no other memory ... It might also be
  possible to model e.g. exception status as a thread-local "memory"
  location that is modified by FP operations, just like errno.)
  Another currently unresolved issue is that at the moment nothing
  prevents *standard* floating-point operations from being moved
  *inside* FENV_ACCESS regions. This may also be invalid, since
  those operations now may cause unexpected traps etc. (More
  specifically, what is invalid is moving any standard FP operation
  across a *call site* within a FENV_ACCESS region.) Note that
  this is even an issue if we only support changing the default
  (and no actual #pragma) if mutiple object files using different
  default settings are being linked together using LTO.
  This last issue could in theory be solved by having all optimization
  passes respect the requirement that floating-point operations may
  not be moved across call sites marked with the strict FP attribute.
  But that does not appear to be straightforward since it would
  introduce a "new" type of dependeny that would have to be added
  throughout LLVM code. If this must be avoided, we'd have to
  find a way to explicity track dependencies at the IR level. In
  the extreme, this could end up equivalent to just always using
  the constrained intrinsics for everything ...
  C. Code generation
  In the back end, effects of strict FP mode have to passed through
  to lower-level representations including SelectionDAG and MI.
  Currently, the "unmodeled side effect" logic of the constrained
  intrinsics is modeled by putting them on the chain during SelectionDAG.
  (If we ever model semantics more precisely at the IR level, that
  would need to be reflected on SelectionDAG accordingly.)
  At the MI level, there is no representation at all. One option to
  fix this would be to model target-specific registers that implement
  the IEEE semantics. Most platforms have registers (or parts of
  registers) that hold:
  - the current rounding mode
  - the exception status flags
  - the exception masks (which enable traps)
  Marking FP instructions as using and/or defining these registers
  would enforce ordering requirements. It may be too strict in some
  cases (e.g. two instructions setting exception status flags may
  still be reordered). On the other hand, I believe if instructions
  may actually *trap*, we actually need the hasSideEffects flag even
  if register dependencies are modeled.
  If we do need hasSideEffects, there is a separate discussion on
  whether this can be implemented without each back end having to
  duplicate all FP instruction patterns (one with hasSideEffects
  and one without), e.g. by having a new feature that allows to
  describe the side-effect status using an MI operand.
  Next steps
  =========
  I believe it is important to break up the full amount of work
  into incremental steps that provide some useful benefits on their
  own. At first, we should be able to get to a state where clang
  can be used to build programs that use some (maybe not all) strict
  FP features, where the generated code is always correct but may
  not always be optimal. To get there, I think we need at a
  minimum:
  - Implement clang support for the default flags, e.g. GCC's
  -frounding-math and -ftrapping-math, and generate always
  the constrained intrinsics. clang should also mark all
  call sites then (as mentioned above).
  - For now, add the requirement that LTO is not supported if
  this would cause mixing of strict and non-strict FP code.
  In the alternative, have the LTO pass automatically transform
  and floating-point operation into a constrained intrinsic
  if *any* (other) module already uses the latter.
  - At the IR level, complete the set of supported constrained
  FP intrinsics (there are still some missing, see e.g
  https://reviews.llvm.org/D43515).
  Also, it seems not all variants (e.g. for vector types) are
  supported correctly through codegen (see e.g.
  https://reviews.llvm.org/D46967).
  - Allow targets to correctly reflect constrained intrinsics
  semantics at the MI level and final machine code generation
  (see e.g. https://reviews.llvm.org/D45576).
  - Review all optimization and codegen passes to verify they
  fully respect strict FP semantics.
  Once this is done, we can improve on the solution by:
  - Supporting mixing strict and non-strict FP operations
  (would lift the LTO restriction). (Note: there seems
  to be still some "invention required" here, see above.)
  - Actually implementing the #pragma supporting different
  regions within a compilation unit (prereq: support for
  mixing strict and non-strict FP operations).
  - Add more optimization of constrained FP intrinsics in
  common optimizers and/or target back ends.
  Does this look reasonable? Please let me know if there's
  anything I overlooked, or you have any additional comments
  or questions.
  Mit freundlichen Gruessen / Best Regards
  Ulrich Weigand
  --
  Dr. Ulrich Weigand | Phone: +49-7031/16-3727
  STSM, GNU/Linux compilers and toolchain
  IBM Deutschland Research & Development GmbH
  Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk
  Wittkopp
  Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
  Stuttgart, HRB 243294
  _______________________________________________
  LLVM Developers mailing list
  llvm-dev at lists.llvm.org
  http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180524/535b0843/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180524/535b0843/attachment.gif>