Sjoerd Meijer via llvm-dev
2017-Dec-04 14:44 UTC
[llvm-dev] [RFC] Half-Precision Support in the Arm Backends
Hi, I am working on C/C++ language support for the Armv8.2-A half-precision instructions. I've added support for _Float16 as a new source language type to Clang. _Float16 is a C11 extension type for which arithmetic is well defined, as opposed to e.g. __fp16 which is a storage-only type. I then fixed up the AArch64 backend, which was mostly straightforward: this involved making operations on f16 legal when FullFP16 is supported, thus avoiding promotions to f32. This enables generation of AArch64 FP16 instruction from C/C++. For AArch64, this work is finished and does not show problems in our testing; Solid Sands provided us with beta versions of their FP16 extension to SuperTest - their C/C++ language conformance test suite. However, as more testing can always be done, and there are not a lot of code bases using _Float16, I would be interested in more testing/feedback. This RFC is thus a quick status update on the AArch64 implementation, but is mainly about the AArch32 implementation in the ARM backend, which is a lot more interesting than AArch64 for a number of reasons. Most importantly because there is no soft-float ABI for AArch64 and it has half-precision H-registers, which is all very different for AArch32. So it's the different combinations, like soft float, softfp with FP support but argument passing in integer registers, hard float, hard float with FP16, and hard float with FullFP16, that makes things interesting. My AArch32 implementation in the ARM backend is nearly complete and I am working on fixing a handful of regression tests (the WIP diff can be found here: https://reviews.llvm.org/D38315). My approach to handle f16 types should not lead to any codegen differences for existing tests, but the way half types are handled and legalized is totally different in some cases and from that point of view the changes could be considered intrusive. Thus, this is a heads up, and below I will discuss the approach and some implementation decisions, for which feedback is welcome of course. Half-Precision RegisterClass ----------------------------------------- Half-precision values sit in the least-significant 16 bits of the single-precision registers. I.e., each instruction that generates a FP16 result writes that to the bottom 16 bits of the associated 32-bit floating-point register and the top 16 bits of the 32-bit floating-point register are written to 0. I've added a new HPR half-precision register class. This new HPR register class is an exact copy of SPR, but it avoids adding f16 and f32 type information to the existing rules, which would be necessary if we add f16 to the SPR register class. Calling Conventions ----------------------------- For the soft float case, - half-precision values are returned in the least significant 16 bits of r0, - half-precision arguments are set to 4 bytes as if it had been copied to the least significant bits of a 32- bit register and the remaining bits filled with unspecified values That's why for CC_ARM_AAPCS, f16 arguments and return values are bitconverted to i16 types. I then had to make some changes to lowering of the formal arguments and create a i16 truncate, followed by an f16 bitcast, in order not to interpret the high 16 bits of the i32 argument values. For the hard float case and CC_ARM_AAPCS_VFP this is straightforward, like f32 values, f16 values are passed in the S-registers and no further changes are required. f16 Type Legalization ------------------------------ The HPR registerclass and f16 type are added as a legal type when: - FullFP16 is enabled, which means support for the Armv8.2-A FP16 instructions, - FP16 is enabled, which means support for the f16 <-> f32 conversion instructions, which are a VFP3 extension and part of VFP4. It's obvious why f16 is legal for the former case, but the latter is perhaps the more interesting/instrusive change. Making fp16 legal for FP16, results in f16 LOADs/STOREs while we don't have instructions for them. So the approach is to custom lower f16 LOAD/STORE nodes (see next section). The reason to make f16 legal when only FP16 is supported is: - avoid very early legalization/combining of f16 arguments to f32 types, which would again interpret the higher 16 bits in 32-bit registers and that would be wrong. Instead of trying to undo this early legalization/combining, I found this approach easier and cleaner. - As a consequence, the isel dags are in a more 'normal form'. I.e. it relies less on funny nodes FP_TO_FP16 and FP16_TO_FP, which are funny because they perform float up/down converts and produce i32 values by moving from/to integer and float registers. Instead, FP_EXTEND and FP_ROUND nodes will be introduced, so this is more a clean up rather than e.g. addressing a correctness issue. Unfortunatly I found that I can't completely get rid of nodes FP16_TO_FP, see 'Custom Lowering' below. - When these FP_EXTEND and FP_ROUND are introduced by the legalizer, and we don't have the FP16 conversion instructions available, they will be custom lowered to EABI calls h2f and f2h. Custom Lowering ------------------------- Making f16 legal and not having native load/stores instructions available, (no FullFP16 support) means custom lowering loads/stores: 1) Since we don't have FP16 load/store instructions available, we create integer half-word loads. I unfortunately need the FP16_TO_FP node here, because that "models" creating an integer value, which is what we need to create a "truncating i16" integer load instructions. Instead, of using FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation to stack loads/stores which I don't want. 2) Custom lowering f16 stores is very similar, and creates truncating half-word integer stores. Relation with __fp16 ----------------------------- Usage of __fp16 results in IR that e.g. loads half-type as i16 integer types, and use "llvm.convert.from.fp16.f32" to promote it to single-precision values before any data processing is done. Now that f16 is legal, the i16 loads and converts are combined/legalized into f16 operations, and FP_EXTEND and FP_ROUNDs are introduced when necessary, and thus avoids FP16_TO_FP and FP_TO_FP16. Lowering works as described above. So LOADs and STOREs are custom lowered, which is perhaps what you expect given that __fp16 is a storage-only type. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171204/bff0c908/attachment.html>
Friedman, Eli via llvm-dev
2017-Dec-04 20:20 UTC
[llvm-dev] [RFC] Half-Precision Support in the Arm Backends
On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote:> > Custom Lowering > ------------------------- > > Making f16 legal and not having native load/stores instructions available, > (no FullFP16 support) means custom lowering loads/stores: > 1) Since we don't have FP16 load/store instructions available, we create > integer half-word loads. I unfortunately need the FP16_TO_FP node here, > because that "models" creating an integer value, which is what we need > to create a "truncating i16" integer load instructions. Instead, of > using > FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation > to stack loads/stores which I don't want. > 2) Custom lowering f16 stores is very similar, and creates truncating > half-word integer stores.Technically, there are no f16 load/store instructions, yes, but we can use NEON vdl1 and vst1 to get something roughly equivalent, right? You probably want to custom-lower BITCAST instructions; the generic sequence emitted by the legalizer is pretty inefficient in most cases. --- Overall, I think your approach makes sense. -Eli -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171204/4fc9092b/attachment.html>
Sjoerd Meijer via llvm-dev
2017-Dec-06 08:32 UTC
[llvm-dev] [RFC] Half-Precision Support in the Arm Backends
Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good. I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16 and FP16_TO_FP nodes are created to avoid inefficient code generation. I will double check if I can't achieve the same without using these nodes (because I really would like to get completely rid of them). Cheers, Sjoerd.>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote: >> >> Custom Lowering >> ------------------------- >> >> Making f16 legal and not having native load/stores instructions available, >> (no FullFP16 support) means custom lowering loads/stores: >> 1) Since we don't have FP16 load/store instructions available, we create >> integer half-word loads. I unfortunately need the FP16_TO_FP node here, >> because that "models" creating an integer value, which is what we need >> to create a "truncating i16" integer load instructions. Instead, of >> using >> FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation >> to stack loads/stores which I don't want. >> 2) Custom lowering f16 stores is very similar, and creates truncating >> half-word integer stores. > >Technically, there are no f16 load/store instructions, yes, but we can >use NEON vdl1 and vst1 to get something roughly equivalent, right? > >You probably want to custom-lower BITCAST instructions; the generic >sequence emitted by the legalizer is pretty inefficient in most cases. > >--- > >Overall, I think your approach makes sense.IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171206/7155a07c/attachment.html>
Reasonably Related Threads
- [RFC] Half-Precision Support in the Arm Backends
- [RFC] Half-Precision Support in the Arm Backends
- [RFC] Half-Precision Support in the Arm Backends
- TypePromoteFloat loses intermediate rounding operations
- TypePromoteFloat loses intermediate rounding operations