Sjoerd Meijer via llvm-dev
2017-Dec-06 08:32 UTC
[llvm-dev] [RFC] Half-Precision Support in the Arm Backends
Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good. I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16 and FP16_TO_FP nodes are created to avoid inefficient code generation. I will double check if I can't achieve the same without using these nodes (because I really would like to get completely rid of them). Cheers, Sjoerd.>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote: >> >> Custom Lowering >> ------------------------- >> >> Making f16 legal and not having native load/stores instructions available, >> (no FullFP16 support) means custom lowering loads/stores: >> 1) Since we don't have FP16 load/store instructions available, we create >> integer half-word loads. I unfortunately need the FP16_TO_FP node here, >> because that "models" creating an integer value, which is what we need >> to create a "truncating i16" integer load instructions. Instead, of >> using >> FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation >> to stack loads/stores which I don't want. >> 2) Custom lowering f16 stores is very similar, and creates truncating >> half-word integer stores. > >Technically, there are no f16 load/store instructions, yes, but we can >use NEON vdl1 and vst1 to get something roughly equivalent, right? > >You probably want to custom-lower BITCAST instructions; the generic >sequence emitted by the legalizer is pretty inefficient in most cases. > >--- > >Overall, I think your approach makes sense.IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171206/7155a07c/attachment.html>
Sjoerd Meijer via llvm-dev
2018-Jan-18 13:58 UTC
[llvm-dev] [RFC] Half-Precision Support in the Arm Backends
I would like to revive this thread, as I am struggling a lot with the FP16 implementation in the ARM backend. My implementation in https://reviews.llvm.org/D38315 is finished (except one case), but a more robust alternative implementation was suggested. One can indeed argue that my current implementation is a bit fragile, because it involves manually patching up the isel dags for a few cases. The suggestion was to look into CCState and adjusting of the calling convention lowering, inspired by a recent discussion on the list here: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html. The benefit of this approach is that I would get most of legalization for free, which is the fragile bit in my approach at the moment. Anyway, it's become a long story, but (almost) in chronological order this is what I've considered. Any hints, tips, suggestions welcome. Goal of my exercise: -------------------- 1) Add match rules for some Armv8.2-A FP16 instructions, e.g.: fsub, fadd, and also some conversion instructions. 2) Don't regress the "storage-only cases", i.e., when we only have the conversion instructions available. The Approach: ------------- 1) First, we make f16 legal only when we have FullFP16 support (i.e. when Armv8.2-A FP16 is supported), so in ARMISelLowering.cpp we add: if (Subtarget->hasFullFP16()) { addRegisterClass(MVT::f16, &ARM::HPRRegClass); } 2) This is the first implementation decision, I introduce a new register class HPR, which is an exact copy of the single-precision register class SPR, except that it holds f16 types: def HPR : RegisterClass<"ARM", [f16], 32, (sequence "S%u", 0, 31)> { ... I think this makes sense because half-types sit in the lower 16 bits of the S-registers, and the reason to create a separate HPR register class is to avoid typing of all rules that currently use SPR, because if we would allow both f16 and f32 in SPR, then there's a choice and rules need to be typed. 3) Next, add match rules for some Armv8.2-A FP16 instructions. Instruction VADDH, which has already a tablegen description but not yet a match rules, simply becomes: def VADDH : AHbI<0b11100, 0b11, 0, 0, (outs HPR:$Sd), (ins HPR:$Sn, HPR:$Sm), IIC_fpALU16, "vadd", ".f16\t$Sd, $Sn, $Sm", [(set HPR:$Sd, (fadd HPR:$Sn, HPR:$Sm))]>, // <~~~ new match rule using HPR This is straightforward business so far, but I already learned the hard way that the conversion are the tricky ones, so I repeat this for an f16 -> f32 upconvert: def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins HPR:$Sm), IIC_fpCVTSH, "vcvtb", ".f32.f16\t$Sd, $Sm", [(set SPR:$Sd, (fpextend HPR:$Sm))]>, // <~~~~ new match rule using HPR and SPR Requires<[HasFP16]>, Sched<[WriteFPCVT]>; This new rule matches and fp_extend that consumes an HPR and produces a result in SPR. And that's essentially it and all the code changes I want to make. With these changes, I want to see an fadd with two f16 opernands being codegen'd, and also I want to keep the existing tests passing of course. The problem: ------------ The problem appears to be that introducing HPR looks like a good idea, but it gives problems when FullFP16 is not supported. This is best illustrated with this existing test which is a simple upconvert of f16 to f32: define float @test_extend32(half* %addr) { %val16 = load half, half* %addr %val32 = fpext half %val16 to float ret float %val32 } It should generate this code:: ldrh r0, [r0] ; integer half word load vmov s0, r0 vcvtb.f32.f16 s0, s0 vmov r0, s0 bx lr when we don't have the Armv8.2-A FP16 instructions available, and thus only have the conversion instructions. The problem is in the conversion rules, some rewrite rules to be more specific, and I think this is one of the culprits: def : Pat<(f16_to_fp GPR:$a), (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>; This rewrite rule is supposed to first move the GPR reg in to a S-registers: vmov s0, r0 and then to the conversion: vcvtb.f32.f16 s0, s0 This rewrite rule gets triggered because the ISEL DAG has indeed this funny node f16_to_fp (which models this register move and the float-convert): t0: ch = EntryToken t2: i32,ch = CopyFromReg t0, Register:i32 %0 t16: i32,ch = load<LD2[%addr], zext from i16> t0, t2, undef:i32 t12: f32 = fp16_to_fp t16 <~~~~~~~~ FUNNY NODE HERE t7: i32 = bitcast t12 t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7 t10: ch = ARMISD::RET_FLAG t9, Register:i32 %r0, t9:1 And yes, to make it even funnier, this node has an i32 operand, and that's because we do the half-float load with an integer load instruction. And after this rewrite, we end up with this DAG: t0: ch = EntryToken t2: i32,ch = CopyFromReg t0, Register:i32 %0 t16: i32,ch = t2LDRHi12<Mem:LD2[%addr]> t2, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, t0 t20: f16 = COPY_TO_REGCLASS t16, TargetConstant:i32<1> <~~~~~~~~~~~~~ PROBLEM HERE t12: f32 = VCVTBHS t20, TargetConstant:i32<14>, Register:i32 %noreg t7: i32 = VMOVRS t12, TargetConstant:i32<14>, Register:i32 %noreg t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7 t10: ch = tBX_RET TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %r0, t9, t9:1 It goes all wrong from here, because a f16 value is produced and it is not a legal type etc. The reason that this happens must be because VCVTBHS that is used in the f16_to_fp rewrite rule, is specified to consume a f16 value, so the CopyToReg moves from GPR to an S-Registers: def : Pat<(f16_to_fp GPR:$a), (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>; Conclusion: ----------- This approach is not going to work: - because there's rewrite rule X which generates instructions Y that use both the SPR and HPR descriptions. - Y can be instruction selected when FullFP16 is enabled, this works and no problems (and we don't need X). - but X is used when FullFP16 is not supported, and generates Y that uses HPRs which is not legal. Introducing a new HPR register class is a possibility though, but then special care needs to be taken in a few cases. This is the first alternative I started exploring. Alternative 1 -------------- This is the approach I took in D38315, which made f16 also legal for the storage-only case, but which I am ready to abandon: - Define the HPR register class, - Use this and make f16 legal also in the storage-only case, -- This works for the storage-only case, because data processing instruction that have f16 operands will get promoted and fp_extend and fp_round nodes will be introduced. Then I manually patch up the f16 loads and stores and a few other corner cases by manually lowering them. This however is a fragile approach, which was rightfully commented by Oli, also after being inspired by a recent discussion about some backends doing their own CCState and function argument lowering. This is Alternative 3. -- Another crucial aspect and benefit of this approach, is that this fixes AAPCS compliance issues: while half-precision arguments sit in single-precision registers, callees should read (and also write) only the lower 16 bits. I got this for free by making f16 legal. Alternative 2 ------------- This alternative aims at getting something ready soon, it relies on exploring creating one register class and a Clang that generates the required up and down converts for the storage-only case: - We have one SPR registerclass that contains both f16 and f32: def SPR : RegisterClass<"ARM", [f16, f32], 32, (sequence "S%u", 0, 31)> { ... - Then I will have to type all rules that e.g. use SPR:$a and change it in (f32 SPR:$a): -- this will be a massive patch, -- but hopefully will not have a problem in the rewrite rules. - Then, I also need to modify a clang workaround, which should inserts the proper up and down converts for half types that are generated from the usage of _Float16, similarly like we already do for __fp16. This workaround can then later be replaced by the CCState implementation. Alternative 3: ---------------- This is the best approach, where I start looking into the AAPCS compliance issues first: - Fix the AAPCS compliance issues by implementing our own CCState and function call lowering for f16 arguments. -- With this we achieve that f16 does not need to be a legal type when in fact it shouldn't be, so then the "normal" promotions kick in for illegal f16 operands and thus I don't need custom lowering for loads/stores. but after that I am unsure how to do the match rules: one or two register classes. It might be that the difference between Alternative 2 and this one, is the CCState or Clang workaround. I will investigate a bit more. ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Sjoerd Meijer via llvm-dev <llvm-dev at lists.llvm.org> Sent: 06 December 2017 08:32 To: Friedman, Eli; llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good. I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16 and FP16_TO_FP nodes are created to avoid inefficient code generation. I will double check if I can't achieve the same without using these nodes (because I really would like to get completely rid of them). Cheers, Sjoerd.>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote: >> >> Custom Lowering >> ------------------------- >> >> Making f16 legal and not having native load/stores instructions available, >> (no FullFP16 support) means custom lowering loads/stores: >> 1) Since we don't have FP16 load/store instructions available, we create >> integer half-word loads. I unfortunately need the FP16_TO_FP node here, >> because that "models" creating an integer value, which is what we need >> to create a "truncating i16" integer load instructions. Instead, of >> using >> FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation >> to stack loads/stores which I don't want. >> 2) Custom lowering f16 stores is very similar, and creates truncating >> half-word integer stores. > >Technically, there are no f16 load/store instructions, yes, but we can >use NEON vdl1 and vst1 to get something roughly equivalent, right? > >You probably want to custom-lower BITCAST instructions; the generic >sequence emitted by the legalizer is pretty inefficient in most cases. > >--- > >Overall, I think your approach makes sense.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180118/930958ce/attachment-0001.html>
Sam Parker via llvm-dev
2018-Jan-18 14:28 UTC
[llvm-dev] [RFC] Half-Precision Support in the Arm Backends
Hi Sjoerd, For ISel, I think having a separate register class will give you less headache. I wondering if you could get away with not touching the instructions descriptions at all, instead defining external pattens for the FullFP16 case, like so: def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins SPR:$Sm), IIC_fpCVTSH, "vcvtb", ".f32.f16\t$Sd, $Sm", []>, Requires<[HasFP16]>, Sched<[WriteFPCVT]>; def : FP16Pat<(f16_to_fp GPR:$a), (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>; def : FullFP16Pat<(f32 (fpextend HPR:$Sm)), (VCVTBHS (COPY_TO_REGLASS HPR:$Sm, SPR)>; I'm not sure of the COPY_TO_REGLASS semantics, but I would (dangerously) assume that it when it comes to copying the values between registers, it will be noticed that HPR and SPR actually alias each other and so no copy is needed. I hope this approach would allow for a clean separation of the FP16 and FullFP16 implementations and negate the need to manually type cast each register access. cheers, sam Sam Parker Compilation Tools Engineer | Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . Arm.com ________________________________ From: Sjoerd Meijer Sent: 18 January 2018 13:58:29 To: Friedman, Eli; Sjoerd Meijer; Sam Parker; Oliver Stannard; llvm-dev at lists.llvm.org Cc: nd Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends I would like to revive this thread, as I am struggling a lot with the FP16 implementation in the ARM backend. My implementation in https://reviews.llvm.org/D38315 is finished (except one case), but a more robust alternative implementation was suggested. One can indeed argue that my current implementation is a bit fragile, because it involves manually patching up the isel dags for a few cases. The suggestion was to look into CCState and adjusting of the calling convention lowering, inspired by a recent discussion on the list here: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html. The benefit of this approach is that I would get most of legalization for free, which is the fragile bit in my approach at the moment. Anyway, it's become a long story, but (almost) in chronological order this is what I've considered. Any hints, tips, suggestions welcome. Goal of my exercise: -------------------- 1) Add match rules for some Armv8.2-A FP16 instructions, e.g.: fsub, fadd, and also some conversion instructions. 2) Don't regress the "storage-only cases", i.e., when we only have the conversion instructions available. The Approach: ------------- 1) First, we make f16 legal only when we have FullFP16 support (i.e. when Armv8.2-A FP16 is supported), so in ARMISelLowering.cpp we add: if (Subtarget->hasFullFP16()) { addRegisterClass(MVT::f16, &ARM::HPRRegClass); } 2) This is the first implementation decision, I introduce a new register class HPR, which is an exact copy of the single-precision register class SPR, except that it holds f16 types: def HPR : RegisterClass<"ARM", [f16], 32, (sequence "S%u", 0, 31)> { ... I think this makes sense because half-types sit in the lower 16 bits of the S-registers, and the reason to create a separate HPR register class is to avoid typing of all rules that currently use SPR, because if we would allow both f16 and f32 in SPR, then there's a choice and rules need to be typed. 3) Next, add match rules for some Armv8.2-A FP16 instructions. Instruction VADDH, which has already a tablegen description but not yet a match rules, simply becomes: def VADDH : AHbI<0b11100, 0b11, 0, 0, (outs HPR:$Sd), (ins HPR:$Sn, HPR:$Sm), IIC_fpALU16, "vadd", ".f16\t$Sd, $Sn, $Sm", [(set HPR:$Sd, (fadd HPR:$Sn, HPR:$Sm))]>, // <~~~ new match rule using HPR This is straightforward business so far, but I already learned the hard way that the conversion are the tricky ones, so I repeat this for an f16 -> f32 upconvert: def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins HPR:$Sm), IIC_fpCVTSH, "vcvtb", ".f32.f16\t$Sd, $Sm", [(set SPR:$Sd, (fpextend HPR:$Sm))]>, // <~~~~ new match rule using HPR and SPR Requires<[HasFP16]>, Sched<[WriteFPCVT]>; This new rule matches and fp_extend that consumes an HPR and produces a result in SPR. And that's essentially it and all the code changes I want to make. With these changes, I want to see an fadd with two f16 opernands being codegen'd, and also I want to keep the existing tests passing of course. The problem: ------------ The problem appears to be that introducing HPR looks like a good idea, but it gives problems when FullFP16 is not supported. This is best illustrated with this existing test which is a simple upconvert of f16 to f32: define float @test_extend32(half* %addr) { %val16 = load half, half* %addr %val32 = fpext half %val16 to float ret float %val32 } It should generate this code:: ldrh r0, [r0] ; integer half word load vmov s0, r0 vcvtb.f32.f16 s0, s0 vmov r0, s0 bx lr when we don't have the Armv8.2-A FP16 instructions available, and thus only have the conversion instructions. The problem is in the conversion rules, some rewrite rules to be more specific, and I think this is one of the culprits: def : Pat<(f16_to_fp GPR:$a), (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>; This rewrite rule is supposed to first move the GPR reg in to a S-registers: vmov s0, r0 and then to the conversion: vcvtb.f32.f16 s0, s0 This rewrite rule gets triggered because the ISEL DAG has indeed this funny node f16_to_fp (which models this register move and the float-convert): t0: ch = EntryToken t2: i32,ch = CopyFromReg t0, Register:i32 %0 t16: i32,ch = load<LD2[%addr], zext from i16> t0, t2, undef:i32 t12: f32 = fp16_to_fp t16 <~~~~~~~~ FUNNY NODE HERE t7: i32 = bitcast t12 t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7 t10: ch = ARMISD::RET_FLAG t9, Register:i32 %r0, t9:1 And yes, to make it even funnier, this node has an i32 operand, and that's because we do the half-float load with an integer load instruction. And after this rewrite, we end up with this DAG: t0: ch = EntryToken t2: i32,ch = CopyFromReg t0, Register:i32 %0 t16: i32,ch = t2LDRHi12<Mem:LD2[%addr]> t2, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, t0 t20: f16 = COPY_TO_REGCLASS t16, TargetConstant:i32<1> <~~~~~~~~~~~~~ PROBLEM HERE t12: f32 = VCVTBHS t20, TargetConstant:i32<14>, Register:i32 %noreg t7: i32 = VMOVRS t12, TargetConstant:i32<14>, Register:i32 %noreg t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7 t10: ch = tBX_RET TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %r0, t9, t9:1 It goes all wrong from here, because a f16 value is produced and it is not a legal type etc. The reason that this happens must be because VCVTBHS that is used in the f16_to_fp rewrite rule, is specified to consume a f16 value, so the CopyToReg moves from GPR to an S-Registers: def : Pat<(f16_to_fp GPR:$a), (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>; Conclusion: ----------- This approach is not going to work: - because there's rewrite rule X which generates instructions Y that use both the SPR and HPR descriptions. - Y can be instruction selected when FullFP16 is enabled, this works and no problems (and we don't need X). - but X is used when FullFP16 is not supported, and generates Y that uses HPRs which is not legal. Introducing a new HPR register class is a possibility though, but then special care needs to be taken in a few cases. This is the first alternative I started exploring. Alternative 1 -------------- This is the approach I took in D38315, which made f16 also legal for the storage-only case, but which I am ready to abandon: - Define the HPR register class, - Use this and make f16 legal also in the storage-only case, -- This works for the storage-only case, because data processing instruction that have f16 operands will get promoted and fp_extend and fp_round nodes will be introduced. Then I manually patch up the f16 loads and stores and a few other corner cases by manually lowering them. This however is a fragile approach, which was rightfully commented by Oli, also after being inspired by a recent discussion about some backends doing their own CCState and function argument lowering. This is Alternative 3. -- Another crucial aspect and benefit of this approach, is that this fixes AAPCS compliance issues: while half-precision arguments sit in single-precision registers, callees should read (and also write) only the lower 16 bits. I got this for free by making f16 legal. Alternative 2 ------------- This alternative aims at getting something ready soon, it relies on exploring creating one register class and a Clang that generates the required up and down converts for the storage-only case: - We have one SPR registerclass that contains both f16 and f32: def SPR : RegisterClass<"ARM", [f16, f32], 32, (sequence "S%u", 0, 31)> { ... - Then I will have to type all rules that e.g. use SPR:$a and change it in (f32 SPR:$a): -- this will be a massive patch, -- but hopefully will not have a problem in the rewrite rules. - Then, I also need to modify a clang workaround, which should inserts the proper up and down converts for half types that are generated from the usage of _Float16, similarly like we already do for __fp16. This workaround can then later be replaced by the CCState implementation. Alternative 3: ---------------- This is the best approach, where I start looking into the AAPCS compliance issues first: - Fix the AAPCS compliance issues by implementing our own CCState and function call lowering for f16 arguments. -- With this we achieve that f16 does not need to be a legal type when in fact it shouldn't be, so then the "normal" promotions kick in for illegal f16 operands and thus I don't need custom lowering for loads/stores. but after that I am unsure how to do the match rules: one or two register classes. It might be that the difference between Alternative 2 and this one, is the CCState or Clang workaround. I will investigate a bit more. ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Sjoerd Meijer via llvm-dev <llvm-dev at lists.llvm.org> Sent: 06 December 2017 08:32 To: Friedman, Eli; llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good. I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16 and FP16_TO_FP nodes are created to avoid inefficient code generation. I will double check if I can't achieve the same without using these nodes (because I really would like to get completely rid of them). Cheers, Sjoerd.>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote: >> >> Custom Lowering >> ------------------------- >> >> Making f16 legal and not having native load/stores instructions available, >> (no FullFP16 support) means custom lowering loads/stores: >> 1) Since we don't have FP16 load/store instructions available, we create >> integer half-word loads. I unfortunately need the FP16_TO_FP node here, >> because that "models" creating an integer value, which is what we need >> to create a "truncating i16" integer load instructions. Instead, of >> using >> FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation >> to stack loads/stores which I don't want. >> 2) Custom lowering f16 stores is very similar, and creates truncating >> half-word integer stores. > >Technically, there are no f16 load/store instructions, yes, but we can >use NEON vdl1 and vst1 to get something roughly equivalent, right? > >You probably want to custom-lower BITCAST instructions; the generic >sequence emitted by the legalizer is pretty inefficient in most cases. > >--- > >Overall, I think your approach makes sense.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180118/308b31ba/attachment.html>
Seemingly Similar Threads
- [RFC] Half-Precision Support in the Arm Backends
- [RFC] Half-Precision Support in the Arm Backends
- [RFC] Half-Precision Support in the Arm Backends
- TypePromoteFloat loses intermediate rounding operations
- [LLVMdev] RFC: Do we still need @llvm.convert.to.fp16 and the reverse?