Craig Topper via llvm-dev
2019-Dec-10 20:18 UTC
[llvm-dev] TypePromoteFloat loses intermediate rounding operations
For the following C code
__fp16 x, y, z, w;
void foo() {
x = y + z;
x = x + w;
}
clang produces IR that extends each operand to float and then truncates to
half before assigning to x. Like this
define dso_local void @foo() #0 !dbg !18 {
%1 = load half, half* @y, align 2, !dbg !21
%2 = fpext half %1 to float, !dbg !21
%3 = load half, half* @z, align 2, !dbg !22
%4 = fpext half %3 to float, !dbg !22
%5 = fadd float %2, %4, !dbg !23
%6 = fptrunc float %5 to half, !dbg !21
store half %6, half* @x, align 2, !dbg !24
%7 = load half, half* @x, align 2, !dbg !25
%8 = fpext half %7 to float, !dbg !25
%9 = load half, half* @w, align 2, !dbg !26
%10 = fpext half %9 to float, !dbg !26
%11 = fadd float %8, %10, !dbg !27
%12 = fptrunc float %11 to half, !dbg !25
store half %12, half* @x, align 2, !dbg !28
ret void, !dbg !29
}
InstCombine then comes along and gets rid of all of the fpext and fptrunc.
Leaving
define dso_local void @foo() local_unnamed_addr #0 !dbg !18 {
%1 = load half, half* @y, align 2, !dbg !21, !tbaa !22
%2 = load half, half* @z, align 2, !dbg !26, !tbaa !22
%3 = fadd half %1, %2, !dbg !21
%4 = load half, half* @w, align 2, !dbg !27, !tbaa !22
%5 = fadd half %3, %4, !dbg !28
store half %5, half* @x, align 2, !dbg !29, !tbaa !22
ret void, !dbg !30
}
Then SelectionDAG type legalization comes along and creates this as the
final assembly
pushq %rax
.cfi_def_cfa_offset 16
movzwl y(%rip), %edi
callq __gnu_h2f_ieee
movss %xmm0, 4(%rsp) # 4-byte Spill
movzwl z(%rip), %edi
callq __gnu_h2f_ieee
addss 4(%rsp), %xmm0 # 4-byte Folded Reload
movss %xmm0, 4(%rsp) # 4-byte Spill
movzwl w(%rip), %edi
callq __gnu_h2f_ieee
addss 4(%rsp), %xmm0 # 4-byte Folded Reload
callq __gnu_f2h_ieee
movw %ax, x(%rip)
popq %rax
I assumed SelectionDAG should produce something equivalent to the original
clang code with 4 total extends to f32 and 2 truncates. Instead we got 3
extends and 1 truncate. So we lost the intermediate rounding between the 2
adds that was in the original clang IR.
I believe this occurs because the TypePromoteFloat legalization converts
all arithmetic operations to their f32 equivalents, but does not place
conversions to/from half around them. Instead fp_to_f16 and f16_to_fp nodes
are only generated at loads, stores, bitcasts, and a probably a few other
places. Basically only the place where the 16-bit size is needed to make
the operation possible. Basically what we have is a very similar
implementation to promoting integers, but that doesn't work for FP because
we lose out on intermediate rounding.
It seems like what we should instead do is insert fp16_to_fp and fp_to_fp16
in the libcall and arithmetic op handling. And use i16 to connect the
legalized pieces together. Similar to how we use integer types when
softening operations. I'm not sure if there would still be rounding issues
with this, but it seems closer to matching the IR.
Unfortunately, I think this would have the side effect of changing half
arguments and return types to i16 instead of float, which would be an ABI
change. At least on some targets __fp16 can't be used as an argument or
return type so maybe that won't be a real problem?
Anyone else have any thoughts on this?
~Craig
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191210/f962716b/attachment.html>
Eli Friedman via llvm-dev
2019-Dec-10 23:18 UTC
[llvm-dev] TypePromoteFloat loses intermediate rounding operations
We could fix the legalization without touching the other handling by just
inserting an fp_to_f16/f16_to_fp pair after each arithmetic operation that
requires it. One advantage to that approach is that it’s easier to take the
obvious shortcut for fast-math.
The “promote-to-larger” strategy doesn’t really round correctly in general, but
it works for specific pairs of operator/operation. For example, for f16 fadd in
the default rounding mode, “a+b” is exactly equivalent to
“(_Float16)((float)a+(float)b)”. Not sure if this works for all f16 operations,
and not sure how much we care if it doesn’t.
There aren’t any calling convention implications here for ARM targets; not sure
about other targets. On 32-bit ARM, clang explicitly coerces half values to a
legal type. And half is always legal on AArch64 (unless you force soft-float,
but at that point we don’t care).
-Eli
From: Craig Topper <craig.topper at gmail.com>
Sent: Tuesday, December 10, 2019 12:18 PM
To: llvm-dev <llvm-dev at lists.llvm.org>; Eli Friedman <efriedma at
quicinc.com>; Tim Northover <t.p.northover at gmail.com>
Subject: [EXT] TypePromoteFloat loses intermediate rounding operations
For the following C code
__fp16 x, y, z, w;
void foo() {
x = y + z;
x = x + w;
}
clang produces IR that extends each operand to float and then truncates to half
before assigning to x. Like this
define dso_local void @foo() #0 !dbg !18 {
%1 = load half, half* @y, align 2, !dbg !21
%2 = fpext half %1 to float, !dbg !21
%3 = load half, half* @z, align 2, !dbg !22
%4 = fpext half %3 to float, !dbg !22
%5 = fadd float %2, %4, !dbg !23
%6 = fptrunc float %5 to half, !dbg !21
store half %6, half* @x, align 2, !dbg !24
%7 = load half, half* @x, align 2, !dbg !25
%8 = fpext half %7 to float, !dbg !25
%9 = load half, half* @w, align 2, !dbg !26
%10 = fpext half %9 to float, !dbg !26
%11 = fadd float %8, %10, !dbg !27
%12 = fptrunc float %11 to half, !dbg !25
store half %12, half* @x, align 2, !dbg !28
ret void, !dbg !29
}
InstCombine then comes along and gets rid of all of the fpext and fptrunc.
Leaving
define dso_local void @foo() local_unnamed_addr #0 !dbg !18 {
%1 = load half, half* @y, align 2, !dbg !21, !tbaa !22
%2 = load half, half* @z, align 2, !dbg !26, !tbaa !22
%3 = fadd half %1, %2, !dbg !21
%4 = load half, half* @w, align 2, !dbg !27, !tbaa !22
%5 = fadd half %3, %4, !dbg !28
store half %5, half* @x, align 2, !dbg !29, !tbaa !22
ret void, !dbg !30
}
Then SelectionDAG type legalization comes along and creates this as the final
assembly
pushq %rax
.cfi_def_cfa_offset 16
movzwl y(%rip), %edi
callq __gnu_h2f_ieee
movss %xmm0, 4(%rsp) # 4-byte Spill
movzwl z(%rip), %edi
callq __gnu_h2f_ieee
addss 4(%rsp), %xmm0 # 4-byte Folded Reload
movss %xmm0, 4(%rsp) # 4-byte Spill
movzwl w(%rip), %edi
callq __gnu_h2f_ieee
addss 4(%rsp), %xmm0 # 4-byte Folded Reload
callq __gnu_f2h_ieee
movw %ax, x(%rip)
popq %rax
I assumed SelectionDAG should produce something equivalent to the original clang
code with 4 total extends to f32 and 2 truncates. Instead we got 3 extends and 1
truncate. So we lost the intermediate rounding between the 2 adds that was in
the original clang IR.
I believe this occurs because the TypePromoteFloat legalization converts all
arithmetic operations to their f32 equivalents, but does not place conversions
to/from half around them. Instead fp_to_f16 and f16_to_fp nodes are only
generated at loads, stores, bitcasts, and a probably a few other places.
Basically only the place where the 16-bit size is needed to make the operation
possible. Basically what we have is a very similar implementation to promoting
integers, but that doesn't work for FP because we lose out on intermediate
rounding.
It seems like what we should instead do is insert fp16_to_fp and fp_to_fp16 in
the libcall and arithmetic op handling. And use i16 to connect the legalized
pieces together. Similar to how we use integer types when softening operations.
I'm not sure if there would still be rounding issues with this, but it seems
closer to matching the IR.
Unfortunately, I think this would have the side effect of changing half
arguments and return types to i16 instead of float, which would be an ABI
change. At least on some targets __fp16 can't be used as an argument or
return type so maybe that won't be a real problem?
Anyone else have any thoughts on this?
~Craig
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191210/93170da6/attachment-0001.html>
Craig Topper via llvm-dev
2019-Dec-10 23:34 UTC
[llvm-dev] TypePromoteFloat loses intermediate rounding operations
Thanks Eli. I forgot to bring up the strict FP questions which I was working on when I found this. If we're in a strict FP function, do the fp_to_f16/f16_to_fp emitted by promoting load/store/bitcast need to be strict versions of fp_to_f16/f16_to_fp. And if so where do we get the chain, especially for the bitcast case which isn't a chained node. ~Craig On Tue, Dec 10, 2019 at 3:18 PM Eli Friedman <efriedma at quicinc.com> wrote:> We could fix the legalization without touching the other handling by just > inserting an fp_to_f16/f16_to_fp pair after each arithmetic operation that > requires it. One advantage to that approach is that it’s easier to take > the obvious shortcut for fast-math. > > > > The “promote-to-larger” strategy doesn’t really round correctly in > general, but it works for specific pairs of operator/operation. For > example, for f16 fadd in the default rounding mode, “a+b” is exactly > equivalent to “(_Float16)((float)a+(float)b)”. Not sure if this works for > all f16 operations, and not sure how much we care if it doesn’t. > > There aren’t any calling convention implications here for ARM targets; not > sure about other targets. On 32-bit ARM, clang explicitly coerces half > values to a legal type. And half is always legal on AArch64 (unless you > force soft-float, but at that point we don’t care). > > > > -Eli > > > > *From:* Craig Topper <craig.topper at gmail.com> > *Sent:* Tuesday, December 10, 2019 12:18 PM > *To:* llvm-dev <llvm-dev at lists.llvm.org>; Eli Friedman < > efriedma at quicinc.com>; Tim Northover <t.p.northover at gmail.com> > *Subject:* [EXT] TypePromoteFloat loses intermediate rounding operations > > > > For the following C code > > > > __fp16 x, y, z, w; > > > > void foo() { > > x = y + z; > > x = x + w; > > } > > > > clang produces IR that extends each operand to float and then truncates to > half before assigning to x. Like this > > > > define dso_local void @foo() #0 !dbg !18 { > > %1 = load half, half* @y, align 2, !dbg !21 > > %2 = fpext half %1 to float, !dbg !21 > > %3 = load half, half* @z, align 2, !dbg !22 > > %4 = fpext half %3 to float, !dbg !22 > > %5 = fadd float %2, %4, !dbg !23 > > %6 = fptrunc float %5 to half, !dbg !21 > > store half %6, half* @x, align 2, !dbg !24 > > %7 = load half, half* @x, align 2, !dbg !25 > > %8 = fpext half %7 to float, !dbg !25 > > %9 = load half, half* @w, align 2, !dbg !26 > > %10 = fpext half %9 to float, !dbg !26 > > %11 = fadd float %8, %10, !dbg !27 > > %12 = fptrunc float %11 to half, !dbg !25 > > store half %12, half* @x, align 2, !dbg !28 > > ret void, !dbg !29 > > } > > > > InstCombine then comes along and gets rid of all of the fpext and fptrunc. > Leaving > > > > define dso_local void @foo() local_unnamed_addr #0 !dbg !18 { > > %1 = load half, half* @y, align 2, !dbg !21, !tbaa !22 > > %2 = load half, half* @z, align 2, !dbg !26, !tbaa !22 > > %3 = fadd half %1, %2, !dbg !21 > > %4 = load half, half* @w, align 2, !dbg !27, !tbaa !22 > > %5 = fadd half %3, %4, !dbg !28 > > store half %5, half* @x, align 2, !dbg !29, !tbaa !22 > > ret void, !dbg !30 > > } > > > > > > Then SelectionDAG type legalization comes along and creates this as the > final assembly > > > > pushq %rax > > .cfi_def_cfa_offset 16 > > movzwl y(%rip), %edi > > callq __gnu_h2f_ieee > > movss %xmm0, 4(%rsp) # 4-byte Spill > > movzwl z(%rip), %edi > > callq __gnu_h2f_ieee > > addss 4(%rsp), %xmm0 # 4-byte Folded Reload > > movss %xmm0, 4(%rsp) # 4-byte Spill > > movzwl w(%rip), %edi > > callq __gnu_h2f_ieee > > addss 4(%rsp), %xmm0 # 4-byte Folded Reload > > callq __gnu_f2h_ieee > > movw %ax, x(%rip) > > popq %rax > > > > > > I assumed SelectionDAG should produce something equivalent to the original > clang code with 4 total extends to f32 and 2 truncates. Instead we got 3 > extends and 1 truncate. So we lost the intermediate rounding between the 2 > adds that was in the original clang IR. > > > > I believe this occurs because the TypePromoteFloat legalization converts > all arithmetic operations to their f32 equivalents, but does not place > conversions to/from half around them. Instead fp_to_f16 and f16_to_fp nodes > are only generated at loads, stores, bitcasts, and a probably a few other > places. Basically only the place where the 16-bit size is needed to make > the operation possible. Basically what we have is a very similar > implementation to promoting integers, but that doesn't work for FP because > we lose out on intermediate rounding. > > > > It seems like what we should instead do is insert fp16_to_fp and > fp_to_fp16 in the libcall and arithmetic op handling. And use i16 to > connect the legalized pieces together. Similar to how we use integer types > when softening operations. I'm not sure if there would still be rounding > issues with this, but it seems closer to matching the IR. > > > > Unfortunately, I think this would have the side effect of changing half > arguments and return types to i16 instead of float, which would be an ABI > change. At least on some targets __fp16 can't be used as an argument or > return type so maybe that won't be a real problem? > > > > Anyone else have any thoughts on this? > > > > ~Craig >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191210/38a1f013/attachment.html>
Apparently Analagous Threads
- TypePromoteFloat loses intermediate rounding operations
- [LLVMdev] RFC: Do we still need @llvm.convert.to.fp16 and the reverse?
- [LLVMdev] RFC: Do we still need @llvm.convert.to.fp16 and the reverse?
- [LLVMdev] Help!!!!Help!!!! " LLVM ERROR: Cannot select: 0x9fc9680: i32 = fp32_to_fp16 0x9fc0750 [ID=16] " problem!!!!!!!!!!!!!!!!!!
- [RFC] Half-Precision Support in the Arm Backends