Sanjay Patel via llvm-dev
2018-Jan-17 22:50 UTC
[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?
Example: define i8 @narrow_add(i8 %x, i8 %y) { %x32 = zext i8 %x to i32 %y32 = zext i8 %y to i32 %add = add nsw i32 %x32, %y32 %tr = trunc i32 %add to i8 ret i8 %tr } With no data-layout or with an x86 target where 8-bit integer is in the data-layout, we reduce to: $ ./opt -instcombine narrowadd.ll -S define i8 @narrow_add(i8 %x, i8 %y) { %add = add i8 %x, %y ret i8 %add } But on a target that has 32-bit registers without explicit subregister ops, we don't do that transform because we avoid changing operations from a legal (as specified in the data-layout) width to an illegal width - see InstCombiner::shouldChangeType(). Should we make an exception to allow narrowing for the common cases of i8 and i16? In the motivating example from PR35875 ( https://bugs.llvm.org/show_bug.cgi?id=35875 ), an ARM target is stuck at 19 IR instructions: declare void @use4(i8, i8, i8, i8) define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) { %nx = xor i8 %x, -1 %ny = xor i8 %y, -1 %nz = xor i8 %z, -1 %zx = zext i8 %nx to i32 %zy = zext i8 %ny to i32 %zz = zext i8 %nz to i32 %cmpxz = icmp ult i32 %zx, %zz %minxz = select i1 %cmpxz, i32 %zx, i32 %zz %cmpyz = icmp ult i32 %zy, %zz %minyz = select i1 %cmpyz, i32 %zy, i32 %zz %cmpyx = icmp ult i8 %y, %x %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz %tr_minxyz = trunc i32 %minxyz to i8 %new_zx = sub nsw i32 %zx, %minxyz %new_zy = sub nsw i32 %zy, %minxyz %new_zz = sub nsw i32 %zz, %minxyz %new_x = trunc i32 %new_zx to i8 %new_y = trunc i32 %new_zy to i8 %new_z = trunc i32 %new_zz to i8 call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z) ret void } ...but x86 gets to shrink the subs which leads to a bunch of other transforms, and we grind this down to 10 instructions between instcombine and early-cse: define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) { %nx = xor i8 %x, -1 %ny = xor i8 %y, -1 %nz = xor i8 %z, -1 %cmpxz = icmp ult i8 %nx, %nz %minxz = select i1 %cmpxz, i8 %nx, i8 %nz %1 = icmp ult i8 %minxz, %ny %minxyz = select i1 %1, i8 %minxz, i8 %ny %new_x = sub i8 %nx, %minxyz %new_y = sub i8 %ny, %minxyz %new_z = sub i8 %nz, %minxyz call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z) ret void } -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180117/e03f433e/attachment.html>
David Green via llvm-dev
2018-Jan-22 10:10 UTC
[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?
Hello Thanks for looking into this. I can't be very confident what the knock on result of a change like that would be, especially on architectures that are not Arm. What I can do though, is run some benchmarks and look at that results. Using this patch: --- a/lib/Transforms/InstCombine/InstructionCombining.cpp +++ b/lib/Transforms/InstCombine/InstructionCombining.cpp @@ -150,6 +150,9 @@ bool InstCombiner::shouldChangeType(unsigned FromWidth, bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth); bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth); + if (FromLegal && ToWidth < FromWidth && (ToWidth == 8 || ToWidth == 16)) + return true; + // If this is a legal integer from type, and the result would be an illegal // type, don't do the transformation. if (FromLegal && !ToLegal) Running on a little A core, in the llvm test suite I am seeing these changes: MultiSource/Benchmarks/BitBench/uudecode/uudecode 3.38% SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding -35.04% MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -17.92% SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant -8.57% External/SPEC/CINT2000/253.perlbmk/253.perlbmk -3.43% MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm -3.36% MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl -1.34% +ve for these is bad, -ve is good. So overall looks like a good change, especially in simple_types_constant_folding. There may be some alignment issues that can causing wilder swings than they should, but the results here look good. The list for aarch64 is roughly the same, just a slightly longer list of minor improvements. On our internal cortex-m tests we are seeing more regressions but it's still a net positive in most cases. I would say that at least for these results, it looks like a profitable idea. Like I said I can't be sure about other architectures though. Dave ________________________________________ From: Sanjay Patel <spatel at rotateright.com> Sent: 17 January 2018 22:50 To: llvm-dev Cc: David Green Subject: always allow canonicalizing to 8- and 16-bit ops? Example: define i8 @narrow_add(i8 %x, i8 %y) { %x32 = zext i8 %x to i32 %y32 = zext i8 %y to i32 %add = add nsw i32 %x32, %y32 %tr = trunc i32 %add to i8 ret i8 %tr } With no data-layout or with an x86 target where 8-bit integer is in the data-layout, we reduce to: $ ./opt -instcombine narrowadd.ll -S define i8 @narrow_add(i8 %x, i8 %y) { %add = add i8 %x, %y ret i8 %add } But on a target that has 32-bit registers without explicit subregister ops, we don't do that transform because we avoid changing operations from a legal (as specified in the data-layout) width to an illegal width - see InstCombiner::shouldChangeType(). Should we make an exception to allow narrowing for the common cases of i8 and i16? In the motivating example from PR35875 ( https://bugs.llvm.org/show_bug.cgi?id=35875 ), an ARM target is stuck at 19 IR instructions: declare void @use4(i8, i8, i8, i8) define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) { %nx = xor i8 %x, -1 %ny = xor i8 %y, -1 %nz = xor i8 %z, -1 %zx = zext i8 %nx to i32 %zy = zext i8 %ny to i32 %zz = zext i8 %nz to i32 %cmpxz = icmp ult i32 %zx, %zz %minxz = select i1 %cmpxz, i32 %zx, i32 %zz %cmpyz = icmp ult i32 %zy, %zz %minyz = select i1 %cmpyz, i32 %zy, i32 %zz %cmpyx = icmp ult i8 %y, %x %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz %tr_minxyz = trunc i32 %minxyz to i8 %new_zx = sub nsw i32 %zx, %minxyz %new_zy = sub nsw i32 %zy, %minxyz %new_zz = sub nsw i32 %zz, %minxyz %new_x = trunc i32 %new_zx to i8 %new_y = trunc i32 %new_zy to i8 %new_z = trunc i32 %new_zz to i8 call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z) ret void } ...but x86 gets to shrink the subs which leads to a bunch of other transforms, and we grind this down to 10 instructions between instcombine and early-cse: define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) { %nx = xor i8 %x, -1 %ny = xor i8 %y, -1 %nz = xor i8 %z, -1 %cmpxz = icmp ult i8 %nx, %nz %minxz = select i1 %cmpxz, i8 %nx, i8 %nz %1 = icmp ult i8 %minxz, %ny %minxyz = select i1 %1, i8 %minxz, i8 %ny %new_x = sub i8 %nx, %minxyz %new_y = sub i8 %ny, %minxyz %new_z = sub i8 %nz, %minxyz call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z) ret void }
Sanjay Patel via llvm-dev
2018-Jan-22 18:00 UTC
[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?
Thanks for the perf testing. I assume that DAG legalization is equipped to handle these cases fairly well, or someone would've complained by now... FWIW (and at least some of this can be blamed on me), instcombine already does the narrowing transforms without checking shouldChangeType() for binops like and/or/xor/udiv. The justification was that narrower ops are always better for bit-tracking. If we can eliminate casts, then that improves analysis even more and allows more follow-on transforms. If there are no objections here, I think you can post your patch for review on Phabricator. If we can squash any of the regressions before that goes in, that would be even better. On Mon, Jan 22, 2018 at 3:10 AM, David Green <David.Green at arm.com> wrote:> Hello > > Thanks for looking into this. > > I can't be very confident what the knock on result of a change like that > would be, > especially on architectures that are not Arm. What I can do though, is run > some > benchmarks and look at that results. > > Using this patch: > > --- a/lib/Transforms/InstCombine/InstructionCombining.cpp > +++ b/lib/Transforms/InstCombine/InstructionCombining.cpp > @@ -150,6 +150,9 @@ bool InstCombiner::shouldChangeType(unsigned > FromWidth, > bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth); > bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth); > > + if (FromLegal && ToWidth < FromWidth && (ToWidth == 8 || ToWidth == 16)) > + return true; > + > // If this is a legal integer from type, and the result would be an > illegal > // type, don't do the transformation. > if (FromLegal && !ToLegal) > > > Running on a little A core, in the llvm test suite I am seeing these > changes: > > MultiSource/Benchmarks/BitBench/uudecode/uudecode > 3.38% > SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding > -35.04% > MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 > -17.92% > SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant > -8.57% > External/SPEC/CINT2000/253.perlbmk/253.perlbmk > -3.43% > MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm > -3.36% > MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl > -1.34% > > +ve for these is bad, -ve is good. So overall looks like a good change, > especially in > simple_types_constant_folding. There may be some alignment issues that can > causing wilder swings than they should, but the results here look good. > The list for > aarch64 is roughly the same, just a slightly longer list of minor > improvements. > > On our internal cortex-m tests we are seeing more regressions but it's > still a net > positive in most cases. > > I would say that at least for these results, it looks like a profitable > idea. Like I said > I can't be sure about other architectures though. > Dave > > ________________________________________ > From: Sanjay Patel <spatel at rotateright.com> > Sent: 17 January 2018 22:50 > To: llvm-dev > Cc: David Green > Subject: always allow canonicalizing to 8- and 16-bit ops? > > Example: > define i8 @narrow_add(i8 %x, i8 %y) { > %x32 = zext i8 %x to i32 > %y32 = zext i8 %y to i32 > %add = add nsw i32 %x32, %y32 > %tr = trunc i32 %add to i8 > ret i8 %tr > } > > With no data-layout or with an x86 target where 8-bit integer is in the > data-layout, we reduce to: > > $ ./opt -instcombine narrowadd.ll -S > define i8 @narrow_add(i8 %x, i8 %y) { > %add = add i8 %x, %y > ret i8 %add > } > > But on a target that has 32-bit registers without explicit subregister > ops, we don't do that transform because we avoid changing operations from a > legal (as specified in the data-layout) width to an illegal width - see > InstCombiner::shouldChangeType(). > > Should we make an exception to allow narrowing for the common cases of i8 > and i16? > > In the motivating example from PR35875 ( https://bugs.llvm.org/show_ > bug.cgi?id=35875 ), an ARM target is stuck at 19 IR instructions: > > declare void @use4(i8, i8, i8, i8) > define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) { > %nx = xor i8 %x, -1 > %ny = xor i8 %y, -1 > %nz = xor i8 %z, -1 > %zx = zext i8 %nx to i32 > %zy = zext i8 %ny to i32 > %zz = zext i8 %nz to i32 > > %cmpxz = icmp ult i32 %zx, %zz > %minxz = select i1 %cmpxz, i32 %zx, i32 %zz > %cmpyz = icmp ult i32 %zy, %zz > %minyz = select i1 %cmpyz, i32 %zy, i32 %zz > %cmpyx = icmp ult i8 %y, %x > %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz > %tr_minxyz = trunc i32 %minxyz to i8 > > %new_zx = sub nsw i32 %zx, %minxyz > %new_zy = sub nsw i32 %zy, %minxyz > %new_zz = sub nsw i32 %zz, %minxyz > %new_x = trunc i32 %new_zx to i8 > %new_y = trunc i32 %new_zy to i8 > %new_z = trunc i32 %new_zz to i8 > > call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z) > ret void > } > > ...but x86 gets to shrink the subs which leads to a bunch of other > transforms, and we grind this down to 10 instructions between instcombine > and early-cse: > > define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) { > %nx = xor i8 %x, -1 > %ny = xor i8 %y, -1 > %nz = xor i8 %z, -1 > %cmpxz = icmp ult i8 %nx, %nz > %minxz = select i1 %cmpxz, i8 %nx, i8 %nz > %1 = icmp ult i8 %minxz, %ny > %minxyz = select i1 %1, i8 %minxz, i8 %ny > %new_x = sub i8 %nx, %minxyz > %new_y = sub i8 %ny, %minxyz > %new_z = sub i8 %nz, %minxyz > > call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z) > ret void > } > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180122/622c0ac9/attachment.html>
Alex Bradbury via llvm-dev
2018-Jan-23 09:37 UTC
[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?
On 17 January 2018 at 22:50, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Example: > define i8 @narrow_add(i8 %x, i8 %y) { > %x32 = zext i8 %x to i32 > %y32 = zext i8 %y to i32 > %add = add nsw i32 %x32, %y32 > %tr = trunc i32 %add to i8 > ret i8 %tr > } > > With no data-layout or with an x86 target where 8-bit integer is in the > data-layout, we reduce to: > > $ ./opt -instcombine narrowadd.ll -S > define i8 @narrow_add(i8 %x, i8 %y) { > %add = add i8 %x, %y > ret i8 %add > } > > But on a target that has 32-bit registers without explicit subregister ops, > we don't do that transform because we avoid changing operations from a legal > (as specified in the data-layout) width to an illegal width - see > InstCombiner::shouldChangeType(). > > Should we make an exception to allow narrowing for the common cases of i8 > and i16?The current implementation of 64-bit code generation for RISC-V has i64 as the only legal type - with Krzysztof's variable-size register class support, simply switching GPRs to i64 was the path of least resistance for getting working codegen. In that case, it would be interesting to explore allowing narrowing for i32 as well. I should note: RV64I codegen isn't yet upstreamed primarily as I want to do more investigation and thinking about the pros/cons of this approach and discuss on llvm-dev. I don't want to derail this thread with that discussion. Best, Alex
Possibly Parallel Threads
- always allow canonicalizing to 8- and 16-bit ops?
- always allow canonicalizing to 8- and 16-bit ops?
- MachineVerifier and undef
- Folding zext from i1 into PHI nodes with only zwo incoming values.
- Folding zext from i1 into PHI nodes with only zwo incoming values.