thr3ads.net - llvm dev - [llvm-dev] always allow canonicalizing to 8- and 16-bit ops? [Jan 2018]

If this information is useful, please help other people find it:
Share via:

Sanjay Patel via llvm-dev

2018-Jan-17 22:50 UTC

[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

Example:
define i8 @narrow_add(i8 %x, i8 %y) {
  %x32 = zext i8 %x to i32
  %y32 = zext i8 %y to i32
  %add = add nsw i32 %x32, %y32
  %tr = trunc i32 %add to i8
  ret i8 %tr
}

With no data-layout or with an x86 target where 8-bit integer is in the
data-layout, we reduce to:

$ ./opt -instcombine narrowadd.ll -S
define i8 @narrow_add(i8 %x, i8 %y) {
  %add = add i8 %x, %y
  ret i8 %add
}

But on a target that has 32-bit registers without explicit subregister ops,
we don't do that transform because we avoid changing operations from a
legal (as specified in the data-layout) width to an illegal width - see
InstCombiner::shouldChangeType().

Should we make an exception to allow narrowing for the common cases of i8
and i16?

In the motivating example from PR35875 (
https://bugs.llvm.org/show_bug.cgi?id=35875 ), an ARM target is stuck at 19
IR instructions:

declare void @use4(i8, i8, i8, i8)
define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
  %nx = xor i8 %x, -1
  %ny = xor i8 %y, -1
  %nz = xor i8 %z, -1
  %zx = zext i8 %nx to i32
  %zy = zext i8 %ny to i32
  %zz = zext i8 %nz to i32

  %cmpxz = icmp ult i32 %zx, %zz
  %minxz = select i1 %cmpxz, i32 %zx, i32 %zz
  %cmpyz = icmp ult i32 %zy, %zz
  %minyz = select i1 %cmpyz, i32 %zy, i32 %zz
  %cmpyx = icmp ult i8 %y, %x
  %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz
  %tr_minxyz = trunc i32 %minxyz to i8

  %new_zx = sub nsw i32 %zx, %minxyz
  %new_zy = sub nsw i32 %zy, %minxyz
  %new_zz = sub nsw i32 %zz, %minxyz
  %new_x = trunc i32 %new_zx to i8
  %new_y = trunc i32 %new_zy to i8
  %new_z = trunc i32 %new_zz to i8

  call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
  ret void
}

...but x86 gets to shrink the subs which leads to a bunch of other
transforms, and we grind this down to 10 instructions between instcombine
and early-cse:

define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
  %nx = xor i8 %x, -1
  %ny = xor i8 %y, -1
  %nz = xor i8 %z, -1
  %cmpxz = icmp ult i8 %nx, %nz
  %minxz = select i1 %cmpxz, i8 %nx, i8 %nz
  %1 = icmp ult i8 %minxz, %ny
  %minxyz = select i1 %1, i8 %minxz, i8 %ny
  %new_x = sub i8 %nx, %minxyz
  %new_y = sub i8 %ny, %minxyz
  %new_z = sub i8 %nz, %minxyz

  call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
  ret void
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180117/e03f433e/attachment.html>

David Green via llvm-dev

2018-Jan-22 10:10 UTC

head link

[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

Hello

Thanks for looking into this.

I can't be very confident what the knock on result of a change like that
would be,
especially on architectures that are not Arm. What I can do though, is run some
benchmarks and look at that results.

Using this patch:

--- a/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -150,6 +150,9 @@ bool InstCombiner::shouldChangeType(unsigned FromWidth,
   bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth);
   bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth);
 
+  if (FromLegal && ToWidth < FromWidth && (ToWidth == 8 ||
ToWidth == 16))
+    return true;
+
   // If this is a legal integer from type, and the result would be an illegal
   // type, don't do the transformation.
   if (FromLegal && !ToLegal)


Running on a little A core, in the llvm test suite I am seeing these changes:

MultiSource/Benchmarks/BitBench/uudecode/uudecode
        3.38%
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding
        -35.04%
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
        -17.92%
SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant
        -8.57%
External/SPEC/CINT2000/253.perlbmk/253.perlbmk
        -3.43%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm
        -3.36%
MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl
        -1.34%

+ve for these is bad, -ve is good. So overall looks like a good change,
especially in
simple_types_constant_folding. There may be some alignment issues that can
causing wilder swings than they should, but the results here look good. The list
for
aarch64 is roughly the same, just a slightly longer list of minor improvements.

On our internal cortex-m tests we are seeing more regressions but it's still
a net
positive in most cases.

I would say that at least for these results, it looks like a profitable idea.
Like I said
I can't be sure about other architectures though.
Dave

________________________________________
From: Sanjay Patel <spatel at rotateright.com>
Sent: 17 January 2018 22:50
To: llvm-dev
Cc: David Green
Subject: always allow canonicalizing to 8- and 16-bit ops?

Example:
define i8 @narrow_add(i8 %x, i8 %y) {
  %x32 = zext i8 %x to i32
  %y32 = zext i8 %y to i32
  %add = add nsw i32 %x32, %y32
  %tr = trunc i32 %add to i8
  ret i8 %tr
}

With no data-layout or with an x86 target where 8-bit integer is in the
data-layout, we reduce to:

$ ./opt -instcombine narrowadd.ll -S
define i8 @narrow_add(i8 %x, i8 %y) {
  %add = add i8 %x, %y
  ret i8 %add
}

But on a target that has 32-bit registers without explicit subregister ops, we
don't do that transform because we avoid changing operations from a legal
(as specified in the data-layout) width to an illegal width - see
InstCombiner::shouldChangeType().

Should we make an exception to allow narrowing for the common cases of i8 and
i16?

In the motivating example from PR35875 (
https://bugs.llvm.org/show_bug.cgi?id=35875 ), an ARM target is stuck at 19 IR
instructions:

declare void @use4(i8, i8, i8, i8)
define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
  %nx = xor i8 %x, -1
  %ny = xor i8 %y, -1
  %nz = xor i8 %z, -1
  %zx = zext i8 %nx to i32
  %zy = zext i8 %ny to i32
  %zz = zext i8 %nz to i32

  %cmpxz = icmp ult i32 %zx, %zz
  %minxz = select i1 %cmpxz, i32 %zx, i32 %zz
  %cmpyz = icmp ult i32 %zy, %zz
  %minyz = select i1 %cmpyz, i32 %zy, i32 %zz
  %cmpyx = icmp ult i8 %y, %x
  %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz
  %tr_minxyz = trunc i32 %minxyz to i8

  %new_zx = sub nsw i32 %zx, %minxyz
  %new_zy = sub nsw i32 %zy, %minxyz
  %new_zz = sub nsw i32 %zz, %minxyz
  %new_x = trunc i32 %new_zx to i8
  %new_y = trunc i32 %new_zy to i8
  %new_z = trunc i32 %new_zz to i8

  call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
  ret void
}

...but x86 gets to shrink the subs which leads to a bunch of other transforms,
and we grind this down to 10 instructions between instcombine and early-cse:

define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
  %nx = xor i8 %x, -1
  %ny = xor i8 %y, -1
  %nz = xor i8 %z, -1
  %cmpxz = icmp ult i8 %nx, %nz
  %minxz = select i1 %cmpxz, i8 %nx, i8 %nz
  %1 = icmp ult i8 %minxz, %ny
  %minxyz = select i1 %1, i8 %minxz, i8 %ny
  %new_x = sub i8 %nx, %minxyz
  %new_y = sub i8 %ny, %minxyz
  %new_z = sub i8 %nz, %minxyz

  call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
  ret void
}

Sanjay Patel via llvm-dev

2018-Jan-22 18:00 UTC

head link

[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

Thanks for the perf testing. I assume that DAG legalization is equipped to
handle these cases fairly well, or someone would've complained by now...

FWIW (and at least some of this can be blamed on me), instcombine already
does the narrowing transforms without checking shouldChangeType() for
binops like and/or/xor/udiv. The justification was that narrower ops are
always better for bit-tracking. If we can eliminate casts, then that
improves analysis even more and allows more follow-on transforms.

If there are no objections here, I think you can post your patch for review
on Phabricator. If we can squash any of the regressions before that goes
in, that would be even better.

On Mon, Jan 22, 2018 at 3:10 AM, David Green <David.Green at arm.com>
wrote:
> Hello
>
> Thanks for looking into this.
>
> I can't be very confident what the knock on result of a change like
that
> would be,
> especially on architectures that are not Arm. What I can do though, is run
> some
> benchmarks and look at that results.
>
> Using this patch:
>
> --- a/lib/Transforms/InstCombine/InstructionCombining.cpp
> +++ b/lib/Transforms/InstCombine/InstructionCombining.cpp
> @@ -150,6 +150,9 @@ bool InstCombiner::shouldChangeType(unsigned
> FromWidth,
>    bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth);
>    bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth);
>
> +  if (FromLegal && ToWidth < FromWidth && (ToWidth == 8
|| ToWidth == 16))
> +    return true;
> +
>    // If this is a legal integer from type, and the result would be an
> illegal
>    // type, don't do the transformation.
>    if (FromLegal && !ToLegal)
>
>
> Running on a little A core, in the llvm test suite I am seeing these
> changes:
>
> MultiSource/Benchmarks/BitBench/uudecode/uudecode
>         3.38%
> SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding
>         -35.04%
> MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
>         -17.92%
> SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant
>         -8.57%
> External/SPEC/CINT2000/253.perlbmk/253.perlbmk
>         -3.43%
> MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm
>         -3.36%
> MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl
>         -1.34%
>
> +ve for these is bad, -ve is good. So overall looks like a good change,
> especially in
> simple_types_constant_folding. There may be some alignment issues that can
> causing wilder swings than they should, but the results here look good.
> The list for
> aarch64 is roughly the same, just a slightly longer list of minor
> improvements.
>
> On our internal cortex-m tests we are seeing more regressions but it's
> still a net
> positive in most cases.
>
> I would say that at least for these results, it looks like a profitable
> idea. Like I said
> I can't be sure about other architectures though.
> Dave
>
> ________________________________________
> From: Sanjay Patel <spatel at rotateright.com>
> Sent: 17 January 2018 22:50
> To: llvm-dev
> Cc: David Green
> Subject: always allow canonicalizing to 8- and 16-bit ops?
>
> Example:
> define i8 @narrow_add(i8 %x, i8 %y) {
>   %x32 = zext i8 %x to i32
>   %y32 = zext i8 %y to i32
>   %add = add nsw i32 %x32, %y32
>   %tr = trunc i32 %add to i8
>   ret i8 %tr
> }
>
> With no data-layout or with an x86 target where 8-bit integer is in the
> data-layout, we reduce to:
>
> $ ./opt -instcombine narrowadd.ll -S
> define i8 @narrow_add(i8 %x, i8 %y) {
>   %add = add i8 %x, %y
>   ret i8 %add
> }
>
> But on a target that has 32-bit registers without explicit subregister
> ops, we don't do that transform because we avoid changing operations
from a
> legal (as specified in the data-layout) width to an illegal width - see
> InstCombiner::shouldChangeType().
>
> Should we make an exception to allow narrowing for the common cases of i8
> and i16?
>
> In the motivating example from PR35875 ( https://bugs.llvm.org/show_
> bug.cgi?id=35875 ), an ARM target is stuck at 19 IR instructions:
>
> declare void @use4(i8, i8, i8, i8)
> define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
>   %nx = xor i8 %x, -1
>   %ny = xor i8 %y, -1
>   %nz = xor i8 %z, -1
>   %zx = zext i8 %nx to i32
>   %zy = zext i8 %ny to i32
>   %zz = zext i8 %nz to i32
>
>   %cmpxz = icmp ult i32 %zx, %zz
>   %minxz = select i1 %cmpxz, i32 %zx, i32 %zz
>   %cmpyz = icmp ult i32 %zy, %zz
>   %minyz = select i1 %cmpyz, i32 %zy, i32 %zz
>   %cmpyx = icmp ult i8 %y, %x
>   %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz
>   %tr_minxyz = trunc i32 %minxyz to i8
>
>   %new_zx = sub nsw i32 %zx, %minxyz
>   %new_zy = sub nsw i32 %zy, %minxyz
>   %new_zz = sub nsw i32 %zz, %minxyz
>   %new_x = trunc i32 %new_zx to i8
>   %new_y = trunc i32 %new_zy to i8
>   %new_z = trunc i32 %new_zz to i8
>
>   call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
>   ret void
> }
>
> ...but x86 gets to shrink the subs which leads to a bunch of other
> transforms, and we grind this down to 10 instructions between instcombine
> and early-cse:
>
> define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
>   %nx = xor i8 %x, -1
>   %ny = xor i8 %y, -1
>   %nz = xor i8 %z, -1
>   %cmpxz = icmp ult i8 %nx, %nz
>   %minxz = select i1 %cmpxz, i8 %nx, i8 %nz
>   %1 = icmp ult i8 %minxz, %ny
>   %minxyz = select i1 %1, i8 %minxz, i8 %ny
>   %new_x = sub i8 %nx, %minxyz
>   %new_y = sub i8 %ny, %minxyz
>   %new_z = sub i8 %nz, %minxyz
>
>   call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
>   ret void
> }
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180122/622c0ac9/attachment.html>

Alex Bradbury via llvm-dev

2018-Jan-23 09:37 UTC

head link

[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

On 17 January 2018 at 22:50, Sanjay Patel via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Example:
> define i8 @narrow_add(i8 %x, i8 %y) {
>   %x32 = zext i8 %x to i32
>   %y32 = zext i8 %y to i32
>   %add = add nsw i32 %x32, %y32
>   %tr = trunc i32 %add to i8
>   ret i8 %tr
> }
>
> With no data-layout or with an x86 target where 8-bit integer is in the
> data-layout, we reduce to:
>
> $ ./opt -instcombine narrowadd.ll -S
> define i8 @narrow_add(i8 %x, i8 %y) {
>   %add = add i8 %x, %y
>   ret i8 %add
> }
>
> But on a target that has 32-bit registers without explicit subregister ops,
> we don't do that transform because we avoid changing operations from a
legal
> (as specified in the data-layout) width to an illegal width - see
> InstCombiner::shouldChangeType().
>
> Should we make an exception to allow narrowing for the common cases of i8
> and i16?
The current implementation of 64-bit code generation for RISC-V has
i64 as the only legal type - with Krzysztof's variable-size register
class support, simply switching GPRs to i64 was the path of least
resistance for getting working codegen. In that case, it would be
interesting to explore allowing narrowing for i32 as well.

I should note: RV64I codegen isn't yet upstreamed primarily as I want
to do more investigation and thinking about the pros/cons of this
approach and discuss on llvm-dev. I don't want to derail this thread
with that discussion.

Best,

Alex

Reasonably Related Threads

Search for more reasonably related threads

llvm dev - Jan 2018 - always allow canonicalizing to 8- and 16-bit ops?

[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

Reasonably Related Threads