thr3ads.net - llvm dev - [LLVMdev] instcombine does silly things with vector x+x [Oct 2011]

If this information is useful, please help other people find it:
Share via:

andrew adams

2011-Oct-28 21:13 UTC

[LLVMdev] instcombine does silly things with vector x+x

Consider the following function which doubles a <16 x i8> vector:

define <16 x i8> @test(<16 x i8> %a) {
       %b = add <16 x i8> %a, %a
       ret <16 x i8> %b
}

If I compile it for x86 with llc like so:

llc paddb.ll -filetype=asm -o=/dev/stdout

I get a two-op function that just does paddb %xmm0 %xmm0 and then
returns. llc does this regardless of the optimization level. Great!

If I let the instcombine pass touch it like so:

opt -instcombine paddb.ll |  llc -filetype=asm -o=/dev/stdout

or like so:

opt -O3 paddb.ll |  llc -filetype=asm -o=/dev/stdout

then the add gets converted to a vector left shift by 1, which then
lowers to a much slower function with about a hundred ops. No amount
of optimization after the fact will simplify it back to paddb.

I'm actually generating these ops in a JIT context, and I want to use
instcombine, as it seems like a useful pass. Any idea how I can
reliably generate the 128-bit sse version of paddb? I thought I might
be able to force the issue with an intrinsic, but there only seems to
be an intrinsic for the 64 bit version (llvm.x86.mmx.padd.b), and the
saturating 128 bit version (llvm.x86.sse2.padds.b). I would just give
up and use inline assembly, but it seems I can't JIT that.

I'm using the latest llvm 3.1 from svn. I get similar behavior at
llvm.org/demo using the following equivalent C code:

#include <emmintrin.h>
__m128i f(__m128i a) {
  return _mm_add_epi8(a, a);
}

The no-optimization compilation of this is better than the optimized version.

Any ideas? Should I just not use this pass?

- Andrew

Chris Lattner

2011-Oct-28 23:04 UTC

head link

[LLVMdev] instcombine does silly things with vector x+x

On Oct 28, 2011, at 2:13 PM, andrew adams wrote:
> Consider the following function which doubles a <16 x i8> vector:
> 
> define <16 x i8> @test(<16 x i8> %a) {
>       %b = add <16 x i8> %a, %a
>       ret <16 x i8> %b
> }
> 
> If I compile it for x86 with llc like so:
> 
> llc paddb.ll -filetype=asm -o=/dev/stdout
> 
> I get a two-op function that just does paddb %xmm0 %xmm0 and then
> returns. llc does this regardless of the optimization level. Great!
> 
> If I let the instcombine pass touch it like so:
> 
> opt -instcombine paddb.ll |  llc -filetype=asm -o=/dev/stdout
> 
> or like so:
> 
> opt -O3 paddb.ll |  llc -filetype=asm -o=/dev/stdout
> 
> then the add gets converted to a vector left shift by 1, which then
> lowers to a much slower function with about a hundred ops. No amount
> of optimization after the fact will simplify it back to paddy.
This sounds like a really serious X86 backend performance bug.  Canonicalizing
"x+x" to a shift is the "right thing to do", the backend
should match it.

-Chris
> 
> I'm actually generating these ops in a JIT context, and I want to use
> instcombine, as it seems like a useful pass. Any idea how I can
> reliably generate the 128-bit sse version of paddb? I thought I might
> be able to force the issue with an intrinsic, but there only seems to
> be an intrinsic for the 64 bit version (llvm.x86.mmx.padd.b), and the
> saturating 128 bit version (llvm.x86.sse2.padds.b). I would just give
> up and use inline assembly, but it seems I can't JIT that.
> 
> I'm using the latest llvm 3.1 from svn. I get similar behavior at
> llvm.org/demo using the following equivalent C code:
> 
> #include <emmintrin.h>
> __m128i f(__m128i a) {
>  return _mm_add_epi8(a, a);
> }
> 
> The no-optimization compilation of this is better than the optimized
version.
> 
> Any ideas? Should I just not use this pass?
> 
> - Andrew
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Rotem, Nadav

2011-Oct-30 07:12 UTC

head link

[LLVMdev] instcombine does silly things with vector x+x

Opened pr11266. I will try to make time to work on it.


-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Chris Lattner
Sent: Saturday, October 29, 2011 01:04
To: andrew adams
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] instcombine does silly things with vector x+x


On Oct 28, 2011, at 2:13 PM, andrew adams wrote:
> Consider the following function which doubles a <16 x i8> vector:
> 
> define <16 x i8> @test(<16 x i8> %a) {
>       %b = add <16 x i8> %a, %a
>       ret <16 x i8> %b
> }
> 
> If I compile it for x86 with llc like so:
> 
> llc paddb.ll -filetype=asm -o=/dev/stdout
> 
> I get a two-op function that just does paddb %xmm0 %xmm0 and then
> returns. llc does this regardless of the optimization level. Great!
> 
> If I let the instcombine pass touch it like so:
> 
> opt -instcombine paddb.ll |  llc -filetype=asm -o=/dev/stdout
> 
> or like so:
> 
> opt -O3 paddb.ll |  llc -filetype=asm -o=/dev/stdout
> 
> then the add gets converted to a vector left shift by 1, which then
> lowers to a much slower function with about a hundred ops. No amount
> of optimization after the fact will simplify it back to paddy.
This sounds like a really serious X86 backend performance bug.  Canonicalizing
"x+x" to a shift is the "right thing to do", the backend
should match it.

-Chris
> 
> I'm actually generating these ops in a JIT context, and I want to use
> instcombine, as it seems like a useful pass. Any idea how I can
> reliably generate the 128-bit sse version of paddb? I thought I might
> be able to force the issue with an intrinsic, but there only seems to
> be an intrinsic for the 64 bit version (llvm.x86.mmx.padd.b), and the
> saturating 128 bit version (llvm.x86.sse2.padds.b). I would just give
> up and use inline assembly, but it seems I can't JIT that.
> 
> I'm using the latest llvm 3.1 from svn. I get similar behavior at
> llvm.org/demo using the following equivalent C code:
> 
> #include <emmintrin.h>
> __m128i f(__m128i a) {
>  return _mm_add_epi8(a, a);
> }
> 
> The no-optimization compilation of this is better than the optimized
version.
> 
> Any ideas? Should I just not use this pass?
> 
> - Andrew
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Oct 2011 - [LLVMdev] instcombine does silly things with vector x+x

[LLVMdev] instcombine does silly things with vector x+x

[LLVMdev] instcombine does silly things with vector x+x

[LLVMdev] instcombine does silly things with vector x+x

Apparently Analagous Threads