thr3ads.net - llvm dev - [LLVMdev] <8 x i1> legalization overhead on X86 with AVX2 [Sep 2014]

If this information is useful, please help other people find it:
Share via:
Lizunov, Andrey E
2014-Sep-12 12:53 UTC
[LLVMdev] <8 x i1> legalization overhead on X86 with AVX2

Hello everyone,

I'm facing a problem with an inefficient vector code generation for X86 with
AVX2.
The problem with the example below is unnecessary conversions of <8 x i32>
to <8 x i16> and back when a boolean vector crosses BB boundary. I suppose
that is because <8 x i1> is legalized to <8 x 16>. As far as I
understand there is no such issue if all i1 vectors are computed inside a BB
because they are optimized out by Machine Code Optimizations.

Here is a short reproducer I've written by hand (based on a real LLVM IR
sequence produced by a vectorizer I'm working with).
bash$ ./llc -mcpu=core-avx2 -filetype=asm -o legalization.overhead.s
legalization.overhead.ll -asm-verbose

define <8 x i32> @foo(<8 x i32> %arg0, <8 x i32> %arg1,       
|foo:                                    # @foo
                      <8 x i32> %arg2, i1 %switch) {            |       
.cfi_startproc
BB0:                                                            |# BB#0:        
# %BB0
  %boolv0 = icmp sgt <8 x i32> %arg0, zeroinitializer           |       
pushq   %rbp
  br i1 %switch, label %BB1, label %BB2                         |.Ltmp2:
BB1:                                                            |       
.cfi_def_cfa_offset 16
  %boolv1 = icmp sgt <8 x i32> %arg1, zeroinitializer           |.Ltmp3:
  br label %BB2                                                 |       
.cfi_offset %rbp, -16
BB2:                                                            |        movq   
%rsp, %rbp
  %boolvx = phi <8 x i1> [ %boolv0, %BB0 ], [ %boolv1, %BB1 ]   |.Ltmp4:
  %boolv2 = icmp sgt <8 x i32> %arg2, zeroinitializer           |       
.cfi_def_cfa_register %rbp
  %merge = and <8 x i1> %boolvx, %boolv2                        |       
vpxor   %ymm3, %ymm3, %ymm3
  %res = sext <8 x i1> %merge to <8 x i32>                      |   
vpcmpgtd        %ymm3, %ymm0, %ymm0
  ret <8 x i32> %res                                            |       
vpshufb .LCPI0_0(%rip), %ymm0, %ymm0
}                                                               |        # ymm0
= ymm0[0,1,4,5,8,9,12,13,NULL,NULL,NULL,NULL,16,17,20,21,24,25,28,29,NULL...]
~                                                               |        vpermq 
$8, %ymm0, %ymm0        # ymm0 = ymm0[0,2,0,0]
~                                                               |        testb  
$1, %dil
~                                                               |        je     
.LBB0_2
~                                                               |# BB#1:        
# %BB1
~                                                               |       
vpcmpgtd        %ymm3, %ymm1, %ymm0
~                                                               |        vpshufb
.LCPI0_0(%rip), %ymm0, %ymm0
~                                                               |        vpermq 
$8, %ymm0, %ymm0        # ymm0 = ymm0[0,2,0,0]
~                                                               |.LBB0_2:       
# %BB2
~                                                               |       
vpcmpgtd        %ymm3, %ymm2, %ymm1
~                                                               |        vpshufb
.LCPI0_0(%rip), %ymm1, %ymm1
~                                                               |        vpermq 
$8, %ymm1, %ymm1        # ymm1 = ymm1[0,2,0,0]
~                                                               |        vpand  
%xmm1, %xmm0, %xmm0
~                                                               |       
vpmovzxwd       %xmm0, %ymm0
~                                                               |        vpslld 
$31, %ymm0, %ymm0
~                                                               |        vpsrad 
$31, %ymm0, %ymm0
~                                                               |        popq   
%rbp
~                                                               |        ret
Instances of the problematic ASM sequences:

1.      <8 x i32> to <8 x i16>
vpshufb .LCPI0_0(%rip), %ymm0, %ymm0
vpermq  $8, %ymm0, %ymm0        # ymm0 = ymm0[0,2,0,0]


2.      <8 x i16> to <8 x i32>
vpmovzxwd       %xmm0, %ymm0
vpslld  $31, %ymm0, %ymm0
vpsrad  $31, %ymm0, %ymm0

Currently I'm trying to figure out the best way to completely remove those
unnecessary conversions and would appreciate any help from the community.

Thank you!
Andrey Lizunov

--------------------------------------------------------------------
Closed Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park, 
17 Krylatskaya Str., Bldg 4, Moscow 121614, 
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140912/a33756e2/attachment.html>
llvm dev - Sep 2014 - [LLVMdev] <8 x i1> legalization overhead on X86 with AVX2

[LLVMdev] <8 x i1> legalization overhead on X86 with AVX2