Lizunov, Andrey E
2014-Sep-12 12:53 UTC
[LLVMdev] <8 x i1> legalization overhead on X86 with AVX2
Hello everyone, I'm facing a problem with an inefficient vector code generation for X86 with AVX2. The problem with the example below is unnecessary conversions of <8 x i32> to <8 x i16> and back when a boolean vector crosses BB boundary. I suppose that is because <8 x i1> is legalized to <8 x 16>. As far as I understand there is no such issue if all i1 vectors are computed inside a BB because they are optimized out by Machine Code Optimizations. Here is a short reproducer I've written by hand (based on a real LLVM IR sequence produced by a vectorizer I'm working with). bash$ ./llc -mcpu=core-avx2 -filetype=asm -o legalization.overhead.s legalization.overhead.ll -asm-verbose define <8 x i32> @foo(<8 x i32> %arg0, <8 x i32> %arg1, |foo: # @foo <8 x i32> %arg2, i1 %switch) { | .cfi_startproc BB0: |# BB#0: # %BB0 %boolv0 = icmp sgt <8 x i32> %arg0, zeroinitializer | pushq %rbp br i1 %switch, label %BB1, label %BB2 |.Ltmp2: BB1: | .cfi_def_cfa_offset 16 %boolv1 = icmp sgt <8 x i32> %arg1, zeroinitializer |.Ltmp3: br label %BB2 | .cfi_offset %rbp, -16 BB2: | movq %rsp, %rbp %boolvx = phi <8 x i1> [ %boolv0, %BB0 ], [ %boolv1, %BB1 ] |.Ltmp4: %boolv2 = icmp sgt <8 x i32> %arg2, zeroinitializer | .cfi_def_cfa_register %rbp %merge = and <8 x i1> %boolvx, %boolv2 | vpxor %ymm3, %ymm3, %ymm3 %res = sext <8 x i1> %merge to <8 x i32> | vpcmpgtd %ymm3, %ymm0, %ymm0 ret <8 x i32> %res | vpshufb .LCPI0_0(%rip), %ymm0, %ymm0 } | # ymm0 = ymm0[0,1,4,5,8,9,12,13,NULL,NULL,NULL,NULL,16,17,20,21,24,25,28,29,NULL...] ~ | vpermq $8, %ymm0, %ymm0 # ymm0 = ymm0[0,2,0,0] ~ | testb $1, %dil ~ | je .LBB0_2 ~ |# BB#1: # %BB1 ~ | vpcmpgtd %ymm3, %ymm1, %ymm0 ~ | vpshufb .LCPI0_0(%rip), %ymm0, %ymm0 ~ | vpermq $8, %ymm0, %ymm0 # ymm0 = ymm0[0,2,0,0] ~ |.LBB0_2: # %BB2 ~ | vpcmpgtd %ymm3, %ymm2, %ymm1 ~ | vpshufb .LCPI0_0(%rip), %ymm1, %ymm1 ~ | vpermq $8, %ymm1, %ymm1 # ymm1 = ymm1[0,2,0,0] ~ | vpand %xmm1, %xmm0, %xmm0 ~ | vpmovzxwd %xmm0, %ymm0 ~ | vpslld $31, %ymm0, %ymm0 ~ | vpsrad $31, %ymm0, %ymm0 ~ | popq %rbp ~ | ret Instances of the problematic ASM sequences: 1. <8 x i32> to <8 x i16> vpshufb .LCPI0_0(%rip), %ymm0, %ymm0 vpermq $8, %ymm0, %ymm0 # ymm0 = ymm0[0,2,0,0] 2. <8 x i16> to <8 x i32> vpmovzxwd %xmm0, %ymm0 vpslld $31, %ymm0, %ymm0 vpsrad $31, %ymm0, %ymm0 Currently I'm trying to figure out the best way to completely remove those unnecessary conversions and would appreciate any help from the community. Thank you! Andrey Lizunov -------------------------------------------------------------------- Closed Joint Stock Company Intel A/O Registered legal address: Krylatsky Hills Business Park, 17 Krylatskaya Str., Bldg 4, Moscow 121614, Russian Federation This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140912/a33756e2/attachment.html>