thr3ads.net - search: "pshufd"

2016 Aug 12

4

Invoke loop vectorizer

...= 0; i < SIZE; ++i) > > A[i] += B[i] + K; > > } > > [dannyb at dannyb-macbookpro3 11:37:20] ~ :) $ clang -O3 test.c -c > -save-temps > [dannyb at dannyb-macbookpro3 11:38:28] ~ :) $ pcregrep -i "^\s*p" > test.s|less > pushq %rbp > pshufd $68, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,0,1] > pslldq $8, %xmm1 ## xmm1 = > zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7] > pshufd $68, %xmm3, %xmm3 ## xmm3 = xmm3[0,1,0,1] > paddq %xmm1, %xmm3 > pshufd $78,...

Invoke loop vectorizer

2016 Aug 12

2

Invoke loop vectorizer

Hi Daniel, I increased the size of your test to be 128 but -stats still shows no loop optimized... Xiaochu On Aug 12, 2016 11:11 AM, "Daniel Berlin" <dberlin at dberlin.org> wrote: > It's not possible to know that A and B don't alias in this example. It's > almost certainly not profitable to add a runtime check given the size of > the loop. > > >

[LLVMdev] AVX Shuffles & PatLeaf Help Needed

2009 Dec 18

2

[LLVMdev] AVX Shuffles & PatLeaf Help Needed

...r thinking a bit more. However, I think in this case we only want to lower to vNi32 since there are no immediate-mask shuffles in X86 that operate on smaller element types. Doing it at the byte level would just be more confusing, I think. PSHUFB is really a completely different instruction than PSHUFD, for example. -Dave

[LLVMdev] AVX Shuffles & PatLeaf Help Needed

2009 Dec 18

0

[LLVMdev] AVX Shuffles & PatLeaf Help Needed

...owever, I think in this > case we only want to lower to vNi32 since there are no immediate-mask shuffles > in X86 that operate on smaller element types. Doing it at the byte level > would just be more confusing, I think. > > PSHUFB is really a completely different instruction than PSHUFD, for example. Aside from consuming one of its inputs, which is a regalloc problem, it isn't really different. It's just a one-input immediate shuffle, where the immediate is not encoded in the instruction. From the perspective of the shuffle instruction, all the x86 shuffles are just var...

[LLVMdev] Is it a bug or am I missing something ?

2013 Feb 19

2

[LLVMdev] Is it a bug or am I missing something ?

....align 16, 0x90 .type sample_test, at function sample_test: # @sample_test # BB#0: # %L.entry movl 4(%esp), %eax movss 304(%eax), %xmm0 xorps %xmm1, %xmm1 movl 8(%esp), %eax movups %xmm1, 624(%eax) pshufd $65, %xmm0, %xmm0 # xmm0 = xmm0[1,0,0,1] movdqu %xmm0, 608(%eax) ret .Ltmp0: .size sample_test, .Ltmp0-sample_test .section ".note.GNU-stack","", at progbits It seems to me that this sequence of instruction is building vector: <float...

[RFC] Introducing a vector reduction add instruction.

2015 Nov 19

5

[RFC] Introducing a vector reduction add instruction.

...psadbw %xmm2, %xmm3 paddd %xmm3, %xmm0 movd b+1028(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero psadbw %xmm2, %xmm3 paddd %xmm3, %xmm1 addq $8, %rax jne .LBB0_1 # BB#2: # %middle.block paddd %xmm0, %xmm1 pshufd $78, %xmm1, %xmm0 # xmm0 = xmm1[2,3,0,1] paddd %xmm1, %xmm0 pshufd $229, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3] paddd %xmm0, %xmm1 movd %xmm1, %eax retq Note that due to smaller VF we are using now (currently 4), we could not explore the most benefit of psadbw. The patch in http://reviews...

enabling interleaved access loop vectorization

2016 Aug 05

3

enabling interleaved access loop vectorization

...# %vector.body # =>This Inner Loop Header: Depth=1 movdqu (%rdi,%rcx), %xmm0 movdqu 4(%rdi,%rcx), %xmm1 movdqu 8(%rdi,%rcx), %xmm2 paddd %xmm0, %xmm1 paddd %xmm2, %xmm1 movdqu (%rdi,%rcx,2), %xmm0 movdqu 16(%rdi,%rcx,2), %xmm2 pshufd $132, %xmm2, %xmm2 # xmm2 = xmm2[0,1,0,2] pshufd $232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3] pblendw $240, %xmm2, %xmm0 # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7] paddd %xmm1, %xmm0 movdqu %xmm0, (%rsi,%rcx) cmpq $992, %rcx # imm = 0x3E0 jne .LBB0_7 The performance I see out of...

enabling interleaved access loop vectorization

2016 May 26

2

enabling interleaved access loop vectorization

Interleaved access is not enabled on X86 yet. We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer

[LLVMdev] Is it a bug or am I missing something ?

2013 Feb 19

0

[LLVMdev] Is it a bug or am I missing something ?

....align 16, 0x90 .type sample_test, at function sample_test: # @sample_test # BB#0: # %L.entry movl 4(%esp), %eax movss 304(%eax), %xmm0 xorps %xmm1, %xmm1 movl 8(%esp), %eax movups %xmm1, 624(%eax) pshufd $65, %xmm0, %xmm0 # xmm0 = xmm0[1,0,0,1] movdqu %xmm0, 608(%eax) ret .Ltmp0: .size sample_test, .Ltmp0-sample_test .section ".note.GNU-stack","", at progbits It seems to me that this sequence of instruction is building vector: <float...

[RFC] Introducing a vector reduction add instruction.

2015 Nov 25

2

[RFC] Introducing a vector reduction add instruction.

...mm2 = mem[0],zero,zero,zero >> movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero >> psadbw %xmm2, %xmm3 >> paddd %xmm3, %xmm1 >> addq $8, %rax >> jne .LBB0_1 >> # BB#2: # %middle.block >> paddd %xmm0, %xmm1 >> pshufd $78, %xmm1, %xmm0 # xmm0 = xmm1[2,3,0,1] >> paddd %xmm1, %xmm0 >> pshufd $229, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3] >> paddd %xmm0, %xmm1 >> movd %xmm1, %eax >> retq >> >> >> Note that due to smaller VF we are using now (currently 4), we cou...

[LLVMdev] How does SSEDomainFix work?

2010 May 11

0

[LLVMdev] How does SSEDomainFix work?

...moved to the int domain because the add forced them. > Please tell me if something would be wrong for me. You should measure if LLVM's code is actually slower that the code you want. If it is, I would like to hear. Our weakness is the shufflevector instruction. It is selected into shufps/pshufd/palign/... only by looking at patterns. The instruction selector does not consider execution domains. This can be a problem because these instructions cannot be freely interchanged by the SSE execution domain pass. > foo.ll: > define <4 x i32> @foo(<4 x i32> %x, <4 x i32>...

enabling interleaved access loop vectorization

2016 Aug 05

2

enabling interleaved access loop vectorization

...# =>This Inner Loop Header: Depth=1 > > movdqu (%rdi,%rcx), %xmm0 > > movdqu 4(%rdi,%rcx), %xmm1 > > movdqu 8(%rdi,%rcx), %xmm2 > > paddd %xmm0, %xmm1 > > paddd %xmm2, %xmm1 > > movdqu (%rdi,%rcx,2), %xmm0 > > movdqu 16(%rdi,%rcx,2), %xmm2 > > pshufd $132, %xmm2, %xmm2 # xmm2 = xmm2[0,1,0,2] > > pshufd $232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3] > > pblendw $240, %xmm2, %xmm0 # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7] > > paddd %xmm1, %xmm0 > > movdqu %xmm0, (%rsi,%rcx) > > cmpq $992, %rcx # imm...

[LLVMdev] the clang 3.5 loop optimizer seems to jump in unintentional for simple loops

2014 Jul 23

4

[LLVMdev] the clang 3.5 loop optimizer seems to jump in unintentional for simple loops

...%r8, %rdi movq %rax, %rdx jmp .LBB0_5 .LBB0_1: pxor %xmm1, %xmm1 .LBB0_5: # %middle.block paddd %xmm1, %xmm0 movdqa %xmm0, %xmm1 movhlps %xmm1, %xmm1 # xmm1 = xmm1[1,1] paddd %xmm0, %xmm1 pshufd $1, %xmm1, %xmm0 # xmm0 = xmm1[1,0,0,0] paddd %xmm1, %xmm0 movd %xmm0, %eax cmpq %rdx, %rsi je .LBB0_7 .align 16, 0x90 .LBB0_6: # %scalar.ph # =>This Inner Loop Header:...

enabling interleaved access loop vectorization

2016 May 26

0

enabling interleaved access loop vectorization

On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > Is there a compile-time and/or potential runtime cost that makes > enableInterleavedAccessVectorization() default to 'false'? > > I notice that this is set to true for ARM, AArch64, and PPC. > > In particular, I'm wondering if there's a reason it's not enabled for

[LLVMdev] LLVM 2.8 and MMX

2010 Sep 22

1

[LLVMdev] LLVM 2.8 and MMX

...ger be selected for certain cases. > > I've attached a potential fix for the 2.8 branch. > > The real problem is that the code above it which checks for > isUNPCK[L|H]_v_undef_Mask cases is only for when OptForSize is true. It > assumes that otherwise things can get lowered to PSHUFD (which is true for > v4i32 and v4f32 but nothing else - in particular MMX operations). > > I'll file a bug now... > > Nicolas > > > -----Original Message----- > From: Dale Johannesen [mailto:dalej at apple.com] > Sent: Wednesday, September 22, 2010 2:37 > To: Bi...

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

2013 Jul 28

2

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

...r-read/writes to the same addresses as the vector store. Maybe the processors can’t prune multiple stores to the same address with different sizes (Section 2.2.4 in the optimization guide has some info on this). Another possibility (less likely) is that we increase the critical path by adding a new pshufd instruction before the last vector store and that affects the store-buffer somehow. In any case, there is not much we can do at the IR-level to predict this. Performance Regressions - Compile Time Δ Previous Current σ MultiSource/Benchmarks/VersaBench/beamformer/beamformer 18.98% 0.0722 0.0859...

[LLVMdev] How does SSEDomainFix work?

2010 May 11

2

[LLVMdev] How does SSEDomainFix work?

Hello. This is my 1st post. I have tried SSE execution domain fixup pass. But I am not able to see any improvements. I expect for the example below to use MOVDQA, PAND &c. (On nehalem, ANDPS is extremely slower than PAND) Please tell me if something would be wrong for me. Thank you. Takumi Host: i386-mingw32 Build: trunk at 103373 foo.ll: define <4 x i32> @foo(<4 x i32> %x,

[LLVMdev] LLVM 3.4.1 - Testing Phase

2014 Apr 11

16

[LLVMdev] LLVM 3.4.1 - Testing Phase

Hi, I have just tagged the first release candidate for the 3.4.1 release, so testers may begin testing. Please refer to http://llvm.org/docs/ReleaseProcess.html for information on how to validate a release. If you have any questions or need something clarified, just email the list. For the 3.4.1 release we want to compare test results against 3.4-final. I have added support to the

RFC: A proposal for vectorizing loops with calls to math functions using SVML

2016 Apr 01

2

RFC: A proposal for vectorizing loops with calls to math functions using SVML

...g !6 %5 = icmp eq i64 %index.next, 1000, !dbg !6 br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 movd %ebx, %xmm0 pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] paddd .LCPI0_0(%rip), %xmm0 cvtdq2ps %xmm0, %xmm0 movaps %xmm0, 16(%rsp) # 16-byte Spill shufps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3] callq sinf movaps %xmm0, (%rsp)...

RFC: A proposal for vectorizing loops with calls to math functions using SVML

2016 Apr 04

2

RFC: A proposal for vectorizing loops with calls to math functions using SVML

...g !6 %5 = icmp eq i64 %index.next, 1000, !dbg !6 br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 movd %ebx, %xmm0 pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] paddd .LCPI0_0(%rip), %xmm0 cvtdq2ps %xmm0, %xmm0 movaps %xmm0, 16(%rsp) # 16-byte Spill shufps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3] callq sinf movaps %xmm0, (%rsp)...

search for: pshufd