thr3ads.net - search: "pblendw"

Displaying 5 results from an estimated 5 matches for "pblendw".

Did you mean: blend

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

2014 Sep 10

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com> wrote: > Awesome, thanks for all the information! > > See below: > > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> > wrote: >> >> You have already mentioned how the new shuffle lowering is missing >> some features; for example, you explicitly

enabling interleaved access loop vectorization

2016 Aug 05

enabling interleaved access loop vectorization

...Header: Depth=1 movdqu (%rdi,%rcx), %xmm0 movdqu 4(%rdi,%rcx), %xmm1 movdqu 8(%rdi,%rcx), %xmm2 paddd %xmm0, %xmm1 paddd %xmm2, %xmm1 movdqu (%rdi,%rcx,2), %xmm0 movdqu 16(%rdi,%rcx,2), %xmm2 pshufd $132, %xmm2, %xmm2 # xmm2 = xmm2[0,1,0,2] pshufd $232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3] pblendw $240, %xmm2, %xmm0 # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7] paddd %xmm1, %xmm0 movdqu %xmm0, (%rsi,%rcx) cmpq $992, %rcx # imm = 0x3E0 jne .LBB0_7 The performance I see out of the 3 versions (with a 500K-iteration outer loop): Scalar: 0m10.320s Vector (Non-interleaved): 0m8.054s Vec...

enabling interleaved access loop vectorization

2016 May 26

enabling interleaved access loop vectorization

Interleaved access is not enabled on X86 yet. We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer

enabling interleaved access loop vectorization

2016 Aug 05

enabling interleaved access loop vectorization

...ovdqu 8(%rdi,%rcx), %xmm2 > > paddd %xmm0, %xmm1 > > paddd %xmm2, %xmm1 > > movdqu (%rdi,%rcx,2), %xmm0 > > movdqu 16(%rdi,%rcx,2), %xmm2 > > pshufd $132, %xmm2, %xmm2 # xmm2 = xmm2[0,1,0,2] > > pshufd $232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3] > > pblendw $240, %xmm2, %xmm0 # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7] > > paddd %xmm1, %xmm0 > > movdqu %xmm0, (%rsi,%rcx) > > cmpq $992, %rcx # imm = 0x3E0 > > jne .LBB0_7 > > > > The performance I see out of the 3 versions (with a 500K-iteration outer >...

enabling interleaved access loop vectorization

2016 May 26

enabling interleaved access loop vectorization

On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > Is there a compile-time and/or potential runtime cost that makes > enableInterleavedAccessVectorization() default to 'false'? > > I notice that this is set to true for ARM, AArch64, and PPC. > > In particular, I'm wondering if there's a reason it's not enabled for

search for: pblendw