Displaying 6 results from an estimated 6 matches for "paddq".
Did you mean:
paddr
2016 Aug 05
3
enabling interleaved access loop vectorization
...et:
.LBB0_3: # %vector.body
# =>This Inner Loop Header: Depth=1
movdqu (%rdi,%rax,4), %xmm3
movd %xmm0, %rcx
movdqu 4(%rdi,%rcx,4), %xmm4
paddd %xmm3, %xmm4
movdqu 8(%rdi,%rcx,4), %xmm3
paddd %xmm4, %xmm3
movdqa %xmm1, %xmm4
paddq %xmm4, %xmm4
movdqa %xmm0, %xmm5
paddq %xmm5, %xmm5
movd %xmm5, %rcx
pextrq $1, %xmm5, %rdx
movd %xmm4, %r8
pextrq $1, %xmm4, %r9
movd (%rdi,%rcx,4), %xmm4 # xmm4 = mem[0],zero,zero,zero
pinsrd $1, (%rdi,%rdx,4), %xmm4
pinsrd $2, (%rdi,%r8,4), %xmm4
pinsrd $3, (%rdi,%r9,4), %xmm4
paddd %xmm3, %x...
2016 May 26
2
enabling interleaved access loop vectorization
Interleaved access is not enabled on X86 yet.
We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer
2016 Aug 05
2
enabling interleaved access loop vectorization
...# =>This Inner Loop Header: Depth=1
>
> movdqu (%rdi,%rax,4), %xmm3
>
> movd %xmm0, %rcx
>
> movdqu 4(%rdi,%rcx,4), %xmm4
>
> paddd %xmm3, %xmm4
>
> movdqu 8(%rdi,%rcx,4), %xmm3
>
> paddd %xmm4, %xmm3
>
> movdqa %xmm1, %xmm4
>
> paddq %xmm4, %xmm4
>
> movdqa %xmm0, %xmm5
>
> paddq %xmm5, %xmm5
>
> movd %xmm5, %rcx
>
> pextrq $1, %xmm5, %rdx
>
> movd %xmm4, %r8
>
> pextrq $1, %xmm4, %r9
>
> movd (%rdi,%rcx,4), %xmm4 # xmm4 = mem[0],zero,zero,zero
>
> pinsrd $1, (%rdi,%rdx,4), %xm...
2016 May 26
0
enabling interleaved access loop vectorization
On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Is there a compile-time and/or potential runtime cost that makes
> enableInterleavedAccessVectorization() default to 'false'?
>
> I notice that this is set to true for ARM, AArch64, and PPC.
>
> In particular, I'm wondering if there's a reason it's not enabled for
2016 Aug 12
4
Invoke loop vectorizer
...> pushq %rbp
> pshufd $68, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,0,1]
> pslldq $8, %xmm1 ## xmm1 =
> zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
> pshufd $68, %xmm3, %xmm3 ## xmm3 = xmm3[0,1,0,1]
> paddq %xmm1, %xmm3
> pshufd $78, %xmm3, %xmm4 ## xmm4 = xmm3[2,3,0,1]
> punpckldq %xmm5, %xmm4 ## xmm4 =
> xmm4[0],xmm5[0],xmm4[1],xmm5[1]
> pshufd $212, %xmm4, %xmm4 ## xmm4 = xmm4[0,1,1,3]
>
>
>
> Note:
> It also vectorizes at S...
2016 Aug 12
2
Invoke loop vectorizer
Hi Daniel,
I increased the size of your test to be 128 but -stats still shows no loop
optimized...
Xiaochu
On Aug 12, 2016 11:11 AM, "Daniel Berlin" <dberlin at dberlin.org> wrote:
> It's not possible to know that A and B don't alias in this example. It's
> almost certainly not profitable to add a runtime check given the size of
> the loop.
>
>
>