thr3ads.net - search: "xmm4"

[LLVMdev] Can LLVM vectorize <2 x i32> type

2015 Jun 26

2

[LLVMdev] Can LLVM vectorize <2 x i32> type

...mp ne i128 %BCS54_D, 0 br i1 %mskS54_D, label %middle.block, label %vector.ph Now the assembly for the above IR code is: # BB#4: # %for.cond.preheader vmovdqa 144(%rsp), %xmm0 # 16-byte Reload vpmuludq %xmm7, %xmm0, %xmm2 vpsrlq $32, %xmm7, %xmm4 vpmuludq %xmm4, %xmm0, %xmm4 vpsllq $32, %xmm4, %xmm4 vpaddq %xmm4, %xmm2, %xmm2 vpsrlq $32, %xmm0, %xmm4 vpmuludq %xmm7, %xmm4, %xmm4 vpsllq $32, %xmm4, %xmm4 vpaddq %xmm4, %xmm2, %xmm2 vpextrq $1, %xmm2, %rax cltq vmovq %rax, %xmm4 vmovq...

[LLVMdev] llvm.x86.sse2.sqrt.pd not using sqrtpd, calling a function that modifies ECX

2013 Jul 19

0

[LLVMdev] llvm.x86.sse2.sqrt.pd not using sqrtpd, calling a function that modifies ECX

...mmword ptr [esp+0B0h],xmm7 002E012F movapd xmm3,xmm1 002E0133 movlpd qword ptr [esp+0F0h],xmm3 002E013C movhpd qword ptr [esp+0E0h],xmm3 002E0145 movlpd qword ptr [esp+100h],xmm7 002E014E pshufd xmm0,xmm7,44h 002E0153 movdqa xmm5,xmm0 002E0157 xorpd xmm4,xmm4 002E015B mulpd xmm5,xmm4 002E015F pshufd xmm2,xmm3,44h 002E0164 movdqa xmm1,xmm2 002E0168 mulpd xmm1,xmm4 002E016C xorpd xmm7,xmm7 002E0170 movapd xmm4,xmmword ptr [esp+70h] 002E0176 subpd xmm4,xmm1 002E017A pshufd xmm3,xmm3,0EEh 002...

[PATCH] Make SSE Run Time option. Add Win32 SSE code

2004 Aug 06

2

[PATCH] Make SSE Run Time option. Add Win32 SSE code

...shufps xmm0, xmm0, 0x00 + shufps xmm1, xmm1, 0x00 + + movaps xmm2, [eax+4] + movaps xmm3, [ebx+4] + mulps xmm2, xmm0 + mulps xmm3, xmm1 + movaps xmm4, [eax+20] + mulps xmm4, xmm0 + addps xmm2, [ecx+4] + movaps xmm5, [ebx+20] + mulps xmm5, xmm1 + addps xmm4, [ecx+20] + subps xmm2, xmm3 + mo...

[LLVMdev] Can LLVM vectorize <2 x i32> type

2015 Jun 24

2

[LLVMdev] Can LLVM vectorize <2 x i32> type

Hi, Is LLVM be able to generate code for the following code? %mul = mul <2 x i32> %1, %2, where %1 and %2 are <2 x i32> type. I am running it on a Haswell processor with LLVM-3.4.2. It seems that it will generates really complicated code with vpaddq, vpmuludq, vpsllq, vpsrlq. Thanks, Zhi -------------- next part -------------- An HTML attachment was scrubbed... URL:

Invoke loop vectorizer

2016 Aug 12

4

Invoke loop vectorizer

...mm0, %xmm0 ## xmm0 = xmm0[0,1,0,1] > pslldq $8, %xmm1 ## xmm1 = > zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7] > pshufd $68, %xmm3, %xmm3 ## xmm3 = xmm3[0,1,0,1] > paddq %xmm1, %xmm3 > pshufd $78, %xmm3, %xmm4 ## xmm4 = xmm3[2,3,0,1] > punpckldq %xmm5, %xmm4 ## xmm4 = > xmm4[0],xmm5[0],xmm4[1],xmm5[1] > pshufd $212, %xmm4, %xmm4 ## xmm4 = xmm4[0,1,1,3] > > > > Note: > It also vectorizes at SIZE=8. > > Not sure what the exact translation o...

[LLVMdev] SIMD instructions and memory alignment on X86

2013 Jul 19

4

[LLVMdev] SIMD instructions and memory alignment on X86

Hmm, I'm not able to get those .ll files to compile if I disable SSE and I end up with SSE instructions(including sqrtpd) if I don't disable it. On Thu, Jul 18, 2013 at 10:53 PM, Peter Newman <peter at uformia.com> wrote: > Is there something specifically required to enable SSE? If it's not > detected as available (based from the target triple?) then I don't think

enabling interleaved access loop vectorization

2016 Aug 05

3

enabling interleaved access loop vectorization

...he vectorized code is actually fairly decent - e.g. forcing vectorization, with SSE4.2, we get: .LBB0_3: # %vector.body # =>This Inner Loop Header: Depth=1 movdqu (%rdi,%rax,4), %xmm3 movd %xmm0, %rcx movdqu 4(%rdi,%rcx,4), %xmm4 paddd %xmm3, %xmm4 movdqu 8(%rdi,%rcx,4), %xmm3 paddd %xmm4, %xmm3 movdqa %xmm1, %xmm4 paddq %xmm4, %xmm4 movdqa %xmm0, %xmm5 paddq %xmm5, %xmm5 movd %xmm5, %rcx pextrq $1, %xmm5, %rdx movd %xmm4, %r8 pextrq $1, %xmm4, %r9 movd (%rdi,%rcx,4), %xmm4 # xmm4 = mem[0],zero,zero,zero pinsrd $1, (%rdi...

enabling interleaved access loop vectorization

2016 May 26

2

enabling interleaved access loop vectorization

Interleaved access is not enabled on X86 yet. We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer

Invoke loop vectorizer

2016 Aug 12

2

Invoke loop vectorizer

Hi Daniel, I increased the size of your test to be 128 but -stats still shows no loop optimized... Xiaochu On Aug 12, 2016 11:11 AM, "Daniel Berlin" <dberlin at dberlin.org> wrote: > It's not possible to know that A and B don't alias in this example. It's > almost certainly not profitable to add a runtime check given the size of > the loop. > > >

enabling interleaved access loop vectorization

2016 Aug 05

2

enabling interleaved access loop vectorization

...vectorization, with SSE4.2, we get: > > > > .LBB0_3: # %vector.body > > # =>This Inner Loop Header: Depth=1 > > movdqu (%rdi,%rax,4), %xmm3 > > movd %xmm0, %rcx > > movdqu 4(%rdi,%rcx,4), %xmm4 > > paddd %xmm3, %xmm4 > > movdqu 8(%rdi,%rcx,4), %xmm3 > > paddd %xmm4, %xmm3 > > movdqa %xmm1, %xmm4 > > paddq %xmm4, %xmm4 > > movdqa %xmm0, %xmm5 > > paddq %xmm5, %xmm5 > > movd %xmm5, %rcx > > pextrq $1, %xmm5, %rdx > > movd %xmm4, %r...

enabling interleaved access loop vectorization

2016 May 26

0

enabling interleaved access loop vectorization

On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > Is there a compile-time and/or potential runtime cost that makes > enableInterleavedAccessVectorization() default to 'false'? > > I notice that this is set to true for ARM, AArch64, and PPC. > > In particular, I'm wondering if there's a reason it's not enabled for

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

2014 Sep 05

3

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at gmail.com> wrote: > Unfortunately, another team, while doing internal testing has seen the > new path generating illegal insertps masks. A sample here: > > vinsertps $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3] > vinsertps $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3] > vinsertps $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3] > vinsertps $416, %xmm1, %xmm4, %xmm14 # xmm14 = > xmm4[0,1],xmm1[2],xmm4[3] > vinsertps $...

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

2014 Sep 04

2

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Greetings all, As you may have noticed, there is a new vector shuffle lowering path in the X86 backend. You can try it out with the '-x86-experimental-vector-shuffle-lowering' flag to llc, or '-mllvm -x86-experimental-vector-shuffle-lowering' to clang. Please test it out! There may be some correctness bugs, I'm still fuzz testing it to shake them out. But I expect fairly few

New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16

2013 Aug 22

2

New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16

...3dnow cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32 cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32_mmx @@ -596,7 +597,7 @@ movss xmm3, xmm2 movss xmm2, xmm0 - ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm3:xmm3:xmm2 + ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm4:xmm3:xmm2 movaps xmm1, xmm0 mulps xmm1, xmm2 addps xmm5, xmm1 @@ -619,6 +620,95 @@ ret ALIGN 16 +cident FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16 + ;[ebp + 20] == autoc[] + ;[ebp + 16] == lag + ;[ebp + 12] == data_len + ;[ebp + 8] == data[] + ;[esp] == __m128 + ;[esp +...

[LLVMdev] Optimized code analysis problems

2009 Jan 31

2

[LLVMdev] Optimized code analysis problems

...elow. I get the function call names as llvm.x86 something instead of getting function names(eg. _mm_cvtsi32_si128) #include <pmmintrin.h> #include<sys/time.h> #include<iostream> void foo_opt(unsigned char output[64], int Yc[64], int S_BITS) { __m128i XMM1, XMM2, XMM3, XMM4; __m128i *xmm1 = (__m128i*)Yc; __m128i XMM5 = _mm_cvtsi32_si128(S_BITS + 3) ; XMM2 = _mm_set1_epi32(S_BITS + 2); for (int l = 0; l < 8; l++) { XMM1 = _mm_loadu_si128(xmm1++); XMM3 = _mm_add_epi32(XMM1, XMM2); XMM1 = _mm_cmplt_epi32(XMM1, _mm_setzero...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 29

2

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...= xmm2[3,0],xmm0[0,0] > vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2] > > > Also, I see differences when some loads are shuffled, that I'm a bit > conflicted about: > vmovaps -0xXX(%rbp), %xmm3 > ... > vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3] > becomes: > vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2] > ... > vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3] > > Note that the second version does the shuffle in-place, in xmm2....

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

2014 Sep 05

2

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

...Robert Lougher <rob.lougher at gmail.com> >> wrote: >>> >>> Unfortunately, another team, while doing internal testing has seen the >>> new path generating illegal insertps masks. A sample here: >>> >>> vinsertps $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3] >>> vinsertps $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3] >>> vinsertps $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3] >>> vinsertps $416, %xmm1, %xmm4, %xmm14 # xmm14 = >>> xmm4[0,1],xmm1[2],xm...

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

2014 Sep 19

4

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

...ible. 3. When zero extending 2 packed 32-bit integers, we should try to emit a vpmovzxdq Example: vmovq 20(%rbx), %xmm0 vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1] Before: vpmovzxdq 20(%rbx), %xmm0 4. We no longer emit a simpler 'vmovq' in the following case: vxorpd %xmm4, %xmm4, %xmm4 vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1] Before, we used to generate: vmovq %xmm2, %xmm4 Before, the vmovq implicitly zero-extended to 128 bits the quadword in %xmm2. Now we always do this with a vxorpd+vblendps. As I said, I will try to create smaller rep...

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

2012 Jul 06

2

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

...ns from a >> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE: >> >> [...] >> movaps 32(%rdi), %xmm3 >> movaps 48(%rdi), %xmm2 >> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 >> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 >> addps %xmm0, %xmm1 >> movaps %xmm1, -16(%rbp) ## 16-byte Spill >> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3 >> [...] >> >> xmm3 loaded, duplicated into 2 regi...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 30

4

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...mm0 ## xmm0 = >>> xmm3[0,2],xmm0[1,2] >>> >>> >>> Also, I see differences when some loads are shuffled, that I'm a bit >>> conflicted about: >>> vmovaps -0xXX(%rbp), %xmm3 >>> ... >>> vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = >>> xmm4[3],xmm3[1,2,3] >>> becomes: >>> vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2] >>> ... >>> vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = >>> xmm4[3],xmm2[1,2,3] >>> >...

search for: xmm4