thr3ads.net - search: "xmm3"

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

2012 Jul 06

2

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

...8-point >> complex FFT, but from 16-point upwards, icc or gcc generates much >> better code. Here is an example of a sequence of instructions from a >> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE: >> >> [...] >> movaps 32(%rdi), %xmm3 >> movaps 48(%rdi), %xmm2 >> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 >> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 >> addps %xmm0, %xmm1 >> movaps %xmm1, -16(%rbp) ## 16-byte Spill >&g...

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

2012 Jul 06

0

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

...ony Blake <amb33 at cs.waikato.ac.nz> wrote: > On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote: >> On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote: >>> [...] >>> movaps 32(%rdi), %xmm3 >>> movaps 48(%rdi), %xmm2 >>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 >>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 >>> addps %xmm0, %xmm1 >>> movaps %xmm1, -16(%rbp) ##...

[PATCH] Make SSE Run Time option. Add Win32 SSE code

2004 Aug 06

2

[PATCH] Make SSE Run Time option. Add Win32 SSE code

...] + addss xmm1, xmm0 + + mov edx, in2 + movss [edx], xmm1 + + shufps xmm0, xmm0, 0x00 + shufps xmm1, xmm1, 0x00 + + movaps xmm2, [eax+4] + movaps xmm3, [ebx+4] + mulps xmm2, xmm0 + mulps xmm3, xmm1 + movaps xmm4, [eax+20] + mulps xmm4, xmm0 + addps xmm2, [ecx+4] + movaps xmm5, [ebx+20] + mul...

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

2012 Jul 06

0

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

...r a function that computes an 8-point > complex FFT, but from 16-point upwards, icc or gcc generates much > better code. Here is an example of a sequence of instructions from a > 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE: > > [...] > movaps 32(%rdi), %xmm3 > movaps 48(%rdi), %xmm2 > movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 > movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 > addps %xmm0, %xmm1 > movaps %xmm1, -16(%rbp) ## 16-byte Spill > movaps 144(%rdi), %xmm3 ### <-- new data mov...

[LLVMdev] Register Allocation ERROR! Ran out of registers during register allocation!

2010 Aug 02

0

[LLVMdev] Register Allocation ERROR! Ran out of registers during register allocation!

...ment is Clang-2.8-svn on Linux-x86. When I build ffmpeg-0.6 using Clang, error output: CC libavcodec/x86/mpegvideo_mmx.o fatal error: error in backend: Ran out of registers during register allocation! Please check your inline asm statement for invalid constraints: INLINEASM <es:movd %eax, %xmm3 pshuflw $$0, %xmm3, %xmm3 punpcklwd %xmm3, %xmm3 pxor %xmm7, %xmm7 pxor %xmm4, %xmm4 movdqa ($2), %xmm5 pxor %xmm6, %xmm6 psubw ($3), %xmm6 mov $$-128, %eax .align 1 << 4 1: movdqa ($1, %eax), %xmm0...

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

2012 Jul 06

2

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

...LLVM generates good code for a function that computes an 8-point complex FFT, but from 16-point upwards, icc or gcc generates much better code. Here is an example of a sequence of instructions from a 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE: [...] movaps 32(%rdi), %xmm3 movaps 48(%rdi), %xmm2 movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 addps %xmm0, %xmm1 movaps %xmm1, -16(%rbp) ## 16-byte Spill movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3 [...]...

[LLVMdev] llvm.x86.sse2.sqrt.pd not using sqrtpd, calling a function that modifies ECX

2013 Jul 19

0

[LLVMdev] llvm.x86.sse2.sqrt.pd not using sqrtpd, calling a function that modifies ECX

...mword ptr [esp+60h],xmm0 002E0108 xorpd xmm0,xmm0 002E010C movapd xmmword ptr [esp+0C0h],xmm0 002E0115 xorpd xmm1,xmm1 002E0119 xorpd xmm7,xmm7 002E011D movapd xmmword ptr [esp+0A0h],xmm1 002E0126 movapd xmmword ptr [esp+0B0h],xmm7 002E012F movapd xmm3,xmm1 002E0133 movlpd qword ptr [esp+0F0h],xmm3 002E013C movhpd qword ptr [esp+0E0h],xmm3 002E0145 movlpd qword ptr [esp+100h],xmm7 002E014E pshufd xmm0,xmm7,44h 002E0153 movdqa xmm5,xmm0 002E0157 xorpd xmm4,xmm4 002E015B mulpd xmm5,xmm4 002E015F...

[LLVMdev] X86 FMA4

2012 Jul 26

1

[LLVMdev] X86 FMA4

...ut there is a significant scalar performance issue following the GCC intrinsics. Let's look at the VFMADDSD pattern. We're operating on scalars with undefineds as the remaining vector elements of the operands. This sounds okay, but when one looks closer... vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647 vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647 The spill here is 16-bytes. But, we're only using the low 8-bytes of xmm3. Changing the intrinsics and patterns to accept scalar ope...

[LLVMdev] Can LLVM vectorize <2 x i32> type

2015 Jun 26

2

[LLVMdev] Can LLVM vectorize <2 x i32> type

...%xmm4 vpmuludq %xmm7, %xmm4, %xmm4 vpsllq $32, %xmm4, %xmm4 vpaddq %xmm4, %xmm2, %xmm2 vpextrq $1, %xmm2, %rax cltq vmovq %rax, %xmm4 vmovq %xmm2, %rax cltq vmovq %rax, %xmm5 vpunpcklqdq %xmm4, %xmm5, %xmm4 # xmm4 = xmm5[0],xmm4[0] vpcmpgtq %xmm3, %xmm4, %xmm3 vptest %xmm3, %xmm3 je .LBB10_66 # BB#5: # %for.body.preheader vpaddq %xmm15, %xmm2, %xmm3 vpand %xmm15, %xmm3, %xmm3 vpaddq .LCPI10_1(%rip), %xmm3, %xmm8 vpand .LCPI10_5(%rip), %xmm8, %xmm5 vpxor %xmm4, %xmm4, %xmm...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 29

2

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...m I'm seeing is that in some cases we can't fold memory > anymore: > vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] > vblendps $0x1, %xmm2, %xmm0, %xmm0 > becomes: > vmovaps -0xXX(%rdx), %xmm2 > vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] > vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2] > > > Also, I see differences when some loads are shuffled, that I'm a bit > conflicted about: > vmovaps -0xXX(%rbp), %xmm3 > ... > vinsertps...

[LLVMdev] Optimized code analysis problems

2009 Jan 31

2

[LLVMdev] Optimized code analysis problems

...code below. I get the function call names as llvm.x86 something instead of getting function names(eg. _mm_cvtsi32_si128) #include <pmmintrin.h> #include<sys/time.h> #include<iostream> void foo_opt(unsigned char output[64], int Yc[64], int S_BITS) { __m128i XMM1, XMM2, XMM3, XMM4; __m128i *xmm1 = (__m128i*)Yc; __m128i XMM5 = _mm_cvtsi32_si128(S_BITS + 3) ; XMM2 = _mm_set1_epi32(S_BITS + 2); for (int l = 0; l < 8; l++) { XMM1 = _mm_loadu_si128(xmm1++); XMM3 = _mm_add_epi32(XMM1, XMM2); XMM1 = _mm_cmplt_epi32(XMM1, _mm_s...

Invoke loop vectorizer

2016 Aug 12

4

Invoke loop vectorizer

...] ~ :) $ pcregrep -i "^\s*p" > test.s|less > pushq %rbp > pshufd $68, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,0,1] > pslldq $8, %xmm1 ## xmm1 = > zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7] > pshufd $68, %xmm3, %xmm3 ## xmm3 = xmm3[0,1,0,1] > paddq %xmm1, %xmm3 > pshufd $78, %xmm3, %xmm4 ## xmm4 = xmm3[2,3,0,1] > punpckldq %xmm5, %xmm4 ## xmm4 = > xmm4[0],xmm5[0],xmm4[1],xmm5[1] > pshufd $212, %xmm4, %xmm4 ## xmm4 = xmm4[0,1,1,3...

[LLVMdev] X86 FMA4

2012 Jul 25

6

[LLVMdev] X86 FMA4

We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. Why is VFMADDSD4 defined with vector types? Is this simply because the gcc intrinsic uses vector types? It's quite unnatural if you have a compiler that generates FMAs as opposed to requiring user intrinsics. -Dave

[LLVMdev] How does SSEDomainFix work?

2010 May 11

2

[LLVMdev] How does SSEDomainFix work?

...i64> %2 } $ llc -mcpu=nehalem -debug-pass=Structure foo.bc -o foo.s (snip) Code Placement Optimizater SSE execution domain fixup Machine Natural Loop Construction X86 AT&T-Style Assembly Printer Delete Garbage Collector Information foo.s: (edited) _foo: movaps %xmm0, %xmm3 andps %xmm2, %xmm3 andnps %xmm1, %xmm2 movaps %xmm2, %xmm0 xorps %xmm3, %xmm0 ret _bar: movaps %xmm0, %xmm3 andps %xmm2, %xmm3 andnps %xmm1, %xmm2 movaps %xmm2, %xmm0 xorps %xmm3, %xmm0 ret

New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16

2013 Aug 22

2

New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16

...e_lag_12 +cglobal FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16 cglobal FLAC__lpc_compute_autocorrelation_asm_ia32_3dnow cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32 cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32_mmx @@ -596,7 +597,7 @@ movss xmm3, xmm2 movss xmm2, xmm0 - ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm3:xmm3:xmm2 + ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm4:xmm3:xmm2 movaps xmm1, xmm0 mulps xmm1, xmm2 addps xmm5, xmm1 @@ -619,6 +620,95 @@ ret ALIGN 16 +cident FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16...

[LLVMdev] X86 FMA4

2012 Jul 26

0

[LLVMdev] X86 FMA4

...wing the GCC intrinsics. > > > > > >Let's look at the VFMADDSD pattern. We're operating on scalars with > undefineds as the remaining vector elements of the operands. This sounds > okay, but when one looks closer... > > > > vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647 > > vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill > > vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647 > > > > > >The spill here is 16-bytes. But, we're only using the low 8-bytes of > xmm3. Changi...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 30

4

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...9;t fold memory >>> anymore: >>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >>> vblendps $0x1, %xmm2, %xmm0, %xmm0 >>> becomes: >>> vmovaps -0xXX(%rdx), %xmm2 >>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] >>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = >>> xmm3[0,2],xmm0[1,2] >>> >>> >>> Also, I see differences when some loads are shuffled, that I'm a bit >>> conflicted about: >>> vmovaps...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 29

0

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...in some cases we can't fold memory >> anymore: >> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >> vblendps $0x1, %xmm2, %xmm0, %xmm0 >> becomes: >> vmovaps -0xXX(%rdx), %xmm2 >> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] >> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = >> xmm3[0,2],xmm0[1,2] >> >> >> Also, I see differences when some loads are shuffled, that I'm a bit >> conflicted about: >> vmovaps -0xXX(%rbp), %xmm3 &g...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 30

0

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...>> anymore: >>>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >>>> vblendps $0x1, %xmm2, %xmm0, %xmm0 >>>> becomes: >>>> vmovaps -0xXX(%rdx), %xmm2 >>>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = >>>> xmm2[3,0],xmm0[0,0] >>>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = >>>> xmm3[0,2],xmm0[1,2] >>>> >>>> >>>> Also, I see differences when some loads are shuffled, that I'm a bit >>>> c...

Invoke loop vectorizer

2016 Aug 12

2

Invoke loop vectorizer

Hi Daniel, I increased the size of your test to be 128 but -stats still shows no loop optimized... Xiaochu On Aug 12, 2016 11:11 AM, "Daniel Berlin" <dberlin at dberlin.org> wrote: > It's not possible to know that A and B don't alias in this example. It's > almost certainly not profitable to add a runtime check given the size of > the loop. > > >

search for: xmm3