Displaying 20 results from an estimated 101 matches for "xmm3".
Did you mean:
xmm0
2012 Jul 06
2
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
...8-point
>> complex FFT, but from 16-point upwards, icc or gcc generates much
>> better code. Here is an example of a sequence of instructions from a
>> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE:
>>
>> [...]
>> movaps 32(%rdi), %xmm3
>> movaps 48(%rdi), %xmm2
>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1
>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4
>> addps %xmm0, %xmm1
>> movaps %xmm1, -16(%rbp) ## 16-byte Spill
>&g...
2012 Jul 06
0
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
...ony Blake <amb33 at cs.waikato.ac.nz> wrote:
> On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote:
>> On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote:
>>> [...]
>>> movaps 32(%rdi), %xmm3
>>> movaps 48(%rdi), %xmm2
>>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1
>>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4
>>> addps %xmm0, %xmm1
>>> movaps %xmm1, -16(%rbp) ##...
2004 Aug 06
2
[PATCH] Make SSE Run Time option. Add Win32 SSE code
...]
+ addss xmm1, xmm0
+
+ mov edx, in2
+ movss [edx], xmm1
+
+ shufps xmm0, xmm0, 0x00
+ shufps xmm1, xmm1, 0x00
+
+ movaps xmm2, [eax+4]
+ movaps xmm3, [ebx+4]
+ mulps xmm2, xmm0
+ mulps xmm3, xmm1
+ movaps xmm4, [eax+20]
+ mulps xmm4, xmm0
+ addps xmm2, [ecx+4]
+ movaps xmm5, [ebx+20]
+ mul...
2012 Jul 06
0
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
...r a function that computes an 8-point
> complex FFT, but from 16-point upwards, icc or gcc generates much
> better code. Here is an example of a sequence of instructions from a
> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE:
>
> [...]
> movaps 32(%rdi), %xmm3
> movaps 48(%rdi), %xmm2
> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1
> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4
> addps %xmm0, %xmm1
> movaps %xmm1, -16(%rbp) ## 16-byte Spill
> movaps 144(%rdi), %xmm3 ### <-- new data mov...
2010 Aug 02
0
[LLVMdev] Register Allocation ERROR! Ran out of registers during register allocation!
...ment is Clang-2.8-svn on Linux-x86. When I build
ffmpeg-0.6 using Clang, error output:
CC libavcodec/x86/mpegvideo_mmx.o
fatal error: error in backend: Ran out of registers during register
allocation!
Please check your inline asm statement for invalid constraints:
INLINEASM <es:movd %eax, %xmm3
pshuflw $$0, %xmm3, %xmm3
punpcklwd %xmm3, %xmm3
pxor %xmm7, %xmm7
pxor %xmm4, %xmm4
movdqa ($2), %xmm5
pxor %xmm6, %xmm6
psubw ($3), %xmm6
mov $$-128, %eax
.align 1 << 4
1:
movdqa ($1, %eax), %xmm0...
2012 Jul 06
2
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
...LLVM generates good code for a function that computes an 8-point
complex FFT, but from 16-point upwards, icc or gcc generates much
better code. Here is an example of a sequence of instructions from a
32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE:
[...]
movaps 32(%rdi), %xmm3
movaps 48(%rdi), %xmm2
movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1
movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4
addps %xmm0, %xmm1
movaps %xmm1, -16(%rbp) ## 16-byte Spill
movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3
[...]...
2013 Jul 19
0
[LLVMdev] llvm.x86.sse2.sqrt.pd not using sqrtpd, calling a function that modifies ECX
...mword ptr [esp+60h],xmm0
002E0108 xorpd xmm0,xmm0
002E010C movapd xmmword ptr [esp+0C0h],xmm0
002E0115 xorpd xmm1,xmm1
002E0119 xorpd xmm7,xmm7
002E011D movapd xmmword ptr [esp+0A0h],xmm1
002E0126 movapd xmmword ptr [esp+0B0h],xmm7
002E012F movapd xmm3,xmm1
002E0133 movlpd qword ptr [esp+0F0h],xmm3
002E013C movhpd qword ptr [esp+0E0h],xmm3
002E0145 movlpd qword ptr [esp+100h],xmm7
002E014E pshufd xmm0,xmm7,44h
002E0153 movdqa xmm5,xmm0
002E0157 xorpd xmm4,xmm4
002E015B mulpd xmm5,xmm4
002E015F...
2012 Jul 26
1
[LLVMdev] X86 FMA4
...ut there is a significant scalar performance issue
following the GCC intrinsics.
Let's look at the VFMADDSD pattern. We're operating on scalars with
undefineds as the remaining vector elements of the operands. This sounds
okay, but when one looks closer...
vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647
vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill
vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647
The spill here is 16-bytes. But, we're only using the low 8-bytes of
xmm3. Changing the intrinsics and patterns to accept scalar ope...
2015 Jun 26
2
[LLVMdev] Can LLVM vectorize <2 x i32> type
...%xmm4
vpmuludq %xmm7, %xmm4, %xmm4
vpsllq $32, %xmm4, %xmm4
vpaddq %xmm4, %xmm2, %xmm2
vpextrq $1, %xmm2, %rax
cltq
vmovq %rax, %xmm4
vmovq %xmm2, %rax
cltq
vmovq %rax, %xmm5
vpunpcklqdq %xmm4, %xmm5, %xmm4 # xmm4 = xmm5[0],xmm4[0]
vpcmpgtq %xmm3, %xmm4, %xmm3
vptest %xmm3, %xmm3
je .LBB10_66
# BB#5: # %for.body.preheader
vpaddq %xmm15, %xmm2, %xmm3
vpand %xmm15, %xmm3, %xmm3
vpaddq .LCPI10_1(%rip), %xmm3, %xmm8
vpand .LCPI10_5(%rip), %xmm8, %xmm5
vpxor %xmm4, %xmm4, %xmm...
2015 Jan 29
2
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
...m I'm seeing is that in some cases we can't fold memory
> anymore:
> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
> vblendps $0x1, %xmm2, %xmm0, %xmm0
> becomes:
> vmovaps -0xXX(%rdx), %xmm2
> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2]
>
>
> Also, I see differences when some loads are shuffled, that I'm a bit
> conflicted about:
> vmovaps -0xXX(%rbp), %xmm3
> ...
> vinsertps...
2009 Jan 31
2
[LLVMdev] Optimized code analysis problems
...code below.
I get the function call names as llvm.x86 something instead of getting
function names(eg. _mm_cvtsi32_si128)
#include <pmmintrin.h>
#include<sys/time.h>
#include<iostream>
void foo_opt(unsigned char output[64], int Yc[64], int S_BITS)
{
__m128i XMM1, XMM2, XMM3, XMM4;
__m128i *xmm1 = (__m128i*)Yc;
__m128i XMM5 = _mm_cvtsi32_si128(S_BITS + 3) ;
XMM2 = _mm_set1_epi32(S_BITS + 2);
for (int l = 0; l < 8; l++) {
XMM1 = _mm_loadu_si128(xmm1++);
XMM3 = _mm_add_epi32(XMM1, XMM2);
XMM1 = _mm_cmplt_epi32(XMM1, _mm_s...
2016 Aug 12
4
Invoke loop vectorizer
...] ~ :) $ pcregrep -i "^\s*p"
> test.s|less
> pushq %rbp
> pshufd $68, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,0,1]
> pslldq $8, %xmm1 ## xmm1 =
> zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
> pshufd $68, %xmm3, %xmm3 ## xmm3 = xmm3[0,1,0,1]
> paddq %xmm1, %xmm3
> pshufd $78, %xmm3, %xmm4 ## xmm4 = xmm3[2,3,0,1]
> punpckldq %xmm5, %xmm4 ## xmm4 =
> xmm4[0],xmm5[0],xmm4[1],xmm5[1]
> pshufd $212, %xmm4, %xmm4 ## xmm4 = xmm4[0,1,1,3...
2012 Jul 25
6
[LLVMdev] X86 FMA4
We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns.
Why is VFMADDSD4 defined with vector types? Is this simply because the
gcc intrinsic uses vector types? It's quite unnatural if you have a
compiler that generates FMAs as opposed to requiring user intrinsics.
-Dave
2010 May 11
2
[LLVMdev] How does SSEDomainFix work?
...i64> %2
}
$ llc -mcpu=nehalem -debug-pass=Structure foo.bc -o foo.s
(snip)
Code Placement Optimizater
SSE execution domain fixup
Machine Natural Loop Construction
X86 AT&T-Style Assembly Printer
Delete Garbage Collector Information
foo.s: (edited)
_foo:
movaps %xmm0, %xmm3
andps %xmm2, %xmm3
andnps %xmm1, %xmm2
movaps %xmm2, %xmm0
xorps %xmm3, %xmm0
ret
_bar:
movaps %xmm0, %xmm3
andps %xmm2, %xmm3
andnps %xmm1, %xmm2
movaps %xmm2, %xmm0
xorps %xmm3, %xmm0
ret
2013 Aug 22
2
New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16
...e_lag_12
+cglobal FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16
cglobal FLAC__lpc_compute_autocorrelation_asm_ia32_3dnow
cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32
cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32_mmx
@@ -596,7 +597,7 @@
movss xmm3, xmm2
movss xmm2, xmm0
- ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm3:xmm3:xmm2
+ ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm4:xmm3:xmm2
movaps xmm1, xmm0
mulps xmm1, xmm2
addps xmm5, xmm1
@@ -619,6 +620,95 @@
ret
ALIGN 16
+cident FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16...
2012 Jul 26
0
[LLVMdev] X86 FMA4
...wing the GCC intrinsics.
> >
> >
> >Let's look at the VFMADDSD pattern. We're operating on scalars with
> undefineds as the remaining vector elements of the operands. This sounds
> okay, but when one looks closer...
> >
> > vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647
> > vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill
> > vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647
> >
> >
> >The spill here is 16-bytes. But, we're only using the low 8-bytes of
> xmm3. Changi...
2015 Jan 30
4
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
...9;t fold memory
>>> anymore:
>>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>>> vblendps $0x1, %xmm2, %xmm0, %xmm0
>>> becomes:
>>> vmovaps -0xXX(%rdx), %xmm2
>>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
>>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 =
>>> xmm3[0,2],xmm0[1,2]
>>>
>>>
>>> Also, I see differences when some loads are shuffled, that I'm a bit
>>> conflicted about:
>>> vmovaps...
2015 Jan 29
0
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
...in some cases we can't fold memory
>> anymore:
>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>> vblendps $0x1, %xmm2, %xmm0, %xmm0
>> becomes:
>> vmovaps -0xXX(%rdx), %xmm2
>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 =
>> xmm3[0,2],xmm0[1,2]
>>
>>
>> Also, I see differences when some loads are shuffled, that I'm a bit
>> conflicted about:
>> vmovaps -0xXX(%rbp), %xmm3
&g...
2015 Jan 30
0
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
...>> anymore:
>>>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>>>> vblendps $0x1, %xmm2, %xmm0, %xmm0
>>>> becomes:
>>>> vmovaps -0xXX(%rdx), %xmm2
>>>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 =
>>>> xmm2[3,0],xmm0[0,0]
>>>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 =
>>>> xmm3[0,2],xmm0[1,2]
>>>>
>>>>
>>>> Also, I see differences when some loads are shuffled, that I'm a bit
>>>> c...
2016 Aug 12
2
Invoke loop vectorizer
Hi Daniel,
I increased the size of your test to be 128 but -stats still shows no loop
optimized...
Xiaochu
On Aug 12, 2016 11:11 AM, "Daniel Berlin" <dberlin at dberlin.org> wrote:
> It's not possible to know that A and B don't alias in this example. It's
> almost certainly not profitable to add a runtime check given the size of
> the loop.
>
>
>