Displaying 20 results from an estimated 75 matches for "xmm4".
Did you mean:
xmm0
2015 Jun 26
2
[LLVMdev] Can LLVM vectorize <2 x i32> type
...mp ne i128 %BCS54_D, 0
br i1 %mskS54_D, label %middle.block, label %vector.ph
Now the assembly for the above IR code is:
# BB#4: # %for.cond.preheader
vmovdqa 144(%rsp), %xmm0 # 16-byte Reload
vpmuludq %xmm7, %xmm0, %xmm2
vpsrlq $32, %xmm7, %xmm4
vpmuludq %xmm4, %xmm0, %xmm4
vpsllq $32, %xmm4, %xmm4
vpaddq %xmm4, %xmm2, %xmm2
vpsrlq $32, %xmm0, %xmm4
vpmuludq %xmm7, %xmm4, %xmm4
vpsllq $32, %xmm4, %xmm4
vpaddq %xmm4, %xmm2, %xmm2
vpextrq $1, %xmm2, %rax
cltq
vmovq %rax, %xmm4
vmovq...
2013 Jul 19
0
[LLVMdev] llvm.x86.sse2.sqrt.pd not using sqrtpd, calling a function that modifies ECX
...mmword ptr [esp+0B0h],xmm7
002E012F movapd xmm3,xmm1
002E0133 movlpd qword ptr [esp+0F0h],xmm3
002E013C movhpd qword ptr [esp+0E0h],xmm3
002E0145 movlpd qword ptr [esp+100h],xmm7
002E014E pshufd xmm0,xmm7,44h
002E0153 movdqa xmm5,xmm0
002E0157 xorpd xmm4,xmm4
002E015B mulpd xmm5,xmm4
002E015F pshufd xmm2,xmm3,44h
002E0164 movdqa xmm1,xmm2
002E0168 mulpd xmm1,xmm4
002E016C xorpd xmm7,xmm7
002E0170 movapd xmm4,xmmword ptr [esp+70h]
002E0176 subpd xmm4,xmm1
002E017A pshufd xmm3,xmm3,0EEh
002...
2004 Aug 06
2
[PATCH] Make SSE Run Time option. Add Win32 SSE code
...shufps xmm0, xmm0, 0x00
+ shufps xmm1, xmm1, 0x00
+
+ movaps xmm2, [eax+4]
+ movaps xmm3, [ebx+4]
+ mulps xmm2, xmm0
+ mulps xmm3, xmm1
+ movaps xmm4, [eax+20]
+ mulps xmm4, xmm0
+ addps xmm2, [ecx+4]
+ movaps xmm5, [ebx+20]
+ mulps xmm5, xmm1
+ addps xmm4, [ecx+20]
+ subps xmm2, xmm3
+ mo...
2015 Jun 24
2
[LLVMdev] Can LLVM vectorize <2 x i32> type
Hi,
Is LLVM be able to generate code for the following code?
%mul = mul <2 x i32> %1, %2, where %1 and %2 are <2 x i32> type.
I am running it on a Haswell processor with LLVM-3.4.2. It seems that it
will generates really complicated code with vpaddq, vpmuludq, vpsllq,
vpsrlq.
Thanks,
Zhi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
2016 Aug 12
4
Invoke loop vectorizer
...mm0, %xmm0 ## xmm0 = xmm0[0,1,0,1]
> pslldq $8, %xmm1 ## xmm1 =
> zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
> pshufd $68, %xmm3, %xmm3 ## xmm3 = xmm3[0,1,0,1]
> paddq %xmm1, %xmm3
> pshufd $78, %xmm3, %xmm4 ## xmm4 = xmm3[2,3,0,1]
> punpckldq %xmm5, %xmm4 ## xmm4 =
> xmm4[0],xmm5[0],xmm4[1],xmm5[1]
> pshufd $212, %xmm4, %xmm4 ## xmm4 = xmm4[0,1,1,3]
>
>
>
> Note:
> It also vectorizes at SIZE=8.
>
> Not sure what the exact translation o...
2013 Jul 19
4
[LLVMdev] SIMD instructions and memory alignment on X86
Hmm, I'm not able to get those .ll files to compile if I disable SSE and I
end up with SSE instructions(including sqrtpd) if I don't disable it.
On Thu, Jul 18, 2013 at 10:53 PM, Peter Newman <peter at uformia.com> wrote:
> Is there something specifically required to enable SSE? If it's not
> detected as available (based from the target triple?) then I don't think
2016 Aug 05
3
enabling interleaved access loop vectorization
...he vectorized code is actually fairly decent - e.g. forcing vectorization, with SSE4.2, we get:
.LBB0_3: # %vector.body
# =>This Inner Loop Header: Depth=1
movdqu (%rdi,%rax,4), %xmm3
movd %xmm0, %rcx
movdqu 4(%rdi,%rcx,4), %xmm4
paddd %xmm3, %xmm4
movdqu 8(%rdi,%rcx,4), %xmm3
paddd %xmm4, %xmm3
movdqa %xmm1, %xmm4
paddq %xmm4, %xmm4
movdqa %xmm0, %xmm5
paddq %xmm5, %xmm5
movd %xmm5, %rcx
pextrq $1, %xmm5, %rdx
movd %xmm4, %r8
pextrq $1, %xmm4, %r9
movd (%rdi,%rcx,4), %xmm4 # xmm4 = mem[0],zero,zero,zero
pinsrd $1, (%rdi...
2016 May 26
2
enabling interleaved access loop vectorization
Interleaved access is not enabled on X86 yet.
We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer
2016 Aug 12
2
Invoke loop vectorizer
Hi Daniel,
I increased the size of your test to be 128 but -stats still shows no loop
optimized...
Xiaochu
On Aug 12, 2016 11:11 AM, "Daniel Berlin" <dberlin at dberlin.org> wrote:
> It's not possible to know that A and B don't alias in this example. It's
> almost certainly not profitable to add a runtime check given the size of
> the loop.
>
>
>
2016 Aug 05
2
enabling interleaved access loop vectorization
...vectorization, with SSE4.2, we get:
>
>
>
> .LBB0_3: # %vector.body
>
> # =>This Inner Loop Header: Depth=1
>
> movdqu (%rdi,%rax,4), %xmm3
>
> movd %xmm0, %rcx
>
> movdqu 4(%rdi,%rcx,4), %xmm4
>
> paddd %xmm3, %xmm4
>
> movdqu 8(%rdi,%rcx,4), %xmm3
>
> paddd %xmm4, %xmm3
>
> movdqa %xmm1, %xmm4
>
> paddq %xmm4, %xmm4
>
> movdqa %xmm0, %xmm5
>
> paddq %xmm5, %xmm5
>
> movd %xmm5, %rcx
>
> pextrq $1, %xmm5, %rdx
>
> movd %xmm4, %r...
2016 May 26
0
enabling interleaved access loop vectorization
On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Is there a compile-time and/or potential runtime cost that makes
> enableInterleavedAccessVectorization() default to 'false'?
>
> I notice that this is set to true for ARM, AArch64, and PPC.
>
> In particular, I'm wondering if there's a reason it's not enabled for
2014 Sep 05
3
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at gmail.com>
wrote:
> Unfortunately, another team, while doing internal testing has seen the
> new path generating illegal insertps masks. A sample here:
>
> vinsertps $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3]
> vinsertps $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3]
> vinsertps $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3]
> vinsertps $416, %xmm1, %xmm4, %xmm14 # xmm14 =
> xmm4[0,1],xmm1[2],xmm4[3]
> vinsertps $...
2014 Sep 04
2
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Greetings all,
As you may have noticed, there is a new vector shuffle lowering path in the
X86 backend. You can try it out with the
'-x86-experimental-vector-shuffle-lowering' flag to llc, or '-mllvm
-x86-experimental-vector-shuffle-lowering' to clang. Please test it out!
There may be some correctness bugs, I'm still fuzz testing it to shake them
out. But I expect fairly few
2013 Aug 22
2
New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16
...3dnow
cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32
cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32_mmx
@@ -596,7 +597,7 @@
movss xmm3, xmm2
movss xmm2, xmm0
- ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm3:xmm3:xmm2
+ ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm4:xmm3:xmm2
movaps xmm1, xmm0
mulps xmm1, xmm2
addps xmm5, xmm1
@@ -619,6 +620,95 @@
ret
ALIGN 16
+cident FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16
+ ;[ebp + 20] == autoc[]
+ ;[ebp + 16] == lag
+ ;[ebp + 12] == data_len
+ ;[ebp + 8] == data[]
+ ;[esp] == __m128
+ ;[esp +...
2009 Jan 31
2
[LLVMdev] Optimized code analysis problems
...elow.
I get the function call names as llvm.x86 something instead of getting
function names(eg. _mm_cvtsi32_si128)
#include <pmmintrin.h>
#include<sys/time.h>
#include<iostream>
void foo_opt(unsigned char output[64], int Yc[64], int S_BITS)
{
__m128i XMM1, XMM2, XMM3, XMM4;
__m128i *xmm1 = (__m128i*)Yc;
__m128i XMM5 = _mm_cvtsi32_si128(S_BITS + 3) ;
XMM2 = _mm_set1_epi32(S_BITS + 2);
for (int l = 0; l < 8; l++) {
XMM1 = _mm_loadu_si128(xmm1++);
XMM3 = _mm_add_epi32(XMM1, XMM2);
XMM1 = _mm_cmplt_epi32(XMM1, _mm_setzero...
2015 Jan 29
2
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
...= xmm2[3,0],xmm0[0,0]
> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2]
>
>
> Also, I see differences when some loads are shuffled, that I'm a bit
> conflicted about:
> vmovaps -0xXX(%rbp), %xmm3
> ...
> vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
> becomes:
> vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
> ...
> vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]
>
> Note that the second version does the shuffle in-place, in xmm2....
2014 Sep 05
2
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
...Robert Lougher <rob.lougher at gmail.com>
>> wrote:
>>>
>>> Unfortunately, another team, while doing internal testing has seen the
>>> new path generating illegal insertps masks. A sample here:
>>>
>>> vinsertps $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3]
>>> vinsertps $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3]
>>> vinsertps $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3]
>>> vinsertps $416, %xmm1, %xmm4, %xmm14 # xmm14 =
>>> xmm4[0,1],xmm1[2],xm...
2014 Sep 19
4
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
...ible.
3. When zero extending 2 packed 32-bit integers, we should try to
emit a vpmovzxdq
Example:
vmovq 20(%rbx), %xmm0
vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]
Before:
vpmovzxdq 20(%rbx), %xmm0
4. We no longer emit a simpler 'vmovq' in the following case:
vxorpd %xmm4, %xmm4, %xmm4
vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]
Before, we used to generate:
vmovq %xmm2, %xmm4
Before, the vmovq implicitly zero-extended to 128 bits the quadword in
%xmm2. Now we always do this with a vxorpd+vblendps.
As I said, I will try to create smaller rep...
2012 Jul 06
2
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
...ns from a
>> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE:
>>
>> [...]
>> movaps 32(%rdi), %xmm3
>> movaps 48(%rdi), %xmm2
>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1
>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4
>> addps %xmm0, %xmm1
>> movaps %xmm1, -16(%rbp) ## 16-byte Spill
>> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3
>> [...]
>>
>> xmm3 loaded, duplicated into 2 regi...
2015 Jan 30
4
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
...mm0 ## xmm0 =
>>> xmm3[0,2],xmm0[1,2]
>>>
>>>
>>> Also, I see differences when some loads are shuffled, that I'm a bit
>>> conflicted about:
>>> vmovaps -0xXX(%rbp), %xmm3
>>> ...
>>> vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 =
>>> xmm4[3],xmm3[1,2,3]
>>> becomes:
>>> vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>>> ...
>>> vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 =
>>> xmm4[3],xmm2[1,2,3]
>>>
>...