Displaying 20 results from an estimated 41 matches for "ymm1".
Did you mean:
xmm1
2017 Feb 18
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...later survives through the backend and
> produces worse code even for x86 with AVX2:
> before:
> vmovd %edi, %xmm1
> vpmovzxwq %xmm1, %xmm1
> vpsraw %xmm1, %xmm0, %xmm0
> retq
>
> after:
> vmovd %edi, %xmm1
> vpbroadcastd %xmm1, %ymm1
> vmovdqa LCPI1_0(%rip), %ymm2
> vpshufb %ymm2, %ymm1, %ymm1
> vpermq $232, %ymm1, %ymm1
> vpmovzxwd %xmm1, %ymm1
> vpmovsxwd %xmm0, %ymm0
> vpsravd %ymm1, %ymm0, %ymm0
> vpshufb %ymm2, %ymm0, %ymm0
> vpermq $232, %ymm0...
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...t;8 x double> __svml_sin8(<8 x double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good.
Later on, standard vector type legalization kicks-in but only the argument and return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes...
2017 Feb 17
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Correction in the C snippet:
typedef signed short v8i16_t __attribute__((ext_vector_type(8)));
v8i16_t foo (v8i16_t a, int n)
{
return a >> n;
}
Best regards
Saurabh
On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com>
wrote:
> Hello,
>
> We are investigating a difference in code generation for vector splat
> instructions between llvm-3.9
2017 Mar 08
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...r x86 with AVX2:
>>> before:
>>> vmovd %edi, %xmm1
>>> vpmovzxwq %xmm1, %xmm1
>>> vpsraw %xmm1, %xmm0, %xmm0
>>> retq
>>>
>>> after:
>>> vmovd %edi, %xmm1
>>> vpbroadcastd %xmm1, %ymm1
>>> vmovdqa LCPI1_0(%rip), %ymm2
>>> vpshufb %ymm2, %ymm1, %ymm1
>>> vpermq $232, %ymm1, %ymm1
>>> vpmovzxwd %xmm1, %ymm1
>>> vpmovsxwd %xmm0, %ymm0
>>> vpsravd %ymm1, %ymm0, %ymm0
>>> vpshu...
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...t;8 x double> __svml_sin8(<8 x double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good.
Later on, standard vector type legalization kicks-in but only the argument and return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes...
2012 Jan 10
0
[LLVMdev] Calling conventions for YMM registers on AVX
...pushq %rbp
movq %rsp, %rbp
subq $64, %rsp
vmovaps %xmm7, -32(%rbp) # 16-byte Spill
vmovaps %xmm6, -16(%rbp) # 16-byte Spill
vmovaps %ymm3, %ymm6
vmovaps %ymm2, %ymm7
vaddps %ymm7, %ymm0, %ymm0
vaddps %ymm6, %ymm1, %ymm1
callq foo
vsubps %ymm7, %ymm0, %ymm0
vsubps %ymm6, %ymm1, %ymm1
vmovaps -16(%rbp), %xmm6 # 16-byte Reload
vmovaps -32(%rbp), %xmm7 # 16-byte Reload
addq $64, %rsp
popq %rbp
ret
ymm6,ymm7 are not saved ac...
2012 Jan 09
3
[LLVMdev] Calling conventions for YMM registers on AVX
...This thread has lots of interesting information: http://software.intel.com/en-us/forums/showthread.php?t=59291
I wasn't able to find a formal Win64 ABI spec, but according to http://www.agner.org/optimize/calling_conventions.pdf, xmm6-xmm15 are callee-saved on win64, but the high bits in ymm6-ymm15 are not.
That's not currently correctly modelled in LLVM. To fix it, create a pseudo-register YMMHI_CLOBBER that aliases ymm6-ymm15. Then add YMMHI_CLOBBER to the registers clobbered by WINCALL64*.
/jakob
2016 May 06
3
Unnecessary spill/fill issue
...nstant vectors immediately to stack,
then each use references the stack pointer directly:
Lots of these at top of function:
movabsq $.LCPI0_212, %rbx
vmovaps (%rbx), %ymm0
vmovaps %ymm0, 2816(%rsp) # 32-byte Spill
Later on, each use references the stack pointer:
vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload
It seems the spill to stack is unnecessary. In one particularly bad kernel,
I have 128 8-wide constant vectors, and so there is 4KB of stack use just
for these constants. I think a better approach could be to load the
constant vector pointers as needed:
movabsq $.LCPI0_212...
2018 Jul 02
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...gt;>
>> This is 8-element SVML sin() called with 8-element argument. On the
>> surface, this looks very good.
>>
>> Later on, standard vector type legalization kicks-in but only the
>> argument and return data are legalized.
>>
>> vmovaps %ymm0, %ymm1
>>
>> vcvtdq2pd %xmm1, %ymm0
>>
>> vextractf128 $1, %ymm1, %xmm1
>>
>> vcvtdq2pd %xmm1, %ymm1
>>
>> callq __svml_sin8
>>
>> vmovups %ymm1, 32(%r15,%r12,8)
>>
>> v...
2018 Jul 02
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...ble>) after the vectorizer.
>
> This is 8-element SVML sin() called with 8-element argument. On the
> surface, this looks very good.
>
> Later on, standard vector type legalization kicks-in but only the argument
> and return data are legalized.
>
> vmovaps %ymm0, %ymm1
>
> vcvtdq2pd %xmm1, %ymm0
>
> vextractf128 $1, %ymm1, %xmm1
>
> vcvtdq2pd %xmm1, %ymm1
>
> callq __svml_sin8
>
> vmovups %ymm1, 32(%r15,%r12,8)
>
> vmovups %ymm0, (%r15,%r12,8)
>
> Unfortunat...
2011 Nov 30
2
[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP
...ces) {
%pointer = getelementptr float* @lut, <8 x i32> %indices
%values = load <8 x float*> %pointer
ret <8 x float> %values;
}
And the final AVX2 code I'd expect would consist of a single VGATHERDPS, both on 64bits and 32bits addressing mode:
foo:
VPCMPEQB ymm1, ymm1, ymm1 ; generate all ones
VGATHERDPS ymm0, DWORD PTR [ymm0 * 4 + lut], ymm1
RET
Jose
----- Original Message -----
> Hi Jose,
>
> The proposed IR change does not contribute nor hinder the usecase you
> mentioned. The case of a base + vector-index shoul...
2018 Jul 02
8
[RFC][VECLIB] how should we legalize VECLIB calls?
...8-element SVML sin() called with 8-element
> argument. On the surface, this looks very good.
>
> Later on, standard vector type legalization kicks-in but
> only the argument and return data are legalized.
>
> vmovaps %ymm0, %ymm1
>
> vcvtdq2pd %xmm1, %ymm0
>
> vextractf128 $1, %ymm1, %xmm1
>
> vcvtdq2pd %xmm1, %ymm1
>
> callq __svml_sin8
>
> vmovups %ymm1, 32(%r15,%r12,8)
>
&g...
2020 Sep 01
2
Vector evolution?
...it to:
00000000000001e0 <_Z4fct7Pf>:
1e0: 31 c0 xor %eax,%eax
1e2: c4 e2 7d 18 05 00 00 vbroadcastss 0x0(%rip),%ymm0 # 1eb
<_Z4fct7Pf+0xb>
1e9: 00 00
1eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1f0: c5 fc 59 0c 87 vmulps (%rdi,%rax,4),%ymm0,%ymm1
1f5: c5 fc 59 54 87 20 vmulps 0x20(%rdi,%rax,4),%ymm0,%ymm2
1fb: c5 fc 59 5c 87 40 vmulps 0x40(%rdi,%rax,4),%ymm0,%ymm3
201: c5 fc 59 64 87 60 vmulps 0x60(%rdi,%rax,4),%ymm0,%ymm4
207: c5 fc 11 0c 87 vmovups %ymm1,(%rdi,%rax,4)
20c: c5 fc 11 54 87 20 vmovups %ymm2,0x20(%rdi,...
2011 Nov 29
0
[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP
Hi Jose,
The proposed IR change does not contribute nor hinder the usecase you mentioned. The case of a base + vector-index should be easily addressed by an intrinsic. The pointer-vector proposal comes to support full scatter/gather instructions (such as the AVX2 gather instructions).
Nadav
-----Original Message-----
From: Jose Fonseca [mailto:jfonseca at vmware.com]
Sent: Tuesday, November
2011 Nov 30
0
[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP
...ces) {
%pointer = getelementptr float* @lut, <8 x i32> %indices
%values = load <8 x float*> %pointer
ret <8 x float> %values;
}
And the final AVX2 code I'd expect would consist of a single VGATHERDPS, both on 64bits and 32bits addressing mode:
foo:
VPCMPEQB ymm1, ymm1, ymm1 ; generate all ones
VGATHERDPS ymm0, DWORD PTR [ymm0 * 4 + lut], ymm1
RET
Jose
----- Original Message -----
> Hi Jose,
>
> The proposed IR change does not contribute nor hinder the usecase you
> mentioned. The case of a base + vector-index shoul...
2011 Nov 29
4
[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP
----- Original Message -----
> "Rotem, Nadav" <nadav.rotem at intel.com> writes:
>
> > David,
> >
> > Thanks for the support! I sent a detailed email with the overall
> > plan. But just to reiterate, the GEP would look like this:
> >
> > %PV = getelementptr <4 x i32*> %base, <4 x i32> <i32 1, i32 2, i32
> > 3, i32
2013 Dec 11
2
[LLVMdev] AVX code gen
...post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such instructions (using the 3.3 release or either 3.4 rc1 or 3.4 rc2). I am new to clang / llvm so I may not be invoking the tools correctly but given that –fvectorize and –fslp-vectorize are on by default at 3.4 I would have tho...
2017 Aug 17
4
unable to emit vectorized code in LLVM IR
I assume compiler knows that your only have 2 input values that you just
added together 1000 times.
Despite the fact that you stored to a[i] and b[i] here, nothing reads them
other than the addition in the same loop iteration. So the compiler easily
removed the a and b arrays. Same with 'c', it's not read outside the loop
so it doesn't need to exist. So the compiler turned your
2012 Mar 01
3
[LLVMdev] Stack alignment on X86 AVX seems incorrect
...; <llvmdev at cs.uiuc.edu>
Message-ID:
<A0DC88CEB3010344830D52D66533DA8E0C2E7A at HASMSX103.ger.corp.intel.com>
Content-Type: text/plain; charset="windows-1252"
./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm |
grep rbp
vmovaps -176(%rbp), %ymm14
vmovaps -144(%rbp), %ymm11
vmovaps -240(%rbp), %ymm13
vmovaps -208(%rbp), %ymm9
vmovaps -272(%rbp), %ymm7
vmovaps -304(%rbp), %ymm0
vmovaps -112(%rbp), %ymm0
vmovaps -80(%rbp), %ymm1
vmovaps -112(%rbp), %ymm0
vmovaps -80(%rbp), %ymm0...
2013 Dec 12
0
[LLVMdev] AVX code gen
...post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such instructions (using the 3.3 release or either 3.4 rc1 or 3.4 rc2). I am new to clang / llvm so I may not be invoking the tools correctly but given that –fvectorize and –fslp-vectorize are on by default at 3.4 I would have tho...