thr3ads.net - search: "ymm1"

Vector trunc code generation difference between llvm-3.9 and 4.0

2017 Feb 18

2

Vector trunc code generation difference between llvm-3.9 and 4.0

...later survives through the backend and > produces worse code even for x86 with AVX2: > before: > vmovd %edi, %xmm1 > vpmovzxwq %xmm1, %xmm1 > vpsraw %xmm1, %xmm0, %xmm0 > retq > > after: > vmovd %edi, %xmm1 > vpbroadcastd %xmm1, %ymm1 > vmovdqa LCPI1_0(%rip), %ymm2 > vpshufb %ymm2, %ymm1, %ymm1 > vpermq $232, %ymm1, %ymm1 > vpmovzxwd %xmm1, %ymm1 > vpmovsxwd %xmm0, %ymm0 > vpsravd %ymm1, %ymm0, %ymm0 > vpshufb %ymm2, %ymm0, %ymm0 > vpermq $232, %ymm0...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jun 29

2

[RFC][VECLIB] how should we legalize VECLIB calls?

...t;8 x double> __svml_sin8(<8 x double>) after the vectorizer. This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good. Later on, standard vector type legalization kicks-in but only the argument and return data are legalized. vmovaps %ymm0, %ymm1 vcvtdq2pd %xmm1, %ymm0 vextractf128 $1, %ymm1, %xmm1 vcvtdq2pd %xmm1, %ymm1 callq __svml_sin8 vmovups %ymm1, 32(%r15,%r12,8) vmovups %ymm0, (%r15,%r12,8) Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes...

Vector trunc code generation difference between llvm-3.9 and 4.0

2017 Feb 17

2

Vector trunc code generation difference between llvm-3.9 and 4.0

Correction in the C snippet: typedef signed short v8i16_t __attribute__((ext_vector_type(8))); v8i16_t foo (v8i16_t a, int n) { return a >> n; } Best regards Saurabh On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com> wrote: > Hello, > > We are investigating a difference in code generation for vector splat > instructions between llvm-3.9

Vector trunc code generation difference between llvm-3.9 and 4.0

2017 Mar 08

2

Vector trunc code generation difference between llvm-3.9 and 4.0

...r x86 with AVX2: >>> before: >>> vmovd %edi, %xmm1 >>> vpmovzxwq %xmm1, %xmm1 >>> vpsraw %xmm1, %xmm0, %xmm0 >>> retq >>> >>> after: >>> vmovd %edi, %xmm1 >>> vpbroadcastd %xmm1, %ymm1 >>> vmovdqa LCPI1_0(%rip), %ymm2 >>> vpshufb %ymm2, %ymm1, %ymm1 >>> vpermq $232, %ymm1, %ymm1 >>> vpmovzxwd %xmm1, %ymm1 >>> vpmovsxwd %xmm0, %ymm0 >>> vpsravd %ymm1, %ymm0, %ymm0 >>> vpshu...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jun 29

2

[RFC][VECLIB] how should we legalize VECLIB calls?

...t;8 x double> __svml_sin8(<8 x double>) after the vectorizer. This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good. Later on, standard vector type legalization kicks-in but only the argument and return data are legalized. vmovaps %ymm0, %ymm1 vcvtdq2pd %xmm1, %ymm0 vextractf128 $1, %ymm1, %xmm1 vcvtdq2pd %xmm1, %ymm1 callq __svml_sin8 vmovups %ymm1, 32(%r15,%r12,8) vmovups %ymm0, (%r15,%r12,8) Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes...

[LLVMdev] Calling conventions for YMM registers on AVX

2012 Jan 10

0

[LLVMdev] Calling conventions for YMM registers on AVX

...pushq %rbp movq %rsp, %rbp subq $64, %rsp vmovaps %xmm7, -32(%rbp) # 16-byte Spill vmovaps %xmm6, -16(%rbp) # 16-byte Spill vmovaps %ymm3, %ymm6 vmovaps %ymm2, %ymm7 vaddps %ymm7, %ymm0, %ymm0 vaddps %ymm6, %ymm1, %ymm1 callq foo vsubps %ymm7, %ymm0, %ymm0 vsubps %ymm6, %ymm1, %ymm1 vmovaps -16(%rbp), %xmm6 # 16-byte Reload vmovaps -32(%rbp), %xmm7 # 16-byte Reload addq $64, %rsp popq %rbp ret ymm6,ymm7 are not saved ac...

[LLVMdev] Calling conventions for YMM registers on AVX

2012 Jan 09

3

[LLVMdev] Calling conventions for YMM registers on AVX

...This thread has lots of interesting information: http://software.intel.com/en-us/forums/showthread.php?t=59291 I wasn't able to find a formal Win64 ABI spec, but according to http://www.agner.org/optimize/calling_conventions.pdf, xmm6-xmm15 are callee-saved on win64, but the high bits in ymm6-ymm15 are not. That's not currently correctly modelled in LLVM. To fix it, create a pseudo-register YMMHI_CLOBBER that aliases ymm6-ymm15. Then add YMMHI_CLOBBER to the registers clobbered by WINCALL64*. /jakob

Unnecessary spill/fill issue

2016 May 06

3

Unnecessary spill/fill issue

...nstant vectors immediately to stack, then each use references the stack pointer directly: Lots of these at top of function: movabsq $.LCPI0_212, %rbx vmovaps (%rbx), %ymm0 vmovaps %ymm0, 2816(%rsp) # 32-byte Spill Later on, each use references the stack pointer: vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload It seems the spill to stack is unnecessary. In one particularly bad kernel, I have 128 8-wide constant vectors, and so there is 4KB of stack use just for these constants. I think a better approach could be to load the constant vector pointers as needed: movabsq $.LCPI0_212...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jul 02

2

[RFC][VECLIB] how should we legalize VECLIB calls?

...gt;> >> This is 8-element SVML sin() called with 8-element argument. On the >> surface, this looks very good. >> >> Later on, standard vector type legalization kicks-in but only the >> argument and return data are legalized. >> >> vmovaps %ymm0, %ymm1 >> >> vcvtdq2pd %xmm1, %ymm0 >> >> vextractf128 $1, %ymm1, %xmm1 >> >> vcvtdq2pd %xmm1, %ymm1 >> >> callq __svml_sin8 >> >> vmovups %ymm1, 32(%r15,%r12,8) >> >> v...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jul 02

2

[RFC][VECLIB] how should we legalize VECLIB calls?

...ble>) after the vectorizer. > > This is 8-element SVML sin() called with 8-element argument. On the > surface, this looks very good. > > Later on, standard vector type legalization kicks-in but only the argument > and return data are legalized. > > vmovaps %ymm0, %ymm1 > > vcvtdq2pd %xmm1, %ymm0 > > vextractf128 $1, %ymm1, %xmm1 > > vcvtdq2pd %xmm1, %ymm1 > > callq __svml_sin8 > > vmovups %ymm1, 32(%r15,%r12,8) > > vmovups %ymm0, (%r15,%r12,8) > > Unfortunat...

[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP

2011 Nov 30

2

[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP

...ces) { %pointer = getelementptr float* @lut, <8 x i32> %indices %values = load <8 x float*> %pointer ret <8 x float> %values; } And the final AVX2 code I'd expect would consist of a single VGATHERDPS, both on 64bits and 32bits addressing mode: foo: VPCMPEQB ymm1, ymm1, ymm1 ; generate all ones VGATHERDPS ymm0, DWORD PTR [ymm0 * 4 + lut], ymm1 RET Jose ----- Original Message ----- > Hi Jose, > > The proposed IR change does not contribute nor hinder the usecase you > mentioned. The case of a base + vector-index shoul...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jul 02

8

[RFC][VECLIB] how should we legalize VECLIB calls?

...8-element SVML sin() called with 8-element > argument. On the surface, this looks very good. > > Later on, standard vector type legalization kicks-in but > only the argument and return data are legalized. > > vmovaps %ymm0, %ymm1 > > vcvtdq2pd %xmm1, %ymm0 > > vextractf128 $1, %ymm1, %xmm1 > > vcvtdq2pd %xmm1, %ymm1 > > callq __svml_sin8 > > vmovups %ymm1, 32(%r15,%r12,8) > &g...

Vector evolution?

2020 Sep 01

2

Vector evolution?

...it to: 00000000000001e0 <_Z4fct7Pf>: 1e0: 31 c0 xor %eax,%eax 1e2: c4 e2 7d 18 05 00 00 vbroadcastss 0x0(%rip),%ymm0 # 1eb <_Z4fct7Pf+0xb> 1e9: 00 00 1eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 1f0: c5 fc 59 0c 87 vmulps (%rdi,%rax,4),%ymm0,%ymm1 1f5: c5 fc 59 54 87 20 vmulps 0x20(%rdi,%rax,4),%ymm0,%ymm2 1fb: c5 fc 59 5c 87 40 vmulps 0x40(%rdi,%rax,4),%ymm0,%ymm3 201: c5 fc 59 64 87 60 vmulps 0x60(%rdi,%rax,4),%ymm0,%ymm4 207: c5 fc 11 0c 87 vmovups %ymm1,(%rdi,%rax,4) 20c: c5 fc 11 54 87 20 vmovups %ymm2,0x20(%rdi,...

[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP

2011 Nov 29

0

[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP

Hi Jose, The proposed IR change does not contribute nor hinder the usecase you mentioned. The case of a base + vector-index should be easily addressed by an intrinsic. The pointer-vector proposal comes to support full scatter/gather instructions (such as the AVX2 gather instructions). Nadav -----Original Message----- From: Jose Fonseca [mailto:jfonseca at vmware.com] Sent: Tuesday, November

[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP

2011 Nov 30

0

[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP

...ces) { %pointer = getelementptr float* @lut, <8 x i32> %indices %values = load <8 x float*> %pointer ret <8 x float> %values; } And the final AVX2 code I'd expect would consist of a single VGATHERDPS, both on 64bits and 32bits addressing mode: foo: VPCMPEQB ymm1, ymm1, ymm1 ; generate all ones VGATHERDPS ymm0, DWORD PTR [ymm0 * 4 + lut], ymm1 RET Jose ----- Original Message ----- > Hi Jose, > > The proposed IR change does not contribute nor hinder the usecase you > mentioned. The case of a base + vector-index shoul...

[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP

2011 Nov 29

4

[LLVMdev] [llvm-commits] Vectors of Pointers and Vector-GEP

----- Original Message ----- > "Rotem, Nadav" <nadav.rotem at intel.com> writes: > > > David, > > > > Thanks for the support! I sent a detailed email with the overall > > plan. But just to reiterate, the GEP would look like this: > > > > %PV = getelementptr <4 x i32*> %base, <4 x i32> <i32 1, i32 2, i32 > > 3, i32

[LLVMdev] AVX code gen

2013 Dec 11

2

[LLVMdev] AVX code gen

...post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such instructions (using the 3.3 release or either 3.4 rc1 or 3.4 rc2). I am new to clang / llvm so I may not be invoking the tools correctly but given that –fvectorize and –fslp-vectorize are on by default at 3.4 I would have tho...

unable to emit vectorized code in LLVM IR

2017 Aug 17

4

unable to emit vectorized code in LLVM IR

I assume compiler knows that your only have 2 input values that you just added together 1000 times. Despite the fact that you stored to a[i] and b[i] here, nothing reads them other than the addition in the same loop iteration. So the compiler easily removed the a and b arrays. Same with 'c', it's not read outside the loop so it doesn't need to exist. So the compiler turned your

[LLVMdev] Stack alignment on X86 AVX seems incorrect

2012 Mar 01

3

[LLVMdev] Stack alignment on X86 AVX seems incorrect

...; <llvmdev at cs.uiuc.edu> Message-ID: <A0DC88CEB3010344830D52D66533DA8E0C2E7A at HASMSX103.ger.corp.intel.com> Content-Type: text/plain; charset="windows-1252" ./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm | grep rbp vmovaps -176(%rbp), %ymm14 vmovaps -144(%rbp), %ymm11 vmovaps -240(%rbp), %ymm13 vmovaps -208(%rbp), %ymm9 vmovaps -272(%rbp), %ymm7 vmovaps -304(%rbp), %ymm0 vmovaps -112(%rbp), %ymm0 vmovaps -80(%rbp), %ymm1 vmovaps -112(%rbp), %ymm0 vmovaps -80(%rbp), %ymm0...

[LLVMdev] AVX code gen

2013 Dec 12

0

[LLVMdev] AVX code gen

...post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such instructions (using the 3.3 release or either 3.4 rc1 or 3.4 rc2). I am new to clang / llvm so I may not be invoking the tools correctly but given that –fvectorize and –fslp-vectorize are on by default at 3.4 I would have tho...

search for: ymm1