search for: ymm0

Displaying 20 results from an estimated 74 matches for "ymm0".

Did you mean: xmm0
2019 Sep 02
3
AVX2 codegen - question reg. FMA generation
...generator (with cpu set to haswell or later types) turning it into an AVX2 FMA instructions. Here's the snippet in the output it generates: $ llc -O3 -mcpu=skylake --------------------- .LBB0_2: # =>This Inner Loop Header: Depth=1 vbroadcastss (%rsi,%rdx,4), %ymm0 vmulps (%rdi,%rcx), %ymm0, %ymm0 vaddps (%rax,%rcx), %ymm0, %ymm0 vmovups %ymm0, (%rax,%rcx) incq %rdx addq $32, %rcx cmpq $15, %rdx jle .LBB0_2 ----------------------- $ llc --version LLVM (http://llvm.org/): LLVM version 8.0.0 Optimized build. Default target: x86_64-unknown-linux-gnu Hos...
2012 May 24
4
[LLVMdev] use AVX automatically if present
...# @_fun1 .cfi_startproc # BB#0: # %_L1 pushq %rbp .Ltmp2: .cfi_def_cfa_offset 16 .Ltmp3: .cfi_offset %rbp, -16 movq %rsp, %rbp .Ltmp4: .cfi_def_cfa_register %rbp vmovaps (%rdi), %ymm0 vaddps (%rsi), %ymm0, %ymm0 vmovaps %ymm0, (%rdi) popq %rbp vzeroupper ret .Ltmp5: .size _fun1, .Ltmp5-_fun1 .cfi_endproc .section ".note.GNU-stack","", at progbits I guess your answer is...
2012 Mar 01
3
[LLVMdev] Stack alignment on X86 AVX seems incorrect
...ows-1252" ./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm | grep rbp vmovaps -176(%rbp), %ymm14 vmovaps -144(%rbp), %ymm11 vmovaps -240(%rbp), %ymm13 vmovaps -208(%rbp), %ymm9 vmovaps -272(%rbp), %ymm7 vmovaps -304(%rbp), %ymm0 vmovaps -112(%rbp), %ymm0 vmovaps -80(%rbp), %ymm1 vmovaps -112(%rbp), %ymm0 vmovaps -80(%rbp), %ymm0 vmovaps -176(%rbp), %ymm15 vmovaps -144(%rbp), %ymm0 vmovaps -240(%rbp), %ymm0 vmovaps -208(%rbp), %ymm0 vmovaps -272(%rbp), %ymm0...
2013 Sep 05
1
[LLVMdev] AVX calling convention?
...tions. The symptom is: a function is called, and the upper half of the function argument (which is short16) is zero. This happens only when I compile code with pocl, but not when I use clang and/or llc manually. I tracked this down to the following. The call site looks like vmovdqa 24064(%rsp), %ymm0 vmovdqa %ymm0, (%rsp) vzeroupper callq __Z14convert_char16Dv16_s which passes the argument on the stack. The callee, however, begins with __Z14convert_char16Dv16_s: ## @_Z14convert_char16Dv16_s .cfi_startproc ## BB#0: ## %entry pushq %rbp Ltmp2: ....
2012 May 24
0
[LLVMdev] use AVX automatically if present
....cfi_startproc > # BB#0: # %_L1 > pushq %rbp > .Ltmp2: > .cfi_def_cfa_offset 16 > .Ltmp3: > .cfi_offset %rbp, -16 > movq %rsp, %rbp > .Ltmp4: > .cfi_def_cfa_register %rbp > vmovaps (%rdi), %ymm0 > vaddps (%rsi), %ymm0, %ymm0 > vmovaps %ymm0, (%rdi) > popq %rbp > vzeroupper > ret > .Ltmp5: > .size _fun1, .Ltmp5-_fun1 > .cfi_endproc > > > .section ".note.GNU-stack","", at...
2017 Feb 18
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...m1, %xmm0, %xmm0 > retq > > after: > vmovd %edi, %xmm1 > vpbroadcastd %xmm1, %ymm1 > vmovdqa LCPI1_0(%rip), %ymm2 > vpshufb %ymm2, %ymm1, %ymm1 > vpermq $232, %ymm1, %ymm1 > vpmovzxwd %xmm1, %ymm1 > vpmovsxwd %xmm0, %ymm0 > vpsravd %ymm1, %ymm0, %ymm0 > vpshufb %ymm2, %ymm0, %ymm0 > vpermq $232, %ymm0, %ymm0 > vzeroupper > > > So this example may have won the bug lottery by exposing all of front-, > middle-, back-end bugs. :) > > > > On Fri, Feb 17, 2017 a...
2020 Sep 01
2
Vector evolution?
...I do the following loop: void fct7(float *x) { #pragma clang loop vectorize(enable) for (int i = 0; i < 4 * 256; ++i) x[i] = 7 * x[i]; } It compiles it to: 00000000000001e0 <_Z4fct7Pf>: 1e0: 31 c0 xor %eax,%eax 1e2: c4 e2 7d 18 05 00 00 vbroadcastss 0x0(%rip),%ymm0 # 1eb <_Z4fct7Pf+0xb> 1e9: 00 00 1eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 1f0: c5 fc 59 0c 87 vmulps (%rdi,%rax,4),%ymm0,%ymm1 1f5: c5 fc 59 54 87 20 vmulps 0x20(%rdi,%rax,4),%ymm0,%ymm2 1fb: c5 fc 59 5c 87 40 vmulps 0x40(%rdi,%rax,4),%ymm0,%ymm3 201: c5 fc...
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...l to <8 x double> __svml_sin8(<8 x double>) after the vectorizer. This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good. Later on, standard vector type legalization kicks-in but only the argument and return data are legalized. vmovaps %ymm0, %ymm1 vcvtdq2pd %xmm1, %ymm0 vextractf128 $1, %ymm1, %xmm1 vcvtdq2pd %xmm1, %ymm1 callq __svml_sin8 vmovups %ymm1, 32(%r15,%r12,8) vmovups %ymm0, (%r15,%r12,8) Unfortunately, __svml_sin8() doesn't use this form of input/output. I...
2012 Mar 01
0
[LLVMdev] Stack alignment on X86 AVX seems incorrect
./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm | grep rbp vmovaps -176(%rbp), %ymm14 vmovaps -144(%rbp), %ymm11 vmovaps -240(%rbp), %ymm13 vmovaps -208(%rbp), %ymm9 vmovaps -272(%rbp), %ymm7 vmovaps -304(%rbp), %ymm0 vmovaps -112(%rbp), %ymm0 vmovaps -80(%rbp), %ymm1 vmovaps -112(%rbp), %ymm0 vmovaps -80(%rbp), %ymm0 vmovaps -176(%rbp), %ymm15 vmovaps -144(%rbp), %ymm0 vmovaps -240(%rbp), %ymm0 vmovaps -208(%rbp), %ymm0 vmovaps -272(%rbp),...
2012 Mar 01
2
[LLVMdev] Stack alignment in kernel
I'm running in AVX mode, but the stack before call to kernel is aligned to 16 bit. Could you, please, tell me where it should be specified? Thank you. - Elena --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or
2017 Feb 17
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Correction in the C snippet: typedef signed short v8i16_t __attribute__((ext_vector_type(8))); v8i16_t foo (v8i16_t a, int n) { return a >> n; } Best regards Saurabh On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com> wrote: > Hello, > > We are investigating a difference in code generation for vector splat > instructions between llvm-3.9
2017 Mar 08
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...> vmovd %edi, %xmm1 >>> vpbroadcastd %xmm1, %ymm1 >>> vmovdqa LCPI1_0(%rip), %ymm2 >>> vpshufb %ymm2, %ymm1, %ymm1 >>> vpermq $232, %ymm1, %ymm1 >>> vpmovzxwd %xmm1, %ymm1 >>> vpmovsxwd %xmm0, %ymm0 >>> vpsravd %ymm1, %ymm0, %ymm0 >>> vpshufb %ymm2, %ymm0, %ymm0 >>> vpermq $232, %ymm0, %ymm0 >>> vzeroupper >>> >>> >>> So this example may have won the bug lottery by exposing all of front-, >>> middl...
2013 Dec 12
0
[LLVMdev] AVX code gen
...t %rbp, -16 movq %rsp, %rbp Ltmp4: .cfi_def_cfa_register %rbp xorl %eax, %eax .align 4, 0x90 LBB0_1: ## %vector.body ## =>This Inner Loop Header: Depth=1 vmovups (%rdx,%rax,4), %ymm0 vmulps (%rsi,%rax,4), %ymm0, %ymm0 vaddps (%rdi,%rax,4), %ymm0, %ymm0 vmovups %ymm0, (%rdi,%rax,4) addq $8, %rax cmpq $256, %rax ## imm = 0x100 jne LBB0_1 ## BB#2: ## %for.end popq %rb...
2012 Mar 01
0
[LLVMdev] Stack alignment on X86 AVX seems incorrect
...c.ll | grep movaps | grep >> ymm | grep rbp >> vmovaps -176(%rbp), %ymm14 >> vmovaps -144(%rbp), %ymm11 >> vmovaps -240(%rbp), %ymm13 >> vmovaps -208(%rbp), %ymm9 >> vmovaps -272(%rbp), %ymm7 >> vmovaps -304(%rbp), %ymm0 >> vmovaps -112(%rbp), %ymm0 >> vmovaps -80(%rbp), %ymm1 >> vmovaps -112(%rbp), %ymm0 >> vmovaps -80(%rbp), %ymm0 >> vmovaps -176(%rbp), %ymm15 >> vmovaps -144(%rbp), %ymm0 >> vmovaps -240(%rbp), %ymm0 >&g...
2013 Dec 11
2
[LLVMdev] AVX code gen
Hello - I found this post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...l to <8 x double> __svml_sin8(<8 x double>) after the vectorizer. This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good. Later on, standard vector type legalization kicks-in but only the argument and return data are legalized. vmovaps %ymm0, %ymm1 vcvtdq2pd %xmm1, %ymm0 vextractf128 $1, %ymm1, %xmm1 vcvtdq2pd %xmm1, %ymm1 callq __svml_sin8 vmovups %ymm1, 32(%r15,%r12,8) vmovups %ymm0, (%r15,%r12,8) Unfortunately, __svml_sin8() doesn't use this form of input/output. I...
2012 Jan 10
0
[LLVMdev] Calling conventions for YMM registers on AVX
...# %entry pushq %rbp movq %rsp, %rbp subq $64, %rsp vmovaps %xmm7, -32(%rbp) # 16-byte Spill vmovaps %xmm6, -16(%rbp) # 16-byte Spill vmovaps %ymm3, %ymm6 vmovaps %ymm2, %ymm7 vaddps %ymm7, %ymm0, %ymm0 vaddps %ymm6, %ymm1, %ymm1 callq foo vsubps %ymm7, %ymm0, %ymm0 vsubps %ymm6, %ymm1, %ymm1 vmovaps -16(%rbp), %xmm6 # 16-byte Reload vmovaps -32(%rbp), %xmm7 # 16-byte Reload addq $64, %rsp popq %rbp...
2013 Jul 10
4
[LLVMdev] unaligned AVX store gets split into two instructions
...ure_instructions .globl _vstore .align 4, 0x90 _vstore: ## @vstore .cfi_startproc ## BB#0: ## %entry pushq %rbp Ltmp2: .cfi_def_cfa_offset 16 Ltmp3: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp4: .cfi_def_cfa_register %rbp vmovups (%rdi), %ymm0 popq %rbp ret .cfi_endproc ---------------------------------------------------------------- Running llvm-33/bin/llc vstore.ll creates: .section __TEXT,__text,regular,pure_instructions .globl _main .align 4, 0x90 _main: ## @main...
2012 Jan 09
3
[LLVMdev] Calling conventions for YMM registers on AVX
On Jan 9, 2012, at 10:00 AM, Jakob Stoklund Olesen wrote: > > On Jan 8, 2012, at 11:18 PM, Demikhovsky, Elena wrote: > >> I'll explain what we see in the code. >> 1. The caller saves XMM registers across the call if needed (according to DEFS definition). >> YMMs are not in the set, so caller does not take care. > > This is not how the register allocator
2016 May 06
3
Unnecessary spill/fill issue
...at compile time. Each vector has a single use. In the final asm, I see a series of spills at the top of the function of all the constant vectors immediately to stack, then each use references the stack pointer directly: Lots of these at top of function: movabsq $.LCPI0_212, %rbx vmovaps (%rbx), %ymm0 vmovaps %ymm0, 2816(%rsp) # 32-byte Spill Later on, each use references the stack pointer: vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload It seems the spill to stack is unnecessary. In one particularly bad kernel, I have 128 8-wide constant vectors, and so there is 4KB of stack us...