Displaying 20 results from an estimated 74 matches for "ymm0".
Did you mean:
xmm0
2019 Sep 02
3
AVX2 codegen - question reg. FMA generation
...generator (with cpu set to haswell or later types) turning it into an
AVX2 FMA instructions. Here's the snippet in the output it generates:
$ llc -O3 -mcpu=skylake
---------------------
.LBB0_2: # =>This Inner Loop Header: Depth=1
vbroadcastss (%rsi,%rdx,4), %ymm0
vmulps (%rdi,%rcx), %ymm0, %ymm0
vaddps (%rax,%rcx), %ymm0, %ymm0
vmovups %ymm0, (%rax,%rcx)
incq %rdx
addq $32, %rcx
cmpq $15, %rdx
jle .LBB0_2
-----------------------
$ llc --version
LLVM (http://llvm.org/):
LLVM version 8.0.0
Optimized build.
Default target: x86_64-unknown-linux-gnu
Hos...
2012 May 24
4
[LLVMdev] use AVX automatically if present
...# @_fun1
.cfi_startproc
# BB#0: # %_L1
pushq %rbp
.Ltmp2:
.cfi_def_cfa_offset 16
.Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
.Ltmp4:
.cfi_def_cfa_register %rbp
vmovaps (%rdi), %ymm0
vaddps (%rsi), %ymm0, %ymm0
vmovaps %ymm0, (%rdi)
popq %rbp
vzeroupper
ret
.Ltmp5:
.size _fun1, .Ltmp5-_fun1
.cfi_endproc
.section ".note.GNU-stack","", at progbits
I guess your answer is...
2012 Mar 01
3
[LLVMdev] Stack alignment on X86 AVX seems incorrect
...ows-1252"
./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm |
grep rbp
vmovaps -176(%rbp), %ymm14
vmovaps -144(%rbp), %ymm11
vmovaps -240(%rbp), %ymm13
vmovaps -208(%rbp), %ymm9
vmovaps -272(%rbp), %ymm7
vmovaps -304(%rbp), %ymm0
vmovaps -112(%rbp), %ymm0
vmovaps -80(%rbp), %ymm1
vmovaps -112(%rbp), %ymm0
vmovaps -80(%rbp), %ymm0
vmovaps -176(%rbp), %ymm15
vmovaps -144(%rbp), %ymm0
vmovaps -240(%rbp), %ymm0
vmovaps -208(%rbp), %ymm0
vmovaps -272(%rbp), %ymm0...
2013 Sep 05
1
[LLVMdev] AVX calling convention?
...tions. The symptom is: a function is called, and the upper half of the function argument (which is short16) is zero. This happens only when I compile code with pocl, but not when I use clang and/or llc manually.
I tracked this down to the following. The call site looks like
vmovdqa 24064(%rsp), %ymm0
vmovdqa %ymm0, (%rsp)
vzeroupper
callq __Z14convert_char16Dv16_s
which passes the argument on the stack. The callee, however, begins with
__Z14convert_char16Dv16_s: ## @_Z14convert_char16Dv16_s
.cfi_startproc
## BB#0: ## %entry
pushq %rbp
Ltmp2:
....
2012 May 24
0
[LLVMdev] use AVX automatically if present
....cfi_startproc
> # BB#0: # %_L1
> pushq %rbp
> .Ltmp2:
> .cfi_def_cfa_offset 16
> .Ltmp3:
> .cfi_offset %rbp, -16
> movq %rsp, %rbp
> .Ltmp4:
> .cfi_def_cfa_register %rbp
> vmovaps (%rdi), %ymm0
> vaddps (%rsi), %ymm0, %ymm0
> vmovaps %ymm0, (%rdi)
> popq %rbp
> vzeroupper
> ret
> .Ltmp5:
> .size _fun1, .Ltmp5-_fun1
> .cfi_endproc
>
>
> .section ".note.GNU-stack","", at...
2017 Feb 18
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...m1, %xmm0, %xmm0
> retq
>
> after:
> vmovd %edi, %xmm1
> vpbroadcastd %xmm1, %ymm1
> vmovdqa LCPI1_0(%rip), %ymm2
> vpshufb %ymm2, %ymm1, %ymm1
> vpermq $232, %ymm1, %ymm1
> vpmovzxwd %xmm1, %ymm1
> vpmovsxwd %xmm0, %ymm0
> vpsravd %ymm1, %ymm0, %ymm0
> vpshufb %ymm2, %ymm0, %ymm0
> vpermq $232, %ymm0, %ymm0
> vzeroupper
>
>
> So this example may have won the bug lottery by exposing all of front-,
> middle-, back-end bugs. :)
>
>
>
> On Fri, Feb 17, 2017 a...
2020 Sep 01
2
Vector evolution?
...I do the following loop:
void fct7(float *x)
{
#pragma clang loop vectorize(enable)
for (int i = 0; i < 4 * 256; ++i)
x[i] = 7 * x[i];
}
It compiles it to:
00000000000001e0 <_Z4fct7Pf>:
1e0: 31 c0 xor %eax,%eax
1e2: c4 e2 7d 18 05 00 00 vbroadcastss 0x0(%rip),%ymm0 # 1eb
<_Z4fct7Pf+0xb>
1e9: 00 00
1eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1f0: c5 fc 59 0c 87 vmulps (%rdi,%rax,4),%ymm0,%ymm1
1f5: c5 fc 59 54 87 20 vmulps 0x20(%rdi,%rax,4),%ymm0,%ymm2
1fb: c5 fc 59 5c 87 40 vmulps 0x40(%rdi,%rax,4),%ymm0,%ymm3
201: c5 fc...
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...l to <8 x double> __svml_sin8(<8 x double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good.
Later on, standard vector type legalization kicks-in but only the argument and return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. I...
2012 Mar 01
0
[LLVMdev] Stack alignment on X86 AVX seems incorrect
./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm | grep rbp
vmovaps -176(%rbp), %ymm14
vmovaps -144(%rbp), %ymm11
vmovaps -240(%rbp), %ymm13
vmovaps -208(%rbp), %ymm9
vmovaps -272(%rbp), %ymm7
vmovaps -304(%rbp), %ymm0
vmovaps -112(%rbp), %ymm0
vmovaps -80(%rbp), %ymm1
vmovaps -112(%rbp), %ymm0
vmovaps -80(%rbp), %ymm0
vmovaps -176(%rbp), %ymm15
vmovaps -144(%rbp), %ymm0
vmovaps -240(%rbp), %ymm0
vmovaps -208(%rbp), %ymm0
vmovaps -272(%rbp),...
2012 Mar 01
2
[LLVMdev] Stack alignment in kernel
I'm running in AVX mode, but the stack before call to kernel is aligned to 16 bit.
Could you, please, tell me where it should be specified?
Thank you.
- Elena
---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or
2017 Feb 17
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Correction in the C snippet:
typedef signed short v8i16_t __attribute__((ext_vector_type(8)));
v8i16_t foo (v8i16_t a, int n)
{
return a >> n;
}
Best regards
Saurabh
On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com>
wrote:
> Hello,
>
> We are investigating a difference in code generation for vector splat
> instructions between llvm-3.9
2017 Mar 08
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...> vmovd %edi, %xmm1
>>> vpbroadcastd %xmm1, %ymm1
>>> vmovdqa LCPI1_0(%rip), %ymm2
>>> vpshufb %ymm2, %ymm1, %ymm1
>>> vpermq $232, %ymm1, %ymm1
>>> vpmovzxwd %xmm1, %ymm1
>>> vpmovsxwd %xmm0, %ymm0
>>> vpsravd %ymm1, %ymm0, %ymm0
>>> vpshufb %ymm2, %ymm0, %ymm0
>>> vpermq $232, %ymm0, %ymm0
>>> vzeroupper
>>>
>>>
>>> So this example may have won the bug lottery by exposing all of front-,
>>> middl...
2013 Dec 12
0
[LLVMdev] AVX code gen
...t %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
xorl %eax, %eax
.align 4, 0x90
LBB0_1: ## %vector.body
## =>This Inner Loop Header: Depth=1
vmovups (%rdx,%rax,4), %ymm0
vmulps (%rsi,%rax,4), %ymm0, %ymm0
vaddps (%rdi,%rax,4), %ymm0, %ymm0
vmovups %ymm0, (%rdi,%rax,4)
addq $8, %rax
cmpq $256, %rax ## imm = 0x100
jne LBB0_1
## BB#2: ## %for.end
popq %rb...
2012 Mar 01
0
[LLVMdev] Stack alignment on X86 AVX seems incorrect
...c.ll | grep movaps | grep
>> ymm | grep rbp
>> vmovaps -176(%rbp), %ymm14
>> vmovaps -144(%rbp), %ymm11
>> vmovaps -240(%rbp), %ymm13
>> vmovaps -208(%rbp), %ymm9
>> vmovaps -272(%rbp), %ymm7
>> vmovaps -304(%rbp), %ymm0
>> vmovaps -112(%rbp), %ymm0
>> vmovaps -80(%rbp), %ymm1
>> vmovaps -112(%rbp), %ymm0
>> vmovaps -80(%rbp), %ymm0
>> vmovaps -176(%rbp), %ymm15
>> vmovaps -144(%rbp), %ymm0
>> vmovaps -240(%rbp), %ymm0
>&g...
2013 Dec 11
2
[LLVMdev] AVX code gen
Hello -
I found this post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...l to <8 x double> __svml_sin8(<8 x double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface, this looks very good.
Later on, standard vector type legalization kicks-in but only the argument and return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. I...
2012 Jan 10
0
[LLVMdev] Calling conventions for YMM registers on AVX
...# %entry
pushq %rbp
movq %rsp, %rbp
subq $64, %rsp
vmovaps %xmm7, -32(%rbp) # 16-byte Spill
vmovaps %xmm6, -16(%rbp) # 16-byte Spill
vmovaps %ymm3, %ymm6
vmovaps %ymm2, %ymm7
vaddps %ymm7, %ymm0, %ymm0
vaddps %ymm6, %ymm1, %ymm1
callq foo
vsubps %ymm7, %ymm0, %ymm0
vsubps %ymm6, %ymm1, %ymm1
vmovaps -16(%rbp), %xmm6 # 16-byte Reload
vmovaps -32(%rbp), %xmm7 # 16-byte Reload
addq $64, %rsp
popq %rbp...
2013 Jul 10
4
[LLVMdev] unaligned AVX store gets split into two instructions
...ure_instructions
.globl _vstore
.align 4, 0x90
_vstore: ## @vstore
.cfi_startproc
## BB#0: ## %entry
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
vmovups (%rdi), %ymm0
popq %rbp
ret
.cfi_endproc
----------------------------------------------------------------
Running llvm-33/bin/llc vstore.ll creates:
.section __TEXT,__text,regular,pure_instructions
.globl _main
.align 4, 0x90
_main: ## @main...
2012 Jan 09
3
[LLVMdev] Calling conventions for YMM registers on AVX
On Jan 9, 2012, at 10:00 AM, Jakob Stoklund Olesen wrote:
>
> On Jan 8, 2012, at 11:18 PM, Demikhovsky, Elena wrote:
>
>> I'll explain what we see in the code.
>> 1. The caller saves XMM registers across the call if needed (according to DEFS definition).
>> YMMs are not in the set, so caller does not take care.
>
> This is not how the register allocator
2016 May 06
3
Unnecessary spill/fill issue
...at compile time.
Each vector has a single use. In the final asm, I see a series of spills at
the top of the function of all the constant vectors immediately to stack,
then each use references the stack pointer directly:
Lots of these at top of function:
movabsq $.LCPI0_212, %rbx
vmovaps (%rbx), %ymm0
vmovaps %ymm0, 2816(%rsp) # 32-byte Spill
Later on, each use references the stack pointer:
vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload
It seems the spill to stack is unnecessary. In one particularly bad kernel,
I have 128 8-wide constant vectors, and so there is 4KB of stack us...