Displaying 20 results from an estimated 59 matches for "vmovaps".
Did you mean:
movaps
2012 Mar 01
3
[LLVMdev] Stack alignment on X86 AVX seems incorrect
...v at cs.uiuc.edu" <llvmdev at cs.uiuc.edu>
Message-ID:
<A0DC88CEB3010344830D52D66533DA8E0C2E7A at HASMSX103.ger.corp.intel.com>
Content-Type: text/plain; charset="windows-1252"
./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm |
grep rbp
vmovaps -176(%rbp), %ymm14
vmovaps -144(%rbp), %ymm11
vmovaps -240(%rbp), %ymm13
vmovaps -208(%rbp), %ymm9
vmovaps -272(%rbp), %ymm7
vmovaps -304(%rbp), %ymm0
vmovaps -112(%rbp), %ymm0
vmovaps -80(%rbp), %ymm1
vmovaps -112(%rbp), %ymm0
vmovaps...
2012 Mar 01
0
[LLVMdev] Stack alignment on X86 AVX seems incorrect
./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm | grep rbp
vmovaps -176(%rbp), %ymm14
vmovaps -144(%rbp), %ymm11
vmovaps -240(%rbp), %ymm13
vmovaps -208(%rbp), %ymm9
vmovaps -272(%rbp), %ymm7
vmovaps -304(%rbp), %ymm0
vmovaps -112(%rbp), %ymm0
vmovaps -80(%rbp), %ymm1
vmovaps -112(%rbp), %ymm0...
2012 Mar 01
2
[LLVMdev] Stack alignment in kernel
I'm running in AVX mode, but the stack before call to kernel is aligned to 16 bit.
Could you, please, tell me where it should be specified?
Thank you.
- Elena
---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or
2012 Mar 01
0
[LLVMdev] Stack alignment on X86 AVX seems incorrect
...m
>> <mailto:A0DC88CEB3010344830D52D66533DA8E0C2E7A at HASMSX103.ger.corp.intel.com>>
>> Content-Type: text/plain; charset="windows-1252"
>>
>> ./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep
>> ymm | grep rbp
>> vmovaps -176(%rbp), %ymm14
>> vmovaps -144(%rbp), %ymm11
>> vmovaps -240(%rbp), %ymm13
>> vmovaps -208(%rbp), %ymm9
>> vmovaps -272(%rbp), %ymm7
>> vmovaps -304(%rbp), %ymm0
>> vmovaps -112(%rbp), %ymm0
>> vmovaps -8...
2012 Jul 27
2
[LLVMdev] X86 FMA4
Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory.
vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5.
vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1.
He does not list vmovsd, but movsd has the same stats as...
2012 Jul 27
0
[LLVMdev] X86 FMA4
Hey Michael,
Thanks for the legwork!
It appears that the stats you listed are for movaps [SSE], not vmovaps
[AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256),
since they are both AVX instructions. Although, yes, I agree that this is
not clear from Agner's report. Please correct me if I am misunderstanding.
As I am sure you are aware, we cannot use SSE (movaps) instructions in an...
2016 Jun 29
2
avx512 JIT backend generates wrong code on <4 x float>
...q $63, %rcx
shrq $62, %rcx
addq %r8, %rcx
sarq $2, %rcx
movq %rax, %rdx
shlq $5, %rdx
leaq 16(%r9,%rdx), %rsi
orq $16, %rdx
movq 16(%rsp), %rdi
addq %rdx, %rdi
addq 8(%rsp), %rdx
.align 16, 0x90
.LBB0_1:
vmovaps -16(%rdx), %xmm0
vmovaps (%rdx), %xmm1
vmovaps -16(%rdi), %xmm2
vmovaps (%rdi), %xmm3
vmulps %xmm3, %xmm1, %xmm4
vmulps %xmm2, %xmm1, %xmm1
vfmadd213ss %xmm4, %xmm0, %xmm2
vfmsub213ss %xmm1, %xmm0, %xmm3
vmovaps %xmm2, -16(%rsi)...
2020 Sep 01
2
Vector evolution?
..._f>:
160: 31 c0 xor %eax,%eax
162: c4 e2 79 18 05 00 00 vbroadcastss 0x0(%rip),%xmm0 # 16b
<_Z4fct6PDv4_f+0xb>
169: 00 00
16b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
170: c5 f8 59 0c 07 vmulps (%rdi,%rax,1),%xmm0,%xmm1
175: c5 f8 29 0c 07 vmovaps %xmm1,(%rdi,%rax,1)
17a: c5 f8 59 4c 07 10 vmulps 0x10(%rdi,%rax,1),%xmm0,%xmm1
180: c5 f8 29 4c 07 10 vmovaps %xmm1,0x10(%rdi,%rax,1)
186: c5 f8 59 4c 07 20 vmulps 0x20(%rdi,%rax,1),%xmm0,%xmm1
18c: c5 f8 29 4c 07 20 vmovaps %xmm1,0x20(%rdi,%rax,1)
192: c5 f8 59 4c 07 30 vmulps...
2016 Jun 29
0
avx512 JIT backend generates wrong code on <4 x float>
...%rcx
> sarq $2, %rcx
> movq %rax, %rdx
> shlq $5, %rdx
> leaq 16(%r9,%rdx), %rsi
> orq $16, %rdx
> movq 16(%rsp), %rdi
> addq %rdx, %rdi
> addq 8(%rsp), %rdx
> .align 16, 0x90
> .LBB0_1:
> vmovaps -16(%rdx), %xmm0
> vmovaps (%rdx), %xmm1
> vmovaps -16(%rdi), %xmm2
> vmovaps (%rdi), %xmm3
> vmulps %xmm3, %xmm1, %xmm4
> vmulps %xmm2, %xmm1, %xmm1
> vfmadd213ss %xmm4, %xmm0, %xmm2
> vfmsub213ss %xmm1, %xmm0, %xmm3
&...
2016 Jun 30
1
avx512 JIT backend generates wrong code on <4 x float>
...ovq %rax, %rdx
>> shlq $5, %rdx
>> leaq 16(%r9,%rdx), %rsi
>> orq $16, %rdx
>> movq 16(%rsp), %rdi
>> addq %rdx, %rdi
>> addq 8(%rsp), %rdx
>> .align 16, 0x90
>> .LBB0_1:
>> vmovaps -16(%rdx), %xmm0
>> vmovaps (%rdx), %xmm1
>> vmovaps -16(%rdi), %xmm2
>> vmovaps (%rdi), %xmm3
>> vmulps %xmm3, %xmm1, %xmm4
>> vmulps %xmm2, %xmm1, %xmm1
>> vfmadd213ss %xmm4, %xmm0, %xmm2
>> v...
2012 Jan 10
0
[LLVMdev] Calling conventions for YMM registers on AVX
...test;
.scl 2;
.type 32;
.endef
.text
.globl test
.align 16, 0x90
test: # @test
# BB#0: # %entry
pushq %rbp
movq %rsp, %rbp
subq $64, %rsp
vmovaps %xmm7, -32(%rbp) # 16-byte Spill
vmovaps %xmm6, -16(%rbp) # 16-byte Spill
vmovaps %ymm3, %ymm6
vmovaps %ymm2, %ymm7
vaddps %ymm7, %ymm0, %ymm0
vaddps %ymm6, %ymm1, %ymm1
callq foo
vsubps %ymm7, %ymm0, %ymm0
vsubps %...
2020 Aug 31
2
Vectorization of math function failed?
...iled with: clang++ -O3 -march=native -mtune=native -c -o
vec.o vec.cc -lmvec -fno-math-errno
And here is what I get:
vec.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z4fct1Dv4_f>:
0: 48 83 ec 48 sub $0x48,%rsp
4: c5 f8 29 04 24 vmovaps %xmm0,(%rsp)
9: e8 00 00 00 00 callq e <_Z4fct1Dv4_f+0xe>
e: c5 f8 29 44 24 30 vmovaps %xmm0,0x30(%rsp)
14: c5 fa 16 04 24 vmovshdup (%rsp),%xmm0
19: e8 00 00 00 00 callq 1e <_Z4fct1Dv4_f+0x1e>
1e: c5 f8 29 44 24 20 vmovaps %xmm0,0x20(%rsp)
24:...
2012 Jul 27
3
[LLVMdev] X86 FMA4
> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding.
You are misunderstanding [no worries, happens to everyone = )]. The timing...
2012 May 24
4
[LLVMdev] use AVX automatically if present
...un1: # @_fun1
.cfi_startproc
# BB#0: # %_L1
pushq %rbp
.Ltmp2:
.cfi_def_cfa_offset 16
.Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
.Ltmp4:
.cfi_def_cfa_register %rbp
vmovaps (%rdi), %ymm0
vaddps (%rsi), %ymm0, %ymm0
vmovaps %ymm0, (%rdi)
popq %rbp
vzeroupper
ret
.Ltmp5:
.size _fun1, .Ltmp5-_fun1
.cfi_endproc
.section ".note.GNU-stack","", at progbits
I guess y...
2012 Jul 26
0
[LLVMdev] X86 FMA4
...gt; >
> >Let's look at the VFMADDSD pattern. We're operating on scalars with
> undefineds as the remaining vector elements of the operands. This sounds
> okay, but when one looks closer...
> >
> > vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647
> > vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill
> > vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647
> >
> >
> >The spill here is 16-bytes. But, we're only using the low 8-bytes of
> xmm3. Changing the intrinsics and patterns to accep...
2016 May 06
3
Unnecessary spill/fill issue
...constant vectors at compile time.
Each vector has a single use. In the final asm, I see a series of spills at
the top of the function of all the constant vectors immediately to stack,
then each use references the stack pointer directly:
Lots of these at top of function:
movabsq $.LCPI0_212, %rbx
vmovaps (%rbx), %ymm0
vmovaps %ymm0, 2816(%rsp) # 32-byte Spill
Later on, each use references the stack pointer:
vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload
It seems the spill to stack is unnecessary. In one particularly bad kernel,
I have 128 8-wide constant vectors, and so there is 4...
2012 Jan 09
3
[LLVMdev] Calling conventions for YMM registers on AVX
On Jan 9, 2012, at 10:00 AM, Jakob Stoklund Olesen wrote:
>
> On Jan 8, 2012, at 11:18 PM, Demikhovsky, Elena wrote:
>
>> I'll explain what we see in the code.
>> 1. The caller saves XMM registers across the call if needed (according to DEFS definition).
>> YMMs are not in the set, so caller does not take care.
>
> This is not how the register allocator
2012 Jul 25
6
[LLVMdev] X86 FMA4
We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns.
Why is VFMADDSD4 defined with vector types? Is this simply because the
gcc intrinsic uses vector types? It's quite unnatural if you have a
compiler that generates FMAs as opposed to requiring user intrinsics.
-Dave
2012 May 24
0
[LLVMdev] use AVX automatically if present
..._fun1
> .cfi_startproc
> # BB#0: # %_L1
> pushq %rbp
> .Ltmp2:
> .cfi_def_cfa_offset 16
> .Ltmp3:
> .cfi_offset %rbp, -16
> movq %rsp, %rbp
> .Ltmp4:
> .cfi_def_cfa_register %rbp
> vmovaps (%rdi), %ymm0
> vaddps (%rsi), %ymm0, %ymm0
> vmovaps %ymm0, (%rdi)
> popq %rbp
> vzeroupper
> ret
> .Ltmp5:
> .size _fun1, .Ltmp5-_fun1
> .cfi_endproc
>
>
> .section ".note.GNU-stack",&q...
2015 Jul 14
4
[LLVMdev] Poor register allocation (constants causing spilling)
...rall performance improvement of
3%.
*** The Problem
Compile the attached testcase as follows:
llc -mcpu=btver2 test.ll
Examining the assembly in test.s we can see a constant is being loaded
into %xmm8 (second instruction in foo). Tracing the constant we can
see the following:
foo:
...
vmovaps .LCPI0_0(%rip), %xmm8 # xmm8 = [6.366197e-01,6.366197e-01,...]
...
vmulps %xmm8, %xmm0, %xmm1 # first use of constant
vmovaps %xmm8, %xmm9 # move constant into another register
...
vmovaps %xmm0, -40(%rsp) # 16-byte Spill
vmovaps %xmm9, %xmm0...