thr3ads.net - search: "vmovaps"

[LLVMdev] Stack alignment on X86 AVX seems incorrect

2012 Mar 01

3

[LLVMdev] Stack alignment on X86 AVX seems incorrect

...v at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> Message-ID: <A0DC88CEB3010344830D52D66533DA8E0C2E7A at HASMSX103.ger.corp.intel.com> Content-Type: text/plain; charset="windows-1252" ./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm | grep rbp vmovaps -176(%rbp), %ymm14 vmovaps -144(%rbp), %ymm11 vmovaps -240(%rbp), %ymm13 vmovaps -208(%rbp), %ymm9 vmovaps -272(%rbp), %ymm7 vmovaps -304(%rbp), %ymm0 vmovaps -112(%rbp), %ymm0 vmovaps -80(%rbp), %ymm1 vmovaps -112(%rbp), %ymm0 vmovaps...

[LLVMdev] Stack alignment on X86 AVX seems incorrect

2012 Mar 01

0

[LLVMdev] Stack alignment on X86 AVX seems incorrect

./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm | grep rbp vmovaps -176(%rbp), %ymm14 vmovaps -144(%rbp), %ymm11 vmovaps -240(%rbp), %ymm13 vmovaps -208(%rbp), %ymm9 vmovaps -272(%rbp), %ymm7 vmovaps -304(%rbp), %ymm0 vmovaps -112(%rbp), %ymm0 vmovaps -80(%rbp), %ymm1 vmovaps -112(%rbp), %ymm0...

[LLVMdev] Stack alignment in kernel

2012 Mar 01

2

[LLVMdev] Stack alignment in kernel

I'm running in AVX mode, but the stack before call to kernel is aligned to 16 bit. Could you, please, tell me where it should be specified? Thank you. - Elena --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or

[LLVMdev] Stack alignment on X86 AVX seems incorrect

2012 Mar 01

0

[LLVMdev] Stack alignment on X86 AVX seems incorrect

...m >> <mailto:A0DC88CEB3010344830D52D66533DA8E0C2E7A at HASMSX103.ger.corp.intel.com>> >> Content-Type: text/plain; charset="windows-1252" >> >> ./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep >> ymm | grep rbp >> vmovaps -176(%rbp), %ymm14 >> vmovaps -144(%rbp), %ymm11 >> vmovaps -240(%rbp), %ymm13 >> vmovaps -208(%rbp), %ymm9 >> vmovaps -272(%rbp), %ymm7 >> vmovaps -304(%rbp), %ymm0 >> vmovaps -112(%rbp), %ymm0 >> vmovaps -8...

[LLVMdev] X86 FMA4

2012 Jul 27

2

[LLVMdev] X86 FMA4

Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd, but movsd has the same stats as...

[LLVMdev] X86 FMA4

2012 Jul 27

0

[LLVMdev] X86 FMA4

Hey Michael, Thanks for the legwork! It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. As I am sure you are aware, we cannot use SSE (movaps) instructions in an...

avx512 JIT backend generates wrong code on <4 x float>

2016 Jun 29

2

avx512 JIT backend generates wrong code on <4 x float>

...q $63, %rcx shrq $62, %rcx addq %r8, %rcx sarq $2, %rcx movq %rax, %rdx shlq $5, %rdx leaq 16(%r9,%rdx), %rsi orq $16, %rdx movq 16(%rsp), %rdi addq %rdx, %rdi addq 8(%rsp), %rdx .align 16, 0x90 .LBB0_1: vmovaps -16(%rdx), %xmm0 vmovaps (%rdx), %xmm1 vmovaps -16(%rdi), %xmm2 vmovaps (%rdi), %xmm3 vmulps %xmm3, %xmm1, %xmm4 vmulps %xmm2, %xmm1, %xmm1 vfmadd213ss %xmm4, %xmm0, %xmm2 vfmsub213ss %xmm1, %xmm0, %xmm3 vmovaps %xmm2, -16(%rsi)...

Vector evolution?

2020 Sep 01

2

Vector evolution?

..._f>: 160: 31 c0 xor %eax,%eax 162: c4 e2 79 18 05 00 00 vbroadcastss 0x0(%rip),%xmm0 # 16b <_Z4fct6PDv4_f+0xb> 169: 00 00 16b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 170: c5 f8 59 0c 07 vmulps (%rdi,%rax,1),%xmm0,%xmm1 175: c5 f8 29 0c 07 vmovaps %xmm1,(%rdi,%rax,1) 17a: c5 f8 59 4c 07 10 vmulps 0x10(%rdi,%rax,1),%xmm0,%xmm1 180: c5 f8 29 4c 07 10 vmovaps %xmm1,0x10(%rdi,%rax,1) 186: c5 f8 59 4c 07 20 vmulps 0x20(%rdi,%rax,1),%xmm0,%xmm1 18c: c5 f8 29 4c 07 20 vmovaps %xmm1,0x20(%rdi,%rax,1) 192: c5 f8 59 4c 07 30 vmulps...

avx512 JIT backend generates wrong code on <4 x float>

2016 Jun 29

0

avx512 JIT backend generates wrong code on <4 x float>

...%rcx > sarq $2, %rcx > movq %rax, %rdx > shlq $5, %rdx > leaq 16(%r9,%rdx), %rsi > orq $16, %rdx > movq 16(%rsp), %rdi > addq %rdx, %rdi > addq 8(%rsp), %rdx > .align 16, 0x90 > .LBB0_1: > vmovaps -16(%rdx), %xmm0 > vmovaps (%rdx), %xmm1 > vmovaps -16(%rdi), %xmm2 > vmovaps (%rdi), %xmm3 > vmulps %xmm3, %xmm1, %xmm4 > vmulps %xmm2, %xmm1, %xmm1 > vfmadd213ss %xmm4, %xmm0, %xmm2 > vfmsub213ss %xmm1, %xmm0, %xmm3 &...

avx512 JIT backend generates wrong code on <4 x float>

2016 Jun 30

1

avx512 JIT backend generates wrong code on <4 x float>

...ovq %rax, %rdx >> shlq $5, %rdx >> leaq 16(%r9,%rdx), %rsi >> orq $16, %rdx >> movq 16(%rsp), %rdi >> addq %rdx, %rdi >> addq 8(%rsp), %rdx >> .align 16, 0x90 >> .LBB0_1: >> vmovaps -16(%rdx), %xmm0 >> vmovaps (%rdx), %xmm1 >> vmovaps -16(%rdi), %xmm2 >> vmovaps (%rdi), %xmm3 >> vmulps %xmm3, %xmm1, %xmm4 >> vmulps %xmm2, %xmm1, %xmm1 >> vfmadd213ss %xmm4, %xmm0, %xmm2 >> v...

[LLVMdev] Calling conventions for YMM registers on AVX

2012 Jan 10

0

[LLVMdev] Calling conventions for YMM registers on AVX

...test; .scl 2; .type 32; .endef .text .globl test .align 16, 0x90 test: # @test # BB#0: # %entry pushq %rbp movq %rsp, %rbp subq $64, %rsp vmovaps %xmm7, -32(%rbp) # 16-byte Spill vmovaps %xmm6, -16(%rbp) # 16-byte Spill vmovaps %ymm3, %ymm6 vmovaps %ymm2, %ymm7 vaddps %ymm7, %ymm0, %ymm0 vaddps %ymm6, %ymm1, %ymm1 callq foo vsubps %ymm7, %ymm0, %ymm0 vsubps %...

Vectorization of math function failed?

2020 Aug 31

2

Vectorization of math function failed?

...iled with: clang++ -O3 -march=native -mtune=native -c -o vec.o vec.cc -lmvec -fno-math-errno And here is what I get: vec.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z4fct1Dv4_f>: 0: 48 83 ec 48 sub $0x48,%rsp 4: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) 9: e8 00 00 00 00 callq e <_Z4fct1Dv4_f+0xe> e: c5 f8 29 44 24 30 vmovaps %xmm0,0x30(%rsp) 14: c5 fa 16 04 24 vmovshdup (%rsp),%xmm0 19: e8 00 00 00 00 callq 1e <_Z4fct1Dv4_f+0x1e> 1e: c5 f8 29 44 24 20 vmovaps %xmm0,0x20(%rsp) 24:...

[LLVMdev] X86 FMA4

2012 Jul 27

3

[LLVMdev] X86 FMA4

> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. You are misunderstanding [no worries, happens to everyone = )]. The timing...

[LLVMdev] use AVX automatically if present

2012 May 24

4

[LLVMdev] use AVX automatically if present

...un1: # @_fun1 .cfi_startproc # BB#0: # %_L1 pushq %rbp .Ltmp2: .cfi_def_cfa_offset 16 .Ltmp3: .cfi_offset %rbp, -16 movq %rsp, %rbp .Ltmp4: .cfi_def_cfa_register %rbp vmovaps (%rdi), %ymm0 vaddps (%rsi), %ymm0, %ymm0 vmovaps %ymm0, (%rdi) popq %rbp vzeroupper ret .Ltmp5: .size _fun1, .Ltmp5-_fun1 .cfi_endproc .section ".note.GNU-stack","", at progbits I guess y...

[LLVMdev] X86 FMA4

2012 Jul 26

0

[LLVMdev] X86 FMA4

...gt; > > >Let's look at the VFMADDSD pattern. We're operating on scalars with > undefineds as the remaining vector elements of the operands. This sounds > okay, but when one looks closer... > > > > vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647 > > vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill > > vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647 > > > > > >The spill here is 16-bytes. But, we're only using the low 8-bytes of > xmm3. Changing the intrinsics and patterns to accep...

Unnecessary spill/fill issue

2016 May 06

3

Unnecessary spill/fill issue

...constant vectors at compile time. Each vector has a single use. In the final asm, I see a series of spills at the top of the function of all the constant vectors immediately to stack, then each use references the stack pointer directly: Lots of these at top of function: movabsq $.LCPI0_212, %rbx vmovaps (%rbx), %ymm0 vmovaps %ymm0, 2816(%rsp) # 32-byte Spill Later on, each use references the stack pointer: vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload It seems the spill to stack is unnecessary. In one particularly bad kernel, I have 128 8-wide constant vectors, and so there is 4...

[LLVMdev] Calling conventions for YMM registers on AVX

2012 Jan 09

3

[LLVMdev] Calling conventions for YMM registers on AVX

On Jan 9, 2012, at 10:00 AM, Jakob Stoklund Olesen wrote: > > On Jan 8, 2012, at 11:18 PM, Demikhovsky, Elena wrote: > >> I'll explain what we see in the code. >> 1. The caller saves XMM registers across the call if needed (according to DEFS definition). >> YMMs are not in the set, so caller does not take care. > > This is not how the register allocator

[LLVMdev] X86 FMA4

2012 Jul 25

6

[LLVMdev] X86 FMA4

We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. Why is VFMADDSD4 defined with vector types? Is this simply because the gcc intrinsic uses vector types? It's quite unnatural if you have a compiler that generates FMAs as opposed to requiring user intrinsics. -Dave

[LLVMdev] use AVX automatically if present

2012 May 24

0

[LLVMdev] use AVX automatically if present

..._fun1 > .cfi_startproc > # BB#0: # %_L1 > pushq %rbp > .Ltmp2: > .cfi_def_cfa_offset 16 > .Ltmp3: > .cfi_offset %rbp, -16 > movq %rsp, %rbp > .Ltmp4: > .cfi_def_cfa_register %rbp > vmovaps (%rdi), %ymm0 > vaddps (%rsi), %ymm0, %ymm0 > vmovaps %ymm0, (%rdi) > popq %rbp > vzeroupper > ret > .Ltmp5: > .size _fun1, .Ltmp5-_fun1 > .cfi_endproc > > > .section ".note.GNU-stack",&q...

[LLVMdev] Poor register allocation (constants causing spilling)

2015 Jul 14

4

[LLVMdev] Poor register allocation (constants causing spilling)

...rall performance improvement of 3%. *** The Problem Compile the attached testcase as follows: llc -mcpu=btver2 test.ll Examining the assembly in test.s we can see a constant is being loaded into %xmm8 (second instruction in foo). Tracing the constant we can see the following: foo: ... vmovaps .LCPI0_0(%rip), %xmm8 # xmm8 = [6.366197e-01,6.366197e-01,...] ... vmulps %xmm8, %xmm0, %xmm1 # first use of constant vmovaps %xmm8, %xmm9 # move constant into another register ... vmovaps %xmm0, -40(%rsp) # 16-byte Spill vmovaps %xmm9, %xmm0...

search for: vmovaps