thr3ads.net - search: "vmovup"

2020 Sep 01

2

Vector evolution?

...0x0(%rax,%rax,1) 1f0: c5 fc 59 0c 87 vmulps (%rdi,%rax,4),%ymm0,%ymm1 1f5: c5 fc 59 54 87 20 vmulps 0x20(%rdi,%rax,4),%ymm0,%ymm2 1fb: c5 fc 59 5c 87 40 vmulps 0x40(%rdi,%rax,4),%ymm0,%ymm3 201: c5 fc 59 64 87 60 vmulps 0x60(%rdi,%rax,4),%ymm0,%ymm4 207: c5 fc 11 0c 87 vmovups %ymm1,(%rdi,%rax,4) 20c: c5 fc 11 54 87 20 vmovups %ymm2,0x20(%rdi,%rax,4) 212: c5 fc 11 5c 87 40 vmovups %ymm3,0x40(%rdi,%rax,4) 218: c5 fc 11 64 87 60 vmovups %ymm4,0x60(%rdi,%rax,4) 21e: c5 fc 59 8c 87 80 00 vmulps 0x80(%rdi,%rax,4),%ymm0,%ymm1 225: 00 00 227: c5 fc 59 94 87 a0 0...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jun 29

2

[RFC][VECLIB] how should we legalize VECLIB calls?

...ks very good. Later on, standard vector type legalization kicks-in but only the argument and return data are legalized. vmovaps %ymm0, %ymm1 vcvtdq2pd %xmm1, %ymm0 vextractf128 $1, %ymm1, %xmm1 vcvtdq2pd %xmm1, %ymm1 callq __svml_sin8 vmovups %ymm1, 32(%r15,%r12,8) vmovups %ymm0, (%r15,%r12,8) Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes zmm0 and returns zmm0. i.e., not legal to use for AVX. What we need to see instead is two calls to __svml_sin4(), like below. vmovaps %ymm0, %ymm1...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jun 29

2

[RFC][VECLIB] how should we legalize VECLIB calls?

...ks very good. Later on, standard vector type legalization kicks-in but only the argument and return data are legalized. vmovaps %ymm0, %ymm1 vcvtdq2pd %xmm1, %ymm0 vextractf128 $1, %ymm1, %xmm1 vcvtdq2pd %xmm1, %ymm1 callq __svml_sin8 vmovups %ymm1, 32(%r15,%r12,8) vmovups %ymm0, (%r15,%r12,8) Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes zmm0 and returns zmm0. i.e., not legal to use for AVX. What we need to see instead is two calls to __svml_sin4(), like below. vmovaps %ymm0, %ymm1...

[LLVMdev] unaligned AVX store gets split into two instructions

2013 Jul 10

4

[LLVMdev] unaligned AVX store gets split into two instructions

...,__text,regular,pure_instructions .globl _vstore .align 4, 0x90 _vstore: ## @vstore .cfi_startproc ## BB#0: ## %entry pushq %rbp Ltmp2: .cfi_def_cfa_offset 16 Ltmp3: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp4: .cfi_def_cfa_register %rbp vmovups (%rdi), %ymm0 popq %rbp ret .cfi_endproc ---------------------------------------------------------------- Running llvm-33/bin/llc vstore.ll creates: .section __TEXT,__text,regular,pure_instructions .globl _main .align 4, 0x90 _main:...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jul 02

2

[RFC][VECLIB] how should we legalize VECLIB calls?

...legalized. >> >> vmovaps %ymm0, %ymm1 >> >> vcvtdq2pd %xmm1, %ymm0 >> >> vextractf128 $1, %ymm1, %xmm1 >> >> vcvtdq2pd %xmm1, %ymm1 >> >> callq __svml_sin8 >> >> vmovups %ymm1, 32(%r15,%r12,8) >> >> vmovups %ymm0, (%r15,%r12,8) >> >> Unfortunately, __svml_sin8() doesn’t use this form of input/output. It >> takes zmm0 and returns zmm0. >> >> i.e., not legal to use for AVX. >> >> >> >> What we...

[LLVMdev] SLP vectorizer on AVX feature

2015 Jul 01

3

[LLVMdev] SLP vectorizer on AVX feature

...he IR to machine code. However, the generated assembly doesn't seem to >> support this assumption :-( >> >> >> main: >> .cfi_startproc >> xorl %eax, %eax >> xorl %esi, %esi >> .align 16, 0x90 >> .LBB0_1: >> vmovups (%r8,%rax), %xmm0 >> vaddps (%rcx,%rax), %xmm0, %xmm0 >> vmovups %xmm0, (%rdx,%rax) >> addq $4, %rsi >> addq $16, %rax >> cmpq $61, %rsi >> jb .LBB0_1 >> retq >> >> I played with -mcpu and -march...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jul 02

2

[RFC][VECLIB] how should we legalize VECLIB calls?

...n but only the argument > and return data are legalized. > > vmovaps %ymm0, %ymm1 > > vcvtdq2pd %xmm1, %ymm0 > > vextractf128 $1, %ymm1, %xmm1 > > vcvtdq2pd %xmm1, %ymm1 > > callq __svml_sin8 > > vmovups %ymm1, 32(%r15,%r12,8) > > vmovups %ymm0, (%r15,%r12,8) > > Unfortunately, __svml_sin8() doesn’t use this form of input/output. It > takes zmm0 and returns zmm0. > > i.e., not legal to use for AVX. > > > > What we need to see instead is two calls to __svml_...

[LLVMdev] SLP vectorizer on AVX feature

2015 Jul 01

3

[LLVMdev] SLP vectorizer on AVX feature

...n lowering the IR to machine code. However, the generated assembly doesn't seem to support this assumption :-( >> >> >> main: >> .cfi_startproc >> xorl %eax, %eax >> xorl %esi, %esi >> .align 16, 0x90 >> .LBB0_1: >> vmovups (%r8,%rax), %xmm0 >> vaddps (%rcx,%rax), %xmm0, %xmm0 >> vmovups %xmm0, (%rdx,%rax) >> addq $4, %rsi >> addq $16, %rax >> cmpq $61, %rsi >> jb .LBB0_1 >> retq >> >> I played with -mcpu and -march switc...

[LLVMdev] AVX code gen

2013 Dec 12

0

[LLVMdev] AVX code gen

...tmp3: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp4: .cfi_def_cfa_register %rbp xorl %eax, %eax .align 4, 0x90 LBB0_1: ## %vector.body ## =>This Inner Loop Header: Depth=1 vmovups (%rdx,%rax,4), %ymm0 vmulps (%rsi,%rax,4), %ymm0, %ymm0 vaddps (%rdi,%rax,4), %ymm0, %ymm0 vmovups %ymm0, (%rdi,%rax,4) addq $8, %rax cmpq $256, %rax ## imm = 0x100 jne LBB0_1 ## BB#2: ## %for.e...

[LLVMdev] Limit loop vectorizer to SSE

2013 Nov 16

1

[LLVMdev] Limit loop vectorizer to SSE

...would emit = load <8 x i32> (which has the semantics of “= load <8 xi32>, align 0” which means the address is aligned with target abi alignment, see http://llvm.org/docs/LangRef.html#load-instruction). When the backend generates code for the former it will emit an unaligned move: = vmovups ... wheres for the later it will use an aligned move: = vmovaps … vmovups can load from unaligned addresses while vmovaps can not. No, we currently don’t peel loops for alignment. Best, Arnold On Nov 15, 2013, at 7:23 PM, Frank Winter <fwinter at jlab.org> wrote: > I confirm that r1...

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jul 02

8

[RFC][VECLIB] how should we legalize VECLIB calls?

... vmovaps %ymm0, %ymm1 > > vcvtdq2pd %xmm1, %ymm0 > > vextractf128 $1, %ymm1, %xmm1 > > vcvtdq2pd %xmm1, %ymm1 > > callq __svml_sin8 > > vmovups %ymm1, 32(%r15,%r12,8) > > vmovups %ymm0, (%r15,%r12,8) > > Unfortunately, __svml_sin8() doesn’t use this form of > input/output. It takes zmm0 and returns zmm0. > > i.e., not legal to use for AVX. > > ...

[LLVMdev] AVX code gen

2013 Dec 11

2

[LLVMdev] AVX code gen

Hello - I found this post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such

[LLVMdev] SLP vectorizer on AVX feature

2015 Jul 01

3

[LLVMdev] SLP vectorizer on AVX feature

...oesn't. Then I thought, that maybe the YMM registers get used when lowering the IR to machine code. However, the generated assembly doesn't seem to support this assumption :-( main: .cfi_startproc xorl %eax, %eax xorl %esi, %esi .align 16, 0x90 .LBB0_1: vmovups (%r8,%rax), %xmm0 vaddps (%rcx,%rax), %xmm0, %xmm0 vmovups %xmm0, (%rdx,%rax) addq $4, %rsi addq $16, %rax cmpq $61, %rsi jb .LBB0_1 retq I played with -mcpu and -march switches without success. In any case, the target architecture should b...

[LLVMdev] Stack alignment on X86 AVX seems incorrect

2012 Mar 01

0

[LLVMdev] Stack alignment on X86 AVX seems incorrect

When stack is unaligned, LLVM should generate vmovups instead of vmovaps. - Elena -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Joerg Sonnenberger Sent: Thursday, March 01, 2012 20:31 To: llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Stack alignment on X86 AVX seems incorrect O...

[LLVMdev] Trip count and Loop Vectorizer

2013 Sep 27

2

[LLVMdev] Trip count and Loop Vectorizer

...n the outer loop is unrolled since the trip count is constant (4). The 4 calls to memcpy is not efficient. * Therefore, I disabled the memcpy optimization for such cases, and found that LLVM LoopVectorizer successfully vectorizes and unrolls the inner loop. However, in order to take the fast path (vmovups) it must copy at least 32 ints, where as in this case we only do an 8int copy. ** Upon closer look, LoopVectorizer obtains the TripCount for the innerloop using getSmallConstantTripCount(Loop,...). This value is 0 for the loop with unknown trip count. Loop unrolling is disabled when TC > 0. Sho...

[LLVMdev] Stack alignment on X86 AVX seems incorrect

2012 Mar 01

2

[LLVMdev] Stack alignment on X86 AVX seems incorrect

On Thu, Mar 01, 2012 at 06:16:46PM +0000, Demikhovsky, Elena wrote: > vmovaps should not access stack if it is not aligned to 32 I'm not completely sure I understand your problem. Are you saying that the generated code assumes 256bit alignment, your default stack alignment is 128bit and LLVM doesn't adjust it automatically? Joerg

AVX2 codegen - question reg. FMA generation

2019 Sep 02

3

AVX2 codegen - question reg. FMA generation

...VX2 FMA instructions. Here's the snippet in the output it generates: $ llc -O3 -mcpu=skylake --------------------- .LBB0_2: # =>This Inner Loop Header: Depth=1 vbroadcastss (%rsi,%rdx,4), %ymm0 vmulps (%rdi,%rcx), %ymm0, %ymm0 vaddps (%rax,%rcx), %ymm0, %ymm0 vmovups %ymm0, (%rax,%rcx) incq %rdx addq $32, %rcx cmpq $15, %rdx jle .LBB0_2 ----------------------- $ llc --version LLVM (http://llvm.org/): LLVM version 8.0.0 Optimized build. Default target: x86_64-unknown-linux-gnu Host CPU: skylake (llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from...

Aligned vector spills and variably sized stack frames

2015 Aug 28

6

Aligned vector spills and variably sized stack frames

...pile things which need to spill vector registers. This is actually what we do today and has worked out fairly well in practice. This is what I'm hoping to move away from. Option 3 - Add an option in the x86 backend to not require aligned spill slots for AVX2 registers. In particular, the VMOVUPS instruction can be used to spill vector registers into an 8 or 16 byte aligned spill slot and not require dynamic frame realignment. This seems like it might be useful in other context as well, but I can't name any at the moment. One thing that occurs to me is that many spills are down rar...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 29

2

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...%xmm0 ## xmm0 > = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11] > vmovdqu %xmm0, 0x20(%rax) > turning into: > vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0] > vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2] > vmovups %xmm0, 0x20(%rax) > All of these stem from what I think is the same core weakness of the current algorithm: we prefer the fully general shufps+shufps 4-way shuffle/blend far too often. Here is how I would more precisely classify the two things missing here: - Check if either inputs are...

[PATCH 2/4] x86/emulator: add emulation of SIMD FP moves

2011 Nov 30

0

[PATCH 2/4] x86/emulator: add emulation of SIMD FP moves

...*/ + /* vmovap{s,d} ymm/m256,ymm */ + case 0x29: /* {,v}movap{s,d} xmm,xmm/m128 */ + /* vmovap{s,d} ymm,ymm/m256 */ + fail_if(vex.pfx & VEX_PREFIX_SCALAR_MASK); + /* fall through */ + case 0x10: /* {,v}movup{s,d} xmm/m128,xmm */ + /* vmovup{s,d} ymm/m256,ymm */ + /* {,v}movss xmm/m32,xmm */ + /* {,v}movsd xmm/m64,xmm */ + case 0x11: /* {,v}movup{s,d} xmm,xmm/m128 */ + /* vmovup{s,d} ymm,ymm/m256 */ + /* {,v}movss xmm,xmm/m32 */ + /* {,v}movsd xmm,xmm/m64 */ +...

search for: vmovup