Displaying 20 results from an estimated 37 matches for "vmovup".
Did you mean:
vmovups
2020 Sep 01
2
Vector evolution?
...0x0(%rax,%rax,1)
1f0: c5 fc 59 0c 87 vmulps (%rdi,%rax,4),%ymm0,%ymm1
1f5: c5 fc 59 54 87 20 vmulps 0x20(%rdi,%rax,4),%ymm0,%ymm2
1fb: c5 fc 59 5c 87 40 vmulps 0x40(%rdi,%rax,4),%ymm0,%ymm3
201: c5 fc 59 64 87 60 vmulps 0x60(%rdi,%rax,4),%ymm0,%ymm4
207: c5 fc 11 0c 87 vmovups %ymm1,(%rdi,%rax,4)
20c: c5 fc 11 54 87 20 vmovups %ymm2,0x20(%rdi,%rax,4)
212: c5 fc 11 5c 87 40 vmovups %ymm3,0x40(%rdi,%rax,4)
218: c5 fc 11 64 87 60 vmovups %ymm4,0x60(%rdi,%rax,4)
21e: c5 fc 59 8c 87 80 00 vmulps 0x80(%rdi,%rax,4),%ymm0,%ymm1
225: 00 00
227: c5 fc 59 94 87 a0 0...
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...ks very good.
Later on, standard vector type legalization kicks-in but only the argument and return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes zmm0 and returns zmm0.
i.e., not legal to use for AVX.
What we need to see instead is two calls to __svml_sin4(), like below.
vmovaps %ymm0, %ymm1...
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...ks very good.
Later on, standard vector type legalization kicks-in but only the argument and return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes zmm0 and returns zmm0.
i.e., not legal to use for AVX.
What we need to see instead is two calls to __svml_sin4(), like below.
vmovaps %ymm0, %ymm1...
2013 Jul 10
4
[LLVMdev] unaligned AVX store gets split into two instructions
...,__text,regular,pure_instructions
.globl _vstore
.align 4, 0x90
_vstore: ## @vstore
.cfi_startproc
## BB#0: ## %entry
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
vmovups (%rdi), %ymm0
popq %rbp
ret
.cfi_endproc
----------------------------------------------------------------
Running llvm-33/bin/llc vstore.ll creates:
.section __TEXT,__text,regular,pure_instructions
.globl _main
.align 4, 0x90
_main:...
2018 Jul 02
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...legalized.
>>
>> vmovaps %ymm0, %ymm1
>>
>> vcvtdq2pd %xmm1, %ymm0
>>
>> vextractf128 $1, %ymm1, %xmm1
>>
>> vcvtdq2pd %xmm1, %ymm1
>>
>> callq __svml_sin8
>>
>> vmovups %ymm1, 32(%r15,%r12,8)
>>
>> vmovups %ymm0, (%r15,%r12,8)
>>
>> Unfortunately, __svml_sin8() doesn’t use this form of input/output. It
>> takes zmm0 and returns zmm0.
>>
>> i.e., not legal to use for AVX.
>>
>>
>>
>> What we...
2015 Jul 01
3
[LLVMdev] SLP vectorizer on AVX feature
...he IR to machine code. However, the generated assembly doesn't seem to
>> support this assumption :-(
>>
>>
>> main:
>> .cfi_startproc
>> xorl %eax, %eax
>> xorl %esi, %esi
>> .align 16, 0x90
>> .LBB0_1:
>> vmovups (%r8,%rax), %xmm0
>> vaddps (%rcx,%rax), %xmm0, %xmm0
>> vmovups %xmm0, (%rdx,%rax)
>> addq $4, %rsi
>> addq $16, %rax
>> cmpq $61, %rsi
>> jb .LBB0_1
>> retq
>>
>> I played with -mcpu and -march...
2018 Jul 02
2
[RFC][VECLIB] how should we legalize VECLIB calls?
...n but only the argument
> and return data are legalized.
>
> vmovaps %ymm0, %ymm1
>
> vcvtdq2pd %xmm1, %ymm0
>
> vextractf128 $1, %ymm1, %xmm1
>
> vcvtdq2pd %xmm1, %ymm1
>
> callq __svml_sin8
>
> vmovups %ymm1, 32(%r15,%r12,8)
>
> vmovups %ymm0, (%r15,%r12,8)
>
> Unfortunately, __svml_sin8() doesn’t use this form of input/output. It
> takes zmm0 and returns zmm0.
>
> i.e., not legal to use for AVX.
>
>
>
> What we need to see instead is two calls to __svml_...
2015 Jul 01
3
[LLVMdev] SLP vectorizer on AVX feature
...n lowering the IR to machine code. However, the generated assembly doesn't seem to support this assumption :-(
>>
>>
>> main:
>> .cfi_startproc
>> xorl %eax, %eax
>> xorl %esi, %esi
>> .align 16, 0x90
>> .LBB0_1:
>> vmovups (%r8,%rax), %xmm0
>> vaddps (%rcx,%rax), %xmm0, %xmm0
>> vmovups %xmm0, (%rdx,%rax)
>> addq $4, %rsi
>> addq $16, %rax
>> cmpq $61, %rsi
>> jb .LBB0_1
>> retq
>>
>> I played with -mcpu and -march switc...
2013 Dec 12
0
[LLVMdev] AVX code gen
...tmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
xorl %eax, %eax
.align 4, 0x90
LBB0_1: ## %vector.body
## =>This Inner Loop Header: Depth=1
vmovups (%rdx,%rax,4), %ymm0
vmulps (%rsi,%rax,4), %ymm0, %ymm0
vaddps (%rdi,%rax,4), %ymm0, %ymm0
vmovups %ymm0, (%rdi,%rax,4)
addq $8, %rax
cmpq $256, %rax ## imm = 0x100
jne LBB0_1
## BB#2: ## %for.e...
2013 Nov 16
1
[LLVMdev] Limit loop vectorizer to SSE
...would emit
= load <8 x i32>
(which has the semantics of “= load <8 xi32>, align 0” which means the address is aligned with target abi alignment, see http://llvm.org/docs/LangRef.html#load-instruction).
When the backend generates code for the former it will emit an unaligned move:
= vmovups ...
wheres for the later it will use an aligned move:
= vmovaps …
vmovups can load from unaligned addresses while vmovaps can not.
No, we currently don’t peel loops for alignment.
Best,
Arnold
On Nov 15, 2013, at 7:23 PM, Frank Winter <fwinter at jlab.org> wrote:
> I confirm that r1...
2018 Jul 02
8
[RFC][VECLIB] how should we legalize VECLIB calls?
... vmovaps %ymm0, %ymm1
>
> vcvtdq2pd %xmm1, %ymm0
>
> vextractf128 $1, %ymm1, %xmm1
>
> vcvtdq2pd %xmm1, %ymm1
>
> callq __svml_sin8
>
> vmovups %ymm1, 32(%r15,%r12,8)
>
> vmovups %ymm0, (%r15,%r12,8)
>
> Unfortunately, __svml_sin8() doesn’t use this form of
> input/output. It takes zmm0 and returns zmm0.
>
> i.e., not legal to use for AVX.
>
> ...
2013 Dec 11
2
[LLVMdev] AVX code gen
Hello -
I found this post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such
2015 Jul 01
3
[LLVMdev] SLP vectorizer on AVX feature
...oesn't. Then I thought, that maybe the YMM registers get used when
lowering the IR to machine code. However, the generated assembly doesn't
seem to support this assumption :-(
main:
.cfi_startproc
xorl %eax, %eax
xorl %esi, %esi
.align 16, 0x90
.LBB0_1:
vmovups (%r8,%rax), %xmm0
vaddps (%rcx,%rax), %xmm0, %xmm0
vmovups %xmm0, (%rdx,%rax)
addq $4, %rsi
addq $16, %rax
cmpq $61, %rsi
jb .LBB0_1
retq
I played with -mcpu and -march switches without success. In any case,
the target architecture should b...
2012 Mar 01
0
[LLVMdev] Stack alignment on X86 AVX seems incorrect
When stack is unaligned, LLVM should generate vmovups instead of vmovaps.
- Elena
-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Joerg Sonnenberger
Sent: Thursday, March 01, 2012 20:31
To: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Stack alignment on X86 AVX seems incorrect
O...
2013 Sep 27
2
[LLVMdev] Trip count and Loop Vectorizer
...n the outer loop is unrolled since the trip count is constant (4). The 4 calls to memcpy is not efficient.
* Therefore, I disabled the memcpy optimization for such cases, and found that LLVM LoopVectorizer successfully vectorizes and unrolls the inner loop. However, in order to take the fast path (vmovups) it must copy at least 32 ints, where as in this case we only do an 8int copy.
** Upon closer look, LoopVectorizer obtains the TripCount for the innerloop using getSmallConstantTripCount(Loop,...). This value is 0 for the loop with unknown trip count. Loop unrolling is disabled when TC > 0. Sho...
2012 Mar 01
2
[LLVMdev] Stack alignment on X86 AVX seems incorrect
On Thu, Mar 01, 2012 at 06:16:46PM +0000, Demikhovsky, Elena wrote:
> vmovaps should not access stack if it is not aligned to 32
I'm not completely sure I understand your problem. Are you saying that
the generated code assumes 256bit alignment, your default stack
alignment is 128bit and LLVM doesn't adjust it automatically?
Joerg
2019 Sep 02
3
AVX2 codegen - question reg. FMA generation
...VX2 FMA instructions. Here's the snippet in the output it generates:
$ llc -O3 -mcpu=skylake
---------------------
.LBB0_2: # =>This Inner Loop Header: Depth=1
vbroadcastss (%rsi,%rdx,4), %ymm0
vmulps (%rdi,%rcx), %ymm0, %ymm0
vaddps (%rax,%rcx), %ymm0, %ymm0
vmovups %ymm0, (%rax,%rcx)
incq %rdx
addq $32, %rcx
cmpq $15, %rdx
jle .LBB0_2
-----------------------
$ llc --version
LLVM (http://llvm.org/):
LLVM version 8.0.0
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
(llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from...
2015 Aug 28
6
Aligned vector spills and variably sized stack frames
...pile things which need to spill vector registers.
This is actually what we do today and has worked out fairly well in
practice. This is what I'm hoping to move away from.
Option 3 - Add an option in the x86 backend to not require aligned spill
slots for AVX2 registers. In particular, the VMOVUPS instruction can be
used to spill vector registers into an 8 or 16 byte aligned spill slot
and not require dynamic frame realignment. This seems like it might be
useful in other context as well, but I can't name any at the moment.
One thing that occurs to me is that many spills are down rar...
2015 Jan 29
2
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
...%xmm0 ## xmm0
> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
> vmovdqu %xmm0, 0x20(%rax)
> turning into:
> vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
> vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2]
> vmovups %xmm0, 0x20(%rax)
>
All of these stem from what I think is the same core weakness of the
current algorithm: we prefer the fully general shufps+shufps 4-way
shuffle/blend far too often. Here is how I would more precisely classify
the two things missing here:
- Check if either inputs are...
2011 Nov 30
0
[PATCH 2/4] x86/emulator: add emulation of SIMD FP moves
...*/
+ /* vmovap{s,d} ymm/m256,ymm */
+ case 0x29: /* {,v}movap{s,d} xmm,xmm/m128 */
+ /* vmovap{s,d} ymm,ymm/m256 */
+ fail_if(vex.pfx & VEX_PREFIX_SCALAR_MASK);
+ /* fall through */
+ case 0x10: /* {,v}movup{s,d} xmm/m128,xmm */
+ /* vmovup{s,d} ymm/m256,ymm */
+ /* {,v}movss xmm/m32,xmm */
+ /* {,v}movsd xmm/m64,xmm */
+ case 0x11: /* {,v}movup{s,d} xmm,xmm/m128 */
+ /* vmovup{s,d} ymm,ymm/m256 */
+ /* {,v}movss xmm,xmm/m32 */
+ /* {,v}movsd xmm,xmm/m64 */
+...