thr3ads.net - search: "paddd"

[LLVMdev] the clang 3.5 loop optimizer seems to jump in unintentional for simple loops

2014 Jul 23

4

[LLVMdev] the clang 3.5 loop optimizer seems to jump in unintentional for simple loops

...xmm1, %xmm1 .align 16, 0x90 .LBB0_3: # %vector.body # =>This Inner Loop Header: Depth=1 movdqa %xmm1, %xmm2 movdqa %xmm0, %xmm3 movdqu -16(%rdi), %xmm0 movdqu (%rdi), %xmm1 paddd %xmm3, %xmm0 paddd %xmm2, %xmm1 addq $32, %rdi addq $-8, %rdx jne .LBB0_3 # BB#4: movq %r8, %rdi movq %rax, %rdx jmp .LBB0_5 .LBB0_1: pxor %xmm1, %xmm1 .LBB0_5: # %middle.block paddd %...

Compilation issues with s390

2006 May 25

2

Compilation issues with s390

Hi all, I'm trying to compile asterisk on the mainframe (s390 / s390x) and I am running into issues. I was wondering if somebody could give a hand? I'm thinking that I should be able to do this. I have noticed that Debian even has binary RPM's out for Asterisk now. I'm trying to do this on SuSE SLES8 (with the 2.4 kernel). What I see is, an issue that arch=s390 isn't

enabling interleaved access loop vectorization

2016 Aug 05

3

enabling interleaved access loop vectorization

...ctorized code is actually fairly decent - e.g. forcing vectorization, with SSE4.2, we get: .LBB0_3: # %vector.body # =>This Inner Loop Header: Depth=1 movdqu (%rdi,%rax,4), %xmm3 movd %xmm0, %rcx movdqu 4(%rdi,%rcx,4), %xmm4 paddd %xmm3, %xmm4 movdqu 8(%rdi,%rcx,4), %xmm3 paddd %xmm4, %xmm3 movdqa %xmm1, %xmm4 paddq %xmm4, %xmm4 movdqa %xmm0, %xmm5 paddq %xmm5, %xmm5 movd %xmm5, %rcx pextrq $1, %xmm5, %rdx movd %xmm4, %r8 pextrq $1, %xmm4, %r9 movd (%rdi,%rcx,4), %xmm4 # xmm4 = mem[0],zero,zero,zero pinsrd $1, (%rdi,%rdx,...

enabling interleaved access loop vectorization

2016 May 26

2

enabling interleaved access loop vectorization

Interleaved access is not enabled on X86 yet. We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer

[RFC] Introducing a vector reduction add instruction.

2015 Nov 19

5

[RFC] Introducing a vector reduction add instruction.

...m1, %xmm1 .align 16, 0x90 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 movd b+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero movd a+1024(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero psadbw %xmm2, %xmm3 paddd %xmm3, %xmm0 movd b+1028(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero psadbw %xmm2, %xmm3 paddd %xmm3, %xmm1 addq $8, %rax jne .LBB0_1 # BB#2: # %middle.block paddd %xmm0, %xmm1 pshufd $78, %xmm1, %xmm0...

enabling interleaved access loop vectorization

2016 Aug 05

2

enabling interleaved access loop vectorization

...with SSE4.2, we get: > > > > .LBB0_3: # %vector.body > > # =>This Inner Loop Header: Depth=1 > > movdqu (%rdi,%rax,4), %xmm3 > > movd %xmm0, %rcx > > movdqu 4(%rdi,%rcx,4), %xmm4 > > paddd %xmm3, %xmm4 > > movdqu 8(%rdi,%rcx,4), %xmm3 > > paddd %xmm4, %xmm3 > > movdqa %xmm1, %xmm4 > > paddq %xmm4, %xmm4 > > movdqa %xmm0, %xmm5 > > paddq %xmm5, %xmm5 > > movd %xmm5, %rcx > > pextrq $1, %xmm5, %rdx > > movd %xmm4, %r8 > > pext...

An assembly optimization and fix

2004 Sep 10

2

An assembly optimization and fix

...0 + psubd mm5, mm2 ; mm5 = 0:last_error_1 + punpckldq mm3, mm5 ; mm3 = last_error_1:last_error_0 + psubd mm2, mm1 ; mm2 = 0:data[-2] - data[-3] + psubd mm5, mm2 ; mm5 = 0:last_error_2 + movq mm4, mm5 ; mm4 = 0:last_error_2 + psubd mm4, mm2 ; mm4 = 0:last_error_2 - (data[-2] - data[-3]) + paddd mm4, mm1 ; mm4 = 0:last_error_2 - (data[-2] - 2 * data[-3]) + psubd mm4, mm0 ; mm4 = 0:last_error_3 + punpckldq mm4, mm5 ; mm4 = last_error_2:last_error_3 + pxor mm0, mm0 ; mm0 = total_error_1:total_error_0 + pxor mm1, mm1 ; mm1 = total_error_2:total_error_3 + pxor mm2, mm2 ; mm2 = 0:tot...

enabling interleaved access loop vectorization

2016 May 26

0

enabling interleaved access loop vectorization

On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > Is there a compile-time and/or potential runtime cost that makes > enableInterleavedAccessVectorization() default to 'false'? > > I notice that this is set to true for ARM, AArch64, and PPC. > > In particular, I'm wondering if there's a reason it's not enabled for

[RFC] Introducing a vector reduction add instruction.

2015 Nov 25

2

[RFC] Introducing a vector reduction add instruction.

...# =>This Inner Loop Header: >> Depth=1 >> movd b+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero >> movd a+1024(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero >> psadbw %xmm2, %xmm3 >> paddd %xmm3, %xmm0 >> movd b+1028(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero >> movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero >> psadbw %xmm2, %xmm3 >> paddd %xmm3, %xmm1 >> addq $8, %rax >> jne .LBB0_1 >> # BB#2:...

[RFC] Introducing a vector reduction add instruction.

2015 Nov 25

2

[RFC] Introducing a vector reduction add instruction.

..., 0x90 > >> .LBB0_1: # %vector.body > >> # =>This Inner Loop Header: > >> Depth=1 > >> movd b+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero > >> movd a+1024(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero > >> psadbw %xmm2, %xmm3 > >> paddd %xmm3, %xmm0 > >> movd b+1028(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero > >> movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero > >> psadbw %xmm2, %xmm3 > >> paddd %xmm3, %xmm1 > >> addq $8, %rax > >> jne .LBB0_1 > >> # BB#2: # %...

[LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets

2014 Oct 13

2

[LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets

...ct-aliasing -funroll-loops -ffast-math -march=native -mtune=native -DSPILLING_ENSUES=0 /* no spilling */ $ objdump -dC --no-show-raw-insn ./a.out ... 00000000004004f0 <main>: 4004f0: vmovdqa 0x2004c8(%rip),%xmm0 # 6009c0 <x> 4004f8: vpsrld $0x17,%xmm0,%xmm0 4004fd: vpaddd 0x17b(%rip),%xmm0,%xmm0 # 400680 <__dso_handle+0x8> 400505: vcvtdq2ps %xmm0,%xmm1 400509: vdivps 0x17f(%rip),%xmm1,%xmm1 # 400690 <__dso_handle+0x18> 400511: vcvttps2dq %xmm1,%xmm1 400515: vpmullw 0x183(%rip),%xmm1,%xmm1 # 4006a0 <__dso_handle+0x2...

experimental patch for libtheora1.1beta3

2009 Aug 30

3

experimental patch for libtheora1.1beta3

...uot; + "sub 32,%[ret]\n\t" "movq 0x40(%[buf]),%%mm0\n\t" "cmp %[ret2],%[ret]\n\t" "movq 0x48(%[buf]),%%mm4\n\t" @@ -511,7 +514,11 @@ static unsigned oc_int_frag_satd_thresh_mmxext(const u "punpckhdq %%mm0,%%mm0\n\t" "paddd %%mm0,%%mm4\n\t" "movd %%mm4,%[ret2]\n\t" - "lea (%[ret],%[ret2],2),%[ret]\n\t" + /* Not working "lea (%[ret],%[ret2],2),%[ret]\n\t" */ + /* Like ret = ret2*2 + ret */ + "mov %[ret2], %%eax\n\t" + "add %%eax, %%eax\n\t" +...

[LLVMdev] How does SSEDomainFix work?

2010 May 11

0

[LLVMdev] How does SSEDomainFix work?

...x, <4 x i32> %y, <4 x i32> %z) nounwind readnone { entry: %0 = add <4 x i32> %x, %z %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1> %1 = and <4 x i32> %not, %y %2 = xor <4 x i32> %0, %1 ret <4 x i32> %2 } _intfoo: movdqa %xmm0, %xmm3 paddd %xmm2, %xmm3 pandn %xmm1, %xmm2 movdqa %xmm2, %xmm0 pxor %xmm3, %xmm0 ret All the instructions moved to the int domain because the add forced them. > Please tell me if something would be wrong for me. You should measure if LLVM's code is actually slower that the code you want. If it i...

[LLVMdev] How does SSEDomainFix work?

2010 May 11

2

[LLVMdev] How does SSEDomainFix work?

Hello. This is my 1st post. I have tried SSE execution domain fixup pass. But I am not able to see any improvements. I expect for the example below to use MOVDQA, PAND &c. (On nehalem, ANDPS is extremely slower than PAND) Please tell me if something would be wrong for me. Thank you. Takumi Host: i386-mingw32 Build: trunk at 103373 foo.ll: define <4 x i32> @foo(<4 x i32> %x,

RFC: A proposal for vectorizing loops with calls to math functions using SVML

2016 Apr 01

2

RFC: A proposal for vectorizing loops with calls to math functions using SVML

...label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 movd %ebx, %xmm0 pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] paddd .LCPI0_0(%rip), %xmm0 cvtdq2ps %xmm0, %xmm0 movaps %xmm0, 16(%rsp) # 16-byte Spill shufps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3] callq sinf movaps %xmm0, (%rsp) # 16-byte Spill movaps 16(%rsp), %xmm0 #...

RFC: A proposal for vectorizing loops with calls to math functions using SVML

2016 Apr 04

2

RFC: A proposal for vectorizing loops with calls to math functions using SVML

...label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 movd %ebx, %xmm0 pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] paddd .LCPI0_0(%rip), %xmm0 cvtdq2ps %xmm0, %xmm0 movaps %xmm0, 16(%rsp) # 16-byte Spill shufps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3] callq sinf movaps %xmm0, (%rsp) # 16-byte Spill movaps 16(%rsp), %xmm0 #...

MMX/mmxext optimisations

2004 Aug 24

5

MMX/mmxext optimisations

quite some speed improvement indeed. attached the updated patch to apply to svn/trunk. j -------------- next part -------------- A non-text attachment was scrubbed... Name: theora-mmx.patch.gz Type: application/x-gzip Size: 8648 bytes Desc: not available Url : http://lists.xiph.org/pipermail/theora-dev/attachments/20040824/5a5f2731/theora-mmx.patch-0001.bin

[RFC] Allow loop vectorizer to choose vector widths that generate illegal types

2016 Jun 15

8

[RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Hello, Currently the loop vectorizer will, by default, not consider vectorization factors that would make it generate types that do not fit into the target platform's vector registers. That is, if the widest scalar type in the scalar loop is i64, and the platform's largest vector register is 256-bit wide, we will not consider a VF above 4. We have a command line option (-mllvm

[LLVMdev] [llvm-commits] [dragonegg] r168787 - in /dragonegg/trunk: src/x86/Target.cpp src/x86/x86_builtins test/validator/c/copysignp.c

2012 Nov 28

0

[LLVMdev] [llvm-commits] [dragonegg] r168787 - in /dragonegg/trunk: src/x86/Target.cpp src/x86/x86_builtins test/validator/c/copysignp.c

...tRHS, SignMask); > + Value *Abs = Builder.CreateAnd(IntLHS, ConstantExpr::getNot(SignMask)); > + Value *IntRes = Builder.CreateOr(Abs, Sign); > + Result = Builder.CreateBitCast(IntRes, VecTy); > + return true; > + } > case paddb: > case paddw: > case paddd: > > Modified: dragonegg/trunk/src/x86/x86_builtins > URL: http://llvm.org/viewvc/llvm-project/dragonegg/trunk/src/x86/x86_builtins?rev=168787&r1=168786&r2=168787&view=diff > ============================================================================== > --- dragonegg/tr...

[LLVMdev] [llvm-commits] r192750 - Enable MI Sched for x86.

2013 Oct 15

0

[LLVMdev] [llvm-commits] r192750 - Enable MI Sched for x86.

...;> +++ llvm/trunk/test/CodeGen/X86/widen_cast-1.ll Tue Oct 15 18:33:07 2013 >> @@ -1,8 +1,8 @@ >> ; RUN: llc -march=x86 -mcpu=generic -mattr=+sse4.2 < %s | FileCheck %s >> ; RUN: llc -march=x86 -mcpu=atom < %s | FileCheck -check-prefix=ATOM %s >> >> -; CHECK: paddd >> ; CHECK: movl >> +; CHECK: paddd >> ; CHECK: movlpd >> >> ; Scheduler causes produce a different instruction order >> >> Modified: llvm/trunk/test/CodeGen/X86/win64_alloca_dynalloca.ll >> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/C...

search for: paddd