thr3ads.net - search: "addps"

[LLVMdev] x86-64 backend generates aligned ADDPS with unaligned address

2015 Jul 29

0

[LLVMdev] x86-64 backend generates aligned ADDPS with unaligned address

...introducing this load without proving that the vector is 16-byte aligned, then that's a bug On Wed, Jul 29, 2015 at 1:02 PM, Frank Winter <fwinter at jlab.org> wrote: > When I compile attached IR with LLVM 3.6 > > llc -march=x86-64 -o f.S f.ll > > it generates an aligned ADDPS with unaligned address. See attached f.S, > here an extract: > > addq $12, %r9 # $12 is not a multiple of 4, thus for > xmm0 this is unaligned > xorl %esi, %esi > .align 16, 0x90 > .LBB0_1: # %loop2 >...

[LLVMdev] Poor code generation for odd sized vectors

2011 Sep 27

2

[LLVMdev] Poor code generation for odd sized vectors

...rks really well when the vector length (16 in the above) is an integer multiple of the SSE vector register width (4) resulting in the following assember code: vector_add_float: # @vector_add_float .Leh_func_begin0: # BB#0: # %entry addps %xmm4, %xmm0 addps %xmm5, %xmm1 addps %xmm6, %xmm2 addps %xmm7, %xmm3 ret However, when the vector length is increased to say 18, the generated code is rather poor, or rather is code that could easily be improved by hand. Is this a know issue? Should LLVM be doing better? SHould I raise a bug...

[LLVMdev] x86-64 backend generates aligned ADDPS with unaligned address

2015 Jul 29

2

[LLVMdev] x86-64 backend generates aligned ADDPS with unaligned address

When I compile attached IR with LLVM 3.6 llc -march=x86-64 -o f.S f.ll it generates an aligned ADDPS with unaligned address. See attached f.S, here an extract: addq $12, %r9 # $12 is not a multiple of 4, thus for xmm0 this is unaligned xorl %esi, %esi .align 16, 0x90 .LBB0_1: # %loop2...

New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16

2013 Aug 22

2

New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16

...ients_asm_ia32 cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32_mmx @@ -596,7 +597,7 @@ movss xmm3, xmm2 movss xmm2, xmm0 - ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm3:xmm3:xmm2 + ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm4:xmm3:xmm2 movaps xmm1, xmm0 mulps xmm1, xmm2 addps xmm5, xmm1 @@ -619,6 +620,95 @@ ret ALIGN 16 +cident FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16 + ;[ebp + 20] == autoc[] + ;[ebp + 16] == lag + ;[ebp + 12] == data_len + ;[ebp + 8] == data[] + ;[esp] == __m128 + ;[esp + 16] == __m128 + + push ebp + mov ebp, esp + and esp, -16 ; s...

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

2012 Jul 06

2

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

...for x86_64 with SSE: >> >> [...] >> movaps 32(%rdi), %xmm3 >> movaps 48(%rdi), %xmm2 >> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 >> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 >> addps %xmm0, %xmm1 >> movaps %xmm1, -16(%rbp) ## 16-byte Spill >> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3 >> [...] >> >> xmm3 loaded, duplicated into 2 registers, and then discarded as other >> data is loaded int...

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

2012 Jul 06

0

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

...wrote: >>> [...] >>> movaps 32(%rdi), %xmm3 >>> movaps 48(%rdi), %xmm2 >>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 >>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 >>> addps %xmm0, %xmm1 >>> movaps %xmm1, -16(%rbp) ## 16-byte Spill >>> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3 >>> [...] >>> >>> xmm3 loaded, duplicated into 2 registers, and then discarded as other >&gt...

[LLVMdev] Poor code generation for odd sized vectors

2011 Sep 27

0

[LLVMdev] Poor code generation for odd sized vectors

...h (16 in the above) is > an integer multiple of the SSE vector register width (4) resulting > in the following assember code: > > vector_add_float: # @vector_add_float > .Leh_func_begin0: > # BB#0: # %entry > addps %xmm4, %xmm0 > addps %xmm5, %xmm1 > addps %xmm6, %xmm2 > addps %xmm7, %xmm3 > ret > > However, when the vector length is increased to say 18, the generated > code is rather poor, or rather is code that could easily be improved > by hand. > > Is this a know issue? S...

[LLVMdev] Merging AVX

2009 Dec 17

1

[LLVMdev] Merging AVX

...nstrSSE.td entirely with a set of patterns that covers all SIMD instructions. But that's going to be gradual so we need to maintain both as we go along. So these foundational templates need to be somewhere accessible to both sets of patterns. Then I'll start with a simple instruction like ADDPS/D / VADDPS/D. I will add all of the base templates needed to implement that and then add the pattern itself, replacing the various ADDPS/D patterns in X86InstrSSE..td We'll do instructions one by one until we're done. When we get to things like shuffles where we've identified major re...

[LLVMdev] passing vector of booleans to functions

2013 Feb 26

2

[LLVMdev] passing vector of booleans to functions

...gt; %a, <4 x float> %b) { entry: %cmp = fcmp olt <4 x float> %a, %b %add = fadd <4 x float> %a, %b %sel = select <4 x i1> %cmp, <4 x float> %add, <4 x float> %a ret <4 x float> %sel } I will get (on SSE): movaps %xmm0, %xmm2 cmpltps %xmm1, %xmm0 addps %xmm2, %xmm1 blendvps %xmm1, %xmm2 movaps %xmm2, %xmm0 ret great :) But now, let us try to pass a mask to a function. define <4 x float> @masked_add_1(<4 x i1> %mask, <4 x float> %a, <4 x float> %b) { entry: %add = fadd <4 x float> %a, %b %sel = select <4 x...

[PATCH] Make SSE Run Time option. Add Win32 SSE code

2004 Aug 06

2

[PATCH] Make SSE Run Time option. Add Win32 SSE code

...xmm1, 0x00 + + movaps xmm2, [eax+4] + movaps xmm3, [ebx+4] + mulps xmm2, xmm0 + mulps xmm3, xmm1 + movaps xmm4, [eax+20] + mulps xmm4, xmm0 + addps xmm2, [ecx+4] + movaps xmm5, [ebx+20] + mulps xmm5, xmm1 + addps xmm4, [ecx+20] + subps xmm2, xmm3 + movups [ecx], xmm2 + subps xmm4, xmm5 +...

[LLVMdev] use AVX automatically if present

2012 May 24

4

[LLVMdev] use AVX automatically if present

..." .text .globl _fun1 .align 16, 0x90 .type _fun1, at function _fun1: # @_fun1 .cfi_startproc # BB#0: # %_L1 movaps (%rdi), %xmm0 movaps 16(%rdi), %xmm1 addps (%rsi), %xmm0 addps 16(%rsi), %xmm1 movaps %xmm1, 16(%rdi) movaps %xmm0, (%rdi) ret .Ltmp0: .size _fun1, .Ltmp0-_fun1 .cfi_endproc .section ".note.GNU-stack","", at progbits $ llc -o - -mattr avx...

[LLVMdev] Duplicate loading of double constants

2013 Aug 19

2

[LLVMdev] Duplicate loading of double constants

...dant as it's dominated by the first one. Two xorps come from 2 FsFLD0SD generated by instruction selection and never eliminated by machine passes. My guess would be machine CSE should take care of it. A variation of this case without indirection shows the same problem, as well as not commuting addps, resulting in an extra movps: $ cat t.c double f(double p, int n) { double s = 0; if (n) s += p; return s; } $ clang -S -O3 t.c -o - ... f: # @f .cfi_startproc # BB#0: xorps %xmm1, %xmm1 testl %edi, %edi j...

Question about llvm vectors

2020 Aug 20

2

Question about llvm vectors

...g, Thank you very much for your answer. I did not want to discuss exactly the semantic and name of one operation but instead raise the question "would it be beneficial to have more vector builtins?". You wrote that the compiler will recognize a pattern and replace it by __builtin_ia32_haddps when possible, but how can I be sure of that? I would have to disassemble the generated code right? It is very impractical isn'it? And it leads me to understand that each CPU target has a bank of patterns which it can recognize but wouldn't it be very similar to have advanced generic vector...

[LLVMdev] VFCmp failing when unordered or UnsafeFPMath on x86

2008 Jun 17

2

[LLVMdev] VFCmp failing when unordered or UnsafeFPMath on x86

...[1] < 0) v[1] += 1.0f; if(v[2] < 0) v[2] += 1.0f; if(v[3] < 0) v[3] += 1.0f; With SSE assembly this would be as simple as: movaps xmm1, xmm0 // v in xmm0 cmpltps xmm1, zero // zero = {0.0f, 0.0f, 0.0f, 0.0f} andps xmm1, one // one = {1.0f, 1.0f, 1.0f, 1.0f} addps xmm0, xmm1 With the current definition of VFCmp this seems hard if not impossible to achieve. Vector compare instructions that return all 1's or all 0's per element are very common, and they are quite powerful in my opinion (effectively allowing to implement a per-element Select). It s...

[LLVMdev] RFC: Machine Instruction Bundle

2011 Dec 05

0

[LLVMdev] RFC: Machine Instruction Bundle

...) R3 = memw(R4) } Constraining spill code insertion ================================= It is important to note that bundling instructions doesn't constrain the register allocation problem. For example, this bundle would be impossible with sequential value constraints: { call foo %vr0 = addps %vr1, %vr2 call bar } The calls clobber the xmm registers, so it is impossible to register allocate this code without breaking up the bundle and inserting spill code between the calls. With our definition of bundle value semantics, the addps is reading %vr1 and %vr2 outside the bundle, and the...

[LLVMdev] argpromotion not working

2010 Jun 18

1

[LLVMdev] argpromotion not working

Hi all, I have the following C code. static int addp(int *c, int a,int b) { int x = *c + a + b; return(x); } I want to replace *c with a scalar. So I tried the -argpromotion pass. However, it fails to do anything to the resulting llvm file. List of commands: clang add.c -c -o add.bc clang add.c -S -o add.ll opt -argpromotion -stats add.bc -o add_a.bc llvm-dis < add_a.bc > add_a.ll Also,

[LLVMdev] passing vector of booleans to functions

2013 Feb 26

0

[LLVMdev] passing vector of booleans to functions

...sked_add_1(<4 x i1> %mask, <4 x float> %a, <4 x float> %b) { > entry: > %add = fadd <4 x float> %a, %b > %sel = select <4 x i1> %mask, <4 x float> %add, <4 x float> %a > ret <4 x float> %sel > } > > I will get: > > addps %xmm1, %xmm2 > pslld $31, %xmm0 > blendvps %xmm2, %xmm1 > movaps %xmm1, %xmm0 > ret > > While this is correct and works, I'm unhappy with the pssld. Apparently, > LLVM uses a <4 x i32> to hold the <4 x i1> while the LSB holds the mask > bit. But blend...

[LLVMdev] State of Loop Unrolling and Vectorization in LLVM

2013 Apr 15

1

[LLVMdev] State of Loop Unrolling and Vectorization in LLVM

...and clang, I am not able to see the loop being unrolled / vectorized. The microbenchmark which runs the function g() over a billion times shows quite some performance difference on gcc against clang Gcc - 8.6 seconds Clang - 12.7 seconds Evidently, the addition operation can be vectorized to use addps, (clang does addss), and the loop can be unrolled for better performance. Any idea why this is happening ? Thanks Sriram -- Sriram Murali SSG/DPD/ECDL/DMP +1 (519) 772 - 2579 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-...

[LLVMdev] Duplicate loading of double constants

2013 Aug 20

0

[LLVMdev] Duplicate loading of double constants

...one. Two xorps come from 2 FsFLD0SD > generated by > instruction selection and never eliminated by machine passes. My guess > would be > machine CSE should take care of it. > > A variation of this case without indirection shows the same problem, as > well as > not commuting addps, resulting in an extra movps: > > $ cat t.c > double f(double p, int n) > { > double s = 0; > if (n) > s += p; > return s; > } > $ clang -S -O3 t.c -o - > ... > f: # @f > .cfi_startproc > # BB...

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

2012 Jul 06

0

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

...s from a > 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE: > > [...] > movaps 32(%rdi), %xmm3 > movaps 48(%rdi), %xmm2 > movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 > movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 > addps %xmm0, %xmm1 > movaps %xmm1, -16(%rbp) ## 16-byte Spill > movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3 > [...] > > xmm3 loaded, duplicated into 2 registers, and then discarded as other > data is loaded into it. Can anyone shed some light on w...

search for: addps