Displaying 20 results from an estimated 38 matches for "addps".
Did you mean:
ddps
2015 Jul 29
0
[LLVMdev] x86-64 backend generates aligned ADDPS with unaligned address
...introducing this load without proving that
the vector is 16-byte aligned, then that's a bug
On Wed, Jul 29, 2015 at 1:02 PM, Frank Winter <fwinter at jlab.org> wrote:
> When I compile attached IR with LLVM 3.6
>
> llc -march=x86-64 -o f.S f.ll
>
> it generates an aligned ADDPS with unaligned address. See attached f.S,
> here an extract:
>
> addq $12, %r9 # $12 is not a multiple of 4, thus for
> xmm0 this is unaligned
> xorl %esi, %esi
> .align 16, 0x90
> .LBB0_1: # %loop2
>...
2011 Sep 27
2
[LLVMdev] Poor code generation for odd sized vectors
...rks really well when the vector length (16 in the above) is
an integer multiple of the SSE vector register width (4) resulting
in the following assember code:
vector_add_float: # @vector_add_float
.Leh_func_begin0:
# BB#0: # %entry
addps %xmm4, %xmm0
addps %xmm5, %xmm1
addps %xmm6, %xmm2
addps %xmm7, %xmm3
ret
However, when the vector length is increased to say 18, the generated
code is rather poor, or rather is code that could easily be improved
by hand.
Is this a know issue? Should LLVM be doing better? SHould I raise a
bug...
2015 Jul 29
2
[LLVMdev] x86-64 backend generates aligned ADDPS with unaligned address
When I compile attached IR with LLVM 3.6
llc -march=x86-64 -o f.S f.ll
it generates an aligned ADDPS with unaligned address. See attached f.S,
here an extract:
addq $12, %r9 # $12 is not a multiple of 4, thus for
xmm0 this is unaligned
xorl %esi, %esi
.align 16, 0x90
.LBB0_1: # %loop2...
2013 Aug 22
2
New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16
...ients_asm_ia32
cglobal FLAC__lpc_compute_residual_from_qlp_coefficients_asm_ia32_mmx
@@ -596,7 +597,7 @@
movss xmm3, xmm2
movss xmm2, xmm0
- ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm3:xmm3:xmm2
+ ; xmm7:xmm6:xmm5 += xmm0:xmm0:xmm0 * xmm4:xmm3:xmm2
movaps xmm1, xmm0
mulps xmm1, xmm2
addps xmm5, xmm1
@@ -619,6 +620,95 @@
ret
ALIGN 16
+cident FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16
+ ;[ebp + 20] == autoc[]
+ ;[ebp + 16] == lag
+ ;[ebp + 12] == data_len
+ ;[ebp + 8] == data[]
+ ;[esp] == __m128
+ ;[esp + 16] == __m128
+
+ push ebp
+ mov ebp, esp
+ and esp, -16 ; s...
2012 Jul 06
2
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
...for x86_64 with SSE:
>>
>> [...]
>> movaps 32(%rdi), %xmm3
>> movaps 48(%rdi), %xmm2
>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1
>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4
>> addps %xmm0, %xmm1
>> movaps %xmm1, -16(%rbp) ## 16-byte Spill
>> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3
>> [...]
>>
>> xmm3 loaded, duplicated into 2 registers, and then discarded as other
>> data is loaded int...
2012 Jul 06
0
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
...wrote:
>>> [...]
>>> movaps 32(%rdi), %xmm3
>>> movaps 48(%rdi), %xmm2
>>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1
>>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4
>>> addps %xmm0, %xmm1
>>> movaps %xmm1, -16(%rbp) ## 16-byte Spill
>>> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3
>>> [...]
>>>
>>> xmm3 loaded, duplicated into 2 registers, and then discarded as other
>>...
2011 Sep 27
0
[LLVMdev] Poor code generation for odd sized vectors
...h (16 in the above) is
> an integer multiple of the SSE vector register width (4) resulting
> in the following assember code:
>
> vector_add_float: # @vector_add_float
> .Leh_func_begin0:
> # BB#0: # %entry
> addps %xmm4, %xmm0
> addps %xmm5, %xmm1
> addps %xmm6, %xmm2
> addps %xmm7, %xmm3
> ret
>
> However, when the vector length is increased to say 18, the generated
> code is rather poor, or rather is code that could easily be improved
> by hand.
>
> Is this a know issue? S...
2009 Dec 17
1
[LLVMdev] Merging AVX
...nstrSSE.td
entirely with a set of patterns that covers all SIMD instructions. But
that's going to be gradual so we need to maintain both as we go along.
So these foundational templates need to be somewhere accessible to
both sets of patterns.
Then I'll start with a simple instruction like ADDPS/D / VADDPS/D. I will add
all of the base templates needed to implement that and then add the
pattern itself, replacing the various ADDPS/D patterns in X86InstrSSE..td
We'll do instructions one by one until we're done.
When we get to things like shuffles where we've identified major re...
2013 Feb 26
2
[LLVMdev] passing vector of booleans to functions
...gt; %a, <4 x float> %b) {
entry:
%cmp = fcmp olt <4 x float> %a, %b
%add = fadd <4 x float> %a, %b
%sel = select <4 x i1> %cmp, <4 x float> %add, <4 x float> %a
ret <4 x float> %sel
}
I will get (on SSE):
movaps %xmm0, %xmm2
cmpltps %xmm1, %xmm0
addps %xmm2, %xmm1
blendvps %xmm1, %xmm2
movaps %xmm2, %xmm0
ret
great :)
But now, let us try to pass a mask to a function.
define <4 x float> @masked_add_1(<4 x i1> %mask, <4 x float> %a, <4 x float> %b) {
entry:
%add = fadd <4 x float> %a, %b
%sel = select <4 x...
2004 Aug 06
2
[PATCH] Make SSE Run Time option. Add Win32 SSE code
...xmm1, 0x00
+
+ movaps xmm2, [eax+4]
+ movaps xmm3, [ebx+4]
+ mulps xmm2, xmm0
+ mulps xmm3, xmm1
+ movaps xmm4, [eax+20]
+ mulps xmm4, xmm0
+ addps xmm2, [ecx+4]
+ movaps xmm5, [ebx+20]
+ mulps xmm5, xmm1
+ addps xmm4, [ecx+20]
+ subps xmm2, xmm3
+ movups [ecx], xmm2
+ subps xmm4, xmm5
+...
2012 May 24
4
[LLVMdev] use AVX automatically if present
..."
.text
.globl _fun1
.align 16, 0x90
.type _fun1, at function
_fun1: # @_fun1
.cfi_startproc
# BB#0: # %_L1
movaps (%rdi), %xmm0
movaps 16(%rdi), %xmm1
addps (%rsi), %xmm0
addps 16(%rsi), %xmm1
movaps %xmm1, 16(%rdi)
movaps %xmm0, (%rdi)
ret
.Ltmp0:
.size _fun1, .Ltmp0-_fun1
.cfi_endproc
.section ".note.GNU-stack","", at progbits
$ llc -o - -mattr avx...
2013 Aug 19
2
[LLVMdev] Duplicate loading of double constants
...dant
as it's dominated by the first one. Two xorps come from 2 FsFLD0SD
generated by
instruction selection and never eliminated by machine passes. My guess
would be
machine CSE should take care of it.
A variation of this case without indirection shows the same problem, as
well as
not commuting addps, resulting in an extra movps:
$ cat t.c
double f(double p, int n)
{
double s = 0;
if (n)
s += p;
return s;
}
$ clang -S -O3 t.c -o -
...
f: # @f
.cfi_startproc
# BB#0:
xorps %xmm1, %xmm1
testl %edi, %edi
j...
2020 Aug 20
2
Question about llvm vectors
...g,
Thank you very much for your answer.
I did not want to discuss exactly the semantic and name of one operation
but instead raise the question "would it be beneficial to have more vector
builtins?".
You wrote that the compiler will recognize a pattern and replace it by
__builtin_ia32_haddps when possible, but how can I be sure of that? I would
have to disassemble the generated code right? It is very
impractical isn'it? And it leads me to understand that each CPU target has
a bank of patterns which it can recognize but wouldn't it be very similar
to have advanced generic vector...
2008 Jun 17
2
[LLVMdev] VFCmp failing when unordered or UnsafeFPMath on x86
...[1] < 0) v[1] += 1.0f;
if(v[2] < 0) v[2] += 1.0f;
if(v[3] < 0) v[3] += 1.0f;
With SSE assembly this would be as simple as:
movaps xmm1, xmm0 // v in xmm0
cmpltps xmm1, zero // zero = {0.0f, 0.0f, 0.0f, 0.0f}
andps xmm1, one // one = {1.0f, 1.0f, 1.0f, 1.0f}
addps xmm0, xmm1
With the current definition of VFCmp this seems hard if not impossible to
achieve. Vector compare instructions that return all 1's or all 0's per
element are very common, and they are quite powerful in my opinion
(effectively allowing to implement a per-element Select). It s...
2011 Dec 05
0
[LLVMdev] RFC: Machine Instruction Bundle
...) R3 = memw(R4)
}
Constraining spill code insertion
=================================
It is important to note that bundling instructions doesn't constrain the register allocation problem.
For example, this bundle would be impossible with sequential value constraints:
{
call foo
%vr0 = addps %vr1, %vr2
call bar
}
The calls clobber the xmm registers, so it is impossible to register allocate this code without breaking up the bundle and inserting spill code between the calls.
With our definition of bundle value semantics, the addps is reading %vr1 and %vr2 outside the bundle, and the...
2010 Jun 18
1
[LLVMdev] argpromotion not working
Hi all,
I have the following C code.
static int addp(int *c, int a,int b)
{
int x = *c + a + b;
return(x);
}
I want to replace *c with a scalar. So I tried the -argpromotion pass.
However, it fails to do anything to the resulting llvm file.
List of commands:
clang add.c -c -o add.bc
clang add.c -S -o add.ll
opt -argpromotion -stats add.bc -o add_a.bc
llvm-dis < add_a.bc > add_a.ll
Also,
2013 Feb 26
0
[LLVMdev] passing vector of booleans to functions
...sked_add_1(<4 x i1> %mask, <4 x float> %a, <4 x float>
%b) {
> entry:
> %add = fadd <4 x float> %a, %b
> %sel = select <4 x i1> %mask, <4 x float> %add, <4 x float> %a
> ret <4 x float> %sel
> }
>
> I will get:
>
> addps %xmm1, %xmm2
> pslld $31, %xmm0
> blendvps %xmm2, %xmm1
> movaps %xmm1, %xmm0
> ret
>
> While this is correct and works, I'm unhappy with the pssld. Apparently,
> LLVM uses a <4 x i32> to hold the <4 x i1> while the LSB holds the mask
> bit. But blend...
2013 Apr 15
1
[LLVMdev] State of Loop Unrolling and Vectorization in LLVM
...and clang, I am not able to see the loop being unrolled / vectorized. The microbenchmark which runs the function g() over a billion times shows quite some performance difference on gcc against clang
Gcc - 8.6 seconds
Clang - 12.7 seconds
Evidently, the addition operation can be vectorized to use addps, (clang does addss), and the loop can be unrolled for better performance. Any idea why this is happening ?
Thanks
Sriram
--
Sriram Murali
SSG/DPD/ECDL/DMP
+1 (519) 772 - 2579
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-...
2013 Aug 20
0
[LLVMdev] Duplicate loading of double constants
...one. Two xorps come from 2 FsFLD0SD
> generated by
> instruction selection and never eliminated by machine passes. My guess
> would be
> machine CSE should take care of it.
>
> A variation of this case without indirection shows the same problem, as
> well as
> not commuting addps, resulting in an extra movps:
>
> $ cat t.c
> double f(double p, int n)
> {
> double s = 0;
> if (n)
> s += p;
> return s;
> }
> $ clang -S -O3 t.c -o -
> ...
> f: # @f
> .cfi_startproc
> # BB...
2012 Jul 06
0
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
...s from a
> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE:
>
> [...]
> movaps 32(%rdi), %xmm3
> movaps 48(%rdi), %xmm2
> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1
> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4
> addps %xmm0, %xmm1
> movaps %xmm1, -16(%rbp) ## 16-byte Spill
> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3
> [...]
>
> xmm3 loaded, duplicated into 2 registers, and then discarded as other
> data is loaded into it. Can anyone shed some light on w...