thr3ads.net - search: "vmul"

Displaying 20 results from an estimated 59 matches for "vmul".

Did you mean: mul

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

...idn't do so on a > whim. > > > The performance gain with vmla instruction is huge. > > Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines > in odd ways, and that was a very primitive core so it's almost > certainly not going to be just as good as a vmul (in fact if I'm > reading correctly, it takes pretty much exactly the same time as > separate vmul and vadd instructions, 10 cycles vs 2 * 5). > It may seem that total number of cycles are more or less same for single vmla and vmul+vadd. However, when vmul+vadd combination is used ins...

[LLVMdev] ARM NEON VMUL.f32 issue

2013 Mar 19

[LLVMdev] ARM NEON VMUL.f32 issue

Hi folks, I just "fixed" a bug on ARM LNT regarding lowering of a VMUL.f32 as NEON and not VFP. The former is not IEEE 754 compliant, while the latter is, and that was failing TSVC. The question is: * is this a problem with the test, that shouldn't be expecting values below FLT_MIN, or * is it a bug in the lowering, that should only be lowering to NEON's VM...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > It may seem that total number of cycles are more or less same for single > vmla and vmul+vadd. However, when vmul+vadd combination is used instead of > vmla, then intermediate results will be generated which needs to be stored > in memory for future access. This will lead to lot of load/store ops being > inserted which degrade performance. Correct me if i am wrong on this, but...

[LLVMdev] ARM NEON VMUL.f32 issue

2013 Mar 19

[LLVMdev] ARM NEON VMUL.f32 issue

...ou mind making this change depend on platform? Darwin should continue to use NEON by default for these operations. -Jim On Mar 19, 2013, at 11:17 AM, Renato Golin <renato.golin at linaro.org> wrote: > Hi folks, > > I just "fixed" a bug on ARM LNT regarding lowering of a VMUL.f32 as NEON and not VFP. The former is not IEEE 754 compliant, while the latter is, and that was failing TSVC. > > The question is: > * is this a problem with the test, that shouldn't be expecting values below FLT_MIN, or > * is it a bug in the lowering, that should only be lower...

[LLVMdev] ARM NEON VMUL.f32 issue

2013 Mar 20

[LLVMdev] ARM NEON VMUL.f32 issue

Hi, | The question is: | * is this a problem with the test, that shouldn't be expecting values below FLT_MIN, or | * is it a bug in the lowering, that should only be lowering to NEON's VMUL when unsafe-math is on, or | * neither, and people should disable that when they want correctness? Note that if you go for the second option, IMO unsafe-math is _far_ too "aggressive" an option to control this whether multiplies should be allowed produce denormals. I can imagine plenty o...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

...ersion of the ARM architecture reference manual (v7 A & R) lists versions requiring NEON and versions requiring VFP. (Section A8.8.337). Split in just the way you'd expect (SIMD variants need NEON). > It may seem that total number of cycles are more or less same for single vmla > and vmul+vadd. However, when vmul+vadd combination is used instead of vmla, > then intermediate results will be generated which needs to be stored in memory > for future access. Well, it increases register pressure slightly I suppose, but there's no need to store anything to memory unless that ge...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

...> versions requiring NEON and versions requiring VFP. (Section > A8.8.337). Split in just the way you'd expect (SIMD variants need > NEON). > I will check on this part. > > > It may seem that total number of cycles are more or less same for single > vmla > > and vmul+vadd. However, when vmul+vadd combination is used instead of > vmla, > > then intermediate results will be generated which needs to be stored in > memory > > for future access. > > Well, it increases register pressure slightly I suppose, but there's > no need to store...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

...who added it in the first place didn't do so on a whim. > The performance gain with vmla instruction is huge. Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines in odd ways, and that was a very primitive core so it's almost certainly not going to be just as good as a vmul (in fact if I'm reading correctly, it takes pretty much exactly the same time as separate vmul and vadd instructions, 10 cycles vs 2 * 5). Cheers. Tim.

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

On Thu, Dec 19, 2013 at 4:36 PM, Renato Golin <renato.golin at linaro.org>wrote: > On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > >> It may seem that total number of cycles are more or less same for single >> vmla and vmul+vadd. However, when vmul+vadd combination is used instead of >> vmla, then intermediate results will be generated which needs to be stored >> in memory for future access. This will lead to lot of load/store ops being >> inserted which degrade performance. Correct me if i am wrong...

[LLVMdev] NEON vector instructions and the fast math IR flags

2013 Jun 07

[LLVMdev] NEON vector instructions and the fast math IR flags

...ld not be used ; to implement this function as it does not comply to the full precision ; requirements (NEON rounds e.g. denormals to zero which reduces precision) define <4 x float> @fooP(<4 x float> %A, <4 x float> %B) { %C = fmul <4 x float> %A, %B ; CHECK: fooP ; CHECK: vmul.f32 s ; CHECK: vmul.f32 s ; CHECK: vmul.f32 s ; CHECK: vmul.f32 s ret <4 x float> %C } ; fooR() performs a vector floating point multiplication with relaxed precision ; requirements. In this case the precision loss introduced by neon is acceptable ; and we should generate NEON instructions...

[LLVMdev] MI scheduler produce badly code with inline function

2013 Oct 15

[LLVMdev] MI scheduler produce badly code with inline function

On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote: > Hi all, > I meet this problem when compiling the TREAM benchmark (http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched > > The small function will be scheduled as good code, but if opt inline this function, the inline part will be scheduled as bad code. A bug for this is welcome. Pretty soon, I’ll

[PATCH 5/5] resample: Add NEON optimized inner_product_single for floating point

2011 Sep 01

[PATCH 5/5] resample: Add NEON optimized inner_product_single for floating point

...+ uint32_t remainder = len % 16; + len = len - remainder; + + asm volatile (" cmp %[len], #0\n" + " bne 1f\n" + " vld1.32 {q4}, [%[b]]!\n" + " vld1.32 {q8}, [%[a]]!\n" + " subs %[remainder], %[remainder], #4\n" + " vmul.f32 q0, q4, q8\n" + " bne 4f\n" + " b 5f\n" + "1:" + " vld1.32 {q4, q5}, [%[b]]!\n" + " vld1.32 {q8, q9}, [%[a]]!\n" + " vld1.32 {q6, q7}, [%[b]]!\n" + " vld1.32 {q10, q11}, [%[a]]!\n" + &q...

[LLVMdev] NEON vector instructions and the fast math IR flags

2013 Jun 07

[LLVMdev] NEON vector instructions and the fast math IR flags

On 7 June 2013 07:05, Owen Anderson <resistor at mac.com> wrote: > Darwin uses NEON for floating point, but does *not* (and should not). > globally enable fast math flags. Use of NEON for FP needs to remain > achievable without globally setting the fast math flags. Fast math may > imply reasonably imply NEON, but the opposite direction is not accurate. > > That said, I

[LLVMdev] MI scheduler produce badly code with inline function

2013 Oct 14

[LLVMdev] MI scheduler produce badly code with inline function

Hi all, I meet this problem when compiling the TREAM benchmark ( http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched The small function will be scheduled as good code, but if opt inline this function, the inline part will be scheduled as bad code. so I rewrite a simple code as attached link (foo.c), and compiled with two different methods: *method A:* *$clang -O3 foo.c -static -S

[LLVMdev] MI scheduler produce badly code with inline function

2013 Oct 21

[LLVMdev] MI scheduler produce badly code with inline function

Hi Andy, I'm working on defining new machine model for my target, But I don't understand how to define the in-order machine (reservation tables) in new model. For example, if target has IF ID EX WB stages should I do: let BufferSize=0 in { def IF: ProcResource<1>; def ID: ProcResource<1>; def EX: ProcResource<1>; def WB: ProcResource<1>; } def :

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

Hi all, Thanks for the info. Few observations from my side : LLVM : cortex-a8 vfpv3 : no vmla or vfma instruction emitted cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid though as cortex-a8 does not have vfpv4) cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma instructions

[LLVMdev] MI scheduler produce badly code with inline function

2013 Oct 16

[LLVMdev] MI scheduler produce badly code with inline function

...r-operand cost model : Scale: push {lr} movw r12, :lower16:c movw lr, :lower16:b movw r3, #9216 movt r12, :upper16:c mov r1, #0 vmov.f64 d16, #3.000000e+00 movt lr, :upper16:b movt r3, #244 .LBB0_1: add r0, r12, r1 add r2, lr, r1 *vldr d17, [r0]* add r1, r1, #32 vmul.f64 d17, d17, d16 cmp r1, r3 vstr d17, [r2] * vldr d17, [r0, #8]* vmul.f64 d17, d17, d16 * * vstr d17, [r2, #8] * vldr d17, [r0, #16]* vmul.f64 d17, d17, d16 vstr d17, [r2, #16] * vldr d17, [r0, #24]* vmul.f64 d17, d17, d16 vstr d17, [r2, #24] bne .LBB0_1 pop {lr}...

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 07

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...a.data[i] * b.data[i]; return result; } void TestVec4Multiply(vec4& a, vec4& b, vec4& result) { result = a * b; } With -O3 the loop gets vectorized and the code generated looks optimal: __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r2] bx lr However if I replace the operator* with a NEON intrinsic implementation (I know the vectorizer figured out optimal code in this case anyway, but that wasn't true for my real situation) then the temporary "result" seems to be kept in the...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

...llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/ray.cpp 40 llvm/projects/test-suite/SingleSource/Benchmarks/Misc/ffbench.c 8 llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c 18 llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c 36 With vmul+vadd instruction pair comes extra overhead of load/store ops, as seen in assembly generated. With -mcpu=cortex-a15 option clang performs better, as it emits vmla instructions. > > This was tested on real hardware. Time taken for a 4x4 matrix >> multiplication: >> > > What...

[LLVMdev] Vectorization of pointer PHI nodes

2013 Oct 14

[LLVMdev] Vectorization of pointer PHI nodes

...read[i+1]* 4.0; float a2_2 = *read[i+3+1]* 4.0; … float a3 = *read[i+2] * 5.0; float a3_2 = *read[i+3+2] * 5.0; write[i] = a1; write[i+3] = a1_2; … write[i+1] = a2; write[i+1+3] = a2_2; ... } VLD3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i] a1..a1_4 = VMUL a1..a1_4, #3.0 a2..a2_4 = VMUL a2..a2_4, #4.0 a3..a3_4 = VMUL a3..a3_4, #5.0 VST3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i] On Oct 14, 2013, at 12:15 PM, Nadav Rotem <nrotem at apple.com> wrote: > This is almost ideal for SLP vectorization, except for two problems: > > 1. We...

search for: vmul