thr3ads.net - search: "vmulq

Displaying 6 results from an estimated 6 matches for "vmulq_f32".

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

...CE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#include "../kiss_fft.h" +#include <arm_neon.h> + +#define C_MUL_NEON(m, a, b, t, ones, tv) \ + do{ \ + t = vrev64q_f32(b); \ + m = vmulq_f32(a, b); \ + m = vmulq_f32(m, ones); \ + t = vmulq_f32(a, t); \ + tv = vtrnq_f32(m, t); \ + m = vaddq_f32(tv.val[0], tv.val[1]); \ + }while(0) + +#define ONES_MINUS_ONE 0xbf8000003f800000 //{-1.0, 1.0} +#define MINUS_ONE 0xbf800000bf800000 // {-1.0, -1.0} + +...

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 07

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...way, but that wasn't true for my real situation) then the temporary "result" seems to be kept in the generated code for the test function, and triggers the bad penalty of a load after a NEON store. vec4 operator* (vec4& a, vec4& b) { vec4 result; float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data)); vst1q_f32(result.data, result_data); return result; } __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: sub sp, #16 vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] mov r0, sp vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r0] vst1.32...

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 08

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? > > If I had to guess, I'd say the intrinsic got in the way of recognising > the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but > vld1q_f32 is still kept as an intrinsic, so register allocators and > schedulers get confused and, when lowering to assembly, you're left > with garbage around it. > > Creating a bug for this is probably the best thing to do...

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 10

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...s the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? >>> >>> If I had to guess, I'd say the intrinsic got in the way of recognising >>> the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but >>> vld1q_f32 is still kept as an intrinsic, so register allocators and >>> schedulers get confused and, when lowering to assembly, you're left >>> with garbage around it. > > FWIW, with top of tree clang, I...

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

2014 Sep 10

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

libvorbis does not currently have any simd/vectorization. Following patches add generic framework for simd/vectorization and on top, add ARM-NEON simd vectorization using intrinsics. I was able to get over 34% performance improvement on my Beaglebone Black which is single Cortex-A8 based CPU. You can find more information on metrics and procedure I used to measure at

search for: vmulq_f32