thr3ads.net - search: "vcombine

Displaying 5 results from an estimated 5 matches for "vcombine_f32".

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

...); + /* scratch_2[1] *= (1, -1) */ + scratch_2[1] = vmul_f32(scratch_2[1], ones_2); + Fout_2[1] = vadd_f32(scratch_2[0], scratch_2[1]); + + /* scratch_2[1] *= (-1, -1) */ + scratch_2[1] = vmul_f32(scratch_2[1], minusones_2); + Fout_2[3] = vadd_f32(scratch_2[0], scratch_2[1]); + + Fout_4[0] = vcombine_f32(Fout_2[0], Fout_2[1]); + Fout_4[1] = vcombine_f32(Fout_2[2], Fout_2[3]); + + vst1q_f32(bi, Fout_4[0]); + bi += 4; + vst1q_f32(bi, Fout_4[1]); + bi += 4; + } +} + +static void kf_bfly4_neon_m8(kiss_fft_cpx * Fout, + const size_t fstride, + const kiss_fft_...

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...d is much simpler to read than below celt_pitch_xcorr_arm.s.. So, I request to leave it simple to read for now. float32x2_t YY_2; while (len > 0) { switch(len) { case 4: case 3: XX_2 = vld1_f32(xi); xi += 2; YY_2 = vld1_f32(yi+4); YY[1] = vcombine_f32(YY_2, YY_2); SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); YEXT[0] = vextq_f32(YY[0], YY[1], 1); SUMM = vmlaq_lane_f32(SUMM, YEXT[0], XX_2, 1); YY[0] = vcombine_f32(vget_high_f32(YY[0]), YY_2); len -=2; break; case 2: XX_2 = vld1_f...

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

2014 Sep 10

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

libvorbis does not currently have any simd/vectorization. Following patches add generic framework for simd/vectorization and on top, add ARM-NEON simd vectorization using intrinsics. I was able to get over 34% performance improvement on my Beaglebone Black which is single Cortex-A8 based CPU. You can find more information on metrics and procedure I used to measure at

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,

search for: vcombine_f32