Displaying 5 results from an estimated 5 matches for "vcombine_f32".
2014 Nov 09
0
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
...);
+ /* scratch_2[1] *= (1, -1) */
+ scratch_2[1] = vmul_f32(scratch_2[1], ones_2);
+ Fout_2[1] = vadd_f32(scratch_2[0], scratch_2[1]);
+
+ /* scratch_2[1] *= (-1, -1) */
+ scratch_2[1] = vmul_f32(scratch_2[1], minusones_2);
+ Fout_2[3] = vadd_f32(scratch_2[0], scratch_2[1]);
+
+ Fout_4[0] = vcombine_f32(Fout_2[0], Fout_2[1]);
+ Fout_4[1] = vcombine_f32(Fout_2[2], Fout_2[3]);
+
+ vst1q_f32(bi, Fout_4[0]);
+ bi += 4;
+ vst1q_f32(bi, Fout_4[1]);
+ bi += 4;
+ }
+}
+
+static void kf_bfly4_neon_m8(kiss_fft_cpx * Fout,
+ const size_t fstride,
+ const kiss_fft_...
2014 Nov 09
3
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
Hello,
This patch introduces ARM NEON Intrinsics to optimize
kf_bfly4 routine in celt part of libopus.
Using NEON optimized kf_bfly4(_neon) routine helped improve
performance of opus_fft_impl function by about 21.4%. The
end use case was decoding a music opus ogg file. The end
use case saw performance improvement of about 4.47%.
This patch has 2 components
i. Actual neon code to improve
2014 Dec 19
2
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...d is much simpler to read than below
celt_pitch_xcorr_arm.s.. So, I request to leave it simple to read for now.
float32x2_t YY_2;
while (len > 0) {
switch(len) {
case 4:
case 3:
XX_2 = vld1_f32(xi);
xi += 2;
YY_2 = vld1_f32(yi+4);
YY[1] = vcombine_f32(YY_2, YY_2);
SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
YEXT[0] = vextq_f32(YY[0], YY[1], 1);
SUMM = vmlaq_lane_f32(SUMM, YEXT[0], XX_2, 1);
YY[0] = vcombine_f32(vget_high_f32(YY[0]), YY_2);
len -=2;
break;
case 2:
XX_2 = vld1_f...
2014 Sep 10
4
[RFC PATCH v1 0/3] Introducing ARM SIMD Support
libvorbis does not currently have any simd/vectorization.
Following patches add generic framework for simd/vectorization
and on top, add ARM-NEON simd vectorization using intrinsics.
I was able to get over 34% performance improvement on my
Beaglebone Black which is single Cortex-A8 based CPU.
You can find more information on metrics and procedure I used
to measure at
2014 Dec 19
2
[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for ARM NEON floating point.
Changes from RFCv3:
- celt_neon_intr.c
- removed warnings due to not having constant pointers
- Put simpler loop to take care of corner cases. Unrolling using
intrinsics was not really mapping well to what was done
in celt_pitch_xcorr_arm.s
- Makefile.am
Removed explicit -O3 optimization
- test_unit_mathops.c,