Displaying 14 results from an estimated 14 matches for "vget_low_f32".
2014 Dec 19
2
[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for ARM NEON floating point.
Changes from RFCv3:
- celt_neon_intr.c
- removed warnings due to not having constant pointers
- Put simpler loop to take care of corner cases. Unrolling using
intrinsics was not really mapping well to what was done
in celt_pitch_xcorr_arm.s
- Makefile.am
Removed explicit -O3 optimization
- test_unit_mathops.c,
2014 Dec 07
2
[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv1:
- Rebased on top of commit
aad281878: Fix celt_pitch_xcorr_c signature.
which got rid of ugly code around CELT_PITCH_XCORR_IMPL
passing of "arch" parameter.
- Unified with --enable-intrinsics used by x86
- Modified algorithm to be more in-line with algorithm in
celt_pitch_xcorr_arm.s
Viswanath Puttagunta
2014 Dec 19
0
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...sure len > 8 and not len >= 8
+ */
+ while (len > 8) {
+ yi += 4;
+ YY[1] = vld1q_f32(yi);
+ yi += 4;
+ YY[2] = vld1q_f32(yi);
+
+ XX[0] = vld1q_f32(xi);
+ xi += 4;
+ XX[1] = vld1q_f32(xi);
+ xi += 4;
+
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], vget_low_f32(XX[0]), 0);
+ YEXT[0] = vextq_f32(YY[0], YY[1], 1);
+ SUMM = vmlaq_lane_f32(SUMM, YEXT[0], vget_low_f32(XX[0]), 1);
+ YEXT[1] = vextq_f32(YY[0], YY[1], 2);
+ SUMM = vmlaq_lane_f32(SUMM, YEXT[1], vget_high_f32(XX[0]), 0);
+ YEXT[2] = vextq_f32(YY[0], YY[1], 3);
+ SUMM =...
2014 Dec 07
0
[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...sure len > 8 and not len >= 8
+ */
+ while (len > 8) {
+ yi += 4;
+ YY[1] = vld1q_f32(yi);
+ yi += 4;
+ YY[2] = vld1q_f32(yi);
+
+ XX[0] = vld1q_f32(xi);
+ xi += 4;
+ XX[1] = vld1q_f32(xi);
+ xi += 4;
+
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], vget_low_f32(XX[0]), 0);
+ YEXT[0] = vextq_f32(YY[0], YY[1], 1);
+ SUMM = vmlaq_lane_f32(SUMM, YEXT[0], vget_low_f32(XX[0]), 1);
+ YEXT[1] = vextq_f32(YY[0], YY[1], 2);
+ SUMM = vmlaq_lane_f32(SUMM, YEXT[1], vget_high_f32(XX[0]), 0);
+ YEXT[2] = vextq_f32(YY[0], YY[1], 3);
+ SUMM =...
2014 Dec 10
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...sure len > 8 and not len >= 8
+ */
+ while (len > 8) {
+ yi += 4;
+ YY[1] = vld1q_f32(yi);
+ yi += 4;
+ YY[2] = vld1q_f32(yi);
+
+ XX[0] = vld1q_f32(xi);
+ xi += 4;
+ XX[1] = vld1q_f32(xi);
+ xi += 4;
+
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], vget_low_f32(XX[0]), 0);
+ YEXT[0] = vextq_f32(YY[0], YY[1], 1);
+ SUMM = vmlaq_lane_f32(SUMM, YEXT[0], vget_low_f32(XX[0]), 1);
+ YEXT[1] = vextq_f32(YY[0], YY[1], 2);
+ SUMM = vmlaq_lane_f32(SUMM, YEXT[1], vget_high_f32(XX[0]), 0);
+ YEXT[2] = vextq_f32(YY[0], YY[1], 3);
+ SUMM =...
2014 Dec 10
2
[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv2:
- Changes recommended by Timothy for celt_neon_intr.c
everything except, left the unrolled loop still unrolled
- configure.ac
- use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE
- Moved compile flags into Makefile.am
- OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR
Viswanath Puttagunta (1):
armv7:
2014 Dec 07
3
[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
From: Viswanath Puttagunta <viswanath.puttagunta at linaro.org>
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv1:
- Rebased on top of commit
aad281878: Fix celt_pitch_xcorr_c signature.
which got rid of ugly code around CELT_PITCH_XCORR_IMPL
passing of "arch" parameter.
- Unified with --enable-intrinsics used by x86
- Modified algorithm to be more
2014 Dec 19
2
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...(len > 8) {
> + yi += 4;
> + YY[1] = vld1q_f32(yi);
> + yi += 4;
> + YY[2] = vld1q_f32(yi);
> +
> + XX[0] = vld1q_f32(xi);
> + xi += 4;
> + XX[1] = vld1q_f32(xi);
> + xi += 4;
> +
> + SUMM = vmlaq_lane_f32(SUMM, YY[0], vget_low_f32(XX[0]), 0);
> + YEXT[0] = vextq_f32(YY[0], YY[1], 1);
> + SUMM = vmlaq_lane_f32(SUMM, YEXT[0], vget_low_f32(XX[0]), 1);
> + YEXT[1] = vextq_f32(YY[0], YY[1], 2);
> + SUMM = vmlaq_lane_f32(SUMM, YEXT[1], vget_high_f32(XX[0]), 0);
> + YEXT[2] = vextq_f32(YY[0],...
2014 Dec 09
1
[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...ycle */
s/cycle/iteration/ (here and below). In performance-critical code when
you say cycle I think machine cycle, and NEON definitely can't process
16 floats in one of those.
> + while (len >= 16) {
> + /* Accumulate results into single float */
> + tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
> + tv = vtrn_f32(tv.val[0], ZERO);
> + tv.val[0] = vadd_f32(tv.val[0], tv.val[1]);
> +
> + vst1_lane_f32(&sumi, tv.val[0], 0);
Accessing tv.val[0] and tv.val[1] directly seems to send these values
through the stack, e.g.,
f4: f3ba7085...
2017 Mar 23
0
[PATCH] Use NEON intrinsics detection that fails with gcc 4.8.
...45b9a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -471,7 +471,7 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[
]],
[[
static float32x4_t A0, A1, SUMM;
- SUMM = vmlaq_f32(SUMM, A0, A1);
+ SUMM = vmlaq_lane_f32(SUMM, A0, vget_low_f32(A1), 0);
return (int)vgetq_lane_f32(SUMM, 0);
]]
)
--
2.9.3
2014 Nov 09
0
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
...* In theory, one could use Q regs instead of
+ * D regs, but you need to consider case when N is odd
+ * One can do that if it justifies performance improment
+ */
+
+ for (i = 0; i < N; i++) {
+ Fout_4[0] = vld1q_f32(ai);
+ ai += 4;
+ Fout_4[1] = vld1q_f32(ai);
+ ai += 4;
+ Fout_2[0] = vget_low_f32(Fout_4[0]);
+ Fout_2[1] = vget_high_f32(Fout_4[0]);
+ Fout_2[2] = vget_low_f32(Fout_4[1]);
+ Fout_2[3] = vget_high_f32(Fout_4[1]);
+
+ scratch_2[0] = vsub_f32(Fout_2[0], Fout_2[2]);
+ Fout_2[0] = vadd_f32(Fout_2[0], Fout_2[2]);
+ scratch_2[1] = vadd_f32(Fout_2[1], Fout_2[3]);
+ Fout_2[2] = v...
2016 Sep 13
4
[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)
Should call celt_inner_prod().
---
celt/bands.c | 7 ++++---
celt/bands.h | 2 +-
celt/celt_encoder.c | 6 +++---
celt/pitch.c | 2 +-
src/opus_multistream_encoder.c | 2 +-
5 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/celt/bands.c b/celt/bands.c
index bbe8a4c..1ab24aa 100644
--- a/celt/bands.c
+++ b/celt/bands.c
2014 Nov 09
3
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
Hello,
This patch introduces ARM NEON Intrinsics to optimize
kf_bfly4 routine in celt part of libopus.
Using NEON optimized kf_bfly4(_neon) routine helped improve
performance of opus_fft_impl function by about 21.4%. The
end use case was decoding a music opus ogg file. The end
use case saw performance improvement of about 4.47%.
This patch has 2 components
i. Actual neon code to improve
2014 Sep 10
4
[RFC PATCH v1 0/3] Introducing ARM SIMD Support
libvorbis does not currently have any simd/vectorization.
Following patches add generic framework for simd/vectorization
and on top, add ARM-NEON simd vectorization using intrinsics.
I was able to get over 34% performance improvement on my
Beaglebone Black which is single Cortex-A8 based CPU.
You can find more information on metrics and procedure I used
to measure at