Displaying 12 results from an estimated 12 matches for "vadd_f32".
Did you mean:
vaddq_f32
2014 Dec 09
1
[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...ues per cycle */
s/cycle/iteration/ (here and below). In performance-critical code when
you say cycle I think machine cycle, and NEON definitely can't process
16 floats in one of those.
> + while (len >= 16) {
> + /* Accumulate results into single float */
> + tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
> + tv = vtrn_f32(tv.val[0], ZERO);
> + tv.val[0] = vadd_f32(tv.val[0], tv.val[1]);
> +
> + vst1_lane_f32(&sumi, tv.val[0], 0);
Accessing tv.val[0] and tv.val[1] directly seems to send these values
through the stack, e.g.,
f4:...
2014 Nov 09
0
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
...= vld1q_f32(ai);
+ ai += 4;
+ Fout_4[1] = vld1q_f32(ai);
+ ai += 4;
+ Fout_2[0] = vget_low_f32(Fout_4[0]);
+ Fout_2[1] = vget_high_f32(Fout_4[0]);
+ Fout_2[2] = vget_low_f32(Fout_4[1]);
+ Fout_2[3] = vget_high_f32(Fout_4[1]);
+
+ scratch_2[0] = vsub_f32(Fout_2[0], Fout_2[2]);
+ Fout_2[0] = vadd_f32(Fout_2[0], Fout_2[2]);
+ scratch_2[1] = vadd_f32(Fout_2[1], Fout_2[3]);
+ Fout_2[2] = vsub_f32(Fout_2[0], scratch_2[1]);
+ Fout_2[0] = vadd_f32(Fout_2[0], scratch_2[1]);
+ scratch_2[1] = vsub_f32(Fout_2[1], Fout_2[3]);
+
+ scratch_2[1] = vrev64_f32(scratch_2[1]);
+ /* scratch_2[1] *= (1, -1)...
2014 Nov 09
3
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
Hello,
This patch introduces ARM NEON Intrinsics to optimize
kf_bfly4 routine in celt part of libopus.
Using NEON optimized kf_bfly4(_neon) routine helped improve
performance of opus_fft_impl function by about 21.4%. The
end use case was decoding a music opus ogg file. The end
use case saw performance improvement of about 4.47%.
This patch has 2 components
i. Actual neon code to improve
2014 Dec 07
2
[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv1:
- Rebased on top of commit
aad281878: Fix celt_pitch_xcorr_c signature.
which got rid of ugly code around CELT_PITCH_XCORR_IMPL
passing of "arch" parameter.
- Unified with --enable-intrinsics used by x86
- Modified algorithm to be more in-line with algorithm in
celt_pitch_xcorr_arm.s
Viswanath Puttagunta
2014 Dec 07
0
[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...-= 8;
+ }
+
+ /* Work on 4 values per cycle */
+ while (len >= 4) {
+ XX[0] = vld1q_f32(xi);
+ xi += 4;
+ YY[0] = vld1q_f32(yi);
+ yi += 4;
+ SUMM = vmlaq_f32(SUMM, YY[0], XX[0]);
+ len -= 4;
+ }
+ /* Accumulate results into single float */
+ tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
+ tv = vtrn_f32(tv.val[0], ZERO);
+ tv.val[0] = vadd_f32(tv.val[0], tv.val[1]);
+
+ vst1_lane_f32(&sumi, tv.val[0], 0);
+ for (i = 0; i < len; i++)
+ sumi += xi[i] * yi[i];
+ *sum = sumi;
+}
+
+void celt_pitch_xcorr_float_neon(const op...
2016 Sep 13
4
[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)
Should call celt_inner_prod().
---
celt/bands.c | 7 ++++---
celt/bands.h | 2 +-
celt/celt_encoder.c | 6 +++---
celt/pitch.c | 2 +-
src/opus_multistream_encoder.c | 2 +-
5 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/celt/bands.c b/celt/bands.c
index bbe8a4c..1ab24aa 100644
--- a/celt/bands.c
+++ b/celt/bands.c
2014 Dec 07
3
[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
From: Viswanath Puttagunta <viswanath.puttagunta at linaro.org>
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv1:
- Rebased on top of commit
aad281878: Fix celt_pitch_xcorr_c signature.
which got rid of ugly code around CELT_PITCH_XCORR_IMPL
passing of "arch" parameter.
- Unified with --enable-intrinsics used by x86
- Modified algorithm to be more
2014 Dec 19
2
[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for ARM NEON floating point.
Changes from RFCv3:
- celt_neon_intr.c
- removed warnings due to not having constant pointers
- Put simpler loop to take care of corner cases. Unrolling using
intrinsics was not really mapping well to what was done
in celt_pitch_xcorr_arm.s
- Makefile.am
Removed explicit -O3 optimization
- test_unit_mathops.c,
2014 Dec 19
0
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...+ if (len >= 2) {
+ /* While at it, consume 2 more values if available */
+ XX_2 = vld1_f32(xi);
+ xi += 2;
+ YY_2 = vld1_f32(yi);
+ yi += 2;
+ SUMM_2[0] = vmla_f32(SUMM_2[0], YY_2, XX_2);
+ len -= 2;
+ }
+ SUMM_2[1] = vget_high_f32(SUMM);
+ SUMM_2[0] = vadd_f32(SUMM_2[0], SUMM_2[1]);
+ SUMM_2[0] = vpadd_f32(SUMM_2[0], SUMM_2[0]);
+ /* Ok, now we have result accumulated in SUMM_2[0].0 */
+
+ if (len > 0) {
+ /* Case when you have one value left */
+ XX_2 = vld1_dup_f32(xi);
+ YY_2 = vld1_dup_f32(yi);
+ SUMM_2[0] = vmla_f32(SUMM...
2014 Dec 10
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...+ if (len >= 2) {
+ /* While at it, consume 2 more values if available */
+ XX_2 = vld1_f32(xi);
+ xi += 2;
+ YY_2 = vld1_f32(yi);
+ yi += 2;
+ SUMM_2[0] = vmla_f32(SUMM_2[0], YY_2, XX_2);
+ len -= 2;
+ }
+ SUMM_2[1] = vget_high_f32(SUMM);
+ SUMM_2[0] = vadd_f32(SUMM_2[0], SUMM_2[1]);
+ SUMM_2[0] = vpadd_f32(SUMM_2[0], SUMM_2[0]);
+ /* Ok, now we have result accumulated in SUMM_2[0].0 */
+
+ if (len > 0) {
+ /* Case when you have one value left */
+ XX_2 = vld1_dup_f32(xi);
+ YY_2 = vld1_dup_f32(yi);
+ SUMM_2[0] = vmla_f32(SUMM...
2014 Dec 10
2
[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv2:
- Changes recommended by Timothy for celt_neon_intr.c
everything except, left the unrolled loop still unrolled
- configure.ac
- use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE
- Moved compile flags into Makefile.am
- OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR
Viswanath Puttagunta (1):
armv7:
2014 Dec 19
2
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...consume 2 more values if available */
> + XX_2 = vld1_f32(xi);
> + xi += 2;
> + YY_2 = vld1_f32(yi);
> + yi += 2;
> + SUMM_2[0] = vmla_f32(SUMM_2[0], YY_2, XX_2);
> + len -= 2;
> + }
> + SUMM_2[1] = vget_high_f32(SUMM);
> + SUMM_2[0] = vadd_f32(SUMM_2[0], SUMM_2[1]);
> + SUMM_2[0] = vpadd_f32(SUMM_2[0], SUMM_2[0]);
> + /* Ok, now we have result accumulated in SUMM_2[0].0 */
> +
> + if (len > 0) {
> + /* Case when you have one value left */
> + XX_2 = vld1_dup_f32(xi);
> + YY_2 = vld1_dup_f32(yi...