thr3ads.net - search: "vget_low

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

2

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

2

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more in-line with algorithm in celt_pitch_xcorr_arm.s Viswanath Puttagunta

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

0

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...sure len > 8 and not len >= 8 + */ + while (len > 8) { + yi += 4; + YY[1] = vld1q_f32(yi); + yi += 4; + YY[2] = vld1q_f32(yi); + + XX[0] = vld1q_f32(xi); + xi += 4; + XX[1] = vld1q_f32(xi); + xi += 4; + + SUMM = vmlaq_lane_f32(SUMM, YY[0], vget_low_f32(XX[0]), 0); + YEXT[0] = vextq_f32(YY[0], YY[1], 1); + SUMM = vmlaq_lane_f32(SUMM, YEXT[0], vget_low_f32(XX[0]), 1); + YEXT[1] = vextq_f32(YY[0], YY[1], 2); + SUMM = vmlaq_lane_f32(SUMM, YEXT[1], vget_high_f32(XX[0]), 0); + YEXT[2] = vextq_f32(YY[0], YY[1], 3); + SUMM =...

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

0

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...sure len > 8 and not len >= 8 + */ + while (len > 8) { + yi += 4; + YY[1] = vld1q_f32(yi); + yi += 4; + YY[2] = vld1q_f32(yi); + + XX[0] = vld1q_f32(xi); + xi += 4; + XX[1] = vld1q_f32(xi); + xi += 4; + + SUMM = vmlaq_lane_f32(SUMM, YY[0], vget_low_f32(XX[0]), 0); + YEXT[0] = vextq_f32(YY[0], YY[1], 1); + SUMM = vmlaq_lane_f32(SUMM, YEXT[0], vget_low_f32(XX[0]), 1); + YEXT[1] = vextq_f32(YY[0], YY[1], 2); + SUMM = vmlaq_lane_f32(SUMM, YEXT[1], vget_high_f32(XX[0]), 0); + YEXT[2] = vextq_f32(YY[0], YY[1], 3); + SUMM =...

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

0

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...sure len > 8 and not len >= 8 + */ + while (len > 8) { + yi += 4; + YY[1] = vld1q_f32(yi); + yi += 4; + YY[2] = vld1q_f32(yi); + + XX[0] = vld1q_f32(xi); + xi += 4; + XX[1] = vld1q_f32(xi); + xi += 4; + + SUMM = vmlaq_lane_f32(SUMM, YY[0], vget_low_f32(XX[0]), 0); + YEXT[0] = vextq_f32(YY[0], YY[1], 1); + SUMM = vmlaq_lane_f32(SUMM, YEXT[0], vget_low_f32(XX[0]), 1); + YEXT[1] = vextq_f32(YY[0], YY[1], 2); + SUMM = vmlaq_lane_f32(SUMM, YEXT[1], vget_high_f32(XX[0]), 0); + YEXT[2] = vextq_f32(YY[0], YY[1], 3); + SUMM =...

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

2

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv2: - Changes recommended by Timothy for celt_neon_intr.c everything except, left the unrolled loop still unrolled - configure.ac - use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE - Moved compile flags into Makefile.am - OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR Viswanath Puttagunta (1): armv7:

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

3

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

From: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

2

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...(len > 8) { > + yi += 4; > + YY[1] = vld1q_f32(yi); > + yi += 4; > + YY[2] = vld1q_f32(yi); > + > + XX[0] = vld1q_f32(xi); > + xi += 4; > + XX[1] = vld1q_f32(xi); > + xi += 4; > + > + SUMM = vmlaq_lane_f32(SUMM, YY[0], vget_low_f32(XX[0]), 0); > + YEXT[0] = vextq_f32(YY[0], YY[1], 1); > + SUMM = vmlaq_lane_f32(SUMM, YEXT[0], vget_low_f32(XX[0]), 1); > + YEXT[1] = vextq_f32(YY[0], YY[1], 2); > + SUMM = vmlaq_lane_f32(SUMM, YEXT[1], vget_high_f32(XX[0]), 0); > + YEXT[2] = vextq_f32(YY[0],...

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 09

1

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...ycle */ s/cycle/iteration/ (here and below). In performance-critical code when you say cycle I think machine cycle, and NEON definitely can't process 16 floats in one of those. > + while (len >= 16) { > + /* Accumulate results into single float */ > + tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM)); > + tv = vtrn_f32(tv.val[0], ZERO); > + tv.val[0] = vadd_f32(tv.val[0], tv.val[1]); > + > + vst1_lane_f32(&sumi, tv.val[0], 0); Accessing tv.val[0] and tv.val[1] directly seems to send these values through the stack, e.g., f4: f3ba7085...

[PATCH] Use NEON intrinsics detection that fails with gcc 4.8.

2017 Mar 23

0

[PATCH] Use NEON intrinsics detection that fails with gcc 4.8.

...45b9a 100644 --- a/configure.ac +++ b/configure.ac @@ -471,7 +471,7 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ ]], [[ static float32x4_t A0, A1, SUMM; - SUMM = vmlaq_f32(SUMM, A0, A1); + SUMM = vmlaq_lane_f32(SUMM, A0, vget_low_f32(A1), 0); return (int)vgetq_lane_f32(SUMM, 0); ]] ) -- 2.9.3

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

0

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

...* In theory, one could use Q regs instead of + * D regs, but you need to consider case when N is odd + * One can do that if it justifies performance improment + */ + + for (i = 0; i < N; i++) { + Fout_4[0] = vld1q_f32(ai); + ai += 4; + Fout_4[1] = vld1q_f32(ai); + ai += 4; + Fout_2[0] = vget_low_f32(Fout_4[0]); + Fout_2[1] = vget_high_f32(Fout_4[0]); + Fout_2[2] = vget_low_f32(Fout_4[1]); + Fout_2[3] = vget_high_f32(Fout_4[1]); + + scratch_2[0] = vsub_f32(Fout_2[0], Fout_2[2]); + Fout_2[0] = vadd_f32(Fout_2[0], Fout_2[2]); + scratch_2[1] = vadd_f32(Fout_2[1], Fout_2[3]); + Fout_2[2] = v...

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

2016 Sep 13

4

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

Should call celt_inner_prod(). --- celt/bands.c | 7 ++++--- celt/bands.h | 2 +- celt/celt_encoder.c | 6 +++--- celt/pitch.c | 2 +- src/opus_multistream_encoder.c | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/celt/bands.c b/celt/bands.c index bbe8a4c..1ab24aa 100644 --- a/celt/bands.c +++ b/celt/bands.c

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

3

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

2014 Sep 10

4

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

libvorbis does not currently have any simd/vectorization. Following patches add generic framework for simd/vectorization and on top, add ARM-NEON simd vectorization using intrinsics. I was able to get over 34% performance improvement on my Beaglebone Black which is single Cortex-A8 based CPU. You can find more information on metrics and procedure I used to measure at

search for: vget_low_f32