thr3ads.net - search: "vst1_lane

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

3

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...nst float32_t (in both xcorr_kernel_neon_float and xcorr_kernel_neon_float_process1). They're currently causing a ton of warning spew. float32_t appears to not be considered equivalent to float, which means you'll also need casts here: > + vst1q_f32(sum, SUMM); and here: > + vst1_lane_f32(sum, SUMM_2[0], 0);

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 09

1

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...ess 16 floats in one of those. > + while (len >= 16) { > + /* Accumulate results into single float */ > + tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM)); > + tv = vtrn_f32(tv.val[0], ZERO); > + tv.val[0] = vadd_f32(tv.val[0], tv.val[1]); > + > + vst1_lane_f32(&sumi, tv.val[0], 0); Accessing tv.val[0] and tv.val[1] directly seems to send these values through the stack, e.g., f4: f3ba7085 vtrn.32 d7, d5 f8: ed0b7b0f vstr d7, [fp, #-60] fc: ed0b5b0d vstr d5, [fp, #-52] ... 114: ed1b6b09 vldr d6...

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

0

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...eon_float and > xcorr_kernel_neon_float_process1). They're currently causing a ton of > warning spew. float32_t appears to not be considered equivalent to > float, which means you'll also need casts here: > >> + vst1q_f32(sum, SUMM); > > and here: > >> + vst1_lane_f32(sum, SUMM_2[0], 0); Thanks, will do. > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 18

2

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Almost there... just a few nits left. Viswanath Puttagunta wrote: > +if OPUS_ARM_NEON_INTR > +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) > +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3 I'll repeat: I don't think you should change the optimization level here. > + /* Just unroll the rest of the loop */ I saw you decided to keep this unrolled, but you didn't actually

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

2

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more in-line with algorithm in celt_pitch_xcorr_arm.s Viswanath Puttagunta

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

2

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

0

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

..._2[0] = vpadd_f32(SUMM_2[0], SUMM_2[0]); + /* Ok, now we have result accumulated in SUMM_2[0].0 */ + + if (len > 0) { + /* Case when you have one value left */ + XX_2 = vld1_dup_f32(xi); + YY_2 = vld1_dup_f32(yi); + SUMM_2[0] = vmla_f32(SUMM_2[0], XX_2, YY_2); + } + + vst1_lane_f32(sum, SUMM_2[0], 0); +} + +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, + opus_val32 *xcorr, int len, int max_pitch) { + int i; + celt_assert(max_pitch > 0); + celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0); + + f...

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

0

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...1q_f32(yi); + yi += 4; + SUMM = vmlaq_f32(SUMM, YY[0], XX[0]); + len -= 4; + } + /* Accumulate results into single float */ + tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM)); + tv = vtrn_f32(tv.val[0], ZERO); + tv.val[0] = vadd_f32(tv.val[0], tv.val[1]); + + vst1_lane_f32(&sumi, tv.val[0], 0); + for (i = 0; i < len; i++) + sumi += xi[i] * yi[i]; + *sum = sumi; +} + +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, + opus_val32 *xcorr, int len, int max_pitch) { + int i; + celt_assert(max_pitch &gt...

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

0

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

..._2[0] = vpadd_f32(SUMM_2[0], SUMM_2[0]); + /* Ok, now we have result accumulated in SUMM_2[0].0 */ + + if (len > 0) { + /* Case when you have one value left */ + XX_2 = vld1_dup_f32(xi); + YY_2 = vld1_dup_f32(yi); + SUMM_2[0] = vmla_f32(SUMM_2[0], XX_2, YY_2); + } + + vst1_lane_f32(sum, SUMM_2[0], 0); +} + +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, + opus_val32 *xcorr, int len, int max_pitch) { + int i; + celt_assert(max_pitch > 0); + celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0); + + f...

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

2

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv2: - Changes recommended by Timothy for celt_neon_intr.c everything except, left the unrolled loop still unrolled - configure.ac - use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE - Moved compile flags into Makefile.am - OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR Viswanath Puttagunta (1): armv7:

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

3

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

From: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

2

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

.../* Ok, now we have result accumulated in SUMM_2[0].0 */ > + > + if (len > 0) { > + /* Case when you have one value left */ > + XX_2 = vld1_dup_f32(xi); > + YY_2 = vld1_dup_f32(yi); > + SUMM_2[0] = vmla_f32(SUMM_2[0], XX_2, YY_2); > + } > + > + vst1_lane_f32(sum, SUMM_2[0], 0); > +} > + > +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, > + opus_val32 *xcorr, int len, int max_pitch) { > + int i; > + celt_assert(max_pitch > 0); > + celt_assert((((unsigned char *)_x-(unsign...

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

2016 Sep 13

4

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

Should call celt_inner_prod(). --- celt/bands.c | 7 ++++--- celt/bands.h | 2 +- celt/celt_encoder.c | 6 +++--- celt/pitch.c | 2 +- src/opus_multistream_encoder.c | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/celt/bands.c b/celt/bands.c index bbe8a4c..1ab24aa 100644 --- a/celt/bands.c +++ b/celt/bands.c

search for: vst1_lane_f32