thr3ads.net - search: "summ

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

2

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

0

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...utes single correlation values and stores in *sum + */ +static void xcorr_kernel_neon_float_process1(const float *x, const float *y, + float *sum, int len) { + float32x4_t XX[4]; + float32x4_t YY[4]; + float32x2_t XX_2; + float32x2_t YY_2; + float32x4_t SUMM; + float32x2_t SUMM_2[2]; + const float *xi = x; + const float *yi = y; + + SUMM = vdupq_n_f32(0); + + /* Work on 16 values per iteration */ + while (len >= 16) { + XX[0] = vld1q_f32(xi); + xi += 4; + XX[1] = vld1q_f32(xi); + xi += 4; + XX[2] = vld1q_f32(xi); + xi += 4; +...

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

0

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...utes single correlation values and stores in *sum + */ +static void xcorr_kernel_neon_float_process1(const float *x, const float *y, + float *sum, int len) { + float32x4_t XX[4]; + float32x4_t YY[4]; + float32x2_t XX_2; + float32x2_t YY_2; + float32x4_t SUMM; + float32x2_t SUMM_2[2]; + float *xi = x; + float *yi = y; + + SUMM = vdupq_n_f32(0); + + /* Work on 16 values per iteration */ + while (len >= 16) { + XX[0] = vld1q_f32(xi); + xi += 4; + XX[1] = vld1q_f32(xi); + xi += 4; + XX[2] = vld1q_f32(xi); + xi += 4; + XX[3] = vld1...

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

2

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv2: - Changes recommended by Timothy for celt_neon_intr.c everything except, left the unrolled loop still unrolled - configure.ac - use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE - Moved compile flags into Makefile.am - OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR Viswanath Puttagunta (1): armv7:

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

2

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...*sum > + */ > +static void xcorr_kernel_neon_float_process1(const float *x, const float *y, > + float *sum, int len) { > + float32x4_t XX[4]; > + float32x4_t YY[4]; > + float32x2_t XX_2; > + float32x2_t YY_2; > + float32x4_t SUMM; > + float32x2_t SUMM_2[2]; > + const float *xi = x; > + const float *yi = y; > + > + SUMM = vdupq_n_f32(0); > + > + /* Work on 16 values per iteration */ > + while (len >= 16) { > + XX[0] = vld1q_f32(xi); > + xi += 4; > + XX[1] = vld1q_f32(xi); > + xi += 4...

[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

2015 May 15

0

[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

...kernel_neon_fixed_process1(const int16_t *x, + const int16_t *y, + int32_t *sum, int len) { + int16x8_t XX[2]; + int16x8_t YY[2]; + + int16x4_t XX_2; + int16x4_t YY_2; + + int32x4_t SUMM; + int32x2_t SUMM_2; + const int16_t *xi = x; + const int16_t *yi = y; + + SUMM = vdupq_n_s32(0); + + /* Work on 16 values per iteration */ + while (len >= 16) { + XX[0] = vld1q_s16(xi); + xi += 8; + XX[1] = vld1q_s16(xi); + xi += 8; + + YY[0] = vld1q_s16(yi); + yi += 8; +...

[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

2015 May 08

0

[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

...kernel_neon_fixed_process1(const int16_t *x, + const int16_t *y, + int32_t *sum, int len) { + int16x8_t XX[2]; + int16x8_t YY[2]; + + int16x4_t XX_2; + int16x4_t YY_2; + + int32x4_t SUMM; + int32x2_t SUMM_2; + const int16_t *xi = x; + const int16_t *yi = y; + + SUMM = vdupq_n_s32(0); + + /* Work on 16 values per iteration */ + while (len >= 16) { + XX[0] = vld1q_s16(xi); + xi += 8; + XX[1] = vld1q_s16(xi); + xi += 8; + + YY[0] = vld1q_s16(yi); + yi += 8; +...

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

2016 Sep 13

4

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

Should call celt_inner_prod(). --- celt/bands.c | 7 ++++--- celt/bands.h | 2 +- celt/celt_encoder.c | 6 +++--- celt/pitch.c | 2 +- src/opus_multistream_encoder.c | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/celt/bands.c b/celt/bands.c index bbe8a4c..1ab24aa 100644 --- a/celt/bands.c +++ b/celt/bands.c

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

3

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...oth xcorr_kernel_neon_float and xcorr_kernel_neon_float_process1). They're currently causing a ton of warning spew. float32_t appears to not be considered equivalent to float, which means you'll also need casts here: > + vst1q_f32(sum, SUMM); and here: > + vst1_lane_f32(sum, SUMM_2[0], 0);

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

0

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...xcorr_kernel_neon_float_process1). They're currently causing a ton of > warning spew. float32_t appears to not be considered equivalent to > float, which means you'll also need casts here: > >> + vst1q_f32(sum, SUMM); > > and here: > >> + vst1_lane_f32(sum, SUMM_2[0], 0); Thanks, will do. > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 18

2

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Almost there... just a few nits left. Viswanath Puttagunta wrote: > +if OPUS_ARM_NEON_INTR > +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) > +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3 I'll repeat: I don't think you should change the optimization level here. > + /* Just unroll the rest of the loop */ I saw you decided to keep this unrolled, but you didn't actually

[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

2015 Mar 31

6

[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

Hi Timothy, As I mentioned earlier [1], I now fixed compile issues with fixed point and resubmitting the patch. I also have new patch that does intrinsics optimizations for celt_pitch_xcorr targetting aarch64. You can find my latest work-in-progress branch at [2] For reference, you can use the Ne10 pre-built libraries at [3] Note that I am working with Phil at ARM to get my patch at [4]

[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]

2015 May 08

8

[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]

Hi All, As per Timothy's suggestion, disabling mdct_forward for fixed point. Only effects armv7,armv8: Extend fixed fft NE10 optimizations to mdct Rest of patches are same as in [1] For reference, latest wip code for opus is at [2] Still working with NE10 team at ARM to get corner cases of mdct_forward. Will update with another patch when issue in NE10 gets fixed. Regards, Vish [1]:

[RFC V3 0/8] Ne10 fft fixed and previous

2015 May 15

11

[RFC V3 0/8] Ne10 fft fixed and previous

Hi All, Changes from RFC v2 [1] armv7,armv8: Extend fixed fft NE10 optimizations to mdct - Overflow issue fixed by Phil at ARM. Ne10 wip at [2]. Should be upstream soon. - So, re-enabled using fixed fft for mdct_forward which was disabled in RFCv2 armv7,armv8: Optimize fixed point fft using NE10 library - Thanks to Jonathan Lennox, fixed some build fixes on iOS and some copy-paste errors Rest

[RFC PATCH v1 0/8] Ne10 fft fixed and previous

2015 Apr 28

10

[RFC PATCH v1 0/8] Ne10 fft fixed and previous

Hello Timothy / Jean-Marc / opus-dev, This patch series is follow up on work I posted on [1]. In addition to what was posted on [1], this patch series mainly integrates Fixed point FFT implementations in NE10 library into opus. You can view my opus wip code at [2]. Note that while I found some issues both with the NE10 library(fixed fft) and with Linaro toolchain (armv8 intrinsics), the work

search for: summ_2