thr3ads.net - search: "celt

Antw: Re: celt_inner_prod() and dual_inner_prod() NEON intrinsics

2017 Jun 06

4

Antw: Re: celt_inner_prod() and dual_inner_prod() NEON intrinsics

...gt; Assuming there's no issue with the patches, next week isn't too late. >> >> Also, I've started looking at your patches. So far there's one thing >> that puzzles me a bit. In the OPUS_CHECK_ASM section of patch 0004, you >> have: >> >> + celt_assert(ABS32(xy1_c - *xy1) <= VERY_SMALL); >> >> Given the normal range of the values (the xy values are often much >> larger than one) and the precision involved here (24-bit mantissa), it >> seems like this test can only succeed if the two values are actually >> equal. I...

celt_inner_prod() and dual_inner_prod() NEON intrinsics

2017 Jun 06

0

celt_inner_prod() and dual_inner_prod() NEON intrinsics

Thank Ulrich! Yes, using celt_assert(1.0 + celt_inner_prod_neon_float_c_simulation(x, y, N) == 1.0 + xy); celt_assert(1.0 + xy1_c == 1.0 + *xy1); celt_assert(1.0 + xy2_c == 1.0 + *xy2); can avoid the useage of VERY_SMALL. Hi Jean-Marc, I added { const opus_val32 xy_c = celt_inner_prod_neon_float_c_simulat...

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 28

2

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...oat32x4_t XX[4]; > + float32x2_t XX_2; > + float32x4_t SUMM[4]; > + float *xi = x; > + float *yi = y; > + int cd = len/4; > + int cr = len%4; len is signed, so / and % are NOT equivalent to the corresponding >> and & (they are much slower). > + int j; > + > + celt_assert(len>=3); > + > + /* Initialize sums to 0 */ > + SUMM[0] = vdupq_n_f32(0); > + SUMM[1] = vdupq_n_f32(0); > + SUMM[2] = vdupq_n_f32(0); > + SUMM[3] = vdupq_n_f32(0); > + > + YY[0] = vld1q_f32(yi); > + > + /* Each loop consumes 8 floats in y vector > + * and 4 floa...

[PATCH] Fix ectest to not check a case which isn't guaranteed to work, and which we don't use.

2008 Dec 21

0

[PATCH] Fix ectest to not check a case which isn't guaranteed to work, and which we don't use.

...; +#include "arch.h" void ec_byte_readinit(ec_byte_buffer *_b,unsigned char *_buf,long _bytes){ @@ -106,6 +107,8 @@ ec_uint32 ec_dec_uint(ec_dec *_this,ec_uint32 _ft){ unsigned s; int ftb; t=0; + /*In order to optimize EC_ILOG(), it is undefined for the value 0.*/ + celt_assert(_ft>1); _ft--; ftb=EC_ILOG(_ft); if(ftb>EC_UNIT_BITS){ diff --git a/libcelt/entenc.c b/libcelt/entenc.c index 3da351e..d0cbb0c 100644 --- a/libcelt/entenc.c +++ b/libcelt/entenc.c @@ -100,8 +100,10 @@ void ec_enc_uint(ec_enc *_this,ec_uint32 _fl,ec_uint32 _ft){ unsigned ft; un...

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 01

0

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...>> + float *xi = x; >> + float *yi = y; >> + int cd = len/4; >> + int cr = len%4; > > len is signed, so / and % are NOT equivalent to the corresponding >> and > & (they are much slower). > >> + int j; >> + >> + celt_assert(len>=3); >> + >> + /* Initialize sums to 0 */ >> + SUMM[0] = vdupq_n_f32(0); >> + SUMM[1] = vdupq_n_f32(0); >> + SUMM[2] = vdupq_n_f32(0); >> + SUMM[3] = vdupq_n_f32(0); >> + >> + YY[0] = vld1q_f32(yi); >> + >> +...

[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

2015 May 15

0

[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

...void xcorr_kernel_neon_fixed(const int16_t *x, const int16_t *y, + int32_t sum[4], int len) { + int16x8_t YY[3]; + int16x4_t YEXT[3]; + int16x8_t XX[2]; + int16x4_t XX_2, YY_2; + int32x4_t SUMM; + const int16_t *xi = x; + const int16_t *yi = y; + + celt_assert(len>4); + + YY[0] = vld1q_s16(yi); + YY_2 = vget_low_s16(YY[0]); + + SUMM = vdupq_n_s32(0); + + /* Consume 16 elements in x vector and 20 elements in y + * vector. However, the y[19] and beyond dont get accessed + * So, if len == 16, then we must only access y[0] to y[18] + * So...

[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

2015 May 08

0

[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

...void xcorr_kernel_neon_fixed(const int16_t *x, const int16_t *y, + int32_t sum[4], int len) { + int16x8_t YY[3]; + int16x4_t YEXT[3]; + int16x8_t XX[2]; + int16x4_t XX_2, YY_2; + int32x4_t SUMM; + const int16_t *xi = x; + const int16_t *yi = y; + + celt_assert(len>4); + + YY[0] = vld1q_s16(yi); + YY_2 = vget_low_s16(YY[0]); + + SUMM = vdupq_n_s32(0); + + /* Consume 16 elements in x vector and 20 elements in y + * vector. However, the y[19] and beyond dont get accessed + * So, if len == 16, then we must only access y[0] to y[18] + * So...

[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 21

4

[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hello, I received feedback from engineers working on NE10 [1] that it would be better to use NE10 [1] for FFT optimizations for opus use cases. However, these FFT patches are currently in review and haven't been integrated into NE10 yet. While the FFT functions in NE10 are getting baked, I wanted to optimize the celt_pitch_xcorr (floating point only) and use it to introduce ARM NEON

celt_inner_prod() and dual_inner_prod() NEON intrinsics

2017 Jun 06

3

celt_inner_prod() and dual_inner_prod() NEON intrinsics

...; don't wait if it's too late for 1.2 release. Assuming there's no issue with the patches, next week isn't too late. Also, I've started looking at your patches. So far there's one thing that puzzles me a bit. In the OPUS_CHECK_ASM section of patch 0004, you have: + celt_assert(ABS32(xy1_c - *xy1) <= VERY_SMALL); Given the normal range of the values (the xy values are often much larger than one) and the precision involved here (24-bit mantissa), it seems like this test can only succeed if the two values are actually equal. Is the float patch actually bit-exact? If so,...

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 21

0

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...;arm_neon.h> +#include "../arch.h" + +static void xcorr_kernel_neon_float(float *x, float *y, float sum[4], int len) { + float32x4_t YY[5]; + float32x4_t XX[4]; + float32x2_t XX_2; + float32x4_t SUMM[4]; + float *xi = x; + float *yi = y; + int cd = len/4; + int cr = len%4; + int j; + + celt_assert(len>=3); + + /* Initialize sums to 0 */ + SUMM[0] = vdupq_n_f32(0); + SUMM[1] = vdupq_n_f32(0); + SUMM[2] = vdupq_n_f32(0); + SUMM[3] = vdupq_n_f32(0); + + YY[0] = vld1q_f32(yi); + + /* Each loop consumes 8 floats in y vector + * and 4 floats in x vector + */ + for (j = 0; j < cd; j++) { +...

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

2

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

0

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...] + */ +static void xcorr_kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + const float *xi = x; + const float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. However, the 12'th element never really gets + * touched in this loop. So, if len == 8, then we only + * must access y[0] to y[10]. y[11] must not...

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

0

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...tores them in sum[4] + */ +void xcorr_kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + float *xi = x; + float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. However, the 12'th element never really gets + * touched in this loop. So, if len == 8, then we only + * must access y[0] to y[10]. y[11] must not...

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

0

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...hem in sum[4] + */ +static void xcorr_kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + float *xi = x; + float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. However, the 12'th element never really gets + * touched in this loop. So, if len == 8, then we only + * must access y[0] to y[10]. y[11] must not...

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

2

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv2: - Changes recommended by Timothy for celt_neon_intr.c everything except, left the unrolled loop still unrolled - configure.ac - use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE - Moved compile flags into Makefile.am - OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR Viswanath Puttagunta (1): armv7:

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

3

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

From: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

2

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...float *x, const float *y, > + float sum[4], int len) { > + float32x4_t YY[3]; > + float32x4_t YEXT[3]; > + float32x4_t XX[2]; > + float32x2_t XX_2; > + float32x4_t SUMM; > + const float *xi = x; > + const float *yi = y; > + > + celt_assert(len>0); > + > + YY[0] = vld1q_f32(yi); > + SUMM = vdupq_n_f32(0); > + > + /* Consume 8 elements in x vector and 12 elements in y > + * vector. However, the 12'th element never really gets > + * touched in this loop. So, if len == 8, then we only > + * m...

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

2

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more in-line with algorithm in celt_pitch_xcorr_arm.s Viswanath Puttagunta

ectest failed with gcc-4.2.4

2008 Dec 20

5

ectest failed with gcc-4.2.4

Hi, compiling the latest release 0.5.1 (as well as from git) with gcc-4.2.4 on zenwalk (slackware current), ectest fails; using gcc-3.4.6 all tests succeeds. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/opus/attachments/20081220/68be24c8/attachment-0002.htm

[AArch64 neon intrinsics v4 0/5] Rework Neon intrinsic code for Aarch64 patchset

2015 Dec 23

6

[AArch64 neon intrinsics v4 0/5] Rework Neon intrinsic code for Aarch64 patchset

Following Tim's comments, here are my reworked patches for the Neon intrinsic function patches of of my Aarch64 patchset, i.e. replacing patches 5-8 of the v2 series. Patches 1-4 and 9-18 of the old series still apply unmodified. The one new (as opposed to changed) patch is the first one in this series, to add named constants for the ARM architecture variants. There are also some minor code

search for: celt_assert