Displaying 15 results from an estimated 15 matches for "summ_2".
2014 Dec 19
2
[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for ARM NEON floating point.
Changes from RFCv3:
- celt_neon_intr.c
- removed warnings due to not having constant pointers
- Put simpler loop to take care of corner cases. Unrolling using
intrinsics was not really mapping well to what was done
in celt_pitch_xcorr_arm.s
- Makefile.am
Removed explicit -O3 optimization
- test_unit_mathops.c,
2014 Dec 19
0
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...utes single correlation values and stores in *sum
+ */
+static void xcorr_kernel_neon_float_process1(const float *x, const float *y,
+ float *sum, int len) {
+ float32x4_t XX[4];
+ float32x4_t YY[4];
+ float32x2_t XX_2;
+ float32x2_t YY_2;
+ float32x4_t SUMM;
+ float32x2_t SUMM_2[2];
+ const float *xi = x;
+ const float *yi = y;
+
+ SUMM = vdupq_n_f32(0);
+
+ /* Work on 16 values per iteration */
+ while (len >= 16) {
+ XX[0] = vld1q_f32(xi);
+ xi += 4;
+ XX[1] = vld1q_f32(xi);
+ xi += 4;
+ XX[2] = vld1q_f32(xi);
+ xi += 4;
+...
2014 Dec 10
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...utes single correlation values and stores in *sum
+ */
+static void xcorr_kernel_neon_float_process1(const float *x, const float *y,
+ float *sum, int len) {
+ float32x4_t XX[4];
+ float32x4_t YY[4];
+ float32x2_t XX_2;
+ float32x2_t YY_2;
+ float32x4_t SUMM;
+ float32x2_t SUMM_2[2];
+ float *xi = x;
+ float *yi = y;
+
+ SUMM = vdupq_n_f32(0);
+
+ /* Work on 16 values per iteration */
+ while (len >= 16) {
+ XX[0] = vld1q_f32(xi);
+ xi += 4;
+ XX[1] = vld1q_f32(xi);
+ xi += 4;
+ XX[2] = vld1q_f32(xi);
+ xi += 4;
+ XX[3] = vld1...
2014 Dec 10
2
[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv2:
- Changes recommended by Timothy for celt_neon_intr.c
everything except, left the unrolled loop still unrolled
- configure.ac
- use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE
- Moved compile flags into Makefile.am
- OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR
Viswanath Puttagunta (1):
armv7:
2014 Dec 19
2
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...*sum
> + */
> +static void xcorr_kernel_neon_float_process1(const float *x, const float *y,
> + float *sum, int len) {
> + float32x4_t XX[4];
> + float32x4_t YY[4];
> + float32x2_t XX_2;
> + float32x2_t YY_2;
> + float32x4_t SUMM;
> + float32x2_t SUMM_2[2];
> + const float *xi = x;
> + const float *yi = y;
> +
> + SUMM = vdupq_n_f32(0);
> +
> + /* Work on 16 values per iteration */
> + while (len >= 16) {
> + XX[0] = vld1q_f32(xi);
> + xi += 4;
> + XX[1] = vld1q_f32(xi);
> + xi += 4...
2015 May 15
0
[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics
...kernel_neon_fixed_process1(const int16_t *x,
+ const int16_t *y,
+ int32_t *sum, int len) {
+ int16x8_t XX[2];
+ int16x8_t YY[2];
+
+ int16x4_t XX_2;
+ int16x4_t YY_2;
+
+ int32x4_t SUMM;
+ int32x2_t SUMM_2;
+ const int16_t *xi = x;
+ const int16_t *yi = y;
+
+ SUMM = vdupq_n_s32(0);
+
+ /* Work on 16 values per iteration */
+ while (len >= 16) {
+ XX[0] = vld1q_s16(xi);
+ xi += 8;
+ XX[1] = vld1q_s16(xi);
+ xi += 8;
+
+ YY[0] = vld1q_s16(yi);
+ yi += 8;
+...
2015 May 08
0
[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics
...kernel_neon_fixed_process1(const int16_t *x,
+ const int16_t *y,
+ int32_t *sum, int len) {
+ int16x8_t XX[2];
+ int16x8_t YY[2];
+
+ int16x4_t XX_2;
+ int16x4_t YY_2;
+
+ int32x4_t SUMM;
+ int32x2_t SUMM_2;
+ const int16_t *xi = x;
+ const int16_t *yi = y;
+
+ SUMM = vdupq_n_s32(0);
+
+ /* Work on 16 values per iteration */
+ while (len >= 16) {
+ XX[0] = vld1q_s16(xi);
+ xi += 8;
+ XX[1] = vld1q_s16(xi);
+ xi += 8;
+
+ YY[0] = vld1q_s16(yi);
+ yi += 8;
+...
2016 Sep 13
4
[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)
Should call celt_inner_prod().
---
celt/bands.c | 7 ++++---
celt/bands.h | 2 +-
celt/celt_encoder.c | 6 +++---
celt/pitch.c | 2 +-
src/opus_multistream_encoder.c | 2 +-
5 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/celt/bands.c b/celt/bands.c
index bbe8a4c..1ab24aa 100644
--- a/celt/bands.c
+++ b/celt/bands.c
2014 Dec 19
3
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...oth xcorr_kernel_neon_float and
xcorr_kernel_neon_float_process1). They're currently causing a ton of
warning spew. float32_t appears to not be considered equivalent to
float, which means you'll also need casts here:
> + vst1q_f32(sum, SUMM);
and here:
> + vst1_lane_f32(sum, SUMM_2[0], 0);
2014 Dec 19
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...xcorr_kernel_neon_float_process1). They're currently causing a ton of
> warning spew. float32_t appears to not be considered equivalent to
> float, which means you'll also need casts here:
>
>> + vst1q_f32(sum, SUMM);
>
> and here:
>
>> + vst1_lane_f32(sum, SUMM_2[0], 0);
Thanks, will do.
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
2014 Dec 18
2
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Almost there... just a few nits left.
Viswanath Puttagunta wrote:
> +if OPUS_ARM_NEON_INTR
> +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR)
> +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3
I'll repeat: I don't think you should change the optimization level here.
> + /* Just unroll the rest of the loop */
I saw you decided to keep this unrolled, but you didn't actually
2015 Mar 31
6
[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series
Hi Timothy,
As I mentioned earlier [1], I now fixed compile issues
with fixed point and resubmitting the patch.
I also have new patch that does intrinsics optimizations
for celt_pitch_xcorr targetting aarch64.
You can find my latest work-in-progress branch at [2]
For reference, you can use the Ne10 pre-built libraries
at [3]
Note that I am working with Phil at ARM to get my patch at [4]
2015 May 08
8
[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]
Hi All,
As per Timothy's suggestion, disabling mdct_forward
for fixed point. Only effects
armv7,armv8: Extend fixed fft NE10 optimizations to mdct
Rest of patches are same as in [1]
For reference, latest wip code for opus is at [2]
Still working with NE10 team at ARM to get corner cases of
mdct_forward. Will update with another patch
when issue in NE10 gets fixed.
Regards,
Vish
[1]:
2015 May 15
11
[RFC V3 0/8] Ne10 fft fixed and previous
Hi All,
Changes from RFC v2 [1]
armv7,armv8: Extend fixed fft NE10 optimizations to mdct
- Overflow issue fixed by Phil at ARM. Ne10 wip at [2]. Should be upstream soon.
- So, re-enabled using fixed fft for mdct_forward which was disabled in RFCv2
armv7,armv8: Optimize fixed point fft using NE10 library
- Thanks to Jonathan Lennox, fixed some build fixes on iOS and some copy-paste errors
Rest
2015 Apr 28
10
[RFC PATCH v1 0/8] Ne10 fft fixed and previous
Hello Timothy / Jean-Marc / opus-dev,
This patch series is follow up on work I posted on [1].
In addition to what was posted on [1], this patch series mainly
integrates Fixed point FFT implementations in NE10 library into opus.
You can view my opus wip code at [2].
Note that while I found some issues both with the NE10 library(fixed fft)
and with Linaro toolchain (armv8 intrinsics), the work