Displaying 9 results from an estimated 9 matches for "xcorr_kernel_neon_float_process1".
2014 Dec 19
3
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...to the switch
statement, that would change my opinion, but if there isn't, then either
of the other two approaches seem better.
One more nit I hadn't noticed before:
> + float *xi = x;
> + float *yi = y;
These need to be const float32_t (in both xcorr_kernel_neon_float and
xcorr_kernel_neon_float_process1). They're currently causing a ton of
warning spew. float32_t appears to not be considered equivalent to
float, which means you'll also need casts here:
> + vst1q_f32(sum, SUMM);
and here:
> + vst1_lane_f32(sum, SUMM_2[0], 0);
2014 Dec 19
2
[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for ARM NEON floating point.
Changes from RFCv3:
- celt_neon_intr.c
- removed warnings due to not having constant pointers
- Put simpler loop to take care of corner cases. Unrolling using
intrinsics was not really mapping well to what was done
in celt_pitch_xcorr_arm.s
- Makefile.am
Removed explicit -O3 optimization
- test_unit_mathops.c,
2014 Dec 19
0
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...XX_2 = vld1_dup_f32(xi++);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ YY[0]= vld1q_f32(yi++);
+ len--;
+ }
+
+ if (len > 0) {
+ XX_2 = vld1_dup_f32(xi);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ }
+
+ vst1q_f32(sum, SUMM);
+}
+
+/*
+ * Function: xcorr_kernel_neon_float_process1
+ * ---------------------------------
+ * Computes single correlation values and stores in *sum
+ */
+static void xcorr_kernel_neon_float_process1(const float *x, const float *y,
+ float *sum, int len) {
+ float32x4_t XX[4];
+ float32x4_t YY[4];
+ float32x2_t XX_2;
+ float32x2...
2014 Dec 10
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...d1q_f32(yi++);
+ case 2:
+ XX_2 = vld1_dup_f32(xi++);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ YY[0] = vld1q_f32(yi++);
+ case 1:
+ XX_2 = vld1_dup_f32(xi++);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ }
+
+ vst1q_f32(sum, SUMM);
+}
+
+/*
+ * Function: xcorr_kernel_neon_float_process1
+ * ---------------------------------
+ * Computes single correlation values and stores in *sum
+ */
+static void xcorr_kernel_neon_float_process1(const float *x, const float *y,
+ float *sum, int len) {
+ float32x4_t XX[4];
+ float32x4_t YY[4];
+ float32x2_t XX_2;
+ float32x2...
2014 Dec 10
2
[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv2:
- Changes recommended by Timothy for celt_neon_intr.c
everything except, left the unrolled loop still unrolled
- configure.ac
- use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE
- Moved compile flags into Makefile.am
- OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR
Viswanath Puttagunta (1):
armv7:
2014 Dec 19
2
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...= vmlaq_lane_f32(SUMM, YY[1], XX_2, 1);
len -= 2;
break;
case 1:
XX_2 = vld1_f32(xi);
SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
len--;
break;
}
}
> +
> + vst1q_f32(sum, SUMM);
> +}
> +
> +/*
> + * Function: xcorr_kernel_neon_float_process1
> + * ---------------------------------
> + * Computes single correlation values and stores in *sum
> + */
> +static void xcorr_kernel_neon_float_process1(const float *x, const float *y,
> + float *sum, int len) {
> + float32x4_t XX[4];
> + float32x4_t YY[4];...
2014 Dec 19
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
..., thanks for your explanation. I will take a closer look at this part
of celt_pitch_xcorr_arm.s
>
>
> One more nit I hadn't noticed before:
>
>> + float *xi = x;
>> + float *yi = y;
>
> These need to be const float32_t (in both xcorr_kernel_neon_float and
> xcorr_kernel_neon_float_process1). They're currently causing a ton of
> warning spew. float32_t appears to not be considered equivalent to
> float, which means you'll also need casts here:
>
>> + vst1q_f32(sum, SUMM);
>
> and here:
>
>> + vst1_lane_f32(sum, SUMM_2[0], 0);
Thanks, will do....
2016 Sep 13
4
[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)
Should call celt_inner_prod().
---
celt/bands.c | 7 ++++---
celt/bands.h | 2 +-
celt/celt_encoder.c | 6 +++---
celt/pitch.c | 2 +-
src/opus_multistream_encoder.c | 2 +-
5 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/celt/bands.c b/celt/bands.c
index bbe8a4c..1ab24aa 100644
--- a/celt/bands.c
+++ b/celt/bands.c
2014 Dec 18
2
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Almost there... just a few nits left.
Viswanath Puttagunta wrote:
> +if OPUS_ARM_NEON_INTR
> +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR)
> +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3
I'll repeat: I don't think you should change the optimization level here.
> + /* Just unroll the rest of the loop */
I saw you decided to keep this unrolled, but you didn't actually