thr3ads.net - search: "vst1q

Displaying 20 results from an estimated 25 matches for "vst1q_f32".

Did you mean: vld1q_f32

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

...t_2[1] = vadd_f32(scratch_2[0], scratch_2[1]); + + /* scratch_2[1] *= (-1, -1) */ + scratch_2[1] = vmul_f32(scratch_2[1], minusones_2); + Fout_2[3] = vadd_f32(scratch_2[0], scratch_2[1]); + + Fout_4[0] = vcombine_f32(Fout_2[0], Fout_2[1]); + Fout_4[1] = vcombine_f32(Fout_2[2], Fout_2[3]); + + vst1q_f32(bi, Fout_4[0]); + bi += 4; + vst1q_f32(bi, Fout_4[1]); + bi += 4; + } +} + +static void kf_bfly4_neon_m8(kiss_fft_cpx * Fout, + const size_t fstride, + const kiss_fft_state *st, + int m, + int N, +...

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...gt; + float *yi = y; These need to be const float32_t (in both xcorr_kernel_neon_float and xcorr_kernel_neon_float_process1). They're currently causing a ton of warning spew. float32_t appears to not be considered equivalent to float, which means you'll also need casts here: > + vst1q_f32(sum, SUMM); and here: > + vst1_lane_f32(sum, SUMM_2[0], 0);

[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

2015 Jan 29

[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

..._fft_impl(). > + > + out = (float *)tempin; These are pretty confusing names (if you have to keep this scaling here). Ideally they'd be related since they refer to the same memory (e.g., scaled and scaledp or something). Also, float is _not_ compatible with float32_t (which is what vst1q_f32 takes) in all compiler versions. Please do not mix and match them. > + scale = vld1_dup_f32(&st->scale); Needs a (const float32_t *) cast. > + for (i = 0; i < N2; i++) { > + inq = vld1q_f32(in); > + in += 4; > + outq = vmulq_lane_f32(inq, scale, 0); &...

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 07

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...n) then the temporary "result" seems to be kept in the generated code for the test function, and triggers the bad penalty of a load after a NEON store. vec4 operator* (vec4& a, vec4& b) { vec4 result; float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data)); vst1q_f32(result.data, result_data); return result; } __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: sub sp, #16 vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] mov r0, sp vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r0] vst1.32 {d16, d17}, [r2] add sp, #16 bx lr Is there som...

[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

2015 Jan 29

[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

...ut = (float *)tempin; > > > These are pretty confusing names (if you have to keep this scaling here). > Ideally they'd be related since they refer to the same memory (e.g., scaled > and scaledp or something). > > Also, float is _not_ compatible with float32_t (which is what vst1q_f32 > takes) in all compiler versions. Please do not mix and match them. > >> + scale = vld1_dup_f32(&st->scale); > > > Needs a (const float32_t *) cast. > >> + for (i = 0; i < N2; i++) { >> + inq = vld1q_f32(in); >> + in += 4; >>...

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 09

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...32(yi++); > + case 2: > + XX_2 = vld1_dup_f32(xi++); > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); > + YY[0] = vld1q_f32(yi++); > + case 1: > + XX_2 = vld1_dup_f32(xi++); > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); > + } > + > + vst1q_f32(sum, SUMM); > +} > + > +/* > + * Function: xcorr3to1_kernel_neon_float > + * --------------------------------- > + * Computes single correlation values and stores in *sum > + */ > +void xcorr3to1_kernel_neon_float(const float *x, const float *y, > + float *s...

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

2014 Sep 10

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

libvorbis does not currently have any simd/vectorization. Following patches add generic framework for simd/vectorization and on top, add ARM-NEON simd vectorization using intrinsics. I was able to get over 34% performance improvement on my Beaglebone Black which is single Cortex-A8 based CPU. You can find more information on metrics and procedure I used to measure at

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...gt; These need to be const float32_t (in both xcorr_kernel_neon_float and > xcorr_kernel_neon_float_process1). They're currently causing a ton of > warning spew. float32_t appears to not be considered equivalent to > float, which means you'll also need casts here: > >> + vst1q_f32(sum, SUMM); > > and here: > >> + vst1_lane_f32(sum, SUMM_2[0], 0); Thanks, will do. > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 28

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...). Use vld1_dup_f32() instead. It's faster and breaks dependencies. > + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0); > + } > + > + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]); > + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]); > + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]); > + > + vst1q_f32(sum, SUMM[0]); > +} > + > +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, > + opus_val32 *xcorr, int len, int max_pitch, int arch) { arch is unused. There's no reason to pass it here. If we're here, we know what the arch is. > + int i, j; &g...

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 18

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Almost there... just a few nits left. Viswanath Puttagunta wrote: > +if OPUS_ARM_NEON_INTR > +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) > +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3 I'll repeat: I don't think you should change the optimization level here. > + /* Just unroll the rest of the loop */ I saw you decided to keep this unrolled, but you didn't actually

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more in-line with algorithm in celt_pitch_xcorr_arm.s Viswanath Puttagunta

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 01

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...nks. > >> + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0); >> + } >> + >> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]); >> + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]); >> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]); >> + >> + vst1q_f32(sum, SUMM[0]); >> +} >> + >> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, >> + opus_val32 *xcorr, int len, int max_pitch, int arch) { > > arch is unused. There's no reason to pass it here. If we're here, we &...

[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 21

[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hello, I received feedback from engineers working on NE10 [1] that it would be better to use NE10 [1] for FFT optimizations for opus use cases. However, these FFT patches are currently in review and haven't been integrated into NE10 yet. While the FFT functions in NE10 are getting baked, I wanted to optimize the celt_pitch_xcorr (floating point only) and use it to introduce ARM NEON

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 21

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...tions = 3 */ + for (j = 0; j < cr; j++) { + YY[0] = vld1q_f32(yi++); + XX_2 = vld1_lane_f32(xi++, XX_2, 0); + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0); + } + + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]); + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]); + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]); + + vst1q_f32(sum, SUMM[0]); +} + +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, + opus_val32 *xcorr, int len, int max_pitch, int arch) { + int i, j; + + celt_assert(max_pitch > 0); + celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0); + + for (i = 0; i < (...

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...+ } + + yi++; + while (len > 1) { + XX_2 = vld1_dup_f32(xi++); + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); + YY[0]= vld1q_f32(yi++); + len--; + } + + if (len > 0) { + XX_2 = vld1_dup_f32(xi); + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); + } + + vst1q_f32(sum, SUMM); +} + +/* + * Function: xcorr_kernel_neon_float_process1 + * --------------------------------- + * Computes single correlation values and stores in *sum + */ +static void xcorr_kernel_neon_float_process1(const float *x, const float *y, + float *sum, int len) { + float32x4...

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

..._f32(SUMM, YY[0], XX_2, 0); + YY[0] = vld1q_f32(yi++); + case 2: + XX_2 = vld1_dup_f32(xi++); + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); + YY[0] = vld1q_f32(yi++); + case 1: + XX_2 = vld1_dup_f32(xi++); + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); + } + + vst1q_f32(sum, SUMM); +} + +/* + * Function: xcorr3to1_kernel_neon_float + * --------------------------------- + * Computes single correlation values and stores in *sum + */ +void xcorr3to1_kernel_neon_float(const float *x, const float *y, + float *sum, int len) { + int i; + float32x4_t XX[...

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

..._f32(SUMM, YY[0], XX_2, 0); + YY[0] = vld1q_f32(yi++); + case 2: + XX_2 = vld1_dup_f32(xi++); + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); + YY[0] = vld1q_f32(yi++); + case 1: + XX_2 = vld1_dup_f32(xi++); + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); + } + + vst1q_f32(sum, SUMM); +} + +/* + * Function: xcorr_kernel_neon_float_process1 + * --------------------------------- + * Computes single correlation values and stores in *sum + */ +static void xcorr_kernel_neon_float_process1(const float *x, const float *y, + float *sum, int len) { + float32x4...

[RFC PATCH v1 0/2] Encode optimize using libNE10

2015 Jan 20

[RFC PATCH v1 0/2] Encode optimize using libNE10

Hello opus-dev, I've been cooking up this patchset to integrate NE10 library into opus. Current patchset focuses on encode use case mainly effecting performance of clt_mdct_forward() and opus_fft() (for float only) Glad to report the following on Encode use case: (Measured on my Beaglebone Black Cortex-A8 board) - Performance improvement for encode use case ~= 12.34% (Based on time -p

search for: vst1q_f32