thr3ads.net - search: "vld1q

Displaying 20 results from an estimated 24 matches for "vld1q_f32".

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

..._kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + float *xi = x; + float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. However, the 12'th element never really gets + * touched in this loop. So, if len == 8, then we only + * must access y[0] to y[10]. y[11] must not be accessed + * hence make sure...

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv2: - Changes recommended by Timothy for celt_neon_intr.c everything except, left the unrolled loop still unrolled - configure.ac - use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE - Moved compile flags into Makefile.am - OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR Viswanath Puttagunta (1): armv7:

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

..._float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + const float *xi = x; + const float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. However, the 12'th element never really gets + * touched in this loop. So, if len == 8, then we only + * must access y[0] to y[10]. y[11] must not be accessed + * hence make sure...

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

From: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more in-line with algorithm in celt_pitch_xcorr_arm.s Viswanath Puttagunta

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...float sum[4], int len) { > + float32x4_t YY[3]; > + float32x4_t YEXT[3]; > + float32x4_t XX[2]; > + float32x2_t XX_2; > + float32x4_t SUMM; > + const float *xi = x; > + const float *yi = y; > + > + celt_assert(len>0); > + > + YY[0] = vld1q_f32(yi); > + SUMM = vdupq_n_f32(0); > + > + /* Consume 8 elements in x vector and 12 elements in y > + * vector. However, the 12'th element never really gets > + * touched in this loop. So, if len == 8, then we only > + * must access y[0] to y[10]. y[11] must not be a...

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

2016 Sep 13

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

Should call celt_inner_prod(). --- celt/bands.c | 7 ++++--- celt/bands.h | 2 +- celt/celt_encoder.c | 6 +++--- celt/pitch.c | 2 +- src/opus_multistream_encoder.c | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/celt/bands.c b/celt/bands.c index bbe8a4c..1ab24aa 100644 --- a/celt/bands.c +++ b/celt/bands.c

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 07

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...hat wasn't true for my real situation) then the temporary "result" seems to be kept in the generated code for the test function, and triggers the bad penalty of a load after a NEON store. vec4 operator* (vec4& a, vec4& b) { vec4 result; float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data)); vst1q_f32(result.data, result_data); return result; } __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: sub sp, #16 vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] mov r0, sp vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r0] vst1.32 {d16, d17}...

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 09

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...;re not going to special case the last 2+1+1 samples, is there a measurable performance difference compared to simply looping? > + yi++; > + switch(len) { > + case 4: > + XX_2 = vld1_dup_f32(xi++); > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); > + YY[0] = vld1q_f32(yi++); > + case 3: > + XX_2 = vld1_dup_f32(xi++); > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); > + YY[0] = vld1q_f32(yi++); > + case 2: > + XX_2 = vld1_dup_f32(xi++); > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); > + YY[0] = vld1q_f32...

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

...al C code, except each neon + * instruction consumes 1 complex number (2 floats) + * In theory, one could use Q regs instead of + * D regs, but you need to consider case when N is odd + * One can do that if it justifies performance improment + */ + + for (i = 0; i < N; i++) { + Fout_4[0] = vld1q_f32(ai); + ai += 4; + Fout_4[1] = vld1q_f32(ai); + ai += 4; + Fout_2[0] = vget_low_f32(Fout_4[0]); + Fout_2[1] = vget_high_f32(Fout_4[0]); + Fout_2[2] = vget_low_f32(Fout_4[1]); + Fout_2[3] = vget_high_f32(Fout_4[1]); + + scratch_2[0] = vsub_f32(Fout_2[0], Fout_2[2]); + Fout_2[0] = vadd_f32(Fo...

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 08

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? > > If I had to guess, I'd say the intrinsic got in the way of recognising > the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but > vld1q_f32 is still kept as an intrinsic, so register allocators and > schedulers get confused and, when lowering to assembly, you're left > with garbage around it. > > Creating a bug for this is probably the best thing to do, since this > is a common pattern that needs looking into to pro...

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 28

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...ponding >> and & (they are much slower). > + int j; > + > + celt_assert(len>=3); > + > + /* Initialize sums to 0 */ > + SUMM[0] = vdupq_n_f32(0); > + SUMM[1] = vdupq_n_f32(0); > + SUMM[2] = vdupq_n_f32(0); > + SUMM[3] = vdupq_n_f32(0); > + > + YY[0] = vld1q_f32(yi); > + > + /* Each loop consumes 8 floats in y vector > + * and 4 floats in x vector > + */ > + for (j = 0; j < cd; j++) { > + yi += 4; > + YY[4] = vld1q_f32(yi); If len == 4, then in the first iteration you will have loaded 8 y values, but only 7 are guaranteed to b...

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 01

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...gt;> + >> + celt_assert(len>=3); >> + >> + /* Initialize sums to 0 */ >> + SUMM[0] = vdupq_n_f32(0); >> + SUMM[1] = vdupq_n_f32(0); >> + SUMM[2] = vdupq_n_f32(0); >> + SUMM[3] = vdupq_n_f32(0); >> + >> + YY[0] = vld1q_f32(yi); >> + >> + /* Each loop consumes 8 floats in y vector >> + * and 4 floats in x vector >> + */ >> + for (j = 0; j < cd; j++) { >> + yi += 4; >> + YY[4] = vld1q_f32(yi); > > If len == 4, then in the first...

[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 21

[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hello, I received feedback from engineers working on NE10 [1] that it would be better to use NE10 [1] for FFT optimizations for opus use cases. However, these FFT patches are currently in review and haven't been integrated into NE10 yet. While the FFT functions in NE10 are getting baked, I wanted to optimize the celt_pitch_xcorr (floating point only) and use it to introduce ARM NEON

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 21

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...at32x2_t XX_2; + float32x4_t SUMM[4]; + float *xi = x; + float *yi = y; + int cd = len/4; + int cr = len%4; + int j; + + celt_assert(len>=3); + + /* Initialize sums to 0 */ + SUMM[0] = vdupq_n_f32(0); + SUMM[1] = vdupq_n_f32(0); + SUMM[2] = vdupq_n_f32(0); + SUMM[3] = vdupq_n_f32(0); + + YY[0] = vld1q_f32(yi); + + /* Each loop consumes 8 floats in y vector + * and 4 floats in x vector + */ + for (j = 0; j < cd; j++) { + yi += 4; + YY[4] = vld1q_f32(yi); + YY[1] = vextq_f32(YY[0], YY[4], 1); + YY[2] = vextq_f32(YY[0], YY[4], 2); + YY[3] = vextq_f32(YY[0], YY[4], 3); + + XX[0] = vld1q_dup_f...

[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

2015 Jan 29

[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

...n it cause issues elsewhere (and using the style encourages others to do it). > + SAVE_STACK; > + int N2 = st->nfft >> 1; > + float32x4_t inq, outq; > + float32x2_t scale; > + float *in = (float *)fin; You're dropping the const qualifier for no reason. Also, vld1q_f32() takes a const float32_t *, NOT a float *, and they are not compatible on all compiler versions. > + float *out; > + int i; > + ALLOC(temp, st->nfft, ne10_fft_cpx_float32_t); > + ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t); This seems like a fairly large increase...

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

2014 Sep 10

[RFC PATCH v1 0/3] Introducing ARM SIMD Support

libvorbis does not currently have any simd/vectorization. Following patches add generic framework for simd/vectorization and on top, add ARM-NEON simd vectorization using intrinsics. I was able to get over 34% performance improvement on my Beaglebone Black which is single Cortex-A8 based CPU. You can find more information on metrics and procedure I used to measure at

search for: vld1q_f32