search for: vld1q_f32

Displaying 20 results from an estimated 24 matches for "vld1q_f32".

2014 Dec 19
2
[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,
2014 Dec 07
0
[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
..._kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + float *xi = x; + float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. However, the 12'th element never really gets + * touched in this loop. So, if len == 8, then we only + * must access y[0] to y[10]. y[11] must not be accessed + * hence make sure...
2014 Dec 10
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
..._kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + float *xi = x; + float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. However, the 12'th element never really gets + * touched in this loop. So, if len == 8, then we only + * must access y[0] to y[10]. y[11] must not be accessed + * hence make sure...
2014 Dec 10
2
[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv2: - Changes recommended by Timothy for celt_neon_intr.c everything except, left the unrolled loop still unrolled - configure.ac - use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE - Moved compile flags into Makefile.am - OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR Viswanath Puttagunta (1): armv7:
2014 Dec 19
0
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
..._float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + const float *xi = x; + const float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. However, the 12'th element never really gets + * touched in this loop. So, if len == 8, then we only + * must access y[0] to y[10]. y[11] must not be accessed + * hence make sure...
2014 Dec 07
3
[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
From: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more
2014 Dec 07
2
[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more in-line with algorithm in celt_pitch_xcorr_arm.s Viswanath Puttagunta
2014 Dec 19
2
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...float sum[4], int len) { > + float32x4_t YY[3]; > + float32x4_t YEXT[3]; > + float32x4_t XX[2]; > + float32x2_t XX_2; > + float32x4_t SUMM; > + const float *xi = x; > + const float *yi = y; > + > + celt_assert(len>0); > + > + YY[0] = vld1q_f32(yi); > + SUMM = vdupq_n_f32(0); > + > + /* Consume 8 elements in x vector and 12 elements in y > + * vector. However, the 12'th element never really gets > + * touched in this loop. So, if len == 8, then we only > + * must access y[0] to y[10]. y[11] must not be a...
2016 Sep 13
4
[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)
Should call celt_inner_prod(). --- celt/bands.c | 7 ++++--- celt/bands.h | 2 +- celt/celt_encoder.c | 6 +++--- celt/pitch.c | 2 +- src/opus_multistream_encoder.c | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/celt/bands.c b/celt/bands.c index bbe8a4c..1ab24aa 100644 --- a/celt/bands.c +++ b/celt/bands.c
2014 Dec 07
3
[LLVMdev] NEON intrinsics preventing redundant load optimization?
...hat wasn't true for my real situation) then the temporary "result" seems to be kept in the generated code for the test function, and triggers the bad penalty of a load after a NEON store. vec4 operator* (vec4& a, vec4& b) { vec4 result; float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data)); vst1q_f32(result.data, result_data); return result; } __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: sub sp, #16 vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] mov r0, sp vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r0] vst1.32 {d16, d17}...
2014 Dec 09
1
[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...;re not going to special case the last 2+1+1 samples, is there a measurable performance difference compared to simply looping? > + yi++; > + switch(len) { > + case 4: > + XX_2 = vld1_dup_f32(xi++); > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); > + YY[0] = vld1q_f32(yi++); > + case 3: > + XX_2 = vld1_dup_f32(xi++); > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); > + YY[0] = vld1q_f32(yi++); > + case 2: > + XX_2 = vld1_dup_f32(xi++); > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0); > + YY[0] = vld1q_f32...
2014 Nov 09
0
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
...al C code, except each neon + * instruction consumes 1 complex number (2 floats) + * In theory, one could use Q regs instead of + * D regs, but you need to consider case when N is odd + * One can do that if it justifies performance improment + */ + + for (i = 0; i < N; i++) { + Fout_4[0] = vld1q_f32(ai); + ai += 4; + Fout_4[1] = vld1q_f32(ai); + ai += 4; + Fout_2[0] = vget_low_f32(Fout_4[0]); + Fout_2[1] = vget_high_f32(Fout_4[0]); + Fout_2[2] = vget_low_f32(Fout_4[1]); + Fout_2[3] = vget_high_f32(Fout_4[1]); + + scratch_2[0] = vsub_f32(Fout_2[0], Fout_2[2]); + Fout_2[0] = vadd_f32(Fo...
2014 Nov 09
3
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve
2014 Dec 08
2
[LLVMdev] NEON intrinsics preventing redundant load optimization?
...store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? > > If I had to guess, I'd say the intrinsic got in the way of recognising > the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but > vld1q_f32 is still kept as an intrinsic, so register allocators and > schedulers get confused and, when lowering to assembly, you're left > with garbage around it. > > Creating a bug for this is probably the best thing to do, since this > is a common pattern that needs looking into to pro...
2014 Nov 28
2
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...ponding >> and & (they are much slower). > + int j; > + > + celt_assert(len>=3); > + > + /* Initialize sums to 0 */ > + SUMM[0] = vdupq_n_f32(0); > + SUMM[1] = vdupq_n_f32(0); > + SUMM[2] = vdupq_n_f32(0); > + SUMM[3] = vdupq_n_f32(0); > + > + YY[0] = vld1q_f32(yi); > + > + /* Each loop consumes 8 floats in y vector > + * and 4 floats in x vector > + */ > + for (j = 0; j < cd; j++) { > + yi += 4; > + YY[4] = vld1q_f32(yi); If len == 4, then in the first iteration you will have loaded 8 y values, but only 7 are guaranteed to b...
2014 Dec 01
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...gt;> + >> + celt_assert(len>=3); >> + >> + /* Initialize sums to 0 */ >> + SUMM[0] = vdupq_n_f32(0); >> + SUMM[1] = vdupq_n_f32(0); >> + SUMM[2] = vdupq_n_f32(0); >> + SUMM[3] = vdupq_n_f32(0); >> + >> + YY[0] = vld1q_f32(yi); >> + >> + /* Each loop consumes 8 floats in y vector >> + * and 4 floats in x vector >> + */ >> + for (j = 0; j < cd; j++) { >> + yi += 4; >> + YY[4] = vld1q_f32(yi); > > If len == 4, then in the first...
2014 Nov 21
4
[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hello, I received feedback from engineers working on NE10 [1] that it would be better to use NE10 [1] for FFT optimizations for opus use cases. However, these FFT patches are currently in review and haven't been integrated into NE10 yet. While the FFT functions in NE10 are getting baked, I wanted to optimize the celt_pitch_xcorr (floating point only) and use it to introduce ARM NEON
2014 Nov 21
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...at32x2_t XX_2; + float32x4_t SUMM[4]; + float *xi = x; + float *yi = y; + int cd = len/4; + int cr = len%4; + int j; + + celt_assert(len>=3); + + /* Initialize sums to 0 */ + SUMM[0] = vdupq_n_f32(0); + SUMM[1] = vdupq_n_f32(0); + SUMM[2] = vdupq_n_f32(0); + SUMM[3] = vdupq_n_f32(0); + + YY[0] = vld1q_f32(yi); + + /* Each loop consumes 8 floats in y vector + * and 4 floats in x vector + */ + for (j = 0; j < cd; j++) { + yi += 4; + YY[4] = vld1q_f32(yi); + YY[1] = vextq_f32(YY[0], YY[4], 1); + YY[2] = vextq_f32(YY[0], YY[4], 2); + YY[3] = vextq_f32(YY[0], YY[4], 3); + + XX[0] = vld1q_dup_f...
2015 Jan 29
2
[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library
...n it cause issues elsewhere (and using the style encourages others to do it). > + SAVE_STACK; > + int N2 = st->nfft >> 1; > + float32x4_t inq, outq; > + float32x2_t scale; > + float *in = (float *)fin; You're dropping the const qualifier for no reason. Also, vld1q_f32() takes a const float32_t *, NOT a float *, and they are not compatible on all compiler versions. > + float *out; > + int i; > + ALLOC(temp, st->nfft, ne10_fft_cpx_float32_t); > + ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t); This seems like a fairly large increase...
2014 Sep 10
4
[RFC PATCH v1 0/3] Introducing ARM SIMD Support
libvorbis does not currently have any simd/vectorization. Following patches add generic framework for simd/vectorization and on top, add ARM-NEON simd vectorization using intrinsics. I was able to get over 34% performance improvement on my Beaglebone Black which is single Cortex-A8 based CPU. You can find more information on metrics and procedure I used to measure at