search for: vld1q_dup_f32

Displaying 4 results from an estimated 4 matches for "vld1q_dup_f32".

Did you mean: vld1_dup_f32
2014 Nov 28
2
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...ch_xcorr_arm.s for details and an example (there are other useful comments there that could shave another cycle or two from this inner loop). > + YY[1] = vextq_f32(YY[0], YY[4], 1); > + YY[2] = vextq_f32(YY[0], YY[4], 2); > + YY[3] = vextq_f32(YY[0], YY[4], 3); > + > + XX[0] = vld1q_dup_f32(xi++); > + XX[1] = vld1q_dup_f32(xi++); > + XX[2] = vld1q_dup_f32(xi++); > + XX[3] = vld1q_dup_f32(xi++); Don't do this. Do a single load and use vmlaq_lane_f32() to multiply by each value. That should cut at least 5 cycles out of this loop. > + > + SUMM[0] = vmlaq_f32(SUM...
2014 Dec 01
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...more elegant. This comment applies to rest of feedback on celt_neon_intr.c > >> + YY[1] = vextq_f32(YY[0], YY[4], 1); >> + YY[2] = vextq_f32(YY[0], YY[4], 2); >> + YY[3] = vextq_f32(YY[0], YY[4], 3); >> + >> + XX[0] = vld1q_dup_f32(xi++); >> + XX[1] = vld1q_dup_f32(xi++); >> + XX[2] = vld1q_dup_f32(xi++); >> + XX[3] = vld1q_dup_f32(xi++); > > Don't do this. Do a single load and use vmlaq_lane_f32() to multiply by > each value. That should cut at least 5 cycles...
2014 Nov 21
4
[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hello, I received feedback from engineers working on NE10 [1] that it would be better to use NE10 [1] for FFT optimizations for opus use cases. However, these FFT patches are currently in review and haven't been integrated into NE10 yet. While the FFT functions in NE10 are getting baked, I wanted to optimize the celt_pitch_xcorr (floating point only) and use it to introduce ARM NEON
2014 Nov 21
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...= vld1q_f32(yi); + + /* Each loop consumes 8 floats in y vector + * and 4 floats in x vector + */ + for (j = 0; j < cd; j++) { + yi += 4; + YY[4] = vld1q_f32(yi); + YY[1] = vextq_f32(YY[0], YY[4], 1); + YY[2] = vextq_f32(YY[0], YY[4], 2); + YY[3] = vextq_f32(YY[0], YY[4], 3); + + XX[0] = vld1q_dup_f32(xi++); + XX[1] = vld1q_dup_f32(xi++); + XX[2] = vld1q_dup_f32(xi++); + XX[3] = vld1q_dup_f32(xi++); + + SUMM[0] = vmlaq_f32(SUMM[0], XX[0], YY[0]); + SUMM[1] = vmlaq_f32(SUMM[1], XX[1], YY[1]); + SUMM[2] = vmlaq_f32(SUMM[2], XX[2], YY[2]); + SUMM[3] = vmlaq_f32(SUMM[3], XX[3], YY[3]); + YY[...