thr3ads.net - search: "vld1q_dup

Displaying 4 results from an estimated 4 matches for "vld1q_dup_f32".

Did you mean: vld1_dup_f32

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 28

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...ch_xcorr_arm.s for details and an example (there are other useful comments there that could shave another cycle or two from this inner loop). > + YY[1] = vextq_f32(YY[0], YY[4], 1); > + YY[2] = vextq_f32(YY[0], YY[4], 2); > + YY[3] = vextq_f32(YY[0], YY[4], 3); > + > + XX[0] = vld1q_dup_f32(xi++); > + XX[1] = vld1q_dup_f32(xi++); > + XX[2] = vld1q_dup_f32(xi++); > + XX[3] = vld1q_dup_f32(xi++); Don't do this. Do a single load and use vmlaq_lane_f32() to multiply by each value. That should cut at least 5 cycles out of this loop. > + > + SUMM[0] = vmlaq_f32(SUM...

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 01

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...more elegant. This comment applies to rest of feedback on celt_neon_intr.c > >> + YY[1] = vextq_f32(YY[0], YY[4], 1); >> + YY[2] = vextq_f32(YY[0], YY[4], 2); >> + YY[3] = vextq_f32(YY[0], YY[4], 3); >> + >> + XX[0] = vld1q_dup_f32(xi++); >> + XX[1] = vld1q_dup_f32(xi++); >> + XX[2] = vld1q_dup_f32(xi++); >> + XX[3] = vld1q_dup_f32(xi++); > > Don't do this. Do a single load and use vmlaq_lane_f32() to multiply by > each value. That should cut at least 5 cycles...

[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 21

[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hello, I received feedback from engineers working on NE10 [1] that it would be better to use NE10 [1] for FFT optimizations for opus use cases. However, these FFT patches are currently in review and haven't been integrated into NE10 yet. While the FFT functions in NE10 are getting baked, I wanted to optimize the celt_pitch_xcorr (floating point only) and use it to introduce ARM NEON

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 21

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...= vld1q_f32(yi); + + /* Each loop consumes 8 floats in y vector + * and 4 floats in x vector + */ + for (j = 0; j < cd; j++) { + yi += 4; + YY[4] = vld1q_f32(yi); + YY[1] = vextq_f32(YY[0], YY[4], 1); + YY[2] = vextq_f32(YY[0], YY[4], 2); + YY[3] = vextq_f32(YY[0], YY[4], 3); + + XX[0] = vld1q_dup_f32(xi++); + XX[1] = vld1q_dup_f32(xi++); + XX[2] = vld1q_dup_f32(xi++); + XX[3] = vld1q_dup_f32(xi++); + + SUMM[0] = vmlaq_f32(SUMM[0], XX[0], YY[0]); + SUMM[1] = vmlaq_f32(SUMM[1], XX[1], YY[1]); + SUMM[2] = vmlaq_f32(SUMM[2], XX[2], YY[2]); + SUMM[3] = vmlaq_f32(SUMM[3], XX[3], YY[3]); + YY[...

search for: vld1q_dup_f32