Linfeng Zhang
2017-Feb-07 00:18 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
This is a great idea. But the order (psEncC->shapingLPCOrder) can be configured to 12, 14, 16, 20 and 24 according to complexity parameter. It's hard to get a universal function to handle all these orders efficiently. Any suggestions? Thanks, Linfeng On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> Hi Linfeng, > > On 06/02/17 02:51 PM, Linfeng Zhang wrote: > > However, the critical thing is that all the states in each stage when > > processing input[i] are reused by the next input[i+1]. That is > > input[i+1] must wait input[i] for 1 stage, and input[i+2] must wait > > input[i+1] for 1 stage, etc. > > That is indeed the tricky part... and the one I think you could do > slightly differently. If you approach the problem in terms of computing > chunks of the inputs N samples at a time, then indeed the approach you > are describing is the only solution. What I was proposing though is to > instead chop the "order" in chunks of N. Using your notation, you would > be doing: > > PROC( in0(s0)) > PROC( in0(s1) in1(s0)) > PROC( in0(s2) in1(s1) in2(s0)) > PROC( in0(s3) in1(s2) in2(s1) in3(s0)) > PROC( in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) > PROC( in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) > PROC( in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) > PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1) in8(s0)) > PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1) in9(s0)) > PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) in9(s1)in10(s0)) > PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) in9(s2)in10(s1)in11(s0)) > ...and so on until the end of the input vector > > The difference is that it's now the input vector that "slides" and the > "state" values sy that remain in the same place. There's still a > prologue, but you can easily get rid of it by (implicitly) zero-padding > the in vector during the initialization phase (start with a zero vector > and real one value at a time). Getting rid of the epilogue is a little > trickier, but I think it can be done. > > Cheers, > > Jean-Marc > > > Then it becomes this > > > > FOR in=0 to N WITH in+=8 > > PROC(in0(s0)) /* prolog 0 */ > > PROC(in0(s1) in1(s0)) /* prolog 1 */ > > PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */ > > PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */ > > PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */ > > PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog 5 */ > > PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /* > > prolog 6 */ > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) > > /* fully process 8 inputs */ > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)) > > /* continue */ > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)) > > /* continue */ > > PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4) in7(s3)) > > /* continue */ > > PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /* > > epilog 0 */ > > PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog 1 */ > > PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */ > > PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */ > > PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */ > > PROC(in6(s10) in7(s9)) /* epilog 5 */ > > PROC(in7(s10)) /* epilog 6 */ > > END FOR > > > > And > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) > > /* fully process 8 inputs */ > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)) > > /* continue */ > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)) > > /* continue */ > > is actually the expansion of the kernel loop > > FOR i=0 TO order-6 WITH i++ > > PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2) > > in6(si+1) in7(si+0)) > > END FOR > > > > The worst thing is that corr_QC[] is so sensitive that any extra > > processing will make them wrong and propagate to the next loop (next 8 > > inputs). state_QS[] is a little better but still very sensitive. For > > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7) > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more time) > > and remove epilog 0, then all final results will be wrong. > > > > That's why the prolog and epilog cannot be saved to the best of my > > knowledge. > > > > The assembly size of silk_warped_autocorrelation_FIX_neon() is about > > 2,744 bytes. Compared with the C code size (about 452 bytes), it's 2.3 > > KB larger. Considering silk_warped_autocorrelation_FIX_c() is the second > > place CPU heavy function in fixed-point, and our testing shows up to 7% > > CPU run time saving of the total encoder with this optimization (at > > Complexity 8), maybe we can take the I-cache burden even if finally we > > still cannot remove the big chunk of prolog and epilog. > > > > Thanks, > > Linfeng Zhang > > > > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin <jmvalin at jmvalin.ca > > <mailto:jmvalin at jmvalin.ca>> wrote: > > > > Hi Felicia, > > > > I've had time to work through the math in the original function and > I'm > > pretty sure it's possible to vectorize this without the huge > > prologue/epilogue. > > > > For the simple case where order = vector size = N (but it should > easily > > generalize to larger order), what I came up with is: > > > > initialize X, Y, M, C to vector of zeros > > > > for i=0 to N+order > > T = [x(i), Y(0:N-2)] > > Y = M + coeff * (Y - T) > > M = T > > X = [x(i), X(0:N-1)] > > C = C + X*Y > > > > I think something similar to this (assuming I didn't mess up any > > details) should give you the correlations in vector C. Did I miss > > anything? > > > > Cheers, > > > > Jean-Marc > > > > > > On 31/01/17 12:30 PM, Felicia Lim wrote: > > > Hi, > > > > > > Attached is a patch with arm neon optimizations for > > > silk_warped_autocorrelation_FIX(). Please review. > > > > > > Thanks, > > > Felicia > > > > > > > > > _______________________________________________ > > > opus mailing list > > > opus at xiph.org <mailto:opus at xiph.org> > > > http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus> > > > > > _______________________________________________ > > opus mailing list > > opus at xiph.org <mailto:opus at xiph.org> > > http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus> > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170206/d29bd2a2/attachment.html>
Jean-Marc Valin
2017-Feb-07 01:47 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Linfeng, On 06/02/17 07:18 PM, Linfeng Zhang wrote:> This is a great idea. But the order (psEncC->shapingLPCOrder) can be > configured to 12, 14, 16, 20 and 24 according to complexity parameter. > > It's hard to get a universal function to handle all these orders > efficiently. Any suggestions?I can think of two ways of handling larger orders. The obvious one is simply to add an inner loop of the form: for (i=0;i<order;i+=VECTOR_SIZE) I think what may be more efficient is to simply have a small "order-N" (N=4 or 8) kernel that not only computes the correlation of order N, but then spits out the signal after the N-stage all-pass is applied. The kernel would look like: void autocorr_kernel4(int *corr, int *orig, int *input, int *output, int len) { /* Implement vectorized order-4 filter (could also be order 8) as described in previous email and outputs the filtered signal. */ } and then the full function would run the kernel multiple times and look like: void full_autocorr(int *corr, int *orig, int len, int order) { int i; int tmp[MAX_SIZE]; int *in = orig; for (i=0;i<order;i+=4) { autocorr_kernel4(corr+i, orig, in, tmp, len); /* Make subsequent calls use the filtered signal as input. */ in = tmp; } } I think the should not only reduce/eliminate the prologue/epilogue problem, but it should also be more efficient since almost all vectors processed would use the full size. Maybe a third option (not sure it's a good idea, but still mentioning it) would be to have a function that hardcodes order=24 and discards the larger values that aren't needed. Since the smallest order seems to be 16, it wouldn't be much of a waste and the code might end up running faster for the higher orders. Cheers, Jean-Marc> Thanks, > Linfeng > > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>> wrote: > > Hi Linfeng, > > On 06/02/17 02:51 PM, Linfeng Zhang wrote: > > However, the critical thing is that all the states in each stage when > > processing input[i] are reused by the next input[i+1]. That is > > input[i+1] must wait input[i] for 1 stage, and input[i+2] must wait > > input[i+1] for 1 stage, etc. > > That is indeed the tricky part... and the one I think you could do > slightly differently. If you approach the problem in terms of computing > chunks of the inputs N samples at a time, then indeed the approach you > are describing is the only solution. What I was proposing though is to > instead chop the "order" in chunks of N. Using your notation, you would > be doing: > > PROC( in0(s0)) > PROC( in0(s1) in1(s0)) > PROC( in0(s2) in1(s1) in2(s0)) > PROC( in0(s3) in1(s2) in2(s1) in3(s0)) > PROC( in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) > PROC( in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) > PROC( in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) > PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1) in8(s0)) > PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1) in9(s0)) > PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) in9(s1)in10(s0)) > PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) in9(s2)in10(s1)in11(s0)) > ...and so on until the end of the input vector > > The difference is that it's now the input vector that "slides" and the > "state" values sy that remain in the same place. There's still a > prologue, but you can easily get rid of it by (implicitly) zero-padding > the in vector during the initialization phase (start with a zero vector > and real one value at a time). Getting rid of the epilogue is a little > trickier, but I think it can be done. > > Cheers, > > Jean-Marc > > > Then it becomes this > > > > FOR in=0 to N WITH in+=8 > > PROC(in0(s0)) /* prolog 0 */ > > PROC(in0(s1) in1(s0)) /* prolog 1 */ > > PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */ > > PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */ > > PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */ > > PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog 5 */ > > PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /* > > prolog 6 */ > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) > in7(s0)) > > /* fully process 8 inputs */ > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) > in7(s1)) > > /* continue */ > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) > in7(s2)) > > /* continue */ > > PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4) > in7(s3)) > > /* continue */ > > PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /* > > epilog 0 */ > > PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog > 1 */ > > PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */ > > PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */ > > PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */ > > PROC(in6(s10) in7(s9)) /* epilog 5 */ > > PROC(in7(s10)) /* epilog 6 */ > > END FOR > > > > And > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) > in7(s0)) > > /* fully process 8 inputs */ > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) > in7(s1)) > > /* continue */ > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) > in7(s2)) > > /* continue */ > > is actually the expansion of the kernel loop > > FOR i=0 TO order-6 WITH i++ > > PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2) > > in6(si+1) in7(si+0)) > > END FOR > > > > The worst thing is that corr_QC[] is so sensitive that any extra > > processing will make them wrong and propagate to the next loop (next 8 > > inputs). state_QS[] is a little better but still very sensitive. For > > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7) > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more time) > > and remove epilog 0, then all final results will be wrong. > > > > That's why the prolog and epilog cannot be saved to the best of my > > knowledge. > > > > The assembly size of silk_warped_autocorrelation_FIX_neon() is about > > 2,744 bytes. Compared with the C code size (about 452 bytes), it's 2.3 > > KB larger. Considering silk_warped_autocorrelation_FIX_c() is the > second > > place CPU heavy function in fixed-point, and our testing shows up > to 7% > > CPU run time saving of the total encoder with this optimization (at > > Complexity 8), maybe we can take the I-cache burden even if finally we > > still cannot remove the big chunk of prolog and epilog. > > > > Thanks, > > Linfeng Zhang > > > > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> wrote: > > > > Hi Felicia, > > > > I've had time to work through the math in the original > function and I'm > > pretty sure it's possible to vectorize this without the huge > > prologue/epilogue. > > > > For the simple case where order = vector size = N (but it > should easily > > generalize to larger order), what I came up with is: > > > > initialize X, Y, M, C to vector of zeros > > > > for i=0 to N+order > > T = [x(i), Y(0:N-2)] > > Y = M + coeff * (Y - T) > > M = T > > X = [x(i), X(0:N-1)] > > C = C + X*Y > > > > I think something similar to this (assuming I didn't mess up any > > details) should give you the correlations in vector C. Did I miss > > anything? > > > > Cheers, > > > > Jean-Marc > > > > > > On 31/01/17 12:30 PM, Felicia Lim wrote: > > > Hi, > > > > > > Attached is a patch with arm neon optimizations for > > > silk_warped_autocorrelation_FIX(). Please review. > > > > > > Thanks, > > > Felicia > > > > > > > > > _______________________________________________ > > > opus mailing list > > > opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > <mailto:opus at xiph.org>> > > > http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > > > _______________________________________________ > > opus mailing list > > opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > <mailto:opus at xiph.org>> > > http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > > > >
Linfeng Zhang
2017-Feb-07 16:58 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Jean-Marc, Thanks for your suggestions. Will get back to you once we have some updates. Linfeng On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> Hi Linfeng, > > On 06/02/17 07:18 PM, Linfeng Zhang wrote: > > This is a great idea. But the order (psEncC->shapingLPCOrder) can be > > configured to 12, 14, 16, 20 and 24 according to complexity parameter. > > > > It's hard to get a universal function to handle all these orders > > efficiently. Any suggestions? > > I can think of two ways of handling larger orders. The obvious one is > simply to add an inner loop of the form: > for (i=0;i<order;i+=VECTOR_SIZE) > I think what may be more efficient is to simply have a small "order-N" > (N=4 or 8) kernel that not only computes the correlation of order N, but > then spits out the signal after the N-stage all-pass is applied. The > kernel would look like: > > void autocorr_kernel4(int *corr, int *orig, int *input, int *output, int > len) { > /* Implement vectorized order-4 filter (could also be order 8) > as described in previous email and outputs the filtered signal. > */ > } > > and then the full function would run the kernel multiple times and look > like: > > void full_autocorr(int *corr, int *orig, int len, int order) { > int i; > int tmp[MAX_SIZE]; > int *in = orig; > for (i=0;i<order;i+=4) { > autocorr_kernel4(corr+i, orig, in, tmp, len); > /* Make subsequent calls use the filtered signal as input. */ > in = tmp; > } > } > > I think the should not only reduce/eliminate the prologue/epilogue > problem, but it should also be more efficient since almost all vectors > processed would use the full size. > > Maybe a third option (not sure it's a good idea, but still mentioning > it) would be to have a function that hardcodes order=24 and discards the > larger values that aren't needed. Since the smallest order seems to be > 16, it wouldn't be much of a waste and the code might end up running > faster for the higher orders. > > Cheers, > > Jean-Marc > > > > Thanks, > > Linfeng > > > > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at jmvalin.ca > > <mailto:jmvalin at jmvalin.ca>> wrote: > > > > Hi Linfeng, > > > > On 06/02/17 02:51 PM, Linfeng Zhang wrote: > > > However, the critical thing is that all the states in each stage > when > > > processing input[i] are reused by the next input[i+1]. That is > > > input[i+1] must wait input[i] for 1 stage, and input[i+2] must wait > > > input[i+1] for 1 stage, etc. > > > > That is indeed the tricky part... and the one I think you could do > > slightly differently. If you approach the problem in terms of > computing > > chunks of the inputs N samples at a time, then indeed the approach > you > > are describing is the only solution. What I was proposing though is > to > > instead chop the "order" in chunks of N. Using your notation, you > would > > be doing: > > > > PROC( in0(s0)) > > PROC( in0(s1) in1(s0)) > > PROC( in0(s2) in1(s1) in2(s0)) > > PROC( in0(s3) in1(s2) in2(s1) in3(s0)) > > PROC( in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) > > PROC( in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) > > PROC( in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) > > PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1) in8(s0)) > > PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1) in9(s0)) > > PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) in9(s1)in10(s0)) > > PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) in9(s2)in10(s1)in11(s0)) > > ...and so on until the end of the input vector > > > > The difference is that it's now the input vector that "slides" and > the > > "state" values sy that remain in the same place. There's still a > > prologue, but you can easily get rid of it by (implicitly) > zero-padding > > the in vector during the initialization phase (start with a zero > vector > > and real one value at a time). Getting rid of the epilogue is a > little > > trickier, but I think it can be done. > > > > Cheers, > > > > Jean-Marc > > > > > Then it becomes this > > > > > > FOR in=0 to N WITH in+=8 > > > PROC(in0(s0)) /* prolog 0 */ > > > PROC(in0(s1) in1(s0)) /* prolog 1 */ > > > PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */ > > > PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */ > > > PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */ > > > PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog > 5 */ > > > PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /* > > > prolog 6 */ > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) > > in7(s0)) > > > /* fully process 8 inputs */ > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) > > in7(s1)) > > > /* continue */ > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) > > in7(s2)) > > > /* continue */ > > > PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4) > > in7(s3)) > > > /* continue */ > > > PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /* > > > epilog 0 */ > > > PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog > > 1 */ > > > PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */ > > > PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */ > > > PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */ > > > PROC(in6(s10) in7(s9)) /* epilog 5 */ > > > PROC(in7(s10)) /* epilog 6 */ > > > END FOR > > > > > > And > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) > > in7(s0)) > > > /* fully process 8 inputs */ > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) > > in7(s1)) > > > /* continue */ > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) > > in7(s2)) > > > /* continue */ > > > is actually the expansion of the kernel loop > > > FOR i=0 TO order-6 WITH i++ > > > PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2) > > > in6(si+1) in7(si+0)) > > > END FOR > > > > > > The worst thing is that corr_QC[] is so sensitive that any extra > > > processing will make them wrong and propagate to the next loop > (next 8 > > > inputs). state_QS[] is a little better but still very sensitive. > For > > > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7) > > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more > time) > > > and remove epilog 0, then all final results will be wrong. > > > > > > That's why the prolog and epilog cannot be saved to the best of my > > > knowledge. > > > > > > The assembly size of silk_warped_autocorrelation_FIX_neon() is > about > > > 2,744 bytes. Compared with the C code size (about 452 bytes), it's > 2.3 > > > KB larger. Considering silk_warped_autocorrelation_FIX_c() is the > > second > > > place CPU heavy function in fixed-point, and our testing shows up > > to 7% > > > CPU run time saving of the total encoder with this optimization (at > > > Complexity 8), maybe we can take the I-cache burden even if > finally we > > > still cannot remove the big chunk of prolog and epilog. > > > > > > Thanks, > > > Linfeng Zhang > > > > > > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin > > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > > > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> wrote: > > > > > > Hi Felicia, > > > > > > I've had time to work through the math in the original > > function and I'm > > > pretty sure it's possible to vectorize this without the huge > > > prologue/epilogue. > > > > > > For the simple case where order = vector size = N (but it > > should easily > > > generalize to larger order), what I came up with is: > > > > > > initialize X, Y, M, C to vector of zeros > > > > > > for i=0 to N+order > > > T = [x(i), Y(0:N-2)] > > > Y = M + coeff * (Y - T) > > > M = T > > > X = [x(i), X(0:N-1)] > > > C = C + X*Y > > > > > > I think something similar to this (assuming I didn't mess up > any > > > details) should give you the correlations in vector C. Did I > miss > > > anything? > > > > > > Cheers, > > > > > > Jean-Marc > > > > > > > > > On 31/01/17 12:30 PM, Felicia Lim wrote: > > > > Hi, > > > > > > > > Attached is a patch with arm neon optimizations for > > > > silk_warped_autocorrelation_FIX(). Please review. > > > > > > > > Thanks, > > > > Felicia > > > > > > > > > > > > _______________________________________________ > > > > opus mailing list > > > > opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > > <mailto:opus at xiph.org>> > > > > http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus> > > > <http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus>> > > > > > > > _______________________________________________ > > > opus mailing list > > > opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > > <mailto:opus at xiph.org>> > > > http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus> > > > <http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus>> > > > > > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170207/b778d6bc/attachment-0001.html>
Possibly Parallel Threads
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON