Linfeng Zhang
2017-Apr-05 18:13 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Thank Jean-Marc! The speedup percentages are all relative to the entire encoder. Comparing to master, this optimization patch speeds up fixed-point SILK encoder on NEON as following: Complexity 5: 6.1% Complexity 6: 5.8% Complexity 8: 5.5% Complexity 10: 4.0% when testing on an Acer Chromebook, ARMv7 Processor rev 3 (v7l), CPU max MHz: 2116.5 Thanks, Linfeng On Wed, Apr 5, 2017 at 11:02 AM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> Hi Linfeng, > > Thanks for the updated patch. I'll have a look and get back to you. When > you report speedup percentages, is that relative to the entire encoder > or relative to just that function in C? Also, what's the speedup > compared to master? > > Cheers, > > Jean-Marc > > On 05/04/17 12:14 PM, Linfeng Zhang wrote: > > I attached a new patch with small cleanup (disassembly is identical as > > the last patch). We have done the same internal testing as usual. > > > > Also, attached 2 failed temporary versions which try to reduce code size > > (just for code review reference purpose). > > > > The new patch of silk_warped_autocorrelation_FIX_neon() has a code size > > of 3,228 bytes (with gcc). > > smaller_slower.c has a code size of 2,304 bytes, but the encoder is > > about 1.8% - 2.7% slower. > > smallest_slowest.c has a code size of 1,656 bytes, but the encoder is > > about 2.3% - 3.6% slower. > > > > Thanks, > > Linfeng > > > > On Mon, Apr 3, 2017 at 3:01 PM, Linfeng Zhang <linfengz at google.com > > <mailto:linfengz at google.com>> wrote: > > > > Hi Jean-Marc, > > > > Attached is the silk_warped_autocorrelation_FIX_neon() which > > implements your idea. > > > > Speed improvement vs the previous optimization: > > > > Complexity 0-4: Doesn't call this function. Complexity 5: 2.1% > > (order = 16) Complexity 6: 1.0% (order = 20) Complexity 8: 0.1% > > (order = 24) Complexity 10: 0.1% (order = 24) > > > > Code size of silk_warped_autocorrelation_FIX_neon() changes from > > 2,644 bytes to 3,228 bytes. > > > > The reason of larger code size is that the new optimization > > specializes order 16, 20 and 24. If only keeping order 24 > > specialization, the code still works and the code size is smaller, > > but the encoder speed will drop 4.0% for Complexity 5 and 2.0% for > > Complexity 6. Anyway, the new code is easier to understand and > maintain. > > > > Thanks, > > > > Linfeng > > > > > > On Tue, Feb 7, 2017 at 8:58 AM, Linfeng Zhang <linfengz at google.com > > <mailto:linfengz at google.com>> wrote: > > > > Hi Jean-Marc, > > > > Thanks for your suggestions. Will get back to you once we have > > some updates. > > > > Linfeng > > > > On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin > > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>> wrote: > > > > Hi Linfeng, > > > > On 06/02/17 07:18 PM, Linfeng Zhang wrote: > > > This is a great idea. But the order > (psEncC->shapingLPCOrder) can be > > > configured to 12, 14, 16, 20 and 24 according to > complexity parameter. > > > > > > It's hard to get a universal function to handle all these > orders > > > efficiently. Any suggestions? > > > > I can think of two ways of handling larger orders. The > > obvious one is > > simply to add an inner loop of the form: > > for (i=0;i<order;i+=VECTOR_SIZE) > > I think what may be more efficient is to simply have a small > > "order-N" > > (N=4 or 8) kernel that not only computes the correlation of > > order N, but > > then spits out the signal after the N-stage all-pass is > > applied. The > > kernel would look like: > > > > void autocorr_kernel4(int *corr, int *orig, int *input, int > > *output, int > > len) { > > /* Implement vectorized order-4 filter (could also be > > order 8) > > as described in previous email and outputs the > > filtered signal. > > */ > > } > > > > and then the full function would run the kernel multiple > > times and look > > like: > > > > void full_autocorr(int *corr, int *orig, int len, int order) > { > > int i; > > int tmp[MAX_SIZE]; > > int *in = orig; > > for (i=0;i<order;i+=4) { > > autocorr_kernel4(corr+i, orig, in, tmp, len); > > /* Make subsequent calls use the filtered signal as > > input. */ > > in = tmp; > > } > > } > > > > I think the should not only reduce/eliminate the > > prologue/epilogue > > problem, but it should also be more efficient since almost > > all vectors > > processed would use the full size. > > > > Maybe a third option (not sure it's a good idea, but still > > mentioning > > it) would be to have a function that hardcodes order=24 and > > discards the > > larger values that aren't needed. Since the smallest order > > seems to be > > 16, it wouldn't be much of a waste and the code might end up > > running > > faster for the higher orders. > > > > Cheers, > > > > Jean-Marc > > > > > > > Thanks, > > > Linfeng > > > > > > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin < > jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > > > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> > > wrote: > > > > > > Hi Linfeng, > > > > > > On 06/02/17 02:51 PM, Linfeng Zhang wrote: > > > > However, the critical thing is that all the states > > in each stage when > > > > processing input[i] are reused by the next > > input[i+1]. That is > > > > input[i+1] must wait input[i] for 1 stage, and > > input[i+2] must wait > > > > input[i+1] for 1 stage, etc. > > > > > > That is indeed the tricky part... and the one I think > > you could do > > > slightly differently. If you approach the problem in > > terms of computing > > > chunks of the inputs N samples at a time, then indeed > > the approach you > > > are describing is the only solution. What I was > > proposing though is to > > > instead chop the "order" in chunks of N. Using your > > notation, you would > > > be doing: > > > > > > PROC( > > in0(s0)) > > > PROC( > > in0(s1) in1(s0)) > > > PROC( in0(s2) > > in1(s1) in2(s0)) > > > PROC( in0(s3) in1(s2) > > in2(s1) in3(s0)) > > > PROC( in0(s4) in1(s3) in2(s2) > > in3(s1) in4(s0)) > > > PROC( in0(s5) in1(s4) in2(s3) in3(s2) > > in4(s1) in5(s0)) > > > PROC( in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) > > in5(s1) in6(s0)) > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) > > in6(s1) in7(s0)) > > > PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) > > in7(s1) in8(s0)) > > > PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) > > in8(s1) in9(s0)) > > > PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) > > in9(s1)in10(s0)) > > > PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) > > in9(s2)in10(s1)in11(s0)) > > > ...and so on until the end of the input vector > > > > > > The difference is that it's now the input vector that > > "slides" and the > > > "state" values sy that remain in the same place. > > There's still a > > > prologue, but you can easily get rid of it by > > (implicitly) zero-padding > > > the in vector during the initialization phase (start > > with a zero vector > > > and real one value at a time). Getting rid of the > > epilogue is a little > > > trickier, but I think it can be done. > > > > > > Cheers, > > > > > > Jean-Marc > > > > > > > Then it becomes this > > > > > > > > FOR in=0 to N WITH in+=8 > > > > PROC(in0(s0)) /* prolog 0 */ > > > > PROC(in0(s1) in1(s0)) /* prolog 1 */ > > > > PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */ > > > > PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 > */ > > > > PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* > > prolog 4 */ > > > > PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) > > in5(s0)) /* prolog 5 */ > > > > PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) > > in5(s1) in6(s0)) /* > > > > prolog 6 */ > > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) > > in5(s2) in6(s1) > > > in7(s0)) > > > > /* fully process 8 inputs */ > > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) > > in5(s3) in6(s2) > > > in7(s1)) > > > > /* continue */ > > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) > > in5(s4) in6(s3) > > > in7(s2)) > > > > /* continue */ > > > > PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) > > in5(s5) in6(s4) > > > in7(s3)) > > > > /* continue */ > > > > PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) > > in6(s5) in7(s4)) /* > > > > epilog 0 */ > > > > PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) > > in7(s5)) /* epilog > > > 1 */ > > > > PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* > > epilog 2 */ > > > > PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 > */ > > > > PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */ > > > > PROC(in6(s10) in7(s9)) /* epilog 5 */ > > > > PROC(in7(s10)) /* epilog 6 */ > > > > END FOR > > > > > > > > And > > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) > > in5(s2) in6(s1) > > > in7(s0)) > > > > /* fully process 8 inputs */ > > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) > > in5(s3) in6(s2) > > > in7(s1)) > > > > /* continue */ > > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) > > in5(s4) in6(s3) > > > in7(s2)) > > > > /* continue */ > > > > is actually the expansion of the kernel loop > > > > FOR i=0 TO order-6 WITH i++ > > > > PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) > > in4(si+3) in5(si+2) > > > > in6(si+1) in7(si+0)) > > > > END FOR > > > > > > > > The worst thing is that corr_QC[] is so sensitive > > that any extra > > > > processing will make them wrong and propagate to the > > next loop (next 8 > > > > inputs). state_QS[] is a little better but still > > very sensitive. For > > > > instance, if adding PROC(in0(s11') in1(s10) in2(s9) > > in3(s8) in4(s7) > > > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by > > looping one more time) > > > > and remove epilog 0, then all final results will be > > wrong. > > > > > > > > That's why the prolog and epilog cannot be saved to > > the best of my > > > > knowledge. > > > > > > > > The assembly size of > > silk_warped_autocorrelation_FIX_neon() is about > > > > 2,744 bytes. Compared with the C code size (about > > 452 bytes), it's 2.3 > > > > KB larger. Considering > > silk_warped_autocorrelation_FIX_c() is the > > > second > > > > place CPU heavy function in fixed-point, and our > > testing shows up > > > to 7% > > > > CPU run time saving of the total encoder with this > > optimization (at > > > > Complexity 8), maybe we can take the I-cache burden > > even if finally we > > > > still cannot remove the big chunk of prolog and > epilog. > > > > > > > > Thanks, > > > > Linfeng Zhang > > > > > > > > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin > > > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>> > > > > <mailto:jmvalin at jmvalin.ca > > <mailto:jmvalin at jmvalin.ca> <mailto:jmvalin at jmvalin.ca > > <mailto:jmvalin at jmvalin.ca>>>> wrote: > > > > > > > > Hi Felicia, > > > > > > > > I've had time to work through the math in the > > original > > > function and I'm > > > > pretty sure it's possible to vectorize this > > without the huge > > > > prologue/epilogue. > > > > > > > > For the simple case where order = vector size > > N (but it > > > should easily > > > > generalize to larger order), what I came up with > is: > > > > > > > > initialize X, Y, M, C to vector of zeros > > > > > > > > for i=0 to N+order > > > > T = [x(i), Y(0:N-2)] > > > > Y = M + coeff * (Y - T) > > > > M = T > > > > X = [x(i), X(0:N-1)] > > > > C = C + X*Y > > > > > > > > I think something similar to this (assuming I > > didn't mess up any > > > > details) should give you the correlations in > > vector C. Did I miss > > > > anything? > > > > > > > > Cheers, > > > > > > > > Jean-Marc > > > > > > > > > > > > On 31/01/17 12:30 PM, Felicia Lim wrote: > > > > > Hi, > > > > > > > > > > Attached is a patch with arm neon > > optimizations for > > > > > silk_warped_autocorrelation_FIX(). Please > review. > > > > > > > > > > Thanks, > > > > > Felicia > > > > > > > > > > > > > > > ______________________________ > _________________ > > > > > opus mailing list > > > > > opus at xiph.org <mailto:opus at xiph.org> > > <mailto:opus at xiph.org <mailto:opus at xiph.org>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > > > <mailto:opus at xiph.org <mailto:opus at xiph.org>>> > > > > > http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus> > > > <http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus>> > > > > <http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus> > > > <http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus>>> > > > > > > > > > _______________________________________________ > > > > opus mailing list > > > > opus at xiph.org <mailto:opus at xiph.org> > > <mailto:opus at xiph.org <mailto:opus at xiph.org>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > > > <mailto:opus at xiph.org <mailto:opus at xiph.org>>> > > > > http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus> > > > <http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus>> > > > > <http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus> > > > <http://lists.xiph.org/mailman/listinfo/opus > > <http://lists.xiph.org/mailman/listinfo/opus>>> > > > > > > > > > > > > > > > > > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170405/edf6c309/attachment-0001.html>
Jean-Marc Valin
2017-Apr-05 18:20 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
That's pretty good for a single function! Jean-Marc On 05/04/17 02:13 PM, Linfeng Zhang wrote:> Thank Jean-Marc! > > The speedup percentages are all relative to the entire encoder. > > Comparing to master, this optimization patch speeds up fixed-point SILK > encoder on NEON as following: Complexity 5: 6.1% Complexity 6: 5.8% > Complexity 8: 5.5% Complexity 10: 4.0% > > when testing on an Acer Chromebook, ARMv7 Processor rev 3 (v7l), CPU max > MHz: 2116.5 > > Thanks, > Linfeng > > On Wed, Apr 5, 2017 at 11:02 AM, Jean-Marc Valin <jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>> wrote: > > Hi Linfeng, > > Thanks for the updated patch. I'll have a look and get back to you. When > you report speedup percentages, is that relative to the entire encoder > or relative to just that function in C? Also, what's the speedup > compared to master? > > Cheers, > > Jean-Marc > > On 05/04/17 12:14 PM, Linfeng Zhang wrote: > > I attached a new patch with small cleanup (disassembly is identical as > > the last patch). We have done the same internal testing as usual. > > > > Also, attached 2 failed temporary versions which try to reduce code size > > (just for code review reference purpose). > > > > The new patch of silk_warped_autocorrelation_FIX_neon() has a code size > > of 3,228 bytes (with gcc). > > smaller_slower.c has a code size of 2,304 bytes, but the encoder is > > about 1.8% - 2.7% slower. > > smallest_slowest.c has a code size of 1,656 bytes, but the encoder is > > about 2.3% - 3.6% slower. > > > > Thanks, > > Linfeng > > > > On Mon, Apr 3, 2017 at 3:01 PM, Linfeng Zhang <linfengz at google.com <mailto:linfengz at google.com> > > <mailto:linfengz at google.com <mailto:linfengz at google.com>>> wrote: > > > > Hi Jean-Marc, > > > > Attached is the silk_warped_autocorrelation_FIX_neon() which > > implements your idea. > > > > Speed improvement vs the previous optimization: > > > > Complexity 0-4: Doesn't call this function. Complexity 5: 2.1% > > (order = 16) Complexity 6: 1.0% (order = 20) Complexity 8: 0.1% > > (order = 24) Complexity 10: 0.1% (order = 24) > > > > Code size of silk_warped_autocorrelation_FIX_neon() changes from > > 2,644 bytes to 3,228 bytes. > > > > The reason of larger code size is that the new optimization > > specializes order 16, 20 and 24. If only keeping order 24 > > specialization, the code still works and the code size is smaller, > > but the encoder speed will drop 4.0% for Complexity 5 and 2.0% for > > Complexity 6. Anyway, the new code is easier to understand and maintain. > > > > Thanks, > > > > Linfeng > > > > > > On Tue, Feb 7, 2017 at 8:58 AM, Linfeng Zhang <linfengz at google.com <mailto:linfengz at google.com> > > <mailto:linfengz at google.com <mailto:linfengz at google.com>>> wrote: > > > > Hi Jean-Marc, > > > > Thanks for your suggestions. Will get back to you once we have > > some updates. > > > > Linfeng > > > > On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin > > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> wrote: > > > > Hi Linfeng, > > > > On 06/02/17 07:18 PM, Linfeng Zhang wrote: > > > This is a great idea. But the order > (psEncC->shapingLPCOrder) can be > > > configured to 12, 14, 16, 20 and 24 according to > complexity parameter. > > > > > > It's hard to get a universal function to handle all > these orders > > > efficiently. Any suggestions? > > > > I can think of two ways of handling larger orders. The > > obvious one is > > simply to add an inner loop of the form: > > for (i=0;i<order;i+=VECTOR_SIZE) > > I think what may be more efficient is to simply have a > small > > "order-N" > > (N=4 or 8) kernel that not only computes the > correlation of > > order N, but > > then spits out the signal after the N-stage all-pass is > > applied. The > > kernel would look like: > > > > void autocorr_kernel4(int *corr, int *orig, int > *input, int > > *output, int > > len) { > > /* Implement vectorized order-4 filter (could also be > > order 8) > > as described in previous email and outputs the > > filtered signal. > > */ > > } > > > > and then the full function would run the kernel multiple > > times and look > > like: > > > > void full_autocorr(int *corr, int *orig, int len, int > order) { > > int i; > > int tmp[MAX_SIZE]; > > int *in = orig; > > for (i=0;i<order;i+=4) { > > autocorr_kernel4(corr+i, orig, in, tmp, len); > > /* Make subsequent calls use the filtered signal as > > input. */ > > in = tmp; > > } > > } > > > > I think the should not only reduce/eliminate the > > prologue/epilogue > > problem, but it should also be more efficient since almost > > all vectors > > processed would use the full size. > > > > Maybe a third option (not sure it's a good idea, but still > > mentioning > > it) would be to have a function that hardcodes > order=24 and > > discards the > > larger values that aren't needed. Since the smallest order > > seems to be > > 16, it wouldn't be much of a waste and the code might > end up > > running > > faster for the higher orders. > > > > Cheers, > > > > Jean-Marc > > > > > > > Thanks, > > > Linfeng > > > > > > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>> > > > <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca> <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>>>> > > wrote: > > > > > > Hi Linfeng, > > > > > > On 06/02/17 02:51 PM, Linfeng Zhang wrote: > > > > However, the critical thing is that all the states > > in each stage when > > > > processing input[i] are reused by the next > > input[i+1]. That is > > > > input[i+1] must wait input[i] for 1 stage, and > > input[i+2] must wait > > > > input[i+1] for 1 stage, etc. > > > > > > That is indeed the tricky part... and the one I > think > > you could do > > > slightly differently. If you approach the problem in > > terms of computing > > > chunks of the inputs N samples at a time, then > indeed > > the approach you > > > are describing is the only solution. What I was > > proposing though is to > > > instead chop the "order" in chunks of N. Using your > > notation, you would > > > be doing: > > > > > > PROC( > > in0(s0)) > > > PROC( > > in0(s1) in1(s0)) > > > PROC( in0(s2) > > in1(s1) in2(s0)) > > > PROC( in0(s3) in1(s2) > > in2(s1) in3(s0)) > > > PROC( in0(s4) in1(s3) in2(s2) > > in3(s1) in4(s0)) > > > PROC( in0(s5) in1(s4) in2(s3) in3(s2) > > in4(s1) in5(s0)) > > > PROC( in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) > > in5(s1) in6(s0)) > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) > > in6(s1) in7(s0)) > > > PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) > > in7(s1) in8(s0)) > > > PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) > > in8(s1) in9(s0)) > > > PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) > > in9(s1)in10(s0)) > > > PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) > > in9(s2)in10(s1)in11(s0)) > > > ...and so on until the end of the input vector > > > > > > The difference is that it's now the input vector > that > > "slides" and the > > > "state" values sy that remain in the same place. > > There's still a > > > prologue, but you can easily get rid of it by > > (implicitly) zero-padding > > > the in vector during the initialization phase (start > > with a zero vector > > > and real one value at a time). Getting rid of the > > epilogue is a little > > > trickier, but I think it can be done. > > > > > > Cheers, > > > > > > Jean-Marc > > > > > > > Then it becomes this > > > > > > > > FOR in=0 to N WITH in+=8 > > > > PROC(in0(s0)) /* prolog 0 */ > > > > PROC(in0(s1) in1(s0)) /* prolog 1 */ > > > > PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */ > > > > PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* > prolog 3 */ > > > > PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* > > prolog 4 */ > > > > PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) > > in5(s0)) /* prolog 5 */ > > > > PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) > > in5(s1) in6(s0)) /* > > > > prolog 6 */ > > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) > > in5(s2) in6(s1) > > > in7(s0)) > > > > /* fully process 8 inputs */ > > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) > > in5(s3) in6(s2) > > > in7(s1)) > > > > /* continue */ > > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) > > in5(s4) in6(s3) > > > in7(s2)) > > > > /* continue */ > > > > PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) > > in5(s5) in6(s4) > > > in7(s3)) > > > > /* continue */ > > > > PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) > > in6(s5) in7(s4)) /* > > > > epilog 0 */ > > > > PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) > > in7(s5)) /* epilog > > > 1 */ > > > > PROC(in3(s10) in4(s9) in5(s8) in6(s7) > in7(s6)) /* > > epilog 2 */ > > > > PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* > epilog 3 */ > > > > PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */ > > > > PROC(in6(s10) in7(s9)) /* epilog 5 */ > > > > PROC(in7(s10)) /* epilog 6 */ > > > > END FOR > > > > > > > > And > > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) > > in5(s2) in6(s1) > > > in7(s0)) > > > > /* fully process 8 inputs */ > > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) > > in5(s3) in6(s2) > > > in7(s1)) > > > > /* continue */ > > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) > > in5(s4) in6(s3) > > > in7(s2)) > > > > /* continue */ > > > > is actually the expansion of the kernel loop > > > > FOR i=0 TO order-6 WITH i++ > > > > PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) > > in4(si+3) in5(si+2) > > > > in6(si+1) in7(si+0)) > > > > END FOR > > > > > > > > The worst thing is that corr_QC[] is so sensitive > > that any extra > > > > processing will make them wrong and propagate > to the > > next loop (next 8 > > > > inputs). state_QS[] is a little better but still > > very sensitive. For > > > > instance, if adding PROC(in0(s11') in1(s10) > in2(s9) > > in3(s8) in4(s7) > > > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by > > looping one more time) > > > > and remove epilog 0, then all final results > will be > > wrong. > > > > > > > > That's why the prolog and epilog cannot be > saved to > > the best of my > > > > knowledge. > > > > > > > > The assembly size of > > silk_warped_autocorrelation_FIX_neon() is about > > > > 2,744 bytes. Compared with the C code size (about > > 452 bytes), it's 2.3 > > > > KB larger. Considering > > silk_warped_autocorrelation_FIX_c() is the > > > second > > > > place CPU heavy function in fixed-point, and our > > testing shows up > > > to 7% > > > > CPU run time saving of the total encoder with this > > optimization (at > > > > Complexity 8), maybe we can take the I-cache > burden > > even if finally we > > > > still cannot remove the big chunk of prolog > and epilog. > > > > > > > > Thanks, > > > > Linfeng Zhang > > > > > > > > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin > > > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>> > > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> > > > > <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca> > > <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>> <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca> > > <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>>>>> wrote: > > > > > > > > Hi Felicia, > > > > > > > > I've had time to work through the math in the > > original > > > function and I'm > > > > pretty sure it's possible to vectorize this > > without the huge > > > > prologue/epilogue. > > > > > > > > For the simple case where order = vector > size > > N (but it > > > should easily > > > > generalize to larger order), what I came > up with is: > > > > > > > > initialize X, Y, M, C to vector of zeros > > > > > > > > for i=0 to N+order > > > > T = [x(i), Y(0:N-2)] > > > > Y = M + coeff * (Y - T) > > > > M = T > > > > X = [x(i), X(0:N-1)] > > > > C = C + X*Y > > > > > > > > I think something similar to this (assuming I > > didn't mess up any > > > > details) should give you the correlations in > > vector C. Did I miss > > > > anything? > > > > > > > > Cheers, > > > > > > > > Jean-Marc > > > > > > > > > > > > On 31/01/17 12:30 PM, Felicia Lim wrote: > > > > > Hi, > > > > > > > > > > Attached is a patch with arm neon > > optimizations for > > > > > silk_warped_autocorrelation_FIX(). > Please review. > > > > > > > > > > Thanks, > > > > > Felicia > > > > > > > > > > > > > > > > _______________________________________________ > > > > > opus mailing list > > > > > opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>> > > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>>>> > > > > > http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>>> > > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>>>> > > > > > > > > > _______________________________________________ > > > > opus mailing list > > > > opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > <mailto:opus at xiph.org>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > <mailto:opus at xiph.org>>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > <mailto:opus at xiph.org>> > > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>>>> > > > > > http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>>> > > > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>>>> > > > > > > > > > > > > > > > > > > > > > > > >
Ulrich Windl
2017-Apr-06 06:27 UTC
[opus] Antw: Re: [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
>>> Linfeng Zhang <linfengz at google.com> schrieb am 05.04.2017 um 20:13 in Nachricht<CAKoqLCA2ECpr9dgWZ0Ay_R5oatHS_4Na4Z=2W64p8c3kv26yyA at mail.gmail.com>:> Thank Jean-Marc! > > The speedup percentages are all relative to the entire encoder. > > Comparing to master, this optimization patch speeds up fixed-point SILK > encoder on NEON as following: Complexity 5: 6.1% Complexity 6: 5.8% > Complexity 8: 5.5% Complexity 10: 4.0%[...] As a side-note: Linfeng said he's using gcc. In the past I wrote some utility that provided build information to an executable. If you are tight on memory, you may not want it, but my system provides information like this (example): "$Build: 0.3.3 (Git:master-36611c9) #109 on rksapv04 (x86_64) with gcc 4.3.4 (4.3.4 [gcc-4_3-branch revision 152973]) (-I. -Wall -Wextra -Wshadow -pipe -O2 -DDEBUG_LEVEL=0 -lrt -lm) at Mon Sep 12 13:24:58 CEST 2016 by windl $" In the auto-generated buildlevel.c the information is created like this (so individual components are available as well): /* RCS ident compatible identification string */ const char build_id[] "@(#) $Build: " BUILD_VERSION " #" BUILD_NUMBER " on " BUILD_HOST " (" BUILD_ARCH ")" " with " BUILD_COMPILER " (" BUILD_CFLAGS ")" " at " BUILD_DATE " by " BUILD_USER " $"; When comparing performance, such information (if accurate) seem helpful. Personally I added it to help me find out which of my many test binaries is which... ;-) Regards, Ulrich
Jean-Marc Valin
2017-Apr-06 19:55 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Linfeng, I had a closer look at your patch and the code looks good -- and slightly simpler than I had anticipated, so that's good. I did some profiling on a Cortex A57 and I've been seeing slightly less improvement than you're reporting, more like 3.5% at complexity 8. It appears that the warped autocorrelation function itself is only faster by a factor of about 1.35. That's a bit surprising considering I see nothing obviously wrong with the code. I'm not sure what's the problem, but here's a few things that may be worth considering: 1) In calc_state(), rather than splitting the multiply in two instructions, you may be able to simply shift the warping left 16 bits, then use the Neon instruction that does a*b>>32 (i.e. the one that computes the top bits of a 32x32 multiply) 2) If the problem is with the movs at the end of each iteration, then you should be able to get rid of them by unrolling by a factor of two. 3) It seems likely that you have significant register spill going on due to processing up to 24 "taps" at the same time. If that's causing a slowdown, then it should be possible to do the processing in "sections". By that, I mean that you can implement (e.g.) an order-8 "kernel" that computes the correlations and also outputs the last element of state_QS_s32x4[0][0] back to input_QS[], so that it can be used to compute a new secion. 4) It's a minor detail, but the last element of corr_QC[] that's not currently vectorized could simply be vectorized independently outside the loop (and it's the same for all orders). Cheers, Jean-Marc On 05/04/17 02:13 PM, Linfeng Zhang wrote:> Thank Jean-Marc! > > The speedup percentages are all relative to the entire encoder. > > Comparing to master, this optimization patch speeds up fixed-point SILK > encoder on NEON as following: Complexity 5: 6.1% Complexity 6: 5.8% > Complexity 8: 5.5% Complexity 10: 4.0% > > when testing on an Acer Chromebook, ARMv7 Processor rev 3 (v7l), CPU max > MHz: 2116.5 > > Thanks, > Linfeng > > On Wed, Apr 5, 2017 at 11:02 AM, Jean-Marc Valin <jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>> wrote: > > Hi Linfeng, > > Thanks for the updated patch. I'll have a look and get back to you. When > you report speedup percentages, is that relative to the entire encoder > or relative to just that function in C? Also, what's the speedup > compared to master? > > Cheers, > > Jean-Marc > > On 05/04/17 12:14 PM, Linfeng Zhang wrote: > > I attached a new patch with small cleanup (disassembly is identical as > > the last patch). We have done the same internal testing as usual. > > > > Also, attached 2 failed temporary versions which try to reduce code size > > (just for code review reference purpose). > > > > The new patch of silk_warped_autocorrelation_FIX_neon() has a code size > > of 3,228 bytes (with gcc). > > smaller_slower.c has a code size of 2,304 bytes, but the encoder is > > about 1.8% - 2.7% slower. > > smallest_slowest.c has a code size of 1,656 bytes, but the encoder is > > about 2.3% - 3.6% slower. > > > > Thanks, > > Linfeng > > > > On Mon, Apr 3, 2017 at 3:01 PM, Linfeng Zhang <linfengz at google.com <mailto:linfengz at google.com> > > <mailto:linfengz at google.com <mailto:linfengz at google.com>>> wrote: > > > > Hi Jean-Marc, > > > > Attached is the silk_warped_autocorrelation_FIX_neon() which > > implements your idea. > > > > Speed improvement vs the previous optimization: > > > > Complexity 0-4: Doesn't call this function. Complexity 5: 2.1% > > (order = 16) Complexity 6: 1.0% (order = 20) Complexity 8: 0.1% > > (order = 24) Complexity 10: 0.1% (order = 24) > > > > Code size of silk_warped_autocorrelation_FIX_neon() changes from > > 2,644 bytes to 3,228 bytes. > > > > The reason of larger code size is that the new optimization > > specializes order 16, 20 and 24. If only keeping order 24 > > specialization, the code still works and the code size is smaller, > > but the encoder speed will drop 4.0% for Complexity 5 and 2.0% for > > Complexity 6. Anyway, the new code is easier to understand and maintain. > > > > Thanks, > > > > Linfeng > > > > > > On Tue, Feb 7, 2017 at 8:58 AM, Linfeng Zhang <linfengz at google.com <mailto:linfengz at google.com> > > <mailto:linfengz at google.com <mailto:linfengz at google.com>>> wrote: > > > > Hi Jean-Marc, > > > > Thanks for your suggestions. Will get back to you once we have > > some updates. > > > > Linfeng > > > > On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin > > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> wrote: > > > > Hi Linfeng, > > > > On 06/02/17 07:18 PM, Linfeng Zhang wrote: > > > This is a great idea. But the order > (psEncC->shapingLPCOrder) can be > > > configured to 12, 14, 16, 20 and 24 according to > complexity parameter. > > > > > > It's hard to get a universal function to handle all > these orders > > > efficiently. Any suggestions? > > > > I can think of two ways of handling larger orders. The > > obvious one is > > simply to add an inner loop of the form: > > for (i=0;i<order;i+=VECTOR_SIZE) > > I think what may be more efficient is to simply have a > small > > "order-N" > > (N=4 or 8) kernel that not only computes the > correlation of > > order N, but > > then spits out the signal after the N-stage all-pass is > > applied. The > > kernel would look like: > > > > void autocorr_kernel4(int *corr, int *orig, int > *input, int > > *output, int > > len) { > > /* Implement vectorized order-4 filter (could also be > > order 8) > > as described in previous email and outputs the > > filtered signal. > > */ > > } > > > > and then the full function would run the kernel multiple > > times and look > > like: > > > > void full_autocorr(int *corr, int *orig, int len, int > order) { > > int i; > > int tmp[MAX_SIZE]; > > int *in = orig; > > for (i=0;i<order;i+=4) { > > autocorr_kernel4(corr+i, orig, in, tmp, len); > > /* Make subsequent calls use the filtered signal as > > input. */ > > in = tmp; > > } > > } > > > > I think the should not only reduce/eliminate the > > prologue/epilogue > > problem, but it should also be more efficient since almost > > all vectors > > processed would use the full size. > > > > Maybe a third option (not sure it's a good idea, but still > > mentioning > > it) would be to have a function that hardcodes > order=24 and > > discards the > > larger values that aren't needed. Since the smallest order > > seems to be > > 16, it wouldn't be much of a waste and the code might > end up > > running > > faster for the higher orders. > > > > Cheers, > > > > Jean-Marc > > > > > > > Thanks, > > > Linfeng > > > > > > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>> > > > <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca> <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>>>> > > wrote: > > > > > > Hi Linfeng, > > > > > > On 06/02/17 02:51 PM, Linfeng Zhang wrote: > > > > However, the critical thing is that all the states > > in each stage when > > > > processing input[i] are reused by the next > > input[i+1]. That is > > > > input[i+1] must wait input[i] for 1 stage, and > > input[i+2] must wait > > > > input[i+1] for 1 stage, etc. > > > > > > That is indeed the tricky part... and the one I > think > > you could do > > > slightly differently. If you approach the problem in > > terms of computing > > > chunks of the inputs N samples at a time, then > indeed > > the approach you > > > are describing is the only solution. What I was > > proposing though is to > > > instead chop the "order" in chunks of N. Using your > > notation, you would > > > be doing: > > > > > > PROC( > > in0(s0)) > > > PROC( > > in0(s1) in1(s0)) > > > PROC( in0(s2) > > in1(s1) in2(s0)) > > > PROC( in0(s3) in1(s2) > > in2(s1) in3(s0)) > > > PROC( in0(s4) in1(s3) in2(s2) > > in3(s1) in4(s0)) > > > PROC( in0(s5) in1(s4) in2(s3) in3(s2) > > in4(s1) in5(s0)) > > > PROC( in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) > > in5(s1) in6(s0)) > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) > > in6(s1) in7(s0)) > > > PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) > > in7(s1) in8(s0)) > > > PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) > > in8(s1) in9(s0)) > > > PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) > > in9(s1)in10(s0)) > > > PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) > > in9(s2)in10(s1)in11(s0)) > > > ...and so on until the end of the input vector > > > > > > The difference is that it's now the input vector > that > > "slides" and the > > > "state" values sy that remain in the same place. > > There's still a > > > prologue, but you can easily get rid of it by > > (implicitly) zero-padding > > > the in vector during the initialization phase (start > > with a zero vector > > > and real one value at a time). Getting rid of the > > epilogue is a little > > > trickier, but I think it can be done. > > > > > > Cheers, > > > > > > Jean-Marc > > > > > > > Then it becomes this > > > > > > > > FOR in=0 to N WITH in+=8 > > > > PROC(in0(s0)) /* prolog 0 */ > > > > PROC(in0(s1) in1(s0)) /* prolog 1 */ > > > > PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */ > > > > PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* > prolog 3 */ > > > > PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* > > prolog 4 */ > > > > PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) > > in5(s0)) /* prolog 5 */ > > > > PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) > > in5(s1) in6(s0)) /* > > > > prolog 6 */ > > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) > > in5(s2) in6(s1) > > > in7(s0)) > > > > /* fully process 8 inputs */ > > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) > > in5(s3) in6(s2) > > > in7(s1)) > > > > /* continue */ > > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) > > in5(s4) in6(s3) > > > in7(s2)) > > > > /* continue */ > > > > PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) > > in5(s5) in6(s4) > > > in7(s3)) > > > > /* continue */ > > > > PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) > > in6(s5) in7(s4)) /* > > > > epilog 0 */ > > > > PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) > > in7(s5)) /* epilog > > > 1 */ > > > > PROC(in3(s10) in4(s9) in5(s8) in6(s7) > in7(s6)) /* > > epilog 2 */ > > > > PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* > epilog 3 */ > > > > PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */ > > > > PROC(in6(s10) in7(s9)) /* epilog 5 */ > > > > PROC(in7(s10)) /* epilog 6 */ > > > > END FOR > > > > > > > > And > > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) > > in5(s2) in6(s1) > > > in7(s0)) > > > > /* fully process 8 inputs */ > > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) > > in5(s3) in6(s2) > > > in7(s1)) > > > > /* continue */ > > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) > > in5(s4) in6(s3) > > > in7(s2)) > > > > /* continue */ > > > > is actually the expansion of the kernel loop > > > > FOR i=0 TO order-6 WITH i++ > > > > PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) > > in4(si+3) in5(si+2) > > > > in6(si+1) in7(si+0)) > > > > END FOR > > > > > > > > The worst thing is that corr_QC[] is so sensitive > > that any extra > > > > processing will make them wrong and propagate > to the > > next loop (next 8 > > > > inputs). state_QS[] is a little better but still > > very sensitive. For > > > > instance, if adding PROC(in0(s11') in1(s10) > in2(s9) > > in3(s8) in4(s7) > > > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by > > looping one more time) > > > > and remove epilog 0, then all final results > will be > > wrong. > > > > > > > > That's why the prolog and epilog cannot be > saved to > > the best of my > > > > knowledge. > > > > > > > > The assembly size of > > silk_warped_autocorrelation_FIX_neon() is about > > > > 2,744 bytes. Compared with the C code size (about > > 452 bytes), it's 2.3 > > > > KB larger. Considering > > silk_warped_autocorrelation_FIX_c() is the > > > second > > > > place CPU heavy function in fixed-point, and our > > testing shows up > > > to 7% > > > > CPU run time saving of the total encoder with this > > optimization (at > > > > Complexity 8), maybe we can take the I-cache > burden > > even if finally we > > > > still cannot remove the big chunk of prolog > and epilog. > > > > > > > > Thanks, > > > > Linfeng Zhang > > > > > > > > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin > > > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>> > > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> > > > > <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca> > > <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>> <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca> > > <mailto:jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>>>>> wrote: > > > > > > > > Hi Felicia, > > > > > > > > I've had time to work through the math in the > > original > > > function and I'm > > > > pretty sure it's possible to vectorize this > > without the huge > > > > prologue/epilogue. > > > > > > > > For the simple case where order = vector > size > > N (but it > > > should easily > > > > generalize to larger order), what I came > up with is: > > > > > > > > initialize X, Y, M, C to vector of zeros > > > > > > > > for i=0 to N+order > > > > T = [x(i), Y(0:N-2)] > > > > Y = M + coeff * (Y - T) > > > > M = T > > > > X = [x(i), X(0:N-1)] > > > > C = C + X*Y > > > > > > > > I think something similar to this (assuming I > > didn't mess up any > > > > details) should give you the correlations in > > vector C. Did I miss > > > > anything? > > > > > > > > Cheers, > > > > > > > > Jean-Marc > > > > > > > > > > > > On 31/01/17 12:30 PM, Felicia Lim wrote: > > > > > Hi, > > > > > > > > > > Attached is a patch with arm neon > > optimizations for > > > > > silk_warped_autocorrelation_FIX(). > Please review. > > > > > > > > > > Thanks, > > > > > Felicia > > > > > > > > > > > > > > > > _______________________________________________ > > > > > opus mailing list > > > > > opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>> > > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>>>> > > > > > http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>>> > > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>>>> > > > > > > > > > _______________________________________________ > > > > opus mailing list > > > > opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > <mailto:opus at xiph.org>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > <mailto:opus at xiph.org>>> > > <mailto:opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org > <mailto:opus at xiph.org>> > > > <mailto:opus at xiph.org <mailto:opus at xiph.org> > <mailto:opus at xiph.org <mailto:opus at xiph.org>>>> > > > > > http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>>> > > > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>> > > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > <http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus>>>> > > > > > > > > > > > > > > > > > > > > > > > >
Linfeng Zhang
2017-Apr-11 23:07 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Jean-Marc, Thanks for your suggestions! I attached the new patch, with inlined reply below. Thanks, Linfeng On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> I did some profiling on a Cortex A57 and I've been seeing slightly less > improvement than you're reporting, more like 3.5% at complexity 8. It > appears that the warped autocorrelation function itself is only faster > by a factor of about 1.35. That's a bit surprising considering I see > nothing obviously wrong with the code. >Speed test the new patch, and got about 7.8% whole encoder speed gain with complexity 8 on my Acre Chromebook. Here is my configure: ./configure --build x86_64-unknown-linux-gnu --host arm-linux-gnueabihf --disable-assertions --enable-fixed-point --enable-intrinsics CFLAGS=-O3 --disable-shared The testing speech file may also change the speed results.> 1) In calc_state(), rather than splitting the multiply in two > instructions, you may be able to simply shift the warping left 16 bits, > then use the Neon instruction that does a*b>>32 (i.e. the one that > computes the top bits of a 32x32 multiply) >Done.> 2) If the problem is with the movs at the end of each iteration, then > you should be able to get rid of them by unrolling by a factor of two. >We did this previously and get some gains, but the code size is much bigger. So we abandoned. Tested again on the new code and got no speed gains.> 3) It seems likely that you have significant register spill going on due > to processing up to 24 "taps" at the same time. If that's causing a > slowdown, then it should be possible to do the processing in "sections". > By that, I mean that you can implement (e.g.) an order-8 "kernel" that > computes the correlations and also outputs the last element of > state_QS_s32x4[0][0] back to input_QS[], so that it can be used to > compute a new secion. >Done. The speed is almost identical (slightly slower), however the extra bonus is code size saving. 4) It's a minor detail, but the last element of corr_QC[] that's not> currently vectorized could simply be vectorized independently outside > the loop (and it's the same for all orders). >Done. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170411/300b590e/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch Type: text/x-patch Size: 29664 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170411/300b590e/attachment-0001.bin>
Possibly Parallel Threads
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON