thr3ads.net - opus - [opus] [PATCH] Optimize silk_warped_autocorrelation

If this information is useful, please help other people find it:
Share via:

Linfeng Zhang

2017-Feb-07 00:18 UTC

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

This is a great idea. But the order (psEncC->shapingLPCOrder) can be
configured to 12, 14, 16, 20 and 24 according to complexity parameter.

It's hard to get a universal function to handle all these orders
efficiently. Any suggestions?

Thanks,
Linfeng

On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Linfeng,
>
> On 06/02/17 02:51 PM, Linfeng Zhang wrote:
> > However, the critical thing is that all the states in each stage when
> > processing input[i] are reused by the next input[i+1]. That is
> > input[i+1] must wait input[i] for 1 stage, and input[i+2] must wait
> > input[i+1] for 1 stage, etc.
>
> That is indeed the tricky part... and the one I think you could do
> slightly differently. If you approach the problem in terms of computing
> chunks of the inputs N samples at a time, then indeed the approach you
> are describing is the only solution. What I was proposing though is to
> instead chop the "order" in chunks of N. Using your notation, you
would
> be doing:
>
> PROC(                                                        in0(s0))
> PROC(                                                in0(s1) in1(s0))
> PROC(                                        in0(s2) in1(s1) in2(s0))
> PROC(                                in0(s3) in1(s2) in2(s1) in3(s0))
> PROC(                        in0(s4) in1(s3) in2(s2) in3(s1) in4(s0))
> PROC(                in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0))
> PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0))
> PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0))
> PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1) in8(s0))
> PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1) in9(s0))
> PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) in9(s1)in10(s0))
> PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) in9(s2)in10(s1)in11(s0))
> ...and so on until the end of the input vector
>
> The difference is that it's now the input vector that
"slides" and the
> "state" values sy that remain in the same place. There's
still a
> prologue, but you can easily get rid of it by (implicitly) zero-padding
> the in vector during the initialization phase (start with a zero vector
> and real one value at a time). Getting rid of the epilogue is a little
> trickier, but I think it can be done.
>
> Cheers,
>
>         Jean-Marc
>
> > Then it becomes this
> >
> > FOR in=0 to N WITH in+=8
> >   PROC(in0(s0)) /* prolog 0 */
> >   PROC(in0(s1) in1(s0)) /* prolog 1 */
> >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
> >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
> >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */
> >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog 5 */
> >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /*
> > prolog 6 */
> >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
in7(s0))
> > /* fully process 8 inputs */
> >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
in7(s1))
> > /* continue */
> >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3)
in7(s2))
> > /* continue */
> >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4)
in7(s3))
> > /* continue */
> >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /*
> > epilog 0 */
> >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog 1
*/
> >   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */
> >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
> >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
> >   PROC(in6(s10) in7(s9)) /* epilog 5 */
> >   PROC(in7(s10)) /* epilog 6 */
> > END FOR
> >
> > And
> >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
in7(s0))
> > /* fully process 8 inputs */
> >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
in7(s1))
> > /* continue */
> >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3)
in7(s2))
> > /* continue */
> > is actually the expansion of the kernel loop
> > FOR i=0 TO order-6 WITH i++
> >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
> > in6(si+1) in7(si+0))
> > END FOR
> >
> > The worst thing is that corr_QC[] is so sensitive that any extra
> > processing will make them wrong and propagate to the next loop (next 8
> > inputs). state_QS[] is a little better but still very sensitive. For
> > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8)
in4(s7)
> > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more time)
> > and remove epilog 0, then all final results will be wrong.
> >
> > That's why the prolog and epilog cannot be saved to the best of my
> > knowledge.
> >
> > The assembly size of silk_warped_autocorrelation_FIX_neon() is about
> > 2,744 bytes. Compared with the C code size (about 452 bytes), it's
2.3
> > KB larger. Considering silk_warped_autocorrelation_FIX_c() is the
second
> > place CPU heavy function in fixed-point, and our testing shows up to
7%
> > CPU run time saving of the total encoder with this optimization (at
> > Complexity 8), maybe we can take the I-cache burden even if finally we
> > still cannot remove the big chunk of prolog and epilog.
> >
> > Thanks,
> > Linfeng Zhang
> >
> > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin <jmvalin at
jmvalin.ca
> > <mailto:jmvalin at jmvalin.ca>> wrote:
> >
> >     Hi Felicia,
> >
> >     I've had time to work through the math in the original
function and
> I'm
> >     pretty sure it's possible to vectorize this without the huge
> >     prologue/epilogue.
> >
> >     For the simple case where order = vector size = N (but it should
> easily
> >     generalize to larger order), what I came up with is:
> >
> >     initialize X, Y, M, C to vector of zeros
> >
> >     for i=0 to N+order
> >        T = [x(i), Y(0:N-2)]
> >        Y = M + coeff * (Y - T)
> >        M = T
> >        X = [x(i), X(0:N-1)]
> >        C = C + X*Y
> >
> >     I think something similar to this (assuming I didn't mess up
any
> >     details) should give you the correlations in vector C. Did I miss
> >     anything?
> >
> >     Cheers,
> >
> >             Jean-Marc
> >
> >
> >     On 31/01/17 12:30 PM, Felicia Lim wrote:
> >     > Hi,
> >     >
> >     > Attached is a patch with arm neon optimizations for
> >     > silk_warped_autocorrelation_FIX(). Please review.
> >     >
> >     > Thanks,
> >     > Felicia
> >     >
> >     >
> >     > _______________________________________________
> >     > opus mailing list
> >     > opus at xiph.org <mailto:opus at xiph.org>
> >     > http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>
> >     >
> >     _______________________________________________
> >     opus mailing list
> >     opus at xiph.org <mailto:opus at xiph.org>
> >     http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170206/d29bd2a2/attachment.html>

Jean-Marc Valin

2017-Feb-07 01:47 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Linfeng,

On 06/02/17 07:18 PM, Linfeng Zhang wrote:> This is a great idea. But the order (psEncC->shapingLPCOrder) can be
> configured to 12, 14, 16, 20 and 24 according to complexity parameter.
> 
> It's hard to get a universal function to handle all these orders
> efficiently. Any suggestions?
I can think of two ways of handling larger orders. The obvious one is
simply to add an inner loop of the form:
for (i=0;i<order;i+=VECTOR_SIZE)
I think what may be more efficient is to simply have a small "order-N"
(N=4 or 8) kernel that not only computes the correlation of order N, but
then spits out the signal after the N-stage all-pass is applied. The
kernel would look like:

void autocorr_kernel4(int *corr, int *orig, int *input, int *output, int
len) {
   /* Implement vectorized order-4 filter (could also be order 8)
      as described in previous email and outputs the filtered signal.
   */
}

and then the full function would run the kernel multiple times and look
like:

void full_autocorr(int *corr, int *orig, int len, int order) {
   int i;
   int tmp[MAX_SIZE];
   int *in = orig;
   for (i=0;i<order;i+=4) {
      autocorr_kernel4(corr+i, orig, in, tmp, len);
      /* Make subsequent calls use the filtered signal as input. */
      in = tmp;
   }
}

I think the should not only reduce/eliminate the prologue/epilogue
problem, but it should also be more efficient since almost all vectors
processed would use the full size.

Maybe a third option (not sure it's a good idea, but still mentioning
it) would be to have a function that hardcodes order=24 and discards the
larger values that aren't needed. Since the smallest order seems to be
16, it wouldn't be much of a waste and the code might end up running
faster for the higher orders.

Cheers,

	Jean-Marc

> Thanks,
> Linfeng
> 
> On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at jmvalin.ca
> <mailto:jmvalin at jmvalin.ca>> wrote:
> 
>     Hi Linfeng,
> 
>     On 06/02/17 02:51 PM, Linfeng Zhang wrote:
>     > However, the critical thing is that all the states in each stage
when
>     > processing input[i] are reused by the next input[i+1]. That is
>     > input[i+1] must wait input[i] for 1 stage, and input[i+2] must
wait
>     > input[i+1] for 1 stage, etc.
> 
>     That is indeed the tricky part... and the one I think you could do
>     slightly differently. If you approach the problem in terms of computing
>     chunks of the inputs N samples at a time, then indeed the approach you
>     are describing is the only solution. What I was proposing though is to
>     instead chop the "order" in chunks of N. Using your notation,
you would
>     be doing:
> 
>     PROC(                                                        in0(s0))
>     PROC(                                                in0(s1) in1(s0))
>     PROC(                                        in0(s2) in1(s1) in2(s0))
>     PROC(                                in0(s3) in1(s2) in2(s1) in3(s0))
>     PROC(                        in0(s4) in1(s3) in2(s2) in3(s1) in4(s0))
>     PROC(                in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0))
>     PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0))
>     PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0))
>     PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1) in8(s0))
>     PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1) in9(s0))
>     PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) in9(s1)in10(s0))
>     PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) in9(s2)in10(s1)in11(s0))
>     ...and so on until the end of the input vector
> 
>     The difference is that it's now the input vector that
"slides" and the
>     "state" values sy that remain in the same place. There's
still a
>     prologue, but you can easily get rid of it by (implicitly) zero-padding
>     the in vector during the initialization phase (start with a zero vector
>     and real one value at a time). Getting rid of the epilogue is a little
>     trickier, but I think it can be done.
> 
>     Cheers,
> 
>             Jean-Marc
> 
>     > Then it becomes this
>     >
>     > FOR in=0 to N WITH in+=8
>     >   PROC(in0(s0)) /* prolog 0 */
>     >   PROC(in0(s1) in1(s0)) /* prolog 1 */
>     >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
>     >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
>     >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */
>     >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog
5 */
>     >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /*
>     > prolog 6 */
>     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
>     in7(s0))
>     > /* fully process 8 inputs */
>     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
>     in7(s1))
>     > /* continue */
>     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3)
>     in7(s2))
>     > /* continue */
>     >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4)
>     in7(s3))
>     > /* continue */
>     >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4))
/*
>     > epilog 0 */
>     >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog
>     1 */
>     >   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */
>     >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
>     >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
>     >   PROC(in6(s10) in7(s9)) /* epilog 5 */
>     >   PROC(in7(s10)) /* epilog 6 */
>     > END FOR
>     >
>     > And
>     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
>     in7(s0))
>     > /* fully process 8 inputs */
>     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
>     in7(s1))
>     > /* continue */
>     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3)
>     in7(s2))
>     > /* continue */
>     > is actually the expansion of the kernel loop
>     > FOR i=0 TO order-6 WITH i++
>     >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
>     > in6(si+1) in7(si+0))
>     > END FOR
>     >
>     > The worst thing is that corr_QC[] is so sensitive that any extra
>     > processing will make them wrong and propagate to the next loop
(next 8
>     > inputs). state_QS[] is a little better but still very sensitive.
For
>     > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8)
in4(s7)
>     > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more
time)
>     > and remove epilog 0, then all final results will be wrong.
>     >
>     > That's why the prolog and epilog cannot be saved to the best
of my
>     > knowledge.
>     >
>     > The assembly size of silk_warped_autocorrelation_FIX_neon() is
about
>     > 2,744 bytes. Compared with the C code size (about 452 bytes),
it's 2.3
>     > KB larger. Considering silk_warped_autocorrelation_FIX_c() is the
>     second
>     > place CPU heavy function in fixed-point, and our testing shows up
>     to 7%
>     > CPU run time saving of the total encoder with this optimization
(at
>     > Complexity 8), maybe we can take the I-cache burden even if
finally we
>     > still cannot remove the big chunk of prolog and epilog.
>     >
>     > Thanks,
>     > Linfeng Zhang
>     >
>     > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
>     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>>> wrote:
>     >
>     >     Hi Felicia,
>     >
>     >     I've had time to work through the math in the original
>     function and I'm
>     >     pretty sure it's possible to vectorize this without the
huge
>     >     prologue/epilogue.
>     >
>     >     For the simple case where order = vector size = N (but it
>     should easily
>     >     generalize to larger order), what I came up with is:
>     >
>     >     initialize X, Y, M, C to vector of zeros
>     >
>     >     for i=0 to N+order
>     >        T = [x(i), Y(0:N-2)]
>     >        Y = M + coeff * (Y - T)
>     >        M = T
>     >        X = [x(i), X(0:N-1)]
>     >        C = C + X*Y
>     >
>     >     I think something similar to this (assuming I didn't mess
up any
>     >     details) should give you the correlations in vector C. Did I
miss
>     >     anything?
>     >
>     >     Cheers,
>     >
>     >             Jean-Marc
>     >
>     >
>     >     On 31/01/17 12:30 PM, Felicia Lim wrote:
>     >     > Hi,
>     >     >
>     >     > Attached is a patch with arm neon optimizations for
>     >     > silk_warped_autocorrelation_FIX(). Please review.
>     >     >
>     >     > Thanks,
>     >     > Felicia
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > opus mailing list
>     >     > opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
>     <mailto:opus at xiph.org>>
>     >     > http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >     >
>     >     _______________________________________________
>     >     opus mailing list
>     >     opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
>     <mailto:opus at xiph.org>>
>     >     http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >
>     >
> 
>

Linfeng Zhang

2017-Feb-07 16:58 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Jean-Marc,

Thanks for your suggestions. Will get back to you once we have some updates.

Linfeng

On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Linfeng,
>
> On 06/02/17 07:18 PM, Linfeng Zhang wrote:
> > This is a great idea. But the order (psEncC->shapingLPCOrder) can
be
> > configured to 12, 14, 16, 20 and 24 according to complexity parameter.
> >
> > It's hard to get a universal function to handle all these orders
> > efficiently. Any suggestions?
>
> I can think of two ways of handling larger orders. The obvious one is
> simply to add an inner loop of the form:
> for (i=0;i<order;i+=VECTOR_SIZE)
> I think what may be more efficient is to simply have a small
"order-N"
> (N=4 or 8) kernel that not only computes the correlation of order N, but
> then spits out the signal after the N-stage all-pass is applied. The
> kernel would look like:
>
> void autocorr_kernel4(int *corr, int *orig, int *input, int *output, int
> len) {
>    /* Implement vectorized order-4 filter (could also be order 8)
>       as described in previous email and outputs the filtered signal.
>    */
> }
>
> and then the full function would run the kernel multiple times and look
> like:
>
> void full_autocorr(int *corr, int *orig, int len, int order) {
>    int i;
>    int tmp[MAX_SIZE];
>    int *in = orig;
>    for (i=0;i<order;i+=4) {
>       autocorr_kernel4(corr+i, orig, in, tmp, len);
>       /* Make subsequent calls use the filtered signal as input. */
>       in = tmp;
>    }
> }
>
> I think the should not only reduce/eliminate the prologue/epilogue
> problem, but it should also be more efficient since almost all vectors
> processed would use the full size.
>
> Maybe a third option (not sure it's a good idea, but still mentioning
> it) would be to have a function that hardcodes order=24 and discards the
> larger values that aren't needed. Since the smallest order seems to be
> 16, it wouldn't be much of a waste and the code might end up running
> faster for the higher orders.
>
> Cheers,
>
>         Jean-Marc
>
>
> > Thanks,
> > Linfeng
> >
> > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at
jmvalin.ca
> > <mailto:jmvalin at jmvalin.ca>> wrote:
> >
> >     Hi Linfeng,
> >
> >     On 06/02/17 02:51 PM, Linfeng Zhang wrote:
> >     > However, the critical thing is that all the states in each
stage
> when
> >     > processing input[i] are reused by the next input[i+1]. That
is
> >     > input[i+1] must wait input[i] for 1 stage, and input[i+2]
must wait
> >     > input[i+1] for 1 stage, etc.
> >
> >     That is indeed the tricky part... and the one I think you could do
> >     slightly differently. If you approach the problem in terms of
> computing
> >     chunks of the inputs N samples at a time, then indeed the approach
> you
> >     are describing is the only solution. What I was proposing though
is
> to
> >     instead chop the "order" in chunks of N. Using your
notation, you
> would
> >     be doing:
> >
> >     PROC(                                                       
in0(s0))
> >     PROC(                                                in0(s1)
in1(s0))
> >     PROC(                                        in0(s2) in1(s1)
in2(s0))
> >     PROC(                                in0(s3) in1(s2) in2(s1)
in3(s0))
> >     PROC(                        in0(s4) in1(s3) in2(s2) in3(s1)
in4(s0))
> >     PROC(                in0(s5) in1(s4) in2(s3) in3(s2) in4(s1)
in5(s0))
> >     PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
in6(s0))
> >     PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
in7(s0))
> >     PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)
in8(s0))
> >     PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1)
in9(s0))
> >     PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2)
in9(s1)in10(s0))
> >     PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3)
in9(s2)in10(s1)in11(s0))
> >     ...and so on until the end of the input vector
> >
> >     The difference is that it's now the input vector that
"slides" and
> the
> >     "state" values sy that remain in the same place.
There's still a
> >     prologue, but you can easily get rid of it by (implicitly)
> zero-padding
> >     the in vector during the initialization phase (start with a zero
> vector
> >     and real one value at a time). Getting rid of the epilogue is a
> little
> >     trickier, but I think it can be done.
> >
> >     Cheers,
> >
> >             Jean-Marc
> >
> >     > Then it becomes this
> >     >
> >     > FOR in=0 to N WITH in+=8
> >     >   PROC(in0(s0)) /* prolog 0 */
> >     >   PROC(in0(s1) in1(s0)) /* prolog 1 */
> >     >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
> >     >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
> >     >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4
*/
> >     >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /*
prolog
> 5 */
> >     >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
in6(s0)) /*
> >     > prolog 6 */
> >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
> >     in7(s0))
> >     > /* fully process 8 inputs */
> >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3)
in6(s2)
> >     in7(s1))
> >     > /* continue */
> >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4)
in6(s3)
> >     in7(s2))
> >     > /* continue */
> >     >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5)
in6(s4)
> >     in7(s3))
> >     > /* continue */
> >     >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5)
in7(s4)) /*
> >     > epilog 0 */
> >     >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /*
epilog
> >     1 */
> >     >   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2
*/
> >     >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
> >     >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
> >     >   PROC(in6(s10) in7(s9)) /* epilog 5 */
> >     >   PROC(in7(s10)) /* epilog 6 */
> >     > END FOR
> >     >
> >     > And
> >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
> >     in7(s0))
> >     > /* fully process 8 inputs */
> >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3)
in6(s2)
> >     in7(s1))
> >     > /* continue */
> >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4)
in6(s3)
> >     in7(s2))
> >     > /* continue */
> >     > is actually the expansion of the kernel loop
> >     > FOR i=0 TO order-6 WITH i++
> >     >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3)
in5(si+2)
> >     > in6(si+1) in7(si+0))
> >     > END FOR
> >     >
> >     > The worst thing is that corr_QC[] is so sensitive that any
extra
> >     > processing will make them wrong and propagate to the next
loop
> (next 8
> >     > inputs). state_QS[] is a little better but still very
sensitive.
> For
> >     > instance, if adding PROC(in0(s11') in1(s10) in2(s9)
in3(s8) in4(s7)
> >     > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one
more
> time)
> >     > and remove epilog 0, then all final results will be wrong.
> >     >
> >     > That's why the prolog and epilog cannot be saved to the
best of my
> >     > knowledge.
> >     >
> >     > The assembly size of silk_warped_autocorrelation_FIX_neon()
is
> about
> >     > 2,744 bytes. Compared with the C code size (about 452 bytes),
it's
> 2.3
> >     > KB larger. Considering silk_warped_autocorrelation_FIX_c() is
the
> >     second
> >     > place CPU heavy function in fixed-point, and our testing
shows up
> >     to 7%
> >     > CPU run time saving of the total encoder with this
optimization (at
> >     > Complexity 8), maybe we can take the I-cache burden even if
> finally we
> >     > still cannot remove the big chunk of prolog and epilog.
> >     >
> >     > Thanks,
> >     > Linfeng Zhang
> >     >
> >     > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
> >     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
> >     > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>>> wrote:
> >     >
> >     >     Hi Felicia,
> >     >
> >     >     I've had time to work through the math in the
original
> >     function and I'm
> >     >     pretty sure it's possible to vectorize this without
the huge
> >     >     prologue/epilogue.
> >     >
> >     >     For the simple case where order = vector size = N (but it
> >     should easily
> >     >     generalize to larger order), what I came up with is:
> >     >
> >     >     initialize X, Y, M, C to vector of zeros
> >     >
> >     >     for i=0 to N+order
> >     >        T = [x(i), Y(0:N-2)]
> >     >        Y = M + coeff * (Y - T)
> >     >        M = T
> >     >        X = [x(i), X(0:N-1)]
> >     >        C = C + X*Y
> >     >
> >     >     I think something similar to this (assuming I didn't
mess up
> any
> >     >     details) should give you the correlations in vector C.
Did I
> miss
> >     >     anything?
> >     >
> >     >     Cheers,
> >     >
> >     >             Jean-Marc
> >     >
> >     >
> >     >     On 31/01/17 12:30 PM, Felicia Lim wrote:
> >     >     > Hi,
> >     >     >
> >     >     > Attached is a patch with arm neon optimizations for
> >     >     > silk_warped_autocorrelation_FIX(). Please review.
> >     >     >
> >     >     > Thanks,
> >     >     > Felicia
> >     >     >
> >     >     >
> >     >     > _______________________________________________
> >     >     > opus mailing list
> >     >     > opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
> >     <mailto:opus at xiph.org>>
> >     >     > http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>
> >     >     <http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>>
> >     >     >
> >     >     _______________________________________________
> >     >     opus mailing list
> >     >     opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
> >     <mailto:opus at xiph.org>>
> >     >     http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>
> >     >     <http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>>
> >     >
> >     >
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170207/b778d6bc/attachment-0001.html>

Maybe Matching Threads

Search for more possibly parallel threads

opus - Feb 2017 - [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Maybe Matching Threads