thr3ads.net - opus - [opus] [PATCH] Optimize silk_warped_autocorrelation

If this information is useful, please help other people find it:
Share via:

Linfeng Zhang

2017-Feb-07 16:58 UTC

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Jean-Marc,

Thanks for your suggestions. Will get back to you once we have some updates.

Linfeng

On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Linfeng,
>
> On 06/02/17 07:18 PM, Linfeng Zhang wrote:
> > This is a great idea. But the order (psEncC->shapingLPCOrder) can
be
> > configured to 12, 14, 16, 20 and 24 according to complexity parameter.
> >
> > It's hard to get a universal function to handle all these orders
> > efficiently. Any suggestions?
>
> I can think of two ways of handling larger orders. The obvious one is
> simply to add an inner loop of the form:
> for (i=0;i<order;i+=VECTOR_SIZE)
> I think what may be more efficient is to simply have a small
"order-N"
> (N=4 or 8) kernel that not only computes the correlation of order N, but
> then spits out the signal after the N-stage all-pass is applied. The
> kernel would look like:
>
> void autocorr_kernel4(int *corr, int *orig, int *input, int *output, int
> len) {
>    /* Implement vectorized order-4 filter (could also be order 8)
>       as described in previous email and outputs the filtered signal.
>    */
> }
>
> and then the full function would run the kernel multiple times and look
> like:
>
> void full_autocorr(int *corr, int *orig, int len, int order) {
>    int i;
>    int tmp[MAX_SIZE];
>    int *in = orig;
>    for (i=0;i<order;i+=4) {
>       autocorr_kernel4(corr+i, orig, in, tmp, len);
>       /* Make subsequent calls use the filtered signal as input. */
>       in = tmp;
>    }
> }
>
> I think the should not only reduce/eliminate the prologue/epilogue
> problem, but it should also be more efficient since almost all vectors
> processed would use the full size.
>
> Maybe a third option (not sure it's a good idea, but still mentioning
> it) would be to have a function that hardcodes order=24 and discards the
> larger values that aren't needed. Since the smallest order seems to be
> 16, it wouldn't be much of a waste and the code might end up running
> faster for the higher orders.
>
> Cheers,
>
>         Jean-Marc
>
>
> > Thanks,
> > Linfeng
> >
> > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at
jmvalin.ca
> > <mailto:jmvalin at jmvalin.ca>> wrote:
> >
> >     Hi Linfeng,
> >
> >     On 06/02/17 02:51 PM, Linfeng Zhang wrote:
> >     > However, the critical thing is that all the states in each
stage
> when
> >     > processing input[i] are reused by the next input[i+1]. That
is
> >     > input[i+1] must wait input[i] for 1 stage, and input[i+2]
must wait
> >     > input[i+1] for 1 stage, etc.
> >
> >     That is indeed the tricky part... and the one I think you could do
> >     slightly differently. If you approach the problem in terms of
> computing
> >     chunks of the inputs N samples at a time, then indeed the approach
> you
> >     are describing is the only solution. What I was proposing though
is
> to
> >     instead chop the "order" in chunks of N. Using your
notation, you
> would
> >     be doing:
> >
> >     PROC(                                                       
in0(s0))
> >     PROC(                                                in0(s1)
in1(s0))
> >     PROC(                                        in0(s2) in1(s1)
in2(s0))
> >     PROC(                                in0(s3) in1(s2) in2(s1)
in3(s0))
> >     PROC(                        in0(s4) in1(s3) in2(s2) in3(s1)
in4(s0))
> >     PROC(                in0(s5) in1(s4) in2(s3) in3(s2) in4(s1)
in5(s0))
> >     PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
in6(s0))
> >     PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
in7(s0))
> >     PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)
in8(s0))
> >     PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1)
in9(s0))
> >     PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2)
in9(s1)in10(s0))
> >     PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3)
in9(s2)in10(s1)in11(s0))
> >     ...and so on until the end of the input vector
> >
> >     The difference is that it's now the input vector that
"slides" and
> the
> >     "state" values sy that remain in the same place.
There's still a
> >     prologue, but you can easily get rid of it by (implicitly)
> zero-padding
> >     the in vector during the initialization phase (start with a zero
> vector
> >     and real one value at a time). Getting rid of the epilogue is a
> little
> >     trickier, but I think it can be done.
> >
> >     Cheers,
> >
> >             Jean-Marc
> >
> >     > Then it becomes this
> >     >
> >     > FOR in=0 to N WITH in+=8
> >     >   PROC(in0(s0)) /* prolog 0 */
> >     >   PROC(in0(s1) in1(s0)) /* prolog 1 */
> >     >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
> >     >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
> >     >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4
*/
> >     >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /*
prolog
> 5 */
> >     >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
in6(s0)) /*
> >     > prolog 6 */
> >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
> >     in7(s0))
> >     > /* fully process 8 inputs */
> >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3)
in6(s2)
> >     in7(s1))
> >     > /* continue */
> >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4)
in6(s3)
> >     in7(s2))
> >     > /* continue */
> >     >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5)
in6(s4)
> >     in7(s3))
> >     > /* continue */
> >     >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5)
in7(s4)) /*
> >     > epilog 0 */
> >     >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /*
epilog
> >     1 */
> >     >   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2
*/
> >     >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
> >     >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
> >     >   PROC(in6(s10) in7(s9)) /* epilog 5 */
> >     >   PROC(in7(s10)) /* epilog 6 */
> >     > END FOR
> >     >
> >     > And
> >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
> >     in7(s0))
> >     > /* fully process 8 inputs */
> >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3)
in6(s2)
> >     in7(s1))
> >     > /* continue */
> >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4)
in6(s3)
> >     in7(s2))
> >     > /* continue */
> >     > is actually the expansion of the kernel loop
> >     > FOR i=0 TO order-6 WITH i++
> >     >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3)
in5(si+2)
> >     > in6(si+1) in7(si+0))
> >     > END FOR
> >     >
> >     > The worst thing is that corr_QC[] is so sensitive that any
extra
> >     > processing will make them wrong and propagate to the next
loop
> (next 8
> >     > inputs). state_QS[] is a little better but still very
sensitive.
> For
> >     > instance, if adding PROC(in0(s11') in1(s10) in2(s9)
in3(s8) in4(s7)
> >     > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one
more
> time)
> >     > and remove epilog 0, then all final results will be wrong.
> >     >
> >     > That's why the prolog and epilog cannot be saved to the
best of my
> >     > knowledge.
> >     >
> >     > The assembly size of silk_warped_autocorrelation_FIX_neon()
is
> about
> >     > 2,744 bytes. Compared with the C code size (about 452 bytes),
it's
> 2.3
> >     > KB larger. Considering silk_warped_autocorrelation_FIX_c() is
the
> >     second
> >     > place CPU heavy function in fixed-point, and our testing
shows up
> >     to 7%
> >     > CPU run time saving of the total encoder with this
optimization (at
> >     > Complexity 8), maybe we can take the I-cache burden even if
> finally we
> >     > still cannot remove the big chunk of prolog and epilog.
> >     >
> >     > Thanks,
> >     > Linfeng Zhang
> >     >
> >     > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
> >     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
> >     > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>>> wrote:
> >     >
> >     >     Hi Felicia,
> >     >
> >     >     I've had time to work through the math in the
original
> >     function and I'm
> >     >     pretty sure it's possible to vectorize this without
the huge
> >     >     prologue/epilogue.
> >     >
> >     >     For the simple case where order = vector size = N (but it
> >     should easily
> >     >     generalize to larger order), what I came up with is:
> >     >
> >     >     initialize X, Y, M, C to vector of zeros
> >     >
> >     >     for i=0 to N+order
> >     >        T = [x(i), Y(0:N-2)]
> >     >        Y = M + coeff * (Y - T)
> >     >        M = T
> >     >        X = [x(i), X(0:N-1)]
> >     >        C = C + X*Y
> >     >
> >     >     I think something similar to this (assuming I didn't
mess up
> any
> >     >     details) should give you the correlations in vector C.
Did I
> miss
> >     >     anything?
> >     >
> >     >     Cheers,
> >     >
> >     >             Jean-Marc
> >     >
> >     >
> >     >     On 31/01/17 12:30 PM, Felicia Lim wrote:
> >     >     > Hi,
> >     >     >
> >     >     > Attached is a patch with arm neon optimizations for
> >     >     > silk_warped_autocorrelation_FIX(). Please review.
> >     >     >
> >     >     > Thanks,
> >     >     > Felicia
> >     >     >
> >     >     >
> >     >     > _______________________________________________
> >     >     > opus mailing list
> >     >     > opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
> >     <mailto:opus at xiph.org>>
> >     >     > http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>
> >     >     <http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>>
> >     >     >
> >     >     _______________________________________________
> >     >     opus mailing list
> >     >     opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
> >     <mailto:opus at xiph.org>>
> >     >     http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>
> >     >     <http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>>
> >     >
> >     >
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170207/b778d6bc/attachment-0001.html>

Jean-Marc Valin

2017-Feb-07 20:30 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

On 07/02/17 11:58 AM, Linfeng Zhang wrote:> Thanks for your suggestions. Will get back to you once we have some
updates.
Thanks. Let me know if you have other of your optimization patches ready
for review.

	Jean-Marc
> Linfeng
> 
> On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin <jmvalin at jmvalin.ca
> <mailto:jmvalin at jmvalin.ca>> wrote:
> 
>     Hi Linfeng,
> 
>     On 06/02/17 07:18 PM, Linfeng Zhang wrote:
>     > This is a great idea. But the order (psEncC->shapingLPCOrder)
can be
>     > configured to 12, 14, 16, 20 and 24 according to complexity
parameter.
>     >
>     > It's hard to get a universal function to handle all these
orders
>     > efficiently. Any suggestions?
> 
>     I can think of two ways of handling larger orders. The obvious one is
>     simply to add an inner loop of the form:
>     for (i=0;i<order;i+=VECTOR_SIZE)
>     I think what may be more efficient is to simply have a small
"order-N"
>     (N=4 or 8) kernel that not only computes the correlation of order N,
but
>     then spits out the signal after the N-stage all-pass is applied. The
>     kernel would look like:
> 
>     void autocorr_kernel4(int *corr, int *orig, int *input, int *output,
int
>     len) {
>        /* Implement vectorized order-4 filter (could also be order 8)
>           as described in previous email and outputs the filtered signal.
>        */
>     }
> 
>     and then the full function would run the kernel multiple times and look
>     like:
> 
>     void full_autocorr(int *corr, int *orig, int len, int order) {
>        int i;
>        int tmp[MAX_SIZE];
>        int *in = orig;
>        for (i=0;i<order;i+=4) {
>           autocorr_kernel4(corr+i, orig, in, tmp, len);
>           /* Make subsequent calls use the filtered signal as input. */
>           in = tmp;
>        }
>     }
> 
>     I think the should not only reduce/eliminate the prologue/epilogue
>     problem, but it should also be more efficient since almost all vectors
>     processed would use the full size.
> 
>     Maybe a third option (not sure it's a good idea, but still
mentioning
>     it) would be to have a function that hardcodes order=24 and discards
the
>     larger values that aren't needed. Since the smallest order seems to
be
>     16, it wouldn't be much of a waste and the code might end up
running
>     faster for the higher orders.
> 
>     Cheers,
> 
>             Jean-Marc
> 
> 
>     > Thanks,
>     > Linfeng
>     >
>     > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at
jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>>> wrote:
>     >
>     >     Hi Linfeng,
>     >
>     >     On 06/02/17 02:51 PM, Linfeng Zhang wrote:
>     >     > However, the critical thing is that all the states in
each
>     stage when
>     >     > processing input[i] are reused by the next input[i+1].
That is
>     >     > input[i+1] must wait input[i] for 1 stage, and input[i+2]
>     must wait
>     >     > input[i+1] for 1 stage, etc.
>     >
>     >     That is indeed the tricky part... and the one I think you
could do
>     >     slightly differently. If you approach the problem in terms of
>     computing
>     >     chunks of the inputs N samples at a time, then indeed the
>     approach you
>     >     are describing is the only solution. What I was proposing
>     though is to
>     >     instead chop the "order" in chunks of N. Using your
notation,
>     you would
>     >     be doing:
>     >
>     >     PROC(                                                       
>     in0(s0))
>     >     PROC(                                                in0(s1)
>     in1(s0))
>     >     PROC(                                        in0(s2) in1(s1)
>     in2(s0))
>     >     PROC(                                in0(s3) in1(s2) in2(s1)
>     in3(s0))
>     >     PROC(                        in0(s4) in1(s3) in2(s2) in3(s1)
>     in4(s0))
>     >     PROC(                in0(s5) in1(s4) in2(s3) in3(s2) in4(s1)
>     in5(s0))
>     >     PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
>     in6(s0))
>     >     PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
>     in7(s0))
>     >     PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)
>     in8(s0))
>     >     PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1)
>     in9(s0))
>     >     PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2)
>     in9(s1)in10(s0))
>     >     PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3)
>     in9(s2)in10(s1)in11(s0))
>     >     ...and so on until the end of the input vector
>     >
>     >     The difference is that it's now the input vector that
"slides"
>     and the
>     >     "state" values sy that remain in the same place.
There's still a
>     >     prologue, but you can easily get rid of it by (implicitly)
>     zero-padding
>     >     the in vector during the initialization phase (start with a
>     zero vector
>     >     and real one value at a time). Getting rid of the epilogue is
>     a little
>     >     trickier, but I think it can be done.
>     >
>     >     Cheers,
>     >
>     >             Jean-Marc
>     >
>     >     > Then it becomes this
>     >     >
>     >     > FOR in=0 to N WITH in+=8
>     >     >   PROC(in0(s0)) /* prolog 0 */
>     >     >   PROC(in0(s1) in1(s0)) /* prolog 1 */
>     >     >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
>     >     >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
>     >     >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog
4 */
>     >     >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0))
/*
>     prolog 5 */
>     >     >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
>     in6(s0)) /*
>     >     > prolog 6 */
>     >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
>     >     in7(s0))
>     >     > /* fully process 8 inputs */
>     >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3)
in6(s2)
>     >     in7(s1))
>     >     > /* continue */
>     >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4)
in6(s3)
>     >     in7(s2))
>     >     > /* continue */
>     >     >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5)
in6(s4)
>     >     in7(s3))
>     >     > /* continue */
>     >     >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5)
>     in7(s4)) /*
>     >     > epilog 0 */
>     >     >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5))
/*
>     epilog
>     >     1 */
>     >     >   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /*
epilog 2 */
>     >     >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
>     >     >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
>     >     >   PROC(in6(s10) in7(s9)) /* epilog 5 */
>     >     >   PROC(in7(s10)) /* epilog 6 */
>     >     > END FOR
>     >     >
>     >     > And
>     >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
>     >     in7(s0))
>     >     > /* fully process 8 inputs */
>     >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3)
in6(s2)
>     >     in7(s1))
>     >     > /* continue */
>     >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4)
in6(s3)
>     >     in7(s2))
>     >     > /* continue */
>     >     > is actually the expansion of the kernel loop
>     >     > FOR i=0 TO order-6 WITH i++
>     >     >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3)
>     in5(si+2)
>     >     > in6(si+1) in7(si+0))
>     >     > END FOR
>     >     >
>     >     > The worst thing is that corr_QC[] is so sensitive that
any extra
>     >     > processing will make them wrong and propagate to the next
>     loop (next 8
>     >     > inputs). state_QS[] is a little better but still very
>     sensitive. For
>     >     > instance, if adding PROC(in0(s11') in1(s10) in2(s9)
in3(s8)
>     in4(s7)
>     >     > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping
one
>     more time)
>     >     > and remove epilog 0, then all final results will be
wrong.
>     >     >
>     >     > That's why the prolog and epilog cannot be saved to
the best
>     of my
>     >     > knowledge.
>     >     >
>     >     > The assembly size of
silk_warped_autocorrelation_FIX_neon()
>     is about
>     >     > 2,744 bytes. Compared with the C code size (about 452
>     bytes), it's 2.3
>     >     > KB larger. Considering
silk_warped_autocorrelation_FIX_c()
>     is the
>     >     second
>     >     > place CPU heavy function in fixed-point, and our testing
>     shows up
>     >     to 7%
>     >     > CPU run time saving of the total encoder with this
>     optimization (at
>     >     > Complexity 8), maybe we can take the I-cache burden even
if
>     finally we
>     >     > still cannot remove the big chunk of prolog and epilog.
>     >     >
>     >     > Thanks,
>     >     > Linfeng Zhang
>     >     >
>     >     > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
>     >     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>>
>     >     > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>
>     <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>>>> wrote:
>     >     >
>     >     >     Hi Felicia,
>     >     >
>     >     >     I've had time to work through the math in the
original
>     >     function and I'm
>     >     >     pretty sure it's possible to vectorize this
without the huge
>     >     >     prologue/epilogue.
>     >     >
>     >     >     For the simple case where order = vector size = N
(but it
>     >     should easily
>     >     >     generalize to larger order), what I came up with is:
>     >     >
>     >     >     initialize X, Y, M, C to vector of zeros
>     >     >
>     >     >     for i=0 to N+order
>     >     >        T = [x(i), Y(0:N-2)]
>     >     >        Y = M + coeff * (Y - T)
>     >     >        M = T
>     >     >        X = [x(i), X(0:N-1)]
>     >     >        C = C + X*Y
>     >     >
>     >     >     I think something similar to this (assuming I
didn't
>     mess up any
>     >     >     details) should give you the correlations in vector
C.
>     Did I miss
>     >     >     anything?
>     >     >
>     >     >     Cheers,
>     >     >
>     >     >             Jean-Marc
>     >     >
>     >     >
>     >     >     On 31/01/17 12:30 PM, Felicia Lim wrote:
>     >     >     > Hi,
>     >     >     >
>     >     >     > Attached is a patch with arm neon optimizations
for
>     >     >     > silk_warped_autocorrelation_FIX(). Please
review.
>     >     >     >
>     >     >     > Thanks,
>     >     >     > Felicia
>     >     >     >
>     >     >     >
>     >     >     > _______________________________________________
>     >     >     > opus mailing list
>     >     >     > opus at xiph.org <mailto:opus at xiph.org>
>     <mailto:opus at xiph.org <mailto:opus at xiph.org>>
<mailto:opus at xiph.org
>     <mailto:opus at xiph.org>
>     >     <mailto:opus at xiph.org <mailto:opus at
xiph.org>>>
>     >     >     > http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>>
>     >     >     >
>     >     >     _______________________________________________
>     >     >     opus mailing list
>     >     >     opus at xiph.org <mailto:opus at xiph.org>
>     <mailto:opus at xiph.org <mailto:opus at xiph.org>>
<mailto:opus at xiph.org
>     <mailto:opus at xiph.org>
>     >     <mailto:opus at xiph.org <mailto:opus at
xiph.org>>>
>     >     >     http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>>
>     >     >
>     >     >
>     >
>     >
> 
>

Linfeng Zhang

2017-Apr-03 22:01 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Jean-Marc,

Attached is the silk_warped_autocorrelation_FIX_neon() which implements
your idea.

Speed improvement vs the previous optimization:

Complexity 0-4: Doesn't call this function. Complexity 5: 2.1% (order = 16)
Complexity 6: 1.0% (order = 20) Complexity 8: 0.1% (order = 24) Complexity
10: 0.1% (order = 24)

Code size of silk_warped_autocorrelation_FIX_neon() changes from 2,644
bytes to 3,228 bytes.

The reason of larger code size is that the new optimization specializes
order 16, 20 and 24. If only keeping order 24 specialization, the code
still works and the code size is smaller, but the encoder speed will drop
4.0% for Complexity 5 and 2.0% for Complexity 6. Anyway, the new code is
easier to understand and maintain.

Thanks,

Linfeng

On Tue, Feb 7, 2017 at 8:58 AM, Linfeng Zhang <linfengz at google.com>
wrote:
> Hi Jean-Marc,
>
> Thanks for your suggestions. Will get back to you once we have some
> updates.
>
> Linfeng
>
> On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin <jmvalin at
jmvalin.ca>
> wrote:
>
>> Hi Linfeng,
>>
>> On 06/02/17 07:18 PM, Linfeng Zhang wrote:
>> > This is a great idea. But the order (psEncC->shapingLPCOrder)
can be
>> > configured to 12, 14, 16, 20 and 24 according to complexity
parameter.
>> >
>> > It's hard to get a universal function to handle all these
orders
>> > efficiently. Any suggestions?
>>
>> I can think of two ways of handling larger orders. The obvious one is
>> simply to add an inner loop of the form:
>> for (i=0;i<order;i+=VECTOR_SIZE)
>> I think what may be more efficient is to simply have a small
"order-N"
>> (N=4 or 8) kernel that not only computes the correlation of order N,
but
>> then spits out the signal after the N-stage all-pass is applied. The
>> kernel would look like:
>>
>> void autocorr_kernel4(int *corr, int *orig, int *input, int *output,
int
>> len) {
>>    /* Implement vectorized order-4 filter (could also be order 8)
>>       as described in previous email and outputs the filtered signal.
>>    */
>> }
>>
>> and then the full function would run the kernel multiple times and look
>> like:
>>
>> void full_autocorr(int *corr, int *orig, int len, int order) {
>>    int i;
>>    int tmp[MAX_SIZE];
>>    int *in = orig;
>>    for (i=0;i<order;i+=4) {
>>       autocorr_kernel4(corr+i, orig, in, tmp, len);
>>       /* Make subsequent calls use the filtered signal as input. */
>>       in = tmp;
>>    }
>> }
>>
>> I think the should not only reduce/eliminate the prologue/epilogue
>> problem, but it should also be more efficient since almost all vectors
>> processed would use the full size.
>>
>> Maybe a third option (not sure it's a good idea, but still
mentioning
>> it) would be to have a function that hardcodes order=24 and discards
the
>> larger values that aren't needed. Since the smallest order seems to
be
>> 16, it wouldn't be much of a waste and the code might end up
running
>> faster for the higher orders.
>>
>> Cheers,
>>
>>         Jean-Marc
>>
>>
>> > Thanks,
>> > Linfeng
>> >
>> > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at
jmvalin.ca
>> > <mailto:jmvalin at jmvalin.ca>> wrote:
>> >
>> >     Hi Linfeng,
>> >
>> >     On 06/02/17 02:51 PM, Linfeng Zhang wrote:
>> >     > However, the critical thing is that all the states in
each stage
>> when
>> >     > processing input[i] are reused by the next input[i+1].
That is
>> >     > input[i+1] must wait input[i] for 1 stage, and input[i+2]
must
>> wait
>> >     > input[i+1] for 1 stage, etc.
>> >
>> >     That is indeed the tricky part... and the one I think you
could do
>> >     slightly differently. If you approach the problem in terms of
>> computing
>> >     chunks of the inputs N samples at a time, then indeed the
approach
>> you
>> >     are describing is the only solution. What I was proposing
though is
>> to
>> >     instead chop the "order" in chunks of N. Using your
notation, you
>> would
>> >     be doing:
>> >
>> >     PROC(
>> in0(s0))
>> >     PROC(                                                in0(s1)
>> in1(s0))
>> >     PROC(                                        in0(s2) in1(s1)
>> in2(s0))
>> >     PROC(                                in0(s3) in1(s2) in2(s1)
>> in3(s0))
>> >     PROC(                        in0(s4) in1(s3) in2(s2) in3(s1)
>> in4(s0))
>> >     PROC(                in0(s5) in1(s4) in2(s3) in3(s2) in4(s1)
>> in5(s0))
>> >     PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
>> in6(s0))
>> >     PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
>> in7(s0))
>> >     PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)
>> in8(s0))
>> >     PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1)
>> in9(s0))
>> >     PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2)
>> in9(s1)in10(s0))
>> >     PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3)
>> in9(s2)in10(s1)in11(s0))
>> >     ...and so on until the end of the input vector
>> >
>> >     The difference is that it's now the input vector that
"slides" and
>> the
>> >     "state" values sy that remain in the same place.
There's still a
>> >     prologue, but you can easily get rid of it by (implicitly)
>> zero-padding
>> >     the in vector during the initialization phase (start with a
zero
>> vector
>> >     and real one value at a time). Getting rid of the epilogue is
a
>> little
>> >     trickier, but I think it can be done.
>> >
>> >     Cheers,
>> >
>> >             Jean-Marc
>> >
>> >     > Then it becomes this
>> >     >
>> >     > FOR in=0 to N WITH in+=8
>> >     >   PROC(in0(s0)) /* prolog 0 */
>> >     >   PROC(in0(s1) in1(s0)) /* prolog 1 */
>> >     >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
>> >     >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
>> >     >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog
4 */
>> >     >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0))
/* prolog
>> 5 */
>> >     >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
in6(s0)) /*
>> >     > prolog 6 */
>> >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
>> >     in7(s0))
>> >     > /* fully process 8 inputs */
>> >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3)
in6(s2)
>> >     in7(s1))
>> >     > /* continue */
>> >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4)
in6(s3)
>> >     in7(s2))
>> >     > /* continue */
>> >     >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5)
in6(s4)
>> >     in7(s3))
>> >     > /* continue */
>> >     >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5)
in7(s4))
>> /*
>> >     > epilog 0 */
>> >     >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5))
/* epilog
>> >     1 */
>> >     >   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /*
epilog 2 */
>> >     >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
>> >     >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
>> >     >   PROC(in6(s10) in7(s9)) /* epilog 5 */
>> >     >   PROC(in7(s10)) /* epilog 6 */
>> >     > END FOR
>> >     >
>> >     > And
>> >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
>> >     in7(s0))
>> >     > /* fully process 8 inputs */
>> >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3)
in6(s2)
>> >     in7(s1))
>> >     > /* continue */
>> >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4)
in6(s3)
>> >     in7(s2))
>> >     > /* continue */
>> >     > is actually the expansion of the kernel loop
>> >     > FOR i=0 TO order-6 WITH i++
>> >     >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3)
in5(si+2)
>> >     > in6(si+1) in7(si+0))
>> >     > END FOR
>> >     >
>> >     > The worst thing is that corr_QC[] is so sensitive that
any extra
>> >     > processing will make them wrong and propagate to the next
loop
>> (next 8
>> >     > inputs). state_QS[] is a little better but still very
sensitive.
>> For
>> >     > instance, if adding PROC(in0(s11') in1(s10) in2(s9)
in3(s8)
>> in4(s7)
>> >     > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping
one more
>> time)
>> >     > and remove epilog 0, then all final results will be
wrong.
>> >     >
>> >     > That's why the prolog and epilog cannot be saved to
the best of my
>> >     > knowledge.
>> >     >
>> >     > The assembly size of
silk_warped_autocorrelation_FIX_neon() is
>> about
>> >     > 2,744 bytes. Compared with the C code size (about 452
bytes),
>> it's 2.3
>> >     > KB larger. Considering
silk_warped_autocorrelation_FIX_c() is the
>> >     second
>> >     > place CPU heavy function in fixed-point, and our testing
shows up
>> >     to 7%
>> >     > CPU run time saving of the total encoder with this
optimization
>> (at
>> >     > Complexity 8), maybe we can take the I-cache burden even
if
>> finally we
>> >     > still cannot remove the big chunk of prolog and epilog.
>> >     >
>> >     > Thanks,
>> >     > Linfeng Zhang
>> >     >
>> >     > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
>> >     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>> >     > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>>> wrote:
>> >     >
>> >     >     Hi Felicia,
>> >     >
>> >     >     I've had time to work through the math in the
original
>> >     function and I'm
>> >     >     pretty sure it's possible to vectorize this
without the huge
>> >     >     prologue/epilogue.
>> >     >
>> >     >     For the simple case where order = vector size = N
(but it
>> >     should easily
>> >     >     generalize to larger order), what I came up with is:
>> >     >
>> >     >     initialize X, Y, M, C to vector of zeros
>> >     >
>> >     >     for i=0 to N+order
>> >     >        T = [x(i), Y(0:N-2)]
>> >     >        Y = M + coeff * (Y - T)
>> >     >        M = T
>> >     >        X = [x(i), X(0:N-1)]
>> >     >        C = C + X*Y
>> >     >
>> >     >     I think something similar to this (assuming I
didn't mess up
>> any
>> >     >     details) should give you the correlations in vector
C. Did I
>> miss
>> >     >     anything?
>> >     >
>> >     >     Cheers,
>> >     >
>> >     >             Jean-Marc
>> >     >
>> >     >
>> >     >     On 31/01/17 12:30 PM, Felicia Lim wrote:
>> >     >     > Hi,
>> >     >     >
>> >     >     > Attached is a patch with arm neon optimizations
for
>> >     >     > silk_warped_autocorrelation_FIX(). Please
review.
>> >     >     >
>> >     >     > Thanks,
>> >     >     > Felicia
>> >     >     >
>> >     >     >
>> >     >     > _______________________________________________
>> >     >     > opus mailing list
>> >     >     > opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
>> >     <mailto:opus at xiph.org>>
>> >     >     > http://lists.xiph.org/mailman/listinfo/opus
>> >     <http://lists.xiph.org/mailman/listinfo/opus>
>> >     >     <http://lists.xiph.org/mailman/listinfo/opus
>> >     <http://lists.xiph.org/mailman/listinfo/opus>>
>> >     >     >
>> >     >     _______________________________________________
>> >     >     opus mailing list
>> >     >     opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
>> >     <mailto:opus at xiph.org>>
>> >     >     http://lists.xiph.org/mailman/listinfo/opus
>> >     <http://lists.xiph.org/mailman/listinfo/opus>
>> >     >     <http://lists.xiph.org/mailman/listinfo/opus
>> >     <http://lists.xiph.org/mailman/listinfo/opus>>
>> >     >
>> >     >
>> >
>> >
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170403/a85fa831/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch
Type: text/x-patch
Size: 34983 bytes
Desc: not available
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170403/a85fa831/attachment-0001.bin>

Linfeng Zhang

2017-Apr-05 16:14 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

I attached a new patch with small cleanup (disassembly is identical as the
last patch). We have done the same internal testing as usual.

Also, attached 2 failed temporary versions which try to reduce code size
(just for code review reference purpose).

The new patch of silk_warped_autocorrelation_FIX_neon() has a code size of
3,228 bytes (with gcc).
smaller_slower.c has a code size of 2,304 bytes, but the encoder is about
1.8% - 2.7% slower.
smallest_slowest.c has a code size of 1,656 bytes, but the encoder is about
2.3% - 3.6% slower.

Thanks,
Linfeng

On Mon, Apr 3, 2017 at 3:01 PM, Linfeng Zhang <linfengz at google.com>
wrote:
> Hi Jean-Marc,
>
> Attached is the silk_warped_autocorrelation_FIX_neon() which implements
> your idea.
>
> Speed improvement vs the previous optimization:
>
> Complexity 0-4: Doesn't call this function. Complexity 5: 2.1% (order
> 16) Complexity 6: 1.0% (order = 20) Complexity 8: 0.1% (order = 24)
> Complexity 10: 0.1% (order = 24)
>
> Code size of silk_warped_autocorrelation_FIX_neon() changes from 2,644
> bytes to 3,228 bytes.
>
> The reason of larger code size is that the new optimization specializes
> order 16, 20 and 24. If only keeping order 24 specialization, the code
> still works and the code size is smaller, but the encoder speed will drop
> 4.0% for Complexity 5 and 2.0% for Complexity 6. Anyway, the new code is
> easier to understand and maintain.
>
> Thanks,
>
> Linfeng
>
> On Tue, Feb 7, 2017 at 8:58 AM, Linfeng Zhang <linfengz at
google.com> wrote:
>
>> Hi Jean-Marc,
>>
>> Thanks for your suggestions. Will get back to you once we have some
>> updates.
>>
>> Linfeng
>>
>> On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin <jmvalin at
jmvalin.ca>
>> wrote:
>>
>>> Hi Linfeng,
>>>
>>> On 06/02/17 07:18 PM, Linfeng Zhang wrote:
>>> > This is a great idea. But the order
(psEncC->shapingLPCOrder) can be
>>> > configured to 12, 14, 16, 20 and 24 according to complexity
parameter.
>>> >
>>> > It's hard to get a universal function to handle all these
orders
>>> > efficiently. Any suggestions?
>>>
>>> I can think of two ways of handling larger orders. The obvious one
is
>>> simply to add an inner loop of the form:
>>> for (i=0;i<order;i+=VECTOR_SIZE)
>>> I think what may be more efficient is to simply have a small
"order-N"
>>> (N=4 or 8) kernel that not only computes the correlation of order
N, but
>>> then spits out the signal after the N-stage all-pass is applied.
The
>>> kernel would look like:
>>>
>>> void autocorr_kernel4(int *corr, int *orig, int *input, int
*output, int
>>> len) {
>>>    /* Implement vectorized order-4 filter (could also be order 8)
>>>       as described in previous email and outputs the filtered
signal.
>>>    */
>>> }
>>>
>>> and then the full function would run the kernel multiple times and
look
>>> like:
>>>
>>> void full_autocorr(int *corr, int *orig, int len, int order) {
>>>    int i;
>>>    int tmp[MAX_SIZE];
>>>    int *in = orig;
>>>    for (i=0;i<order;i+=4) {
>>>       autocorr_kernel4(corr+i, orig, in, tmp, len);
>>>       /* Make subsequent calls use the filtered signal as input. */
>>>       in = tmp;
>>>    }
>>> }
>>>
>>> I think the should not only reduce/eliminate the prologue/epilogue
>>> problem, but it should also be more efficient since almost all
vectors
>>> processed would use the full size.
>>>
>>> Maybe a third option (not sure it's a good idea, but still
mentioning
>>> it) would be to have a function that hardcodes order=24 and
discards the
>>> larger values that aren't needed. Since the smallest order
seems to be
>>> 16, it wouldn't be much of a waste and the code might end up
running
>>> faster for the higher orders.
>>>
>>> Cheers,
>>>
>>>         Jean-Marc
>>>
>>>
>>> > Thanks,
>>> > Linfeng
>>> >
>>> > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin
at jmvalin.ca
>>> > <mailto:jmvalin at jmvalin.ca>> wrote:
>>> >
>>> >     Hi Linfeng,
>>> >
>>> >     On 06/02/17 02:51 PM, Linfeng Zhang wrote:
>>> >     > However, the critical thing is that all the states in
each stage
>>> when
>>> >     > processing input[i] are reused by the next
input[i+1]. That is
>>> >     > input[i+1] must wait input[i] for 1 stage, and
input[i+2] must
>>> wait
>>> >     > input[i+1] for 1 stage, etc.
>>> >
>>> >     That is indeed the tricky part... and the one I think you
could do
>>> >     slightly differently. If you approach the problem in terms
of
>>> computing
>>> >     chunks of the inputs N samples at a time, then indeed the
approach
>>> you
>>> >     are describing is the only solution. What I was proposing
though
>>> is to
>>> >     instead chop the "order" in chunks of N. Using
your notation, you
>>> would
>>> >     be doing:
>>> >
>>> >     PROC(
>>> in0(s0))
>>> >     PROC(                                               
in0(s1)
>>> in1(s0))
>>> >     PROC(                                        in0(s2)
in1(s1)
>>> in2(s0))
>>> >     PROC(                                in0(s3) in1(s2)
in2(s1)
>>> in3(s0))
>>> >     PROC(                        in0(s4) in1(s3) in2(s2)
in3(s1)
>>> in4(s0))
>>> >     PROC(                in0(s5) in1(s4) in2(s3) in3(s2)
in4(s1)
>>> in5(s0))
>>> >     PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2)
in5(s1)
>>> in6(s0))
>>> >     PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
in6(s1)
>>> in7(s0))
>>> >     PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
in7(s1)
>>> in8(s0))
>>> >     PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)
in8(s1)
>>> in9(s0))
>>> >     PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2)
>>> in9(s1)in10(s0))
>>> >     PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3)
>>> in9(s2)in10(s1)in11(s0))
>>> >     ...and so on until the end of the input vector
>>> >
>>> >     The difference is that it's now the input vector that
"slides" and
>>> the
>>> >     "state" values sy that remain in the same place.
There's still a
>>> >     prologue, but you can easily get rid of it by (implicitly)
>>> zero-padding
>>> >     the in vector during the initialization phase (start with
a zero
>>> vector
>>> >     and real one value at a time). Getting rid of the epilogue
is a
>>> little
>>> >     trickier, but I think it can be done.
>>> >
>>> >     Cheers,
>>> >
>>> >             Jean-Marc
>>> >
>>> >     > Then it becomes this
>>> >     >
>>> >     > FOR in=0 to N WITH in+=8
>>> >     >   PROC(in0(s0)) /* prolog 0 */
>>> >     >   PROC(in0(s1) in1(s0)) /* prolog 1 */
>>> >     >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
>>> >     >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3
*/
>>> >     >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /*
prolog 4 */
>>> >     >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1)
in5(s0)) /*
>>> prolog 5 */
>>> >     >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2)
in5(s1) in6(s0))
>>> /*
>>> >     > prolog 6 */
>>> >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3)
in5(s2) in6(s1)
>>> >     in7(s0))
>>> >     > /* fully process 8 inputs */
>>> >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4)
in5(s3) in6(s2)
>>> >     in7(s1))
>>> >     > /* continue */
>>> >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5)
in5(s4) in6(s3)
>>> >     in7(s2))
>>> >     > /* continue */
>>> >     >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6)
in5(s5) in6(s4)
>>> >     in7(s3))
>>> >     > /* continue */
>>> >     >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6)
in6(s5) in7(s4))
>>> /*
>>> >     > epilog 0 */
>>> >     >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6)
in7(s5)) /*
>>> epilog
>>> >     1 */
>>> >     >   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /*
epilog 2 */
>>> >     >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3
*/
>>> >     >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
>>> >     >   PROC(in6(s10) in7(s9)) /* epilog 5 */
>>> >     >   PROC(in7(s10)) /* epilog 6 */
>>> >     > END FOR
>>> >     >
>>> >     > And
>>> >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3)
in5(s2) in6(s1)
>>> >     in7(s0))
>>> >     > /* fully process 8 inputs */
>>> >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4)
in5(s3) in6(s2)
>>> >     in7(s1))
>>> >     > /* continue */
>>> >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5)
in5(s4) in6(s3)
>>> >     in7(s2))
>>> >     > /* continue */
>>> >     > is actually the expansion of the kernel loop
>>> >     > FOR i=0 TO order-6 WITH i++
>>> >     >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4)
in4(si+3)
>>> in5(si+2)
>>> >     > in6(si+1) in7(si+0))
>>> >     > END FOR
>>> >     >
>>> >     > The worst thing is that corr_QC[] is so sensitive
that any extra
>>> >     > processing will make them wrong and propagate to the
next loop
>>> (next 8
>>> >     > inputs). state_QS[] is a little better but still very
sensitive.
>>> For
>>> >     > instance, if adding PROC(in0(s11') in1(s10)
in2(s9) in3(s8)
>>> in4(s7)
>>> >     > in5(s6) in6(s5) in7(s4)) to the kernel loop (by
looping one more
>>> time)
>>> >     > and remove epilog 0, then all final results will be
wrong.
>>> >     >
>>> >     > That's why the prolog and epilog cannot be saved
to the best of
>>> my
>>> >     > knowledge.
>>> >     >
>>> >     > The assembly size of
silk_warped_autocorrelation_FIX_neon() is
>>> about
>>> >     > 2,744 bytes. Compared with the C code size (about 452
bytes),
>>> it's 2.3
>>> >     > KB larger. Considering
silk_warped_autocorrelation_FIX_c() is
>>> the
>>> >     second
>>> >     > place CPU heavy function in fixed-point, and our
testing shows up
>>> >     to 7%
>>> >     > CPU run time saving of the total encoder with this
optimization
>>> (at
>>> >     > Complexity 8), maybe we can take the I-cache burden
even if
>>> finally we
>>> >     > still cannot remove the big chunk of prolog and
epilog.
>>> >     >
>>> >     > Thanks,
>>> >     > Linfeng Zhang
>>> >     >
>>> >     > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
>>> >     <jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>
>>> >     > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin
at jmvalin.ca>>> wrote:
>>> >     >
>>> >     >     Hi Felicia,
>>> >     >
>>> >     >     I've had time to work through the math in the
original
>>> >     function and I'm
>>> >     >     pretty sure it's possible to vectorize this
without the huge
>>> >     >     prologue/epilogue.
>>> >     >
>>> >     >     For the simple case where order = vector size = N
(but it
>>> >     should easily
>>> >     >     generalize to larger order), what I came up with
is:
>>> >     >
>>> >     >     initialize X, Y, M, C to vector of zeros
>>> >     >
>>> >     >     for i=0 to N+order
>>> >     >        T = [x(i), Y(0:N-2)]
>>> >     >        Y = M + coeff * (Y - T)
>>> >     >        M = T
>>> >     >        X = [x(i), X(0:N-1)]
>>> >     >        C = C + X*Y
>>> >     >
>>> >     >     I think something similar to this (assuming I
didn't mess up
>>> any
>>> >     >     details) should give you the correlations in
vector C. Did I
>>> miss
>>> >     >     anything?
>>> >     >
>>> >     >     Cheers,
>>> >     >
>>> >     >             Jean-Marc
>>> >     >
>>> >     >
>>> >     >     On 31/01/17 12:30 PM, Felicia Lim wrote:
>>> >     >     > Hi,
>>> >     >     >
>>> >     >     > Attached is a patch with arm neon
optimizations for
>>> >     >     > silk_warped_autocorrelation_FIX(). Please
review.
>>> >     >     >
>>> >     >     > Thanks,
>>> >     >     > Felicia
>>> >     >     >
>>> >     >     >
>>> >     >     >
_______________________________________________
>>> >     >     > opus mailing list
>>> >     >     > opus at xiph.org <mailto:opus at
xiph.org> <mailto:opus at xiph.org
>>> >     <mailto:opus at xiph.org>>
>>> >     >     > http://lists.xiph.org/mailman/listinfo/opus
>>> >     <http://lists.xiph.org/mailman/listinfo/opus>
>>> >     >     <http://lists.xiph.org/mailman/listinfo/opus
>>> >     <http://lists.xiph.org/mailman/listinfo/opus>>
>>> >     >     >
>>> >     >     _______________________________________________
>>> >     >     opus mailing list
>>> >     >     opus at xiph.org <mailto:opus at xiph.org>
<mailto:opus at xiph.org
>>> >     <mailto:opus at xiph.org>>
>>> >     >     http://lists.xiph.org/mailman/listinfo/opus
>>> >     <http://lists.xiph.org/mailman/listinfo/opus>
>>> >     >     <http://lists.xiph.org/mailman/listinfo/opus
>>> >     <http://lists.xiph.org/mailman/listinfo/opus>>
>>> >     >
>>> >     >
>>> >
>>> >
>>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170405/41265d1d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch
Type: text/x-patch
Size: 34156 bytes
Desc: not available
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170405/41265d1d/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smaller_slower.c
Type: text/x-csrc
Size: 11374 bytes
Desc: not available
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170405/41265d1d/attachment-0002.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smallest_slowest.c
Type: text/x-csrc
Size: 10850 bytes
Desc: not available
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170405/41265d1d/attachment-0003.c>

Reasonably Related Threads

Search for more possibly parallel threads

opus - Apr 2017 - [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Reasonably Related Threads