thr3ads.net - opus - [opus] [PATCH] Optimize silk_warped_autocorrelation

If this information is useful, please help other people find it:
Share via:

Linfeng Zhang

2017-Feb-06 19:51 UTC

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Jean-Marc,

Thanks a lot for reviewing this huge assembly function!

silk_warped_autocorrelation_FIX_c()'s kernel part is
    for( n = 0; n < length; n++ ) {
        tmp1_QS = silk_LSHIFT32( (opus_int32)input[ n ], QS );
        /* Loop over allpass sections */
        for( i = 0; i < order; i++ ) {
            /* Output of allpass section */
            tmp2_QS = silk_SMLAWB( state_QS[ i ], state_QS[ i + 1 ] -
tmp1_QS, warping_Q16 );
            state_QS[ i ]  = tmp1_QS;
            corr_QC[  i ] += silk_RSHIFT64( silk_SMULL( tmp1_QS, state_QS[
0 ] ), 2 * QS - QC );
            tmp1_QS = tmp2_QS;
        }
        state_QS[ order ] = tmp1_QS;
        corr_QC[  order ] += silk_RSHIFT64( silk_SMULL( tmp1_QS, state_QS[
0 ] ), 2 * QS - QC );
    }
in which corr_QC[0, 1, ..., order] is the only output.

Suppose order = 10, and each stage of the inner loop is noted by s0, s1,
..., s9. And suppose we simultaneously process 8 input in SIMD, from in0 to
in7. Let PROC(inx(sy)) denote processing input[x] at stage y.

If there is no dependency between inx(sy) and in(x+1)(sy), then we can do
this

FOR in=0 TO N WITH in+=8
  FOR y=0 TO order-1 WITH y++
   PROC(in0(sy) in1(sy) in2(sy) in3(sy) in4(sy) in5(sy) in6(sy) in7(sy))
  END FOR
END FOR

Definitely there is no any prolog and epilog needed.

However, the critical thing is that all the states in each stage when
processing input[i] are reused by the next input[i+1]. That is input[i+1]
must wait input[i] for 1 stage, and input[i+2] must wait input[i+1] for 1
stage, etc.
Then it becomes this

FOR in=0 to N WITH in+=8
  PROC(in0(s0)) /* prolog 0 */
  PROC(in0(s1) in1(s0)) /* prolog 1 */
  PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
  PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
  PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */
  PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog 5 */
  PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /* prolog 6
*/
  PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) /*
fully process 8 inputs */
  PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)) /*
continue */
  PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)) /*
continue */
  PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4) in7(s3)) /*
continue */
  PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /* epilog
0 */
  PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog 1 */
  PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */
  PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
  PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
  PROC(in6(s10) in7(s9)) /* epilog 5 */
  PROC(in7(s10)) /* epilog 6 */
END FOR

And
  PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) /*
fully process 8 inputs */
  PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)) /*
continue */
  PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)) /*
continue */
is actually the expansion of the kernel loop
FOR i=0 TO order-6 WITH i++
  PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
in6(si+1) in7(si+0))
END FOR

The worst thing is that corr_QC[] is so sensitive that any extra processing
will make them wrong and propagate to the next loop (next 8 inputs).
state_QS[] is a little better but still very sensitive. For instance, if
adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5)
in7(s4)) to the kernel loop (by looping one more time) and remove epilog 0,
then all final results will be wrong.

That's why the prolog and epilog cannot be saved to the best of my
knowledge.

The assembly size of silk_warped_autocorrelation_FIX_neon() is about 2,744
bytes. Compared with the C code size (about 452 bytes), it's 2.3 KB larger.
Considering silk_warped_autocorrelation_FIX_c() is the second place CPU
heavy function in fixed-point, and our testing shows up to 7% CPU run time
saving of the total encoder with this optimization (at Complexity 8), maybe
we can take the I-cache burden even if finally we still cannot remove the
big chunk of prolog and epilog.

Thanks,
Linfeng Zhang

On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Felicia,
>
> I've had time to work through the math in the original function and
I'm
> pretty sure it's possible to vectorize this without the huge
> prologue/epilogue.
>
> For the simple case where order = vector size = N (but it should easily
> generalize to larger order), what I came up with is:
>
> initialize X, Y, M, C to vector of zeros
>
> for i=0 to N+order
>    T = [x(i), Y(0:N-2)]
>    Y = M + coeff * (Y - T)
>    M = T
>    X = [x(i), X(0:N-1)]
>    C = C + X*Y
>
> I think something similar to this (assuming I didn't mess up any
> details) should give you the correlations in vector C. Did I miss anything?
>
> Cheers,
>
>         Jean-Marc
>
>
> On 31/01/17 12:30 PM, Felicia Lim wrote:
> > Hi,
> >
> > Attached is a patch with arm neon optimizations for
> > silk_warped_autocorrelation_FIX(). Please review.
> >
> > Thanks,
> > Felicia
> >
> >
> > _______________________________________________
> > opus mailing list
> > opus at xiph.org
> > http://lists.xiph.org/mailman/listinfo/opus
> >
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170206/64d409a6/attachment.html>

Jean-Marc Valin

2017-Feb-06 20:40 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Linfeng,

On 06/02/17 02:51 PM, Linfeng Zhang wrote:> However, the critical thing is that all the states in each stage when
> processing input[i] are reused by the next input[i+1]. That is
> input[i+1] must wait input[i] for 1 stage, and input[i+2] must wait
> input[i+1] for 1 stage, etc.
That is indeed the tricky part... and the one I think you could do
slightly differently. If you approach the problem in terms of computing
chunks of the inputs N samples at a time, then indeed the approach you
are describing is the only solution. What I was proposing though is to
instead chop the "order" in chunks of N. Using your notation, you
would
be doing:

PROC(                                                        in0(s0))
PROC(                                                in0(s1) in1(s0))
PROC(                                        in0(s2) in1(s1) in2(s0))
PROC(                                in0(s3) in1(s2) in2(s1) in3(s0))
PROC(                        in0(s4) in1(s3) in2(s2) in3(s1) in4(s0))
PROC(                in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0))
PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0))
PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0))
PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1) in8(s0))
PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1) in9(s0))
PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) in9(s1)in10(s0))
PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) in9(s2)in10(s1)in11(s0))
...and so on until the end of the input vector

The difference is that it's now the input vector that "slides" and
the
"state" values sy that remain in the same place. There's still a
prologue, but you can easily get rid of it by (implicitly) zero-padding
the in vector during the initialization phase (start with a zero vector
and real one value at a time). Getting rid of the epilogue is a little
trickier, but I think it can be done.

Cheers,

	Jean-Marc
> Then it becomes this
> 
> FOR in=0 to N WITH in+=8
>   PROC(in0(s0)) /* prolog 0 */
>   PROC(in0(s1) in1(s0)) /* prolog 1 */
>   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
>   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
>   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */
>   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog 5 */
>   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /*
> prolog 6 */
>   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0))
> /* fully process 8 inputs */
>   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1))
> /* continue */
>   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2))
> /* continue */
>   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4) in7(s3))
> /* continue */
>   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /*
> epilog 0 */
>   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog 1 */
>   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */
>   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
>   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
>   PROC(in6(s10) in7(s9)) /* epilog 5 */
>   PROC(in7(s10)) /* epilog 6 */
> END FOR
> 
> And
>   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0))
> /* fully process 8 inputs */
>   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1))
> /* continue */
>   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2))
> /* continue */
> is actually the expansion of the kernel loop
> FOR i=0 TO order-6 WITH i++
>   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
> in6(si+1) in7(si+0))
> END FOR
> 
> The worst thing is that corr_QC[] is so sensitive that any extra
> processing will make them wrong and propagate to the next loop (next 8
> inputs). state_QS[] is a little better but still very sensitive. For
> instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7)
> in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more time)
> and remove epilog 0, then all final results will be wrong.
> 
> That's why the prolog and epilog cannot be saved to the best of my
> knowledge.
> 
> The assembly size of silk_warped_autocorrelation_FIX_neon() is about
> 2,744 bytes. Compared with the C code size (about 452 bytes), it's 2.3
> KB larger. Considering silk_warped_autocorrelation_FIX_c() is the second
> place CPU heavy function in fixed-point, and our testing shows up to 7%
> CPU run time saving of the total encoder with this optimization (at
> Complexity 8), maybe we can take the I-cache burden even if finally we
> still cannot remove the big chunk of prolog and epilog.
> 
> Thanks,
> Linfeng Zhang
> 
> On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin <jmvalin at jmvalin.ca
> <mailto:jmvalin at jmvalin.ca>> wrote:
> 
>     Hi Felicia,
> 
>     I've had time to work through the math in the original function and
I'm
>     pretty sure it's possible to vectorize this without the huge
>     prologue/epilogue.
> 
>     For the simple case where order = vector size = N (but it should easily
>     generalize to larger order), what I came up with is:
> 
>     initialize X, Y, M, C to vector of zeros
> 
>     for i=0 to N+order
>        T = [x(i), Y(0:N-2)]
>        Y = M + coeff * (Y - T)
>        M = T
>        X = [x(i), X(0:N-1)]
>        C = C + X*Y
> 
>     I think something similar to this (assuming I didn't mess up any
>     details) should give you the correlations in vector C. Did I miss
>     anything?
> 
>     Cheers,
> 
>             Jean-Marc
> 
> 
>     On 31/01/17 12:30 PM, Felicia Lim wrote:
>     > Hi,
>     >
>     > Attached is a patch with arm neon optimizations for
>     > silk_warped_autocorrelation_FIX(). Please review.
>     >
>     > Thanks,
>     > Felicia
>     >
>     >
>     > _______________________________________________
>     > opus mailing list
>     > opus at xiph.org <mailto:opus at xiph.org>
>     > http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >
>     _______________________________________________
>     opus mailing list
>     opus at xiph.org <mailto:opus at xiph.org>
>     http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
> 
>

Linfeng Zhang

2017-Feb-07 00:18 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

This is a great idea. But the order (psEncC->shapingLPCOrder) can be
configured to 12, 14, 16, 20 and 24 according to complexity parameter.

It's hard to get a universal function to handle all these orders
efficiently. Any suggestions?

Thanks,
Linfeng

On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Linfeng,
>
> On 06/02/17 02:51 PM, Linfeng Zhang wrote:
> > However, the critical thing is that all the states in each stage when
> > processing input[i] are reused by the next input[i+1]. That is
> > input[i+1] must wait input[i] for 1 stage, and input[i+2] must wait
> > input[i+1] for 1 stage, etc.
>
> That is indeed the tricky part... and the one I think you could do
> slightly differently. If you approach the problem in terms of computing
> chunks of the inputs N samples at a time, then indeed the approach you
> are describing is the only solution. What I was proposing though is to
> instead chop the "order" in chunks of N. Using your notation, you
would
> be doing:
>
> PROC(                                                        in0(s0))
> PROC(                                                in0(s1) in1(s0))
> PROC(                                        in0(s2) in1(s1) in2(s0))
> PROC(                                in0(s3) in1(s2) in2(s1) in3(s0))
> PROC(                        in0(s4) in1(s3) in2(s2) in3(s1) in4(s0))
> PROC(                in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0))
> PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0))
> PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0))
> PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1) in8(s0))
> PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1) in9(s0))
> PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2) in9(s1)in10(s0))
> PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3) in9(s2)in10(s1)in11(s0))
> ...and so on until the end of the input vector
>
> The difference is that it's now the input vector that
"slides" and the
> "state" values sy that remain in the same place. There's
still a
> prologue, but you can easily get rid of it by (implicitly) zero-padding
> the in vector during the initialization phase (start with a zero vector
> and real one value at a time). Getting rid of the epilogue is a little
> trickier, but I think it can be done.
>
> Cheers,
>
>         Jean-Marc
>
> > Then it becomes this
> >
> > FOR in=0 to N WITH in+=8
> >   PROC(in0(s0)) /* prolog 0 */
> >   PROC(in0(s1) in1(s0)) /* prolog 1 */
> >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
> >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
> >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */
> >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog 5 */
> >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /*
> > prolog 6 */
> >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
in7(s0))
> > /* fully process 8 inputs */
> >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
in7(s1))
> > /* continue */
> >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3)
in7(s2))
> > /* continue */
> >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4)
in7(s3))
> > /* continue */
> >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /*
> > epilog 0 */
> >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog 1
*/
> >   PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */
> >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
> >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
> >   PROC(in6(s10) in7(s9)) /* epilog 5 */
> >   PROC(in7(s10)) /* epilog 6 */
> > END FOR
> >
> > And
> >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
in7(s0))
> > /* fully process 8 inputs */
> >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
in7(s1))
> > /* continue */
> >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3)
in7(s2))
> > /* continue */
> > is actually the expansion of the kernel loop
> > FOR i=0 TO order-6 WITH i++
> >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
> > in6(si+1) in7(si+0))
> > END FOR
> >
> > The worst thing is that corr_QC[] is so sensitive that any extra
> > processing will make them wrong and propagate to the next loop (next 8
> > inputs). state_QS[] is a little better but still very sensitive. For
> > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8)
in4(s7)
> > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more time)
> > and remove epilog 0, then all final results will be wrong.
> >
> > That's why the prolog and epilog cannot be saved to the best of my
> > knowledge.
> >
> > The assembly size of silk_warped_autocorrelation_FIX_neon() is about
> > 2,744 bytes. Compared with the C code size (about 452 bytes), it's
2.3
> > KB larger. Considering silk_warped_autocorrelation_FIX_c() is the
second
> > place CPU heavy function in fixed-point, and our testing shows up to
7%
> > CPU run time saving of the total encoder with this optimization (at
> > Complexity 8), maybe we can take the I-cache burden even if finally we
> > still cannot remove the big chunk of prolog and epilog.
> >
> > Thanks,
> > Linfeng Zhang
> >
> > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin <jmvalin at
jmvalin.ca
> > <mailto:jmvalin at jmvalin.ca>> wrote:
> >
> >     Hi Felicia,
> >
> >     I've had time to work through the math in the original
function and
> I'm
> >     pretty sure it's possible to vectorize this without the huge
> >     prologue/epilogue.
> >
> >     For the simple case where order = vector size = N (but it should
> easily
> >     generalize to larger order), what I came up with is:
> >
> >     initialize X, Y, M, C to vector of zeros
> >
> >     for i=0 to N+order
> >        T = [x(i), Y(0:N-2)]
> >        Y = M + coeff * (Y - T)
> >        M = T
> >        X = [x(i), X(0:N-1)]
> >        C = C + X*Y
> >
> >     I think something similar to this (assuming I didn't mess up
any
> >     details) should give you the correlations in vector C. Did I miss
> >     anything?
> >
> >     Cheers,
> >
> >             Jean-Marc
> >
> >
> >     On 31/01/17 12:30 PM, Felicia Lim wrote:
> >     > Hi,
> >     >
> >     > Attached is a patch with arm neon optimizations for
> >     > silk_warped_autocorrelation_FIX(). Please review.
> >     >
> >     > Thanks,
> >     > Felicia
> >     >
> >     >
> >     > _______________________________________________
> >     > opus mailing list
> >     > opus at xiph.org <mailto:opus at xiph.org>
> >     > http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>
> >     >
> >     _______________________________________________
> >     opus mailing list
> >     opus at xiph.org <mailto:opus at xiph.org>
> >     http://lists.xiph.org/mailman/listinfo/opus
> >     <http://lists.xiph.org/mailman/listinfo/opus>
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170206/d29bd2a2/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

opus - Feb 2017 - [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Apparently Analagous Threads