thr3ads.net - opus - [opus] [PATCH] Optimize silk_warped_autocorrelation

If this information is useful, please help other people find it:
Share via:

Felicia Lim

2017-Jan-31 17:30 UTC

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi,

Attached is a patch with arm neon optimizations for
silk_warped_autocorrelation_FIX(). Please review.

Thanks,
Felicia
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170131/9a912bb4/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch
Type: text/x-patch
Size: 48530 bytes
Desc: not available
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170131/9a912bb4/attachment-0001.bin>

Jean-Marc Valin

2017-Jan-31 20:55 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Felicia,

Thanks for the patch. Can you give more details on what checks/tests
you've done so far on this patch?

Thanks,

	Jean-Marc

On 31/01/17 12:30 PM, Felicia Lim wrote:> Hi, 
> 
> Attached is a patch with arm neon optimizations for
> silk_warped_autocorrelation_FIX(). Please review. 
> 
> Thanks,
> Felicia
> 
> 
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>

James Zern

2017-Feb-01 02:15 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

On Tue, Jan 31, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:> Hi Felicia,
>
> Thanks for the patch. Can you give more details on what checks/tests
> you've done so far on this patch?
>
I did builds for x86_64 and arm (USE_CELT_FIR on and off, -O0, -O2,
gcc & clang) that checked for new warnings / c90 compatibility. The
optimized arm gcc builds (with --enable-check-asm) were used to check
bit exactness for complexity 0-10 at 8, 16, 48kHz using a few sample
rtc clips. ASan/UBSan builds pass as well.
I also made a pass through the NEON before this was forwarded, it
looks in line with the C.
> Thanks,
>
>         Jean-Marc
>
> On 31/01/17 12:30 PM, Felicia Lim wrote:
>> Hi,
>>
>> Attached is a patch with arm neon optimizations for
>> silk_warped_autocorrelation_FIX(). Please review.
>>
>> Thanks,
>> Felicia
>>
>>
>> _______________________________________________
>> opus mailing list
>> opus at xiph.org
>> http://lists.xiph.org/mailman/listinfo/opus
>>
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus

Jean-Marc Valin

2017-Feb-02 22:39 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Felicia,

I've not yet really looked into the details, but first here's a few
comments and questions:

1) Why does the patch need to define SKIP_CONFIG_H and CUSTOM_MODES at
the beginning? I can't see a reason for that.

2) The whole code is inside an #ifdef FIXED_POINT. It seems like it the
file shouldn't be compiled at all for float, so that #ifdef would be
unnecessary. Or did I miss something?

3) The code as it is written is pretty hard to follow. Can you explain
at a high level how you're vectorizing this code, i.e. which loop(s)
gets unrolled and the general strategy. That should make it easier to
follow.

4) In general I'm a bit skeptical that so much unrolled prolog/epilog
code is needed. Did you try having less unrolling for the prolog/epilog?
That would be nicer to the I-cache (if possible).

Cheers,

	Jean-Marc

On 31/01/17 12:30 PM, Felicia Lim wrote:> Hi, 
> 
> Attached is a patch with arm neon optimizations for
> silk_warped_autocorrelation_FIX(). Please review. 
> 
> Thanks,
> Felicia
> 
> 
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>

Jean-Marc Valin

2017-Feb-05 00:17 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Felicia,

I've had time to work through the math in the original function and I'm
pretty sure it's possible to vectorize this without the huge
prologue/epilogue.

For the simple case where order = vector size = N (but it should easily
generalize to larger order), what I came up with is:

initialize X, Y, M, C to vector of zeros

for i=0 to N+order
   T = [x(i), Y(0:N-2)]
   Y = M + coeff * (Y - T)
   M = T
   X = [x(i), X(0:N-1)]
   C = C + X*Y

I think something similar to this (assuming I didn't mess up any
details) should give you the correlations in vector C. Did I miss anything?

Cheers,

	Jean-Marc

On 31/01/17 12:30 PM, Felicia Lim wrote:> Hi, 
> 
> Attached is a patch with arm neon optimizations for
> silk_warped_autocorrelation_FIX(). Please review. 
> 
> Thanks,
> Felicia
> 
> 
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>

Linfeng Zhang

2017-Feb-06 19:42 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Jean-Marc,

1) Why does the patch need to define SKIP_CONFIG_H and CUSTOM_MODES at the
beginning? I can't see a reason for that.

A: I just copied them from some CELT optimizations and didn't think too
much. Will fix. Thanks!

2) The whole code is inside an #ifdef FIXED_POINT. It seems like it
the file shouldn't be compiled at all for float, so that #ifdef would be
unnecessary. Or did I miss something?

A: Will fix.

3) The code as it is written is pretty hard to follow. Can you explain at a
high level how you're vectorizing this code, i.e. which loop(s) gets
unrolled and the general strategy. That should make it easier to follow.

A: Please see next email.

4) In general I'm a bit skeptical that so much unrolled prolog/epilog code
is needed. Did you try having less unrolling for the prolog/epilog? That
would be nicer to the I-cache (if possible).

A: Please see next email.

Thanks,
Linfeng Zhang

On Thu, Feb 2, 2017 at 2:39 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Felicia,
>
> I've not yet really looked into the details, but first here's a few
> comments and questions:
>
> 1) Why does the patch need to define SKIP_CONFIG_H and CUSTOM_MODES at
> the beginning? I can't see a reason for that.
>
> 2) The whole code is inside an #ifdef FIXED_POINT. It seems like it the
> file shouldn't be compiled at all for float, so that #ifdef would be
> unnecessary. Or did I miss something?
>
> 3) The code as it is written is pretty hard to follow. Can you explain
> at a high level how you're vectorizing this code, i.e. which loop(s)
> gets unrolled and the general strategy. That should make it easier to
> follow.
>
> 4) In general I'm a bit skeptical that so much unrolled prolog/epilog
> code is needed. Did you try having less unrolling for the prolog/epilog?
> That would be nicer to the I-cache (if possible).
>
> Cheers,
>
>         Jean-Marc
>
> On 31/01/17 12:30 PM, Felicia Lim wrote:
> > Hi,
> >
> > Attached is a patch with arm neon optimizations for
> > silk_warped_autocorrelation_FIX(). Please review.
> >
> > Thanks,
> > Felicia
> >
> >
> > _______________________________________________
> > opus mailing list
> > opus at xiph.org
> > http://lists.xiph.org/mailman/listinfo/opus
> >
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170206/b447a5b7/attachment.html>

Linfeng Zhang

2017-Feb-06 19:51 UTC

head link

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Hi Jean-Marc,

Thanks a lot for reviewing this huge assembly function!

silk_warped_autocorrelation_FIX_c()'s kernel part is
    for( n = 0; n < length; n++ ) {
        tmp1_QS = silk_LSHIFT32( (opus_int32)input[ n ], QS );
        /* Loop over allpass sections */
        for( i = 0; i < order; i++ ) {
            /* Output of allpass section */
            tmp2_QS = silk_SMLAWB( state_QS[ i ], state_QS[ i + 1 ] -
tmp1_QS, warping_Q16 );
            state_QS[ i ]  = tmp1_QS;
            corr_QC[  i ] += silk_RSHIFT64( silk_SMULL( tmp1_QS, state_QS[
0 ] ), 2 * QS - QC );
            tmp1_QS = tmp2_QS;
        }
        state_QS[ order ] = tmp1_QS;
        corr_QC[  order ] += silk_RSHIFT64( silk_SMULL( tmp1_QS, state_QS[
0 ] ), 2 * QS - QC );
    }
in which corr_QC[0, 1, ..., order] is the only output.

Suppose order = 10, and each stage of the inner loop is noted by s0, s1,
..., s9. And suppose we simultaneously process 8 input in SIMD, from in0 to
in7. Let PROC(inx(sy)) denote processing input[x] at stage y.

If there is no dependency between inx(sy) and in(x+1)(sy), then we can do
this

FOR in=0 TO N WITH in+=8
  FOR y=0 TO order-1 WITH y++
   PROC(in0(sy) in1(sy) in2(sy) in3(sy) in4(sy) in5(sy) in6(sy) in7(sy))
  END FOR
END FOR

Definitely there is no any prolog and epilog needed.

However, the critical thing is that all the states in each stage when
processing input[i] are reused by the next input[i+1]. That is input[i+1]
must wait input[i] for 1 stage, and input[i+2] must wait input[i+1] for 1
stage, etc.
Then it becomes this

FOR in=0 to N WITH in+=8
  PROC(in0(s0)) /* prolog 0 */
  PROC(in0(s1) in1(s0)) /* prolog 1 */
  PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
  PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
  PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */
  PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog 5 */
  PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /* prolog 6
*/
  PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) /*
fully process 8 inputs */
  PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)) /*
continue */
  PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)) /*
continue */
  PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4) in7(s3)) /*
continue */
  PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /* epilog
0 */
  PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog 1 */
  PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */
  PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
  PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
  PROC(in6(s10) in7(s9)) /* epilog 5 */
  PROC(in7(s10)) /* epilog 6 */
END FOR

And
  PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) /*
fully process 8 inputs */
  PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)) /*
continue */
  PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)) /*
continue */
is actually the expansion of the kernel loop
FOR i=0 TO order-6 WITH i++
  PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
in6(si+1) in7(si+0))
END FOR

The worst thing is that corr_QC[] is so sensitive that any extra processing
will make them wrong and propagate to the next loop (next 8 inputs).
state_QS[] is a little better but still very sensitive. For instance, if
adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5)
in7(s4)) to the kernel loop (by looping one more time) and remove epilog 0,
then all final results will be wrong.

That's why the prolog and epilog cannot be saved to the best of my
knowledge.

The assembly size of silk_warped_autocorrelation_FIX_neon() is about 2,744
bytes. Compared with the C code size (about 452 bytes), it's 2.3 KB larger.
Considering silk_warped_autocorrelation_FIX_c() is the second place CPU
heavy function in fixed-point, and our testing shows up to 7% CPU run time
saving of the total encoder with this optimization (at Complexity 8), maybe
we can take the I-cache burden even if finally we still cannot remove the
big chunk of prolog and epilog.

Thanks,
Linfeng Zhang

On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Felicia,
>
> I've had time to work through the math in the original function and
I'm
> pretty sure it's possible to vectorize this without the huge
> prologue/epilogue.
>
> For the simple case where order = vector size = N (but it should easily
> generalize to larger order), what I came up with is:
>
> initialize X, Y, M, C to vector of zeros
>
> for i=0 to N+order
>    T = [x(i), Y(0:N-2)]
>    Y = M + coeff * (Y - T)
>    M = T
>    X = [x(i), X(0:N-1)]
>    C = C + X*Y
>
> I think something similar to this (assuming I didn't mess up any
> details) should give you the correlations in vector C. Did I miss anything?
>
> Cheers,
>
>         Jean-Marc
>
>
> On 31/01/17 12:30 PM, Felicia Lim wrote:
> > Hi,
> >
> > Attached is a patch with arm neon optimizations for
> > silk_warped_autocorrelation_FIX(). Please review.
> >
> > Thanks,
> > Felicia
> >
> >
> > _______________________________________________
> > opus mailing list
> > opus at xiph.org
> > http://lists.xiph.org/mailman/listinfo/opus
> >
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170206/64d409a6/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

opus - Feb 2017 - [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Possibly Parallel Threads