Felicia Lim
2017-Jan-31  17:30 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi, Attached is a patch with arm neon optimizations for silk_warped_autocorrelation_FIX(). Please review. Thanks, Felicia -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170131/9a912bb4/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch Type: text/x-patch Size: 48530 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170131/9a912bb4/attachment-0001.bin>
Jean-Marc Valin
2017-Jan-31  20:55 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Felicia, Thanks for the patch. Can you give more details on what checks/tests you've done so far on this patch? Thanks, Jean-Marc On 31/01/17 12:30 PM, Felicia Lim wrote:> Hi, > > Attached is a patch with arm neon optimizations for > silk_warped_autocorrelation_FIX(). Please review. > > Thanks, > Felicia > > > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >
James Zern
2017-Feb-01  02:15 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
On Tue, Jan 31, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> Hi Felicia, > > Thanks for the patch. Can you give more details on what checks/tests > you've done so far on this patch? >I did builds for x86_64 and arm (USE_CELT_FIR on and off, -O0, -O2, gcc & clang) that checked for new warnings / c90 compatibility. The optimized arm gcc builds (with --enable-check-asm) were used to check bit exactness for complexity 0-10 at 8, 16, 48kHz using a few sample rtc clips. ASan/UBSan builds pass as well. I also made a pass through the NEON before this was forwarded, it looks in line with the C.> Thanks, > > Jean-Marc > > On 31/01/17 12:30 PM, Felicia Lim wrote: >> Hi, >> >> Attached is a patch with arm neon optimizations for >> silk_warped_autocorrelation_FIX(). Please review. >> >> Thanks, >> Felicia >> >> >> _______________________________________________ >> opus mailing list >> opus at xiph.org >> http://lists.xiph.org/mailman/listinfo/opus >> > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus
Jean-Marc Valin
2017-Feb-02  22:39 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Felicia, I've not yet really looked into the details, but first here's a few comments and questions: 1) Why does the patch need to define SKIP_CONFIG_H and CUSTOM_MODES at the beginning? I can't see a reason for that. 2) The whole code is inside an #ifdef FIXED_POINT. It seems like it the file shouldn't be compiled at all for float, so that #ifdef would be unnecessary. Or did I miss something? 3) The code as it is written is pretty hard to follow. Can you explain at a high level how you're vectorizing this code, i.e. which loop(s) gets unrolled and the general strategy. That should make it easier to follow. 4) In general I'm a bit skeptical that so much unrolled prolog/epilog code is needed. Did you try having less unrolling for the prolog/epilog? That would be nicer to the I-cache (if possible). Cheers, Jean-Marc On 31/01/17 12:30 PM, Felicia Lim wrote:> Hi, > > Attached is a patch with arm neon optimizations for > silk_warped_autocorrelation_FIX(). Please review. > > Thanks, > Felicia > > > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >
Jean-Marc Valin
2017-Feb-05  00:17 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Felicia, I've had time to work through the math in the original function and I'm pretty sure it's possible to vectorize this without the huge prologue/epilogue. For the simple case where order = vector size = N (but it should easily generalize to larger order), what I came up with is: initialize X, Y, M, C to vector of zeros for i=0 to N+order T = [x(i), Y(0:N-2)] Y = M + coeff * (Y - T) M = T X = [x(i), X(0:N-1)] C = C + X*Y I think something similar to this (assuming I didn't mess up any details) should give you the correlations in vector C. Did I miss anything? Cheers, Jean-Marc On 31/01/17 12:30 PM, Felicia Lim wrote:> Hi, > > Attached is a patch with arm neon optimizations for > silk_warped_autocorrelation_FIX(). Please review. > > Thanks, > Felicia > > > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >
Linfeng Zhang
2017-Feb-06  19:42 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Jean-Marc, 1) Why does the patch need to define SKIP_CONFIG_H and CUSTOM_MODES at the beginning? I can't see a reason for that. A: I just copied them from some CELT optimizations and didn't think too much. Will fix. Thanks! 2) The whole code is inside an #ifdef FIXED_POINT. It seems like it the file shouldn't be compiled at all for float, so that #ifdef would be unnecessary. Or did I miss something? A: Will fix. 3) The code as it is written is pretty hard to follow. Can you explain at a high level how you're vectorizing this code, i.e. which loop(s) gets unrolled and the general strategy. That should make it easier to follow. A: Please see next email. 4) In general I'm a bit skeptical that so much unrolled prolog/epilog code is needed. Did you try having less unrolling for the prolog/epilog? That would be nicer to the I-cache (if possible). A: Please see next email. Thanks, Linfeng Zhang On Thu, Feb 2, 2017 at 2:39 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> Hi Felicia, > > I've not yet really looked into the details, but first here's a few > comments and questions: > > 1) Why does the patch need to define SKIP_CONFIG_H and CUSTOM_MODES at > the beginning? I can't see a reason for that. > > 2) The whole code is inside an #ifdef FIXED_POINT. It seems like it the > file shouldn't be compiled at all for float, so that #ifdef would be > unnecessary. Or did I miss something? > > 3) The code as it is written is pretty hard to follow. Can you explain > at a high level how you're vectorizing this code, i.e. which loop(s) > gets unrolled and the general strategy. That should make it easier to > follow. > > 4) In general I'm a bit skeptical that so much unrolled prolog/epilog > code is needed. Did you try having less unrolling for the prolog/epilog? > That would be nicer to the I-cache (if possible). > > Cheers, > > Jean-Marc > > On 31/01/17 12:30 PM, Felicia Lim wrote: > > Hi, > > > > Attached is a patch with arm neon optimizations for > > silk_warped_autocorrelation_FIX(). Please review. > > > > Thanks, > > Felicia > > > > > > _______________________________________________ > > opus mailing list > > opus at xiph.org > > http://lists.xiph.org/mailman/listinfo/opus > > > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170206/b447a5b7/attachment.html>
Linfeng Zhang
2017-Feb-06  19:51 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Jean-Marc,
Thanks a lot for reviewing this huge assembly function!
silk_warped_autocorrelation_FIX_c()'s kernel part is
    for( n = 0; n < length; n++ ) {
        tmp1_QS = silk_LSHIFT32( (opus_int32)input[ n ], QS );
        /* Loop over allpass sections */
        for( i = 0; i < order; i++ ) {
            /* Output of allpass section */
            tmp2_QS = silk_SMLAWB( state_QS[ i ], state_QS[ i + 1 ] -
tmp1_QS, warping_Q16 );
            state_QS[ i ]  = tmp1_QS;
            corr_QC[  i ] += silk_RSHIFT64( silk_SMULL( tmp1_QS, state_QS[
0 ] ), 2 * QS - QC );
            tmp1_QS = tmp2_QS;
        }
        state_QS[ order ] = tmp1_QS;
        corr_QC[  order ] += silk_RSHIFT64( silk_SMULL( tmp1_QS, state_QS[
0 ] ), 2 * QS - QC );
    }
in which corr_QC[0, 1, ..., order] is the only output.
Suppose order = 10, and each stage of the inner loop is noted by s0, s1,
..., s9. And suppose we simultaneously process 8 input in SIMD, from in0 to
in7. Let PROC(inx(sy)) denote processing input[x] at stage y.
If there is no dependency between inx(sy) and in(x+1)(sy), then we can do
this
FOR in=0 TO N WITH in+=8
  FOR y=0 TO order-1 WITH y++
   PROC(in0(sy) in1(sy) in2(sy) in3(sy) in4(sy) in5(sy) in6(sy) in7(sy))
  END FOR
END FOR
Definitely there is no any prolog and epilog needed.
However, the critical thing is that all the states in each stage when
processing input[i] are reused by the next input[i+1]. That is input[i+1]
must wait input[i] for 1 stage, and input[i+2] must wait input[i+1] for 1
stage, etc.
Then it becomes this
FOR in=0 to N WITH in+=8
  PROC(in0(s0)) /* prolog 0 */
  PROC(in0(s1) in1(s0)) /* prolog 1 */
  PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
  PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
  PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */
  PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /* prolog 5 */
  PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0)) /* prolog 6
*/
  PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) /*
fully process 8 inputs */
  PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)) /*
continue */
  PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)) /*
continue */
  PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4) in7(s3)) /*
continue */
  PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4)) /* epilog
0 */
  PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /* epilog 1 */
  PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */
  PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
  PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
  PROC(in6(s10) in7(s9)) /* epilog 5 */
  PROC(in7(s10)) /* epilog 6 */
END FOR
And
  PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1) in7(s0)) /*
fully process 8 inputs */
  PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)) /*
continue */
  PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)) /*
continue */
is actually the expansion of the kernel loop
FOR i=0 TO order-6 WITH i++
  PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
in6(si+1) in7(si+0))
END FOR
The worst thing is that corr_QC[] is so sensitive that any extra processing
will make them wrong and propagate to the next loop (next 8 inputs).
state_QS[] is a little better but still very sensitive. For instance, if
adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5)
in7(s4)) to the kernel loop (by looping one more time) and remove epilog 0,
then all final results will be wrong.
That's why the prolog and epilog cannot be saved to the best of my
knowledge.
The assembly size of silk_warped_autocorrelation_FIX_neon() is about 2,744
bytes. Compared with the C code size (about 452 bytes), it's 2.3 KB larger.
Considering silk_warped_autocorrelation_FIX_c() is the second place CPU
heavy function in fixed-point, and our testing shows up to 7% CPU run time
saving of the total encoder with this optimization (at Complexity 8), maybe
we can take the I-cache burden even if finally we still cannot remove the
big chunk of prolog and epilog.
Thanks,
Linfeng Zhang
On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Felicia,
>
> I've had time to work through the math in the original function and
I'm
> pretty sure it's possible to vectorize this without the huge
> prologue/epilogue.
>
> For the simple case where order = vector size = N (but it should easily
> generalize to larger order), what I came up with is:
>
> initialize X, Y, M, C to vector of zeros
>
> for i=0 to N+order
>    T = [x(i), Y(0:N-2)]
>    Y = M + coeff * (Y - T)
>    M = T
>    X = [x(i), X(0:N-1)]
>    C = C + X*Y
>
> I think something similar to this (assuming I didn't mess up any
> details) should give you the correlations in vector C. Did I miss anything?
>
> Cheers,
>
>         Jean-Marc
>
>
> On 31/01/17 12:30 PM, Felicia Lim wrote:
> > Hi,
> >
> > Attached is a patch with arm neon optimizations for
> > silk_warped_autocorrelation_FIX(). Please review.
> >
> > Thanks,
> > Felicia
> >
> >
> > _______________________________________________
> > opus mailing list
> > opus at xiph.org
> > http://lists.xiph.org/mailman/listinfo/opus
> >
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170206/64d409a6/attachment.html>
Apparently Analagous Threads
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON