Displaying 13 results from an estimated 13 matches for "state_q".
Did you mean:
state_qs
2016 Jul 01
1
silk_warped_autocorrelation_FIX() NEON optimization
Hi all,
I'm sending patch "Optimize silk_warped_autocorrelation_FIX() for ARM NEON" in an separate email.
It is based on Tim’s aarch64v8 branch https://git.xiph.org/?p=users/tterribe/opus.git;a=shortlog;h=refs/heads/aarch64v8
Thanks for your comments.
Linfeng
2016 Jul 14
6
Several patches of ARM NEON optimization
I rebased my previous 3 patches to the current master with minor changes.
Patches 1 to 3 replace all my previous submitted patches.
Patches 4 and 5 are new.
Thanks,
Linfeng Zhang
2017 Jan 31
6
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi,
Attached is a patch with arm neon optimizations for
silk_warped_autocorrelation_FIX(). Please review.
Thanks,
Felicia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170131/9a912bb4/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name:
2017 Feb 06
2
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...lation_FIX_c()'s kernel part is
for( n = 0; n < length; n++ ) {
tmp1_QS = silk_LSHIFT32( (opus_int32)input[ n ], QS );
/* Loop over allpass sections */
for( i = 0; i < order; i++ ) {
/* Output of allpass section */
tmp2_QS = silk_SMLAWB( state_QS[ i ], state_QS[ i + 1 ] -
tmp1_QS, warping_Q16 );
state_QS[ i ] = tmp1_QS;
corr_QC[ i ] += silk_RSHIFT64( silk_SMULL( tmp1_QS, state_QS[
0 ] ), 2 * QS - QC );
tmp1_QS = tmp2_QS;
}
state_QS[ order ] = tmp1_QS;
corr_QC[ order ] += silk_R...
2017 Feb 07
2
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
> > in6(si+1) in7(si+0))
> > END FOR
> >
> > The worst thing is that corr_QC[] is so sensitive that any extra
> > processing will make them wrong and propagate to the next loop (next 8
> > inputs). state_QS[] is a little better but still very sensitive. For
> > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7)
> > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more time)
> > and remove epilog 0, then all final results will be wrong.
> >
>...
2017 Feb 07
3
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...(si+2)
> > > in6(si+1) in7(si+0))
> > > END FOR
> > >
> > > The worst thing is that corr_QC[] is so sensitive that any extra
> > > processing will make them wrong and propagate to the next loop
> (next 8
> > > inputs). state_QS[] is a little better but still very sensitive.
> For
> > > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7)
> > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more
> time)
> > > and remove epilog 0, then all final r...
2017 Feb 06
0
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...=0 TO order-6 WITH i++
> PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
> in6(si+1) in7(si+0))
> END FOR
>
> The worst thing is that corr_QC[] is so sensitive that any extra
> processing will make them wrong and propagate to the next loop (next 8
> inputs). state_QS[] is a little better but still very sensitive. For
> instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7)
> in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more time)
> and remove epilog 0, then all final results will be wrong.
>
> That's why the...
2017 Apr 05
2
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...gt;>> > > END FOR
>>> > >
>>> > > The worst thing is that corr_QC[] is so sensitive that any extra
>>> > > processing will make them wrong and propagate to the next loop
>>> (next 8
>>> > > inputs). state_QS[] is a little better but still very sensitive.
>>> For
>>> > > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8)
>>> in4(s7)
>>> > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more
>>> time)
>>>...
2017 Feb 07
0
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...) in2(si+5) in3(si+4) in4(si+3) in5(si+2)
> > in6(si+1) in7(si+0))
> > END FOR
> >
> > The worst thing is that corr_QC[] is so sensitive that any extra
> > processing will make them wrong and propagate to the next loop (next 8
> > inputs). state_QS[] is a little better but still very sensitive. For
> > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8) in4(s7)
> > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more time)
> > and remove epilog 0, then all final results will be wrong.
>...
2017 Apr 03
0
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...; in6(si+1) in7(si+0))
>> > > END FOR
>> > >
>> > > The worst thing is that corr_QC[] is so sensitive that any extra
>> > > processing will make them wrong and propagate to the next loop
>> (next 8
>> > > inputs). state_QS[] is a little better but still very sensitive.
>> For
>> > > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8)
>> in4(s7)
>> > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more
>> time)
>> > > and remo...
2017 Apr 05
4
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...t;
> > > > The worst thing is that corr_QC[] is so sensitive
> > that any extra
> > > > processing will make them wrong and propagate to the
> > next loop (next 8
> > > > inputs). state_QS[] is a little better but still
> > very sensitive. For
> > > > instance, if adding PROC(in0(s11') in1(s10) in2(s9)
> > in3(s8) in4(s7)
> > > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by
> >...
2017 Apr 05
0
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...; > >
> > > The worst thing is that corr_QC[] is so sensitive
> that any extra
> > > processing will make them wrong and propagate to the
> next loop (next 8
> > > inputs). state_QS[] is a little better but still
> very sensitive. For
> > > instance, if adding PROC(in0(s11') in1(s10) in2(s9)
> in3(s8) in4(s7)
> > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by
> looping one mo...
2017 Apr 06
0
[PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
...sing up to 24 "taps" at the same time. If that's causing a
slowdown, then it should be possible to do the processing in "sections".
By that, I mean that you can implement (e.g.) an order-8 "kernel" that
computes the correlations and also outputs the last element of
state_QS_s32x4[0][0] back to input_QS[], so that it can be used to
compute a new secion.
4) It's a minor detail, but the last element of corr_QC[] that's not
currently vectorized could simply be vectorized independently outside
the loop (and it's the same for all orders).
Cheers,
Jean-Marc...