thr3ads.net - opus - [opus] Bug fix in celt_lpc.c and some xcorr

If this information is useful, please help other people find it:
Share via:

John Ridges

2013-Jun-07 00:07 UTC

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Hi JM,

At line 221 in celt_lpc.c (the celt_iir function) I think you really 
want the RESTORE_STACK statement to be before the #endif instead of 
after it. Also, I couldn't help notice that your SSE code for 
xcorr_kernel reads more than "len" elements of "_x". I
don't know if
that's really a problem when running the codec, but a tool like valgrind 
will have a fit if it's accessing uninitialized memory. Here's a version
I wrote a few days ago you're welcome to use that doesn't suffer from 
that problem:

static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 
*y, opus_val32 sum[4], int len)
{
     int j;
     __m128 xsum1 = _mm_loadu_ps(sum);
     __m128 xsum2 = _mm_setzero_ps();

     for (j = 0; j < len-3; j += 4) {
         const __m128 x0 = _mm_loadu_ps(x+j);
         const __m128 y0 = _mm_loadu_ps(y+j);
         const __m128 y3 = _mm_loadu_ps(y+j+3);

         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0));
         xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49)));
         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e)));
         xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3));
     }
     if (j < len) {
         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
         if (++j < len) {
             xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
             if (++j < len) {
                 xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
             }
         }
     }
     _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2));
}

Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry I 
don't have a floating-point version, but I only use fixed-point opus in 
ARM):

#include <arm_neon.h>

static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 
*y, opus_val32 sum[4], int len)
{
     int j;
     int32x4_t xsum1 = vld1q_s32(sum);
     int32x4_t xsum2 = vdupq_n_s32(0);

     for (j = 0; j < len-1; j += 2) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++));
         xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++));
     }
     if (j < len) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y));
     }
     vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
}


Cheers,
John Ridges

Jean-Marc Valin

2013-Jun-07 03:22 UTC

head link

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Hi John,

Thanks for the two fixes. They're in git now. Your SSE version seems to
also be slightly faster than mine -- probably due the the partial sums.
As for the NEON code, it would be good to compare the performance with
the code Aur?lien Zanelli posted at
http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch

Cheers,

	Jean-Marc

On 06/06/2013 08:07 PM, John Ridges wrote:> Hi JM,
> 
> At line 221 in celt_lpc.c (the celt_iir function) I think you really 
> want the RESTORE_STACK statement to be before the #endif instead of 
> after it. Also, I couldn't help notice that your SSE code for 
> xcorr_kernel reads more than "len" elements of "_x". I
don't know if
> that's really a problem when running the codec, but a tool like
valgrind
> will have a fit if it's accessing uninitialized memory. Here's a
version
> I wrote a few days ago you're welcome to use that doesn't suffer
from
> that problem:
> 
> static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 
> *y, opus_val32 sum[4], int len)
> {
>      int j;
>      __m128 xsum1 = _mm_loadu_ps(sum);
>      __m128 xsum2 = _mm_setzero_ps();
> 
>      for (j = 0; j < len-3; j += 4) {
>          const __m128 x0 = _mm_loadu_ps(x+j);
>          const __m128 y0 = _mm_loadu_ps(y+j);
>          const __m128 y3 = _mm_loadu_ps(y+j+3);
> 
>          xsum1 = 
> _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0));
>          xsum2 = 
>
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49)));
>          xsum1 = 
>
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e)));
>          xsum2 = 
> _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3));
>      }
>      if (j < len) {
>          xsum1 = 
> _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
>          if (++j < len) {
>              xsum2 = 
> _mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
>              if (++j < len) {
>                  xsum1 = 
> _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
>              }
>          }
>      }
>      _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2));
> }
> 
> Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry
I
> don't have a floating-point version, but I only use fixed-point opus in
> ARM):
> 
> #include <arm_neon.h>
> 
> static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 
> *y, opus_val32 sum[4], int len)
> {
>      int j;
>      int32x4_t xsum1 = vld1q_s32(sum);
>      int32x4_t xsum2 = vdupq_n_s32(0);
> 
>      for (j = 0; j < len-1; j += 2) {
>          xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++));
>          xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++));
>      }
>      if (j < len) {
>          xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y));
>      }
>      vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
> }
> 
> 
> Cheers,
> John Ridges
> 
> 
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>

John Ridges

2013-Jun-07 18:33 UTC

head link

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Hi JM,

I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned 
assembly is bound to be faster than using intrinsics. However I notice 
that his code can also read past the y buffer.

Cheers,
--John

On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:> Hi John,
>
> Thanks for the two fixes. They're in git now. Your SSE version seems to
> also be slightly faster than mine -- probably due the the partial sums.
> As for the NEON code, it would be good to compare the performance with
> the code Aur?lien Zanelli posted at
>
http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
>
> Cheers,
>
> 	Jean-Marc
>
>

Reasonably Related Threads

Search for more possibly parallel threads

opus - Jun 2013 - Bug fix in celt_lpc.c and some xcorr_kernel optimizations

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Reasonably Related Threads