John Ridges
2013-Jun-07  00:07 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Hi JM,
At line 221 in celt_lpc.c (the celt_iir function) I think you really 
want the RESTORE_STACK statement to be before the #endif instead of 
after it. Also, I couldn't help notice that your SSE code for 
xcorr_kernel reads more than "len" elements of "_x". I
don't know if
that's really a problem when running the codec, but a tool like valgrind 
will have a fit if it's accessing uninitialized memory. Here's a version
I wrote a few days ago you're welcome to use that doesn't suffer from 
that problem:
static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 
*y, opus_val32 sum[4], int len)
{
     int j;
     __m128 xsum1 = _mm_loadu_ps(sum);
     __m128 xsum2 = _mm_setzero_ps();
     for (j = 0; j < len-3; j += 4) {
         const __m128 x0 = _mm_loadu_ps(x+j);
         const __m128 y0 = _mm_loadu_ps(y+j);
         const __m128 y3 = _mm_loadu_ps(y+j+3);
         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0));
         xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49)));
         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e)));
         xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3));
     }
     if (j < len) {
         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
         if (++j < len) {
             xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
             if (++j < len) {
                 xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
             }
         }
     }
     _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2));
}
Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry I 
don't have a floating-point version, but I only use fixed-point opus in 
ARM):
#include <arm_neon.h>
static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 
*y, opus_val32 sum[4], int len)
{
     int j;
     int32x4_t xsum1 = vld1q_s32(sum);
     int32x4_t xsum2 = vdupq_n_s32(0);
     for (j = 0; j < len-1; j += 2) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++));
         xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++));
     }
     if (j < len) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y));
     }
     vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
}
Cheers,
John Ridges
Jean-Marc Valin
2013-Jun-07  03:22 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Hi John, Thanks for the two fixes. They're in git now. Your SSE version seems to also be slightly faster than mine -- probably due the the partial sums. As for the NEON code, it would be good to compare the performance with the code Aur?lien Zanelli posted at http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch Cheers, Jean-Marc On 06/06/2013 08:07 PM, John Ridges wrote:> Hi JM, > > At line 221 in celt_lpc.c (the celt_iir function) I think you really > want the RESTORE_STACK statement to be before the #endif instead of > after it. Also, I couldn't help notice that your SSE code for > xcorr_kernel reads more than "len" elements of "_x". I don't know if > that's really a problem when running the codec, but a tool like valgrind > will have a fit if it's accessing uninitialized memory. Here's a version > I wrote a few days ago you're welcome to use that doesn't suffer from > that problem: > > static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 > *y, opus_val32 sum[4], int len) > { > int j; > __m128 xsum1 = _mm_loadu_ps(sum); > __m128 xsum2 = _mm_setzero_ps(); > > for (j = 0; j < len-3; j += 4) { > const __m128 x0 = _mm_loadu_ps(x+j); > const __m128 y0 = _mm_loadu_ps(y+j); > const __m128 y3 = _mm_loadu_ps(y+j+3); > > xsum1 = > _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0)); > xsum2 = > _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49))); > xsum1 = > _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e))); > xsum2 = > _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3)); > } > if (j < len) { > xsum1 = > _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); > if (++j < len) { > xsum2 = > _mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); > if (++j < len) { > xsum1 = > _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); > } > } > } > _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2)); > } > > Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry I > don't have a floating-point version, but I only use fixed-point opus in > ARM): > > #include <arm_neon.h> > > static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 > *y, opus_val32 sum[4], int len) > { > int j; > int32x4_t xsum1 = vld1q_s32(sum); > int32x4_t xsum2 = vdupq_n_s32(0); > > for (j = 0; j < len-1; j += 2) { > xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++)); > xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++)); > } > if (j < len) { > xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y)); > } > vst1q_s32(sum,vaddq_s32(xsum1,xsum2)); > } > > > Cheers, > John Ridges > > > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >
John Ridges
2013-Jun-07  18:33 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Hi JM, I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned assembly is bound to be faster than using intrinsics. However I notice that his code can also read past the y buffer. Cheers, --John On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:> Hi John, > > Thanks for the two fixes. They're in git now. Your SSE version seems to > also be slightly faster than mine -- probably due the the partial sums. > As for the NEON code, it would be good to compare the performance with > the code Aur?lien Zanelli posted at > http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch > > Cheers, > > Jean-Marc > >
Maybe Matching Threads
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- opus Digest, Vol 53, Issue 2
- [RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series