John Ridges
2013-Jun-07 00:07 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Hi JM, At line 221 in celt_lpc.c (the celt_iir function) I think you really want the RESTORE_STACK statement to be before the #endif instead of after it. Also, I couldn't help notice that your SSE code for xcorr_kernel reads more than "len" elements of "_x". I don't know if that's really a problem when running the codec, but a tool like valgrind will have a fit if it's accessing uninitialized memory. Here's a version I wrote a few days ago you're welcome to use that doesn't suffer from that problem: static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 *y, opus_val32 sum[4], int len) { int j; __m128 xsum1 = _mm_loadu_ps(sum); __m128 xsum2 = _mm_setzero_ps(); for (j = 0; j < len-3; j += 4) { const __m128 x0 = _mm_loadu_ps(x+j); const __m128 y0 = _mm_loadu_ps(y+j); const __m128 y3 = _mm_loadu_ps(y+j+3); xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0)); xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49))); xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e))); xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3)); } if (j < len) { xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); if (++j < len) { xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); if (++j < len) { xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); } } } _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2)); } Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry I don't have a floating-point version, but I only use fixed-point opus in ARM): #include <arm_neon.h> static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 *y, opus_val32 sum[4], int len) { int j; int32x4_t xsum1 = vld1q_s32(sum); int32x4_t xsum2 = vdupq_n_s32(0); for (j = 0; j < len-1; j += 2) { xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++)); xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++)); } if (j < len) { xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y)); } vst1q_s32(sum,vaddq_s32(xsum1,xsum2)); } Cheers, John Ridges
Jean-Marc Valin
2013-Jun-07 03:22 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Hi John, Thanks for the two fixes. They're in git now. Your SSE version seems to also be slightly faster than mine -- probably due the the partial sums. As for the NEON code, it would be good to compare the performance with the code Aur?lien Zanelli posted at http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch Cheers, Jean-Marc On 06/06/2013 08:07 PM, John Ridges wrote:> Hi JM, > > At line 221 in celt_lpc.c (the celt_iir function) I think you really > want the RESTORE_STACK statement to be before the #endif instead of > after it. Also, I couldn't help notice that your SSE code for > xcorr_kernel reads more than "len" elements of "_x". I don't know if > that's really a problem when running the codec, but a tool like valgrind > will have a fit if it's accessing uninitialized memory. Here's a version > I wrote a few days ago you're welcome to use that doesn't suffer from > that problem: > > static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 > *y, opus_val32 sum[4], int len) > { > int j; > __m128 xsum1 = _mm_loadu_ps(sum); > __m128 xsum2 = _mm_setzero_ps(); > > for (j = 0; j < len-3; j += 4) { > const __m128 x0 = _mm_loadu_ps(x+j); > const __m128 y0 = _mm_loadu_ps(y+j); > const __m128 y3 = _mm_loadu_ps(y+j+3); > > xsum1 = > _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0)); > xsum2 = > _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49))); > xsum1 = > _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e))); > xsum2 = > _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3)); > } > if (j < len) { > xsum1 = > _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); > if (++j < len) { > xsum2 = > _mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); > if (++j < len) { > xsum1 = > _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); > } > } > } > _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2)); > } > > Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry I > don't have a floating-point version, but I only use fixed-point opus in > ARM): > > #include <arm_neon.h> > > static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 > *y, opus_val32 sum[4], int len) > { > int j; > int32x4_t xsum1 = vld1q_s32(sum); > int32x4_t xsum2 = vdupq_n_s32(0); > > for (j = 0; j < len-1; j += 2) { > xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++)); > xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++)); > } > if (j < len) { > xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y)); > } > vst1q_s32(sum,vaddq_s32(xsum1,xsum2)); > } > > > Cheers, > John Ridges > > > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >
John Ridges
2013-Jun-07 18:33 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Hi JM, I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned assembly is bound to be faster than using intrinsics. However I notice that his code can also read past the y buffer. Cheers, --John On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:> Hi John, > > Thanks for the two fixes. They're in git now. Your SSE version seems to > also be slightly faster than mine -- probably due the the partial sums. > As for the NEON code, it would be good to compare the performance with > the code Aur?lien Zanelli posted at > http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch > > Cheers, > > Jean-Marc > >
Seemingly Similar Threads
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- opus Digest, Vol 53, Issue 2
- [RFC PATCH v3] Intrinsics/RTCD related fixes. Mostly x86.
- [RFC PATCHv2] Intrinsics/RTCD related fixes. Mostly x86.