John Ridges
2013-Jun-07 18:33 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Hi JM, I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned assembly is bound to be faster than using intrinsics. However I notice that his code can also read past the y buffer. Cheers, --John On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:> Hi John, > > Thanks for the two fixes. They're in git now. Your SSE version seems to > also be slightly faster than mine -- probably due the the partial sums. > As for the NEON code, it would be good to compare the performance with > the code Aur?lien Zanelli posted at > http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch > > Cheers, > > Jean-Marc > >
Jean-Marc Valin
2013-Jun-07 18:51 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
On 06/07/2013 02:33 PM, John Ridges wrote:> I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned > assembly is bound to be faster than using intrinsics.I was mostly curious about comparing vectorization approaches (assuming the two are different) than exact code.> However I notice > that his code can also read past the y buffer.Yeah we'd need to either fix this or make sure that we add some padding to the buffers. In practice it's unlikely to even trigger valgrind (it's on the stack and the uninitialized data ends up being discarded), but it's definitely not clean and could come back and bite us later. Cheers, Jean-Marc> Cheers, > --John > > > On 6/6/2013 9:22 PM, Jean-Marc Valin wrote: >> Hi John, >> >> Thanks for the two fixes. They're in git now. Your SSE version seems to >> also be slightly faster than mine -- probably due the the partial sums. >> As for the NEON code, it would be good to compare the performance with >> the code Aur?lien Zanelli posted at >> http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch >> >> >> Cheers, >> >> Jean-Marc >> >> > > >
John Ridges
2013-Jun-07 22:50 UTC
[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Unfortunately I don't have a setup that lets me easily profile ARM code, so I really can't tell which method is faster (though I suspect Mr. Zanelli's code is). Let me offer up another intrinsic version of the NEON xcorr_kernel that is almost identical to the SSE version, and more in line with Mr. Zanelli's code: static inline void xcorr_kernel_neon(const opus_val16 *x, const opus_val16 *y, opus_val32 sum[4], int len) { int j; int32x4_t xsum1 = vld1q_s32(sum); int32x4_t xsum2 = vdupq_n_s32(0); for (j = 0; j < len-3; j += 4) { int16x4_t x0 = vld1_s16(x+j); int16x4_t y0 = vld1_s16(y+j); int16x4_t y3 = vld1_s16(y+j+3); int16x4_t y4 = vext_s16(y3,y3,1); xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0); xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1)); xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2)); xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3); } if (j < len) { xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j)); if (++j < len) { xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j)); if (++j < len) { xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j)); } } } vst1q_s32(sum,vaddq_s32(xsum1,xsum2)); } Whether or not this version is faster than the first version I submitted probably depends a lot on how fast unaligned memory vector accesses are on an ARM processor. Of course hand-coded assembly would be even faster than using intrinsics (for instance the "vdup_lane_s16"s wouldn't be needed), but in this case it could be that the multiply-add stalls swamp most of the inefficiencies in the intrinsic code. It would be cool if someone out there has a setup that would let us definitively find out which is fastest and by how much. If the hit from using intrinsics isn't too bad, I would prefer them since they are compatible with I think nearly all ARM compilers (and in truth I also prefer using intrinsics for NEON code because I'm just not that familiar with ARM assembly). Cheers, --John On 6/7/2013 12:51 PM, Jean-Marc Valin wrote:> On 06/07/2013 02:33 PM, John Ridges wrote: >> I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned >> assembly is bound to be faster than using intrinsics. > I was mostly curious about comparing vectorization approaches (assuming > the two are different) than exact code. > >> However I notice >> that his code can also read past the y buffer. > Yeah we'd need to either fix this or make sure that we add some padding > to the buffers. In practice it's unlikely to even trigger valgrind (it's > on the stack and the uninitialized data ends up being discarded), but > it's definitely not clean and could come back and bite us later. > > Cheers, > > Jean-Marc > >> Cheers, >> --John >> >> >> On 6/6/2013 9:22 PM, Jean-Marc Valin wrote: >>> Hi John, >>> >>> Thanks for the two fixes. They're in git now. Your SSE version seems to >>> also be slightly faster than mine -- probably due the the partial sums. >>> As for the NEON code, it would be good to compare the performance with >>> the code Aur?lien Zanelli posted at >>> http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch >>> >>> >>> Cheers, >>> >>> Jean-Marc >>> >>> >> >> >
Possibly Parallel Threads
- opus Digest, Vol 53, Issue 2
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- [Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.