thr3ads.net - opus - [opus] Bug fix in celt_lpc.c and some xcorr

If this information is useful, please help other people find it:
Share via:

John Ridges

2013-Jun-07 18:33 UTC

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Hi JM,

I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned 
assembly is bound to be faster than using intrinsics. However I notice 
that his code can also read past the y buffer.

Cheers,
--John

On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:> Hi John,
>
> Thanks for the two fixes. They're in git now. Your SSE version seems to
> also be slightly faster than mine -- probably due the the partial sums.
> As for the NEON code, it would be good to compare the performance with
> the code Aur?lien Zanelli posted at
>
http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
>
> Cheers,
>
> 	Jean-Marc
>
>

Jean-Marc Valin

2013-Jun-07 18:51 UTC

head link

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

On 06/07/2013 02:33 PM, John Ridges wrote:> I have no doubt that Mr. Zanelli's NEON code is faster, since hand
tuned
> assembly is bound to be faster than using intrinsics.
I was mostly curious about comparing vectorization approaches (assuming
the two are different) than exact code.
> However I notice
> that his code can also read past the y buffer.
Yeah we'd need to either fix this or make sure that we add some padding
to the buffers. In practice it's unlikely to even trigger valgrind (it's
on the stack and the uninitialized data ends up being discarded), but
it's definitely not clean and could come back and bite us later.

Cheers,

	Jean-Marc
> Cheers,
> --John
> 
> 
> On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:
>> Hi John,
>>
>> Thanks for the two fixes. They're in git now. Your SSE version
seems to
>> also be slightly faster than mine -- probably due the the partial sums.
>> As for the NEON code, it would be good to compare the performance with
>> the code Aur?lien Zanelli posted at
>>
http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
>>
>>
>> Cheers,
>>
>>     Jean-Marc
>>
>>
> 
> 
>

John Ridges

2013-Jun-07 22:50 UTC

head link

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Unfortunately I don't have a setup that lets me easily profile ARM code, 
so I really can't tell which method is faster (though I suspect Mr. 
Zanelli's code is). Let me offer up another intrinsic version of the 
NEON xcorr_kernel that is almost identical to the SSE version, and more 
in line with Mr. Zanelli's code:

static inline void xcorr_kernel_neon(const opus_val16 *x, const 
opus_val16 *y, opus_val32 sum[4], int len)
{
     int j;
     int32x4_t xsum1 = vld1q_s32(sum);
     int32x4_t xsum2 = vdupq_n_s32(0);

     for (j = 0; j < len-3; j += 4) {
         int16x4_t x0 = vld1_s16(x+j);
         int16x4_t y0 = vld1_s16(y+j);
         int16x4_t y3 = vld1_s16(y+j+3);
         int16x4_t y4 = vext_s16(y3,y3,1);

         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0);
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1));
         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2));
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3);
     }
     if (j < len) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
         if (++j < len) {
             xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j));
             if (++j < len) {
                 xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
             }
         }
     }
     vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
}

Whether or not this version is faster than the first version I submitted 
probably depends a lot on how fast unaligned memory vector accesses are 
on an ARM processor. Of course hand-coded assembly would be even faster 
than using intrinsics (for instance the "vdup_lane_s16"s wouldn't
be
needed), but in this case it could be that the multiply-add stalls swamp 
most of the inefficiencies in the intrinsic code. It would be cool if 
someone out there has a setup that would let us definitively find out 
which is fastest and by how much.

If the hit from using intrinsics isn't too bad, I would prefer them 
since they are compatible with I think nearly all ARM compilers (and in 
truth I also prefer using intrinsics for NEON code because I'm just not 
that familiar with ARM assembly).

Cheers,
--John

On 6/7/2013 12:51 PM, Jean-Marc Valin wrote:> On 06/07/2013 02:33 PM, John Ridges wrote:
>> I have no doubt that Mr. Zanelli's NEON code is faster, since hand
tuned
>> assembly is bound to be faster than using intrinsics.
> I was mostly curious about comparing vectorization approaches (assuming
> the two are different) than exact code.
>
>> However I notice
>> that his code can also read past the y buffer.
> Yeah we'd need to either fix this or make sure that we add some padding
> to the buffers. In practice it's unlikely to even trigger valgrind
(it's
> on the stack and the uninitialized data ends up being discarded), but
> it's definitely not clean and could come back and bite us later.
>
> Cheers,
>
> 	Jean-Marc
>
>> Cheers,
>> --John
>>
>>
>> On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:
>>> Hi John,
>>>
>>> Thanks for the two fixes. They're in git now. Your SSE version
seems to
>>> also be slightly faster than mine -- probably due the the partial
sums.
>>> As for the NEON code, it would be good to compare the performance
with
>>> the code Aur?lien Zanelli posted at
>>>
http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
>>>
>>>
>>> Cheers,
>>>
>>>      Jean-Marc
>>>
>>>
>>
>>
>

Apparently Analagous Threads

Search for more possibly parallel threads

opus - Jun 2013 - Bug fix in celt_lpc.c and some xcorr_kernel optimizations

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Apparently Analagous Threads