Displaying 3 results from an estimated 3 matches for "vdup_lane_s16".
2013 Jun 07
1
Bug fix in celt_lpc.c and some xcorr_kernel optimizations
...sum1 = vld1q_s32(sum);
int32x4_t xsum2 = vdupq_n_s32(0);
for (j = 0; j < len-3; j += 4) {
int16x4_t x0 = vld1_s16(x+j);
int16x4_t y0 = vld1_s16(y+j);
int16x4_t y3 = vld1_s16(y+j+3);
int16x4_t y4 = vext_s16(y3,y3,1);
xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0);
xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1));
xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2));
xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3);
}
if (j < len) {
xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vl...
2013 Jun 07
2
Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Hi JM,
I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned
assembly is bound to be faster than using intrinsics. However I notice
that his code can also read past the y buffer.
Cheers,
--John
On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:
> Hi John,
>
> Thanks for the two fixes. They're in git now. Your SSE version seems to
> also be slightly faster than
2013 Jun 10
0
opus Digest, Vol 53, Issue 2
...sum1 = vld1q_s32(sum);
int32x4_t xsum2 = vdupq_n_s32(0);
for (j = 0; j < len-3; j += 4) {
int16x4_t x0 = vld1_s16(x+j);
int16x4_t y0 = vld1_s16(y+j);
int16x4_t y3 = vld1_s16(y+j+3);
int16x4_t y4 = vext_s16(y3,y3,1);
xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0);
xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1));
xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2));
xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3);
}
if (j < len) {
xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vl...