thr3ads.net - Speex dev - [Speex-dev] Resampler experimental speedups [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Thorvald Natvig

2008-Apr-04 00:15 UTC

[Speex-dev] Resampler experimental speedups

Hello :)

The attached patch (which is not in any way finished) optimizes the 
resampler. (For those following the discussions on IRC; this version 
includes optimizations for both direct and interpolate cases).

Using GCC 4.3, x86_64, Valgrind to measure instruction counts, 
resampling 10 frames of 320 floats at quality 3. Direct was measured 
with a 16=>48 resampling, and interpolate with a 16=>44.1 resampling.

Using just '-O2':
Original: Direct 4548 k, Interpolate 9657k
This version: Direct 2992k, Interpolate 9003k

So this version uses only 65% of the instructions of the one in SVN for 
the direct, which I think is decent speedup :) For interpolate, there's 
so much to do in each loop iteration that my tricks only give a marginal 
improvement (5% or so). Note that no loop unrolling has been done; for 
the direct case unrolling 4 times will reduce instruction count noticeably.

Using '-ftree-vectorize -ffast-math -O3' and a profile run:
Original: Direct 3419k, Interpolate 9255k
This version: Direct 1629k, Interpolate 8588k

My loop transformations allow GCC to recognize it as vectorizable for 
the direct case, giving a very nice speedup. For interpolate, we're 
again hurt by the loop doing too much work. Note though that GCC 
currently does not vectorize the inner loop for interpolate as it's 
unable to recognize that the operations are applied equally to all 
elements in accum[].

On the downside, this will allocate, on the stack, in_len + st->filt_len 
elements to hold a temporary array for the input. In my testcase, this 
means 1472 bytes. If you use larger frames, this will scale accordingly.

Unless anyone can spot any glaring mistakes I've made, the plan is to 
fix the double versions, correct the int->float (and vice versa) 
conversions and make sure the magic bytes work. Then it's time for some 
unrolling and _USE_SSE improvements ;)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: resample-both-test.patch
Type: text/x-diff
Size: 9068 bytes
Desc: not available
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20080404/eedc0f5a/attachment.patch

Thorvald Natvig

2008-Apr-04 11:48 UTC

head link

[Speex-dev] Resampler experimental speedups

>
> Using just '-O2':
> Original: Direct 4548 k, Interpolate 9657k
> This version: Direct 2992k, Interpolate 9003k
> Unless anyone can spot any glaring mistakes I've made, the plan is to 
> fix the double versions, correct the int->float (and vice versa) 
> conversions and make sure the magic bytes work. Then it's time for 
> some unrolling and _USE_SSE improvements ;)Or I can do SSE improvements first. With -O2, direct is 1043k, 
interpolate is 5166k. With full optimization, direct is 931k, 
interpolate is 5034k. I'll try to scrounge up a core2 machine I can use 
for performance testing to check wallclock time as well.
The interpolate is hurt a LOT by the 'broadcast curr_in to all 4 
elements of register'. Since there is no 'load single to all' 
instruction, we have to copy low-word to high-word and then shuffle. We 
could probably shave 20% of the instruction count if we duplicate the 
input array to have 4 copies of each value, but I think that might be 
overkill ;) ... Or maybe it isn't? For the 320 sample case (speex 
wideband), that would be only 6kB of stack, but would require massive 
rewrites (the buffer would need this copying inside resample.c, meaning 
we move arch-specific stuff into the main file). Opinions wanted.

There's a small observation here. The other SSE optimizations in speex 
have the SSE function in a header, and unless override_blah is defined, 
will define a similar function. For the non-SSE case, GCC does 
suboptimal optimization of this case with '-O2'; it is unable to merge 
the loop index and array indexes which wastes quite a few instructions. 
I therefore implemented it as follows:

...
#ifndef OVERRIDE_INNER_PRODUCT_SINGLE
      float sum1, sum2;
      sum1 = sum2 = 0.0f;

      for(j=0;j<N;j+=2) {
        sum1 += sinc[j]*iptr[j];
        sum2 += sinc[j+1]*iptr[j+1];
      }
      sum = sum1 + sum2;
#else
      sum = inner_product_single(sinc, iptr, N);
#endif
...

This gives optimal results with -O2 for the non-SSE case.

Note that the interpolated SSE case suffers a bit from the same 
problems, and we can shave 5% or so of the instruction count if we move 
it inside the function with #ifdef'ing. The function is declared as 
'static inline' and has no side-effects, so I have no idea why GCC 
doesn't produce the same code :(
How bad would it be to move the SSE code directly into the functions, 
suitably #ifdef'd?

Possibly Parallel Threads

Search for more possibly parallel threads

Speex dev - Apr 2008 - Resampler experimental speedups

[Speex-dev] Resampler experimental speedups

[Speex-dev] Resampler experimental speedups

Possibly Parallel Threads