>
> Using just '-O2':
> Original: Direct 4548 k, Interpolate 9657k
> This version: Direct 2992k, Interpolate 9003k
> Unless anyone can spot any glaring mistakes I've made, the plan is to 
> fix the double versions, correct the int->float (and vice versa) 
> conversions and make sure the magic bytes work. Then it's time for 
> some unrolling and _USE_SSE improvements ;)
Or I can do SSE improvements first. With -O2, direct is 1043k, 
interpolate is 5166k. With full optimization, direct is 931k, 
interpolate is 5034k. I'll try to scrounge up a core2 machine I can use 
for performance testing to check wallclock time as well.
The interpolate is hurt a LOT by the 'broadcast curr_in to all 4 
elements of register'. Since there is no 'load single to all' 
instruction, we have to copy low-word to high-word and then shuffle. We 
could probably shave 20% of the instruction count if we duplicate the 
input array to have 4 copies of each value, but I think that might be 
overkill ;) ... Or maybe it isn't? For the 320 sample case (speex 
wideband), that would be only 6kB of stack, but would require massive 
rewrites (the buffer would need this copying inside resample.c, meaning 
we move arch-specific stuff into the main file). Opinions wanted.
There's a small observation here. The other SSE optimizations in speex 
have the SSE function in a header, and unless override_blah is defined, 
will define a similar function. For the non-SSE case, GCC does 
suboptimal optimization of this case with '-O2'; it is unable to merge 
the loop index and array indexes which wastes quite a few instructions. 
I therefore implemented it as follows:
...
#ifndef OVERRIDE_INNER_PRODUCT_SINGLE
      float sum1, sum2;
      sum1 = sum2 = 0.0f;
      for(j=0;j<N;j+=2) {
        sum1 += sinc[j]*iptr[j];
        sum2 += sinc[j+1]*iptr[j+1];
      }
      sum = sum1 + sum2;
#else
      sum = inner_product_single(sinc, iptr, N);
#endif
...
This gives optimal results with -O2 for the non-SSE case.
Note that the interpolated SSE case suffers a bit from the same 
problems, and we can shave 5% or so of the instruction count if we move 
it inside the function with #ifdef'ing. The function is declared as 
'static inline' and has no side-effects, so I have no idea why GCC 
doesn't produce the same code :(
How bad would it be to move the SSE code directly into the functions, 
suitably #ifdef'd?