> I just realized that. I added counters for each of the arithmetic
> macros and discovered that mac16_32_q15 is the most frequent. I'm not
> sure I can do much more without understanding the code better. Q15 is
> Q1.15 format, right? Looking at
> MAC16_32_Q15(long c, short a, long b)
MAC16_32_Q15 means (c + (a*b)>>15)) so it means that you have 15 more
"binary decimals" in a and b than in c. It could be for example Q16.16
Q8.8 * Q9.23.
> Is b a Q15 represented as Q17.15 so that the implementation does not
> depend on saturating hardware? If we do have saturating hardware, can I
> keep the b input saturated so that only 1 16x16->32 multiply is required
> instead of 2? Essentially, pushing the saturation back into the inner
> loop because it's almost free.
Not quite. In this case, b is still a 32-bit value, so you can't just
use one 16x16 operation. You might however be able to use something
else. For example on ARM (5E arch I think), there's an instruction that
does (a*b)>>16 for where a is 16 bits and b is 32 bits.
Otherwise, if you look at the line
mem[j] = MAC16_32_Q15(MAC16_32_Q15(mem[j+1], num[j+1],xi), den[j
+1],nyi);
in filters.c, you see that some of the operations on xi and nyi (shift
and logical &) can actually be moved outside the loop (make sure your
compiler is intelligent enough to do it).
Last thing, note that all of this has nothing to do with the saturation,
since the code assumes no overflow is possible in the first place (input
is properly scaled). If you have hardware saturation though, it might be
possible to convert some of the calls to filter_mem2 to use 16 bit
arithmetic without increasing the noise level too much. I suggest you
try other solution first though. Oh, and you probably want to use
complexity 2 instead of 3 (or even 1 if you're really tight on CPU).
Jean-Marc