Hi, Basically, inner_prod() can and should be adapted to the architecture it will run on. It is not really sensitive to noise, so it's possible to tweak it a lot. Also, in the current code, I saturate it to +-16384, which is OK to prevent overflows. I'm not concerned with the case of a constant -16384 value because it can't really happen in practice (especially after filtering). BTW, on platforms that have a 40-bit accumulator, it's possible to even remove the shift from the loop and apply it only at the end. Le vendredi 03 f?vrier 2006 ? 11:27 -0600, Jerry Trantow a ?crit :> I am overriding the inner product routine in ltp.c. To test my replacement, > I threw some test vectors at it. I understand the loss of resolution caused > by the shift. I also see a FIXED_POINT danger with the summation of four > mults overflowing the 32 bit before the shift. > > I can fix this by accumulating each term into a long, but if the code scales > the x[],y[] vectors to avoid this problem I could use parallel 16x16 > multiply/adds.What do you mean here?> You can see this problem with the following test case. > > for (i=0;i<40;i++) > { > x[i]=-16384; > y[i]=-32768; > }The value -32768 is not supposed to happen in vectors sent to inner_prod.> sum0=inner_prod(x, y, 40); > fprintf(stderr,"inner_prod0(%8d).\n",sum0);Jean-Marc
Ok, I hadn't verified inner product was called with values scaled to <+-16384. That would make it safe to do a 32 bit add of the intermediate terms. I have implemented the 40-bit accumulator.> by the shift. I also see a FIXED_POINT danger with the summation of four > mults overflowing the 32 bit before the shift. > > I can fix this by accumulating each term into a long, but if the codescales> the x[],y[] vectors to avoid this problem I could use parallel 16x16 > multiply/adds.What do you mean here? The C64x has a _dotp2() instruction that does two 16x16 multiplies and adds the products together. Since the values are scaled to 16384, I can add the results of the two _dotp2()s together before the long add without worrying about overflow. I didn't understand that inner_prod() was always passed scaled vectors. That's the danger of optimizing routines without knowing how they are called. I split a norm_shift() out of your normalize16(). This function can also be used twice in pitch_gain_search_3tap(). Are there any other places that would benefit from this optimized routine? /* Returns number of shifts to normalize a 32 bit vector to [-16384,+16384). */ static inline int norm_shift(const spx_sig_t *x, spx_sig_t max_scale, int len) { int sig_shift_ti; int i; #warn Using the optimized normalize16() function. /* Directly find the min(_norm(x[i]) rather than searching for max(abs(x[i])) and taking _norm. */ #pragma MUST_ITERATE(24,184,4) for (i=0;i<len;i++) { sig_shift_ti=min(sig_shift_ti,_norm(x[i])); } sig_shift_ti=max(0,_norm(max_scale-1)-sig_shift_ti); /* Return the shift value. */ return(sig_shift_ti); } // norm_shift(). PS. Here are the C64x MIPS vs Complexity results for the original code. I have been able to reduce the complexity 1 encoder to 15.7 MIPS. Encoder Complexity Original 32 Original 16 1 31.2 29.6 2 41.7 39.8 3 51.4 49.0 4 61.6 5 6 7 93.1 8 9 120.8 Jerry J. Trantow Applied Signal Processing, Inc. jtrantow@ieee.org
Le samedi 04 f?vrier 2006 ? 11:38 -0600, Jerry Trantow a ?crit :> Ok, I hadn't verified inner product was called with values scaled to <> +-16384. That would make it safe to do a 32 bit add of the intermediate > terms. I have implemented the 40-bit accumulator.Actually, it you have a 40-bit accumulator, you can just do a loop on "accumulator += *x++ * *y++" without worrying about anything> The C64x has a _dotp2() instruction that does two 16x16 multiplies and adds > the products together. Since the values are scaled to 16384, I can add the > results of the two _dotp2()s together before the long add without worrying > about overflow.Why would you do that instead of just accumulating directly?> I split a norm_shift() out of your normalize16(). This function can also be > used twice in pitch_gain_search_3tap(). Are there any other places that > would benefit from this optimized routine?Not sure I see what it does exactly...> PS. Here are the C64x MIPS vs Complexity results for the original code. I > have been able to reduce the complexity 1 encoder to 15.7 MIPS. > > Encoder > Complexity Original 32 Original 16 > 1 31.2 29.6 > 2 41.7 39.8 > 3 51.4 49.0 > 4 61.6 > 5 > 6 > 7 93.1 > 8 > 9 120.8Could you explain what this means and what the 15.7 MIPS value means? And what bit-rate? Jean-Marc