thr3ads.net - Speex dev - [Speex-dev] Speex inner_prod(), normalize, C64 MIPS [Feb 2006]

If this information is useful, please help other people find it:
Share via:

Jean-Marc Valin

2006-Feb-03 18:55 UTC

[Speex-dev] Speex inner_prod()

Hi,

Basically, inner_prod() can and should be adapted to the architecture it
will run on. It is not really sensitive to noise, so it's possible to
tweak it a lot. Also, in the current code, I saturate it to +-16384,
which is OK to prevent overflows. I'm not concerned with the case of a
constant -16384 value because it can't really happen in practice
(especially after filtering). BTW, on platforms that have a 40-bit
accumulator, it's possible to even remove the shift from the loop and
apply it only at the end.

Le vendredi 03 f?vrier 2006 ? 11:27 -0600, Jerry Trantow a ?crit
:> I am overriding the inner product routine in ltp.c.  To test my
replacement,
> I threw some test vectors at it.  I understand the loss of resolution
caused
> by the shift.  I also see a FIXED_POINT danger with the summation of four
> mults overflowing the 32 bit before the shift.  
> 
> I can fix this by accumulating each term into a long, but if the code
scales
> the x[],y[] vectors to avoid this problem I could use parallel 16x16
> multiply/adds.  
What do you mean here?
> You can see this problem with the following test case.
> 
> for (i=0;i<40;i++)
> {
> 	x[i]=-16384;
> 	y[i]=-32768;
> }
The value -32768 is not supposed to happen in vectors sent to
inner_prod.
> sum0=inner_prod(x, y, 40);
> fprintf(stderr,"inner_prod0(%8d).\n",sum0);
	Jean-Marc

Jerry Trantow

2006-Feb-04 09:38 UTC

head link

[Speex-dev] Speex inner_prod(), normalize, C64 MIPS

Ok, I hadn't verified inner product was called with values scaled to
<+-16384.  That would make it safe to do a 32 bit add of the intermediate
terms. I have implemented the 40-bit accumulator.
> by the shift.  I also see a FIXED_POINT danger with the summation of four
> mults overflowing the 32 bit before the shift.  
> 
> I can fix this by accumulating each term into a long, but if the code
scales> the x[],y[] vectors to avoid this problem I could use parallel 16x16
> multiply/adds.  
What do you mean here?

The C64x has a _dotp2() instruction that does two 16x16 multiplies and adds
the products together.  Since the values are scaled to 16384, I can add the
results of the two _dotp2()s together before the long add without worrying
about overflow.  I didn't understand that inner_prod() was always passed
scaled vectors.  That's the danger of optimizing routines without knowing
how they are called.

I split a norm_shift() out of your normalize16().  This function can also be
used twice in pitch_gain_search_3tap().  Are there any other places that
would benefit from this optimized routine?

/*
	Returns number of shifts to normalize a 32 bit vector to 
	[-16384,+16384).
*/
static inline int norm_shift(const spx_sig_t *x, spx_sig_t max_scale, int
len)
{
    int sig_shift_ti;
	int i;

	#warn Using the optimized normalize16() function.
    /*
        Directly find the min(_norm(x[i]) rather than searching for
max(abs(x[i])) and taking _norm.
    */
    #pragma MUST_ITERATE(24,184,4)
    for (i=0;i<len;i++)
    {
        sig_shift_ti=min(sig_shift_ti,_norm(x[i]));	
    }
    sig_shift_ti=max(0,_norm(max_scale-1)-sig_shift_ti);
    /*
        Return the shift value.
    */
    return(sig_shift_ti);
}	//	norm_shift().	


PS.  Here are the C64x MIPS vs Complexity results for the original code.  I
have been able to reduce the complexity 1 encoder to 15.7 MIPS.

Encoder		
Complexity	Original 32	Original 16
1	31.2	29.6
2	41.7	39.8
3	51.4	49.0
4	61.6	
5		
6		
7		93.1
8		
9		120.8


Jerry J. Trantow
Applied Signal Processing, Inc.
jtrantow@ieee.org

Jean-Marc Valin

2006-Feb-04 18:42 UTC

head link

[Speex-dev] Speex inner_prod(), normalize, C64 MIPS

Le samedi 04 f?vrier 2006 ? 11:38 -0600, Jerry Trantow a ?crit
:> Ok, I hadn't verified inner product was called with values scaled to
<> +-16384.  That would make it safe to do a 32 bit add of the
intermediate
> terms. I have implemented the 40-bit accumulator.
Actually, it you have a 40-bit accumulator, you can just do a loop on
"accumulator += *x++ * *y++" without worrying about anything
> The C64x has a _dotp2() instruction that does two 16x16 multiplies and adds
> the products together.  Since the values are scaled to 16384, I can add the
> results of the two _dotp2()s together before the long add without worrying
> about overflow.  
Why would you do that instead of just accumulating directly?
> I split a norm_shift() out of your normalize16().  This function can also
be
> used twice in pitch_gain_search_3tap().  Are there any other places that
> would benefit from this optimized routine?
Not sure I see what it does exactly...
> PS.  Here are the C64x MIPS vs Complexity results for the original code.  I
> have been able to reduce the complexity 1 encoder to 15.7 MIPS.
> 
> Encoder		
> Complexity	Original 32	Original 16
> 1	31.2	29.6
> 2	41.7	39.8
> 3	51.4	49.0
> 4	61.6	
> 5		
> 6		
> 7		93.1
> 8		
> 9		120.8
Could you explain what this means and what the 15.7 MIPS value means?
And what bit-rate?

	Jean-Marc

Apparently Analagous Threads

Search for more reasonably related threads

Speex dev - Feb 2006 - Speex inner_prod(), normalize, C64 MIPS

[Speex-dev] Speex inner_prod()

[Speex-dev] Speex inner_prod(), normalize, C64 MIPS

[Speex-dev] Speex inner_prod(), normalize, C64 MIPS

Apparently Analagous Threads