thr3ads.net - Speex dev - [Speex-dev] Resampler saturation, blackfin performance [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Stephane Lesage

2009-Jun-14 21:31 UTC

[Speex-dev] Resampler saturation, blackfin performance

> -----Message d'origine-----
> De : Jean-Marc Valin [mailto:jean-marc.valin at usherbrooke.ca] 
> Envoy? : dimanche, 14. juin 2009 20:46
> ? : Stephane Lesage
> Cc : speex-dev at xiph.org
> Objet : Re: [Speex-dev] Resampler saturation
> 
> Just to make sure I understand, the two patches you sent are 
> two different ways to fix the problem, with the only 
> difference being that resample.patch converts the "unrolled 
> by four" loop into a plain one that's easier on DSPs, right?
Yes exactly, plus a little explanation in comments.
I really have no idea of the performance difference on x86. But I think gcc/msvc
can unroll.
Up to you. Anyway I can OVERRIDE_INNER_PRODUCT_SINGLE.


Talking about performance (still using generic version with VDSP compiler):
1. I got a pretty good boost by using a scratch buffer in SRAM.
2. Wideband Encode+Decode takes 79.1 + 7.2 MIPS on my BF536 400/133 Mhz
3. Profiler says:
vq_nbest                  33.05%
vq_nbest_sign             11.12%
filter_mem16               4.14%
inner_prod                 4.07%
iir_mem16                  2.75%
qmf_synth                  2.32%
lsp_to_lpc                 2.32%
open_loop_nbest_pitch      1.41%
compute_impulse_response   1.37%
qmf_decomp                 1.28%
lpc_to_lsp                 1.26%
fir_mem16                  1.16%
speex_bits_pack            1.07%
speex_bits_unpack_unsigned 0.86%
compute_rms16              0.79%

4. I'm using the echo-canceller + preprocessor,
I'd really like to improve performance here:
- I would like to use ADI's FFT, but it's limited to powers of 2,
is it safe to enable "Round ps_size down to the nearest power of two" 
in the preproc ?
can we do the same trick with the echo-canceller for window_size ?
- are there buffers who could be placed in scratch memory ?
(I don't see any speex_scratch_alloc inthere)

-- 
St?phane LESAGE
ATEIS International

Jean-Marc Valin

2009-Jun-14 23:16 UTC

head link

[Speex-dev] Resampler saturation, blackfin performance

Stephane Lesage a ?crit :>> -----Message d'origine-----
>> De : Jean-Marc Valin [mailto:jean-marc.valin at usherbrooke.ca] 
>> Envoy? : dimanche, 14. juin 2009 20:46
>> ? : Stephane Lesage
>> Cc : speex-dev at xiph.org
>> Objet : Re: [Speex-dev] Resampler saturation
>>
>> Just to make sure I understand, the two patches you sent are 
>> two different ways to fix the problem, with the only 
>> difference being that resample.patch converts the "unrolled 
>> by four" loop into a plain one that's easier on DSPs, right?
> 
> Yes exactly, plus a little explanation in comments.
> I really have no idea of the performance difference on x86. But I think
gcc/msvc can unroll.
> Up to you. Anyway I can OVERRIDE_INNER_PRODUCT_SINGLE.
OK, I guess I'll apply resample.patch considering that we already have
an SSE version (the split in four was for SSE anyway).
> Talking about performance (still using generic version with VDSP compiler):
You'll likely see a difference just by optimising the MULT16_32_Q15() macro.
> 1. I got a pretty good boost by using a scratch buffer in SRAM.
Normally, all the data should end up in L1 cache, so it's surprising
that you see a difference with using SRAM. Are you sure your cache isn't
configured as write-through?
> 2. Wideband Encode+Decode takes 79.1 + 7.2 MIPS on my BF536 400/133 Mhz
> 3. Profiler says:
> vq_nbest                  33.05%
> vq_nbest_sign             11.12%
You should be able to get a big boost in performance just by optimising
the N=1 case for vq_nbest() and vq_nbest_sign().
> filter_mem16               4.14%
If you look at the Blackfin-optimised version, it actually uses a
different algorithm (that does 2 MACs/cycle) for this one (assuming you
place the arrays in two banks, which I don't do yet).
> inner_prod                 4.07%
Again, the Blackfin-optimised version does it with 2 MACs/cycle.
> iir_mem16                  2.75%
> qmf_synth                  2.32%
> lsp_to_lpc                 2.32%
> open_loop_nbest_pitch      1.41%
> compute_impulse_response   1.37%
> qmf_decomp                 1.28%
> lpc_to_lsp                 1.26%
> fir_mem16                  1.16%
> speex_bits_pack            1.07%
> speex_bits_unpack_unsigned 0.86%
> compute_rms16              0.79%
> 
> 4. I'm using the echo-canceller + preprocessor,
> I'd really like to improve performance here:
> - I would like to use ADI's FFT, but it's limited to powers of 2,
> is it safe to enable "Round ps_size down to the nearest power of
two"  in the preproc ?
It should be (unless I broke it!). Otherwise, nothing prevents you from
doing all that processing on power-of-two frames and then doing a bit of
buffering for the codec.
> can we do the same trick with the echo-canceller for window_size ?
If you want to use the echo canceller with a power-of-two FFT, the frame
size really needs to be a power-of-two
> - are there buffers who could be placed in scratch memory ?
> (I don't see any speex_scratch_alloc inthere)
I don't understand your question.

	Jean-Marc

Maybe Matching Threads

Search for more apparently analagous threads

Speex dev - Jun 2009 - Resampler saturation, blackfin performance

[Speex-dev] Resampler saturation, blackfin performance

[Speex-dev] Resampler saturation, blackfin performance

Maybe Matching Threads