Stephane Lesage
2009-Jun-14 21:31 UTC
[Speex-dev] Resampler saturation, blackfin performance
> -----Message d'origine----- > De : Jean-Marc Valin [mailto:jean-marc.valin at usherbrooke.ca] > Envoy? : dimanche, 14. juin 2009 20:46 > ? : Stephane Lesage > Cc : speex-dev at xiph.org > Objet : Re: [Speex-dev] Resampler saturation > > Just to make sure I understand, the two patches you sent are > two different ways to fix the problem, with the only > difference being that resample.patch converts the "unrolled > by four" loop into a plain one that's easier on DSPs, right?Yes exactly, plus a little explanation in comments. I really have no idea of the performance difference on x86. But I think gcc/msvc can unroll. Up to you. Anyway I can OVERRIDE_INNER_PRODUCT_SINGLE. Talking about performance (still using generic version with VDSP compiler): 1. I got a pretty good boost by using a scratch buffer in SRAM. 2. Wideband Encode+Decode takes 79.1 + 7.2 MIPS on my BF536 400/133 Mhz 3. Profiler says: vq_nbest 33.05% vq_nbest_sign 11.12% filter_mem16 4.14% inner_prod 4.07% iir_mem16 2.75% qmf_synth 2.32% lsp_to_lpc 2.32% open_loop_nbest_pitch 1.41% compute_impulse_response 1.37% qmf_decomp 1.28% lpc_to_lsp 1.26% fir_mem16 1.16% speex_bits_pack 1.07% speex_bits_unpack_unsigned 0.86% compute_rms16 0.79% 4. I'm using the echo-canceller + preprocessor, I'd really like to improve performance here: - I would like to use ADI's FFT, but it's limited to powers of 2, is it safe to enable "Round ps_size down to the nearest power of two" in the preproc ? can we do the same trick with the echo-canceller for window_size ? - are there buffers who could be placed in scratch memory ? (I don't see any speex_scratch_alloc inthere) -- St?phane LESAGE ATEIS International
Jean-Marc Valin
2009-Jun-14 23:16 UTC
[Speex-dev] Resampler saturation, blackfin performance
Stephane Lesage a ?crit :>> -----Message d'origine----- >> De : Jean-Marc Valin [mailto:jean-marc.valin at usherbrooke.ca] >> Envoy? : dimanche, 14. juin 2009 20:46 >> ? : Stephane Lesage >> Cc : speex-dev at xiph.org >> Objet : Re: [Speex-dev] Resampler saturation >> >> Just to make sure I understand, the two patches you sent are >> two different ways to fix the problem, with the only >> difference being that resample.patch converts the "unrolled >> by four" loop into a plain one that's easier on DSPs, right? > > Yes exactly, plus a little explanation in comments. > I really have no idea of the performance difference on x86. But I think gcc/msvc can unroll. > Up to you. Anyway I can OVERRIDE_INNER_PRODUCT_SINGLE.OK, I guess I'll apply resample.patch considering that we already have an SSE version (the split in four was for SSE anyway).> Talking about performance (still using generic version with VDSP compiler):You'll likely see a difference just by optimising the MULT16_32_Q15() macro.> 1. I got a pretty good boost by using a scratch buffer in SRAM.Normally, all the data should end up in L1 cache, so it's surprising that you see a difference with using SRAM. Are you sure your cache isn't configured as write-through?> 2. Wideband Encode+Decode takes 79.1 + 7.2 MIPS on my BF536 400/133 Mhz > 3. Profiler says: > vq_nbest 33.05% > vq_nbest_sign 11.12%You should be able to get a big boost in performance just by optimising the N=1 case for vq_nbest() and vq_nbest_sign().> filter_mem16 4.14%If you look at the Blackfin-optimised version, it actually uses a different algorithm (that does 2 MACs/cycle) for this one (assuming you place the arrays in two banks, which I don't do yet).> inner_prod 4.07%Again, the Blackfin-optimised version does it with 2 MACs/cycle.> iir_mem16 2.75% > qmf_synth 2.32% > lsp_to_lpc 2.32% > open_loop_nbest_pitch 1.41% > compute_impulse_response 1.37% > qmf_decomp 1.28% > lpc_to_lsp 1.26% > fir_mem16 1.16% > speex_bits_pack 1.07% > speex_bits_unpack_unsigned 0.86% > compute_rms16 0.79% > > 4. I'm using the echo-canceller + preprocessor, > I'd really like to improve performance here: > - I would like to use ADI's FFT, but it's limited to powers of 2, > is it safe to enable "Round ps_size down to the nearest power of two" in the preproc ?It should be (unless I broke it!). Otherwise, nothing prevents you from doing all that processing on power-of-two frames and then doing a bit of buffering for the codec.> can we do the same trick with the echo-canceller for window_size ?If you want to use the echo canceller with a power-of-two FFT, the frame size really needs to be a power-of-two> - are there buffers who could be placed in scratch memory ? > (I don't see any speex_scratch_alloc inthere)I don't understand your question. Jean-Marc