Jean-Marc Valin wrote:>>I have the encoder and decoder running now and have verified that the >>encoder is bit-exact wrt to the fixed-point code running on x86 for the >>same 30-second audio sample. Encode and decode together run in >>real-time for 8KHz data, complexity=3, on 120MHz C5509 when code and >>data are all in on-chip SRAM. I have not tested the wideband codec yet. >> >> > >Cool! Just curious, how much of the DSP does it take to do that? > >It took all of the DSP at that speed. The C5509 will go up to 200MHz, so I guess it's 2/3 of the DSP. We're going to profile the code to see if we can speed it up.> >I looked at the patch. Seems mostly OK, though there are some minor >issues. First, I don't think it's right to say spx_int_32_t=long except >for alpha. This will likely cause problems on Itanium, Power64, SPARC64 >and probably x86-64. I suggest either an explicit check for C55xx (and >later others) or a test in configure.ac. > >Also, I'm not sure I get your use of "+#ifdef UNUSED" in the SpeexBits >code. Finally, there's a few changes which I think should be removed, >like for speex_alloc, speex_warning, ... same as the "#if 0" for >wideband and other debug stuff. Could you clean the patch and send it >back to me so I can apply it? > >I'll clean up the patch as suggested. I meant to remove all the extraneous changes but I guess I missed some. A test in configure.ac for sizeof(long) makes more sense than the ifdef I put in there. Jamey
> It took all of the DSP at that speed. The C5509 will go up to 200MHz, > so I guess it's 2/3 of the DSP. We're going to profile the code to see > if we can speed it up.I would think it's possible to make it much faster. The first thing I'd look is the fixed-point macros, like MUL16_16 and make sure they don't do something much more complicated than what they're supposed to do (e.g. compute on 32 bit operands), because there's no way to tell C to do things like "multiply two 16-bit values that will have a result in 32 bits". Also, Speex uses lots of 16x32 multiplications, which your DSP may do more efficiently in other ways. Of course, after that, I can also point out several routines that would benefit the most from optimizations (mainly the filters). Jean-Marc
Jean-Marc Valin wrote:>>It took all of the DSP at that speed. The C5509 will go up to 200MHz, >>so I guess it's 2/3 of the DSP. We're going to profile the code to see >>if we can speed it up. >> >> > >I would think it's possible to make it much faster. The first thing I'd >look is the fixed-point macros, like MUL16_16 and make sure they don't >do something much more complicated than what they're supposed to do >(e.g. compute on 32 bit operands), because there's no way to tell C to >do things like "multiply two 16-bit values that will have a result in 32 >bits". >I have most of these right. TI's C compiler generates a 16x16=>32 multiply from the following idiom: int32_t z = (int32_t)(int16_t)x * (int32_t)(int16_t)y; But I see there are still some calls to the 32x32 multiply routine in one of the filters. It looks like the 16x32 multiplies to which you refer below.>Also, Speex uses lots of 16x32 multiplications, which your DSP >may do more efficiently in other ways. Of course, after that, I can also >point out several routines that would benefit the most from >optimizations (mainly the filters). > >That would be helpful, if there are some key ones to start on. Jamey
Jean-Marc Valin wrote:>Well, I guess the first thing to look is whether your DSP can actually >do either 16x32=>48 or 16x32=>32 (keeping the MSBs), which is what the >smulwb does on ARM. If that's the case, you can gain a lot of speed (use >one instruction for 16x32 instead of three). Otherwise, replacing the >32x32 multiplies by 16x16 is probably a good thing. > >One thing I've noticed so far in the filter_mem2 code is the calls to SATURATE(x, 805306368). 805306368 is 0x30000000. I was expecting that to be on a bit boundary, say 0x3fffffff? In which case the arithmetic saturation logic could be used. Jamey
> One thing I've noticed so far in the filter_mem2 code is the calls to > SATURATE(x, 805306368). 805306368 is 0x30000000. I was expecting that > to be on a bit boundary, say 0x3fffffff? In which case the arithmetic > saturation logic could be used.I don't think it would make that big of a difference, since the saturation is outside of the inner loop. If it's that critical, you could probably remove the saturation completely and just make sure your signals are scaled properly (i.e. not too close to saturation). Jean-Marc