Jean-Marc Valin
2006-Apr-22 06:56 UTC
[Speex-dev] Major internal changes, TI DSP build change
> >I fixed it in svn. Could you check that? > > Now all platforms match again. Note that the measured SNR for this test > sample is lower than with the broken code (10.87 vs 11.10), but of course > this is no way to judge the real quality.SNR, especially on a single sample, can be very misleading. Yet, could you just check that the DSP results match what you get on a PC?> >Does the C55 have a 32x16 multiplier or do you mean it handles my > >emulation of it well? > > I has two ALUs with 17x17 bit MACs, and it has an instruction that does > this: > ACy = M40(rnd((ACx >> #16) + (uns(Xmem) * uns(Ymem)))) > > I never quite understood this, so I went of and looked at the manuals. It > can multiply the low half in one cycle, then shift and add it to the high > half in a second cycle. And, in a type loop the parallel ALUs would allow > one 32x16 multiply per cycle.Just one thing I'd like to understand. Did you do some tricks and/or assembly to implement the MULT16_32_Q* routines with these instructions or does the compiler figure them out by itself?> The C54x cannot do this, and uses library calls for 32x16 multiplies.Why is that? By default all the 32x16 multiplies are computed using only 16x16 multiplies (see fixed_generic.h).> The > changes that you have made since 1.1.8 are most dramatic for the 54x, which > dropped from 184 (unusable in real time, the fastest parts are 160 MHz) to > 79 MIPs. The C55x dropped from 41.5 to 29.4 MIPs (mixed 16/32 bit > capability), and the C6x dropped slightly from 36 to 34.5 MIPs (32bit > machine).Glad it makes such a difference. I'm just surprised that the C6x complexity is that high. Jean-Marc
Jim Crichton
2006-Apr-22 19:50 UTC
[Speex-dev] Major internal changes, TI DSP build change
Jean-Marc,>> >I fixed it in svn. Could you check that? >> >> Now all platforms match again. Note that the measured SNR for this test >> sample is lower than with the broken code (10.87 vs 11.10), but of course >> this is no way to judge the real quality. > > SNR, especially on a single sample, can be very misleading. Yet, could > you just check that the DSP results match what you get on a PC?I do not have a build environment for a PC. I have been using the 6-second test file male.wav from the Speex site for my simulations, if someone else wants to run the audio through the encoder and decoder at 8kbps, complexity 1. I might be able to get a coworker to do this, but not any time soon.>> >Does the C55 have a 32x16 multiplier or do you mean it handles my >> >emulation of it well? >> >> I has two ALUs with 17x17 bit MACs, and it has an instruction that does >> this: >> ACy = M40(rnd((ACx >> #16) + (uns(Xmem) * uns(Ymem)))) >> >> I never quite understood this, so I went of and looked at the manuals. >> It >> can multiply the low half in one cycle, then shift and add it to the high >> half in a second cycle. And, in a type loop the parallel ALUs would >> allow >> one 32x16 multiply per cycle. > > Just one thing I'd like to understand. Did you do some tricks and/or > assembly to implement the MULT16_32_Q* routines with these instructions > or does the compiler figure them out by itself?No, I have done no assembly work on any of these DSPs. It has been a few years since I did assembly work on any DSP, and it does not look like I will need to for my applications. I just found the above instruction in the instruction set reference manual, and it seems perfect for 16x32 multiplies. When I look at the assembler output for filter.c, I do not see this instruction used, probably because there is always some shift in the result (like MULT_16_32_Q15, which takes 6 instructions to implement: two multiplies, two adds, a shift, and a store). So, never mind.>> The C54x cannot do this, and uses library calls for 32x16 multiplies. > > Why is that? By default all the 32x16 multiplies are computed using only > 16x16 multiplies (see fixed_generic.h).Once again, I spoke to soon. I saw the library calls when I first tested the C54x last year, but I do not see them now. I am using a later version of the TI compiler, and there could be some different compile options.>> The >> changes that you have made since 1.1.8 are most dramatic for the 54x, >> which >> dropped from 184 (unusable in real time, the fastest parts are 160 MHz) >> to >> 79 MIPs. The C55x dropped from 41.5 to 29.4 MIPs (mixed 16/32 bit >> capability), and the C6x dropped slightly from 36 to 34.5 MIPs (32bit >> machine). > > Glad it makes such a difference. I'm just surprised that the C6x > complexity is that high.There was a post from Jerry Trantow on 4-Feb that he had cut the C6x MIPs about in half with some assembly optimization (do you know if he planned to submit these?). Because this is a very parallel machine, it is not an assembly language for the faint of heart. - Jim
Jean-Marc Valin
2006-Apr-23 16:33 UTC
[Speex-dev] Major internal changes, TI DSP build change
> I do not have a build environment for a PC. I have been using the 6-second > test file male.wav from the Speex site for my simulations, if someone else > wants to run the audio through the encoder and decoder at 8kbps, complexity > 1. I might be able to get a coworker to do this, but not any time soon.Could you send me (don't post on the list) the male.wav as encoder at quality 8, complexity 3 and the decoded version?> No, I have done no assembly work on any of these DSPs. It has been a few > years since I did assembly work on any DSP, and it does not look like I will > need to for my applications. I just found the above instruction in the > instruction set reference manual, and it seems perfect for 16x32 multiplies. > When I look at the assembler output for filter.c, I do not see this > instruction used, probably because there is always some shift in the result > (like MULT_16_32_Q15, which takes 6 instructions to implement: two > multiplies, two adds, a shift, and a store). So, never mind.If the TI compiler supports gcc-like inline assembly (i.e. with constraints), then it would be possible to get a performance boost simply by defining MULT_16_32_Q15 and the like to use these instructions. See fixed_arm5e.h to see what I mean.> Once again, I spoke to soon. I saw the library calls when I first tested > the C54x last year, but I do not see them now. I am using a later version > of the TI compiler, and there could be some different compile options.The definition of MULT16_16 was changed at some point to make it clear that the operands are 16-bit. Seems like it solved the problem.> There was a post from Jerry Trantow on 4-Feb that he had cut the C6x MIPs > about in half with some assembly optimization (do you know if he planned to > submit these?).Don't know. Jerry, any news?> Because this is a very parallel machine, it is not an > assembly language for the faint of heart.Well, I'll be attending a workshop on C6x programming, so we'll see how ugly (or not) it is. The Blackfin has a similar long instruction word architecture and the assembly was rather easy to use. Jean-Marc