Hello Vorbis folks, I'm one of the FFTW authors (www.fftw.org), and a few days ago I was playing with our codelet generator for fun and modified it to spit out hard-coded MDCTs of small sizes. The code (at jdj.mit.edu/~stevenj/mdct_128nr.c) for 256 samples (128 outputs) seems to be almost twice as fast as the Vorbis MDCT code for that size on my 2.2GHz P-IV (gcc 3.2.2 and flags "-O1 -mcpu=pentium4 -fomit-frame-pointer -fstrict-aliasing -malign-double"), in single precision. I guess the other size that is important to you is 2048 samples; for that size you definitely don't want a hard-coded routine (the linked 256-sample codelet is just inside the crossover-point for hard-coding, in our experience). There, you probably want a recursive/looping algorithm, albeit probably only two stages (e.g. one radix-32 step). FFTW 3 contains code for a DCT-IV of arbitrary sizes (it works by pre/post-processing a real-input FFT of half the size) that can be fairly trivially adapted to the MDCT (which is just a DCT-IV with some input aliasing). I tried that this evening (see attached mdct.[ch]), but it's only about 30% faster than the Vorbis MDCT for 2048 samples, although the advantage increases for larger sizes (e.g. 60% faster for 128k samples). It could be made substantially more efficient by generating special purpose codelets to avoid separate pre/post-processing passes...we know our DCT-IV code is not optimal. (It also doesn't use SIMD.) The above two codes compute an unwindowed MDCT, and give the same results as the one in your mdct.c, but I can also easily make one with a window function built in (to avoid the separate pass/loads). I'm not sure how much you care about MDCT performance (what fraction of CPU time is it?), but I thought you might find this interesting anyway. Regards, Steven G. Johnson PS. Hi Monty! You probably don't remember me, but I knew you from ESG at MIT when we were undergrads (I think you were a year ahead of me?), around '92. I remember you talking back then about founding a company around some audio processing software you were working on (some kind of programmable filtering tool? I forget). PPS. I wrote up a few notes on the MDCT for Wikipedia (http://www.wikipedia.org/wiki/Modified_discrete_cosine_transform), with some help from Frank Klemm of LAME regarding their use in various codecs. Maybe someone here will find the page useful (or have corrections). PPPS. I'll be out of town, probably with no email, until Wednesday. -------------- next part -------------- A non-text attachment was scrubbed... Name: mdct.c Type: text/x-csrc Size: 7280 bytes Desc: mdct.c Url : http://lists.xiph.org/pipermail/vorbis-dev/attachments/20030531/cf90b139/mdct-0001.c -------------- next part -------------- A non-text attachment was scrubbed... Name: mdct.h Type: text/x-chdr Size: 1276 bytes Desc: mdct.h Url : http://lists.xiph.org/pipermail/vorbis-dev/attachments/20030531/cf90b139/mdct-0001.h
Hi Steven, I can't speak for Xiph, obviously, but as a supporter of vorbis and their work I think it's wonderful that you are bringing this to the community. It sounds initally as though your generator could have a direct effect on improving vorbis usability, something that will aid its adoption. Thanks for bringing this to the party. M. --- "Steven G. Johnson" <stevenj@ab-initio.mit.edu> wrote:> Hello Vorbis folks...*snip* __________________________________ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
On Sunday 01 June 2003 13:43, Steven G. Johnson wrote:> Hello Vorbis folks, > > I'm one of the FFTW authors (www.fftw.org), and a few days ago I was > playing with our codelet generator for fun and modified it to spit out > hard-coded MDCTs of small sizes. The code (at > jdj.mit.edu/~stevenj/mdct_128nr.c) for 256 samples (128 outputs) seems to > be almost twice as fast as the Vorbis MDCT code for that size on my 2.2GHz > P-IV (gcc 3.2.2 and flags "-O1 -mcpu=pentium4 -fomit-frame-pointer > -fstrict-aliasing -malign-double"), in single precision. > > I guess the other size that is important to you is 2048 samples; for that > size you definitely don't want a hard-coded routine (the linked 256-sample > codelet is just inside the crossover-point for hard-coding, in our > experience). There, you probably want a recursive/looping algorithm, > albeit probably only two stages (e.g. one radix-32 step). FFTW 3 contains > code for a DCT-IV of arbitrary sizes (it works by pre/post-processing a > real-input FFT of half the size) that can be fairly trivially adapted to > the MDCT (which is just a DCT-IV with some input aliasing). I tried that > this evening (see attached mdct.[ch]), but it's only about 30% faster than > the Vorbis MDCT for 2048 samples, although the advantage increases for > larger sizes (e.g. 60% faster for 128k samples). It could be made > substantially more efficient by generating special purpose codelets to > avoid separate pre/post-processing passes...we know our DCT-IV code is not > optimal. (It also doesn't use SIMD.) > > The above two codes compute an unwindowed MDCT, and give the same results > as the one in your mdct.c, but I can also easily make one with a window > function built in (to avoid the separate pass/loads). > > I'm not sure how much you care about MDCT performance (what fraction of > CPU time is it?), but I thought you might find this interesting anyway.Steven, Since nobody else has answered you, I thought I should say something. MDCT performance is insignificant on encode, but takes a substantial (not sure what percentage, but it's non-trivial) amount of time on decode. However, decode is not really a performance problem on 'desktop-class' cpus - it only really matters a lot for embedded use (either things like consoles - some people are quite interested in increasing performance on the PS2, for example - or portable players, though in the latter case floating point hardware is an unheard-of luxury, so this isn't directly relevent here). We can't use any of your code as it stands anyway (at least the one you attached, I haven't checked FFTW generally) because it's GPLed. Obvioulsy relicensing FFTW to xiph.org's bsd-like license isn't an option, but I'm not sure about what the licensing status of the generated code (as, for example, the code you gave a link to: jdj.mit.edu/~stevenj/mdct_128nr.c) is. Would it be usable for us? Mike --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> From: Michael Smith [mailto:msmith@xiph.org] > To: vorbis-dev@xiph.org > Subject: Re: [vorbis-dev] faster mdct's > > On Sunday 01 June 2003 13:43, Steven G. Johnson wrote: > > Hello Vorbis folks, > > > > I'm one of the FFTW authors (www.fftw.org), and a few days ago I was > > playing with our codelet generator for fun and modified it > > to spit out hard-coded MDCTs of small sizes. The code (at > > jdj.mit.edu/~stevenj/mdct_128nr.c) for 256 samples (128 > > outputs) seems to be almost twice as fast as the Vorbis MDCT code > > for that size on my 2.2GHz > > P-IV (gcc 3.2.2 and flags "-O1 -mcpu=pentium4 -fomit-frame-pointer > > -fstrict-aliasing -malign-double"), in single precision.....> > I'm not sure how much you care about MDCT performance (what > > fraction of CPU time is it?), but I thought you might find > > this interesting anyway.> Steven, > > Since nobody else has answered you, I thought I should say something. > > MDCT performance is insignificant on encode, but takes a > substantial (not sure what percentage, but it's non-trivial) > amount of time on decode. However, decode is not really a > performance problem on 'desktop-class' cpus - it only > really matters a lot for embedded use (either things like > consoles - some people are quite interested in increasing > performance on the PS2, for example - or portable players, > though in the latter case floating point hardware is > an unheard-of luxury, so this isn't directly relevent here).Still, it does have an application if the code generator could be modified to use integer-only instructions on other architectures (e.g ARM, or a DSP like 56k). Saving 50% in the MDCT would translate to having something like (ball park) 20% more battery life on portables, for example. My only concern there is the size of code generated - portables tend to have little memory available both for code and data. Block sizes 256 and 2048 are the most common but not the only ones encountered, so technically you would need an MDCT for all sizes from 64 to 8192. It would be interesting if you could avoid this by generating the code on the fly, but then the code generator would have to be smaller than the total of all MDCTs :) So, how hard would it be to get your code generator integer-only and non-x86? It would certainly save a hell of a lot of hand optimisation. - John Ripley --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Michael Smith wrote:> We can't use any of your code as it stands anyway (at least the one > you attached, I haven't checked FFTW generally) because it's GPLed. > Obvioulsy relicensing FFTW to xiph.org's bsd-like license isn't an > option, but I'm not sure about what the licensing status of the > generated code (as, for example, the code you gave a link to: > jdj.mit.edu/~stevenj/mdct_128nr.c) is. Would it be usable for us?>From MIT's perspective, the generated code ("codelets") is notcopyrighted. (One can put a compilation copyright on a collection of generated codelets, I think, but if someone regenerates one then they are unencumbered.) So, you can do whatever you want with mdct_128nr.c. Regarding the other code that I attached, for arbitrary sizes, it's possible that we could arrange a waiver to allow that particular file to be licensed under a BSD license. The version I attached links to the rest of FFTW, of course, but all it really needs for a single size like 2048 is a couple of codelets; it could easily be hacked to call those (generated, non-copyrighted) codelets by hand. Also, as I mentioned, if you really care about performance for n=2048, the right thing is to modify the generator to spit out special-purpose recursive/looping MDCT code directly, so as to avoid the pre-/post-processing passes. This would avoid the copyright issue, since you'd be working almost entirely with generated code. However, I don't have time to do this work with the generator right now, myself. John Ripley wrote:> Michael Smith wrote: > > MDCT performance is insignificant on encode, but takes a > > substantial (not sure what percentage, but it's non-trivial) > > amount of time on decode. However, decode is not really a > > performance problem on 'desktop-class' cpus - it only > > really matters a lot for embedded use > > Still, it does have an application if the code generator could be > modified to use integer-only instructions on other architectures (e.g > ARM, or a DSP like 56k). Saving 50% in the MDCT would translate to > having something like (ball park) 20% more battery life on portables, > for example. My only concern there is the size of code generated - > portables tend to have little memory available both for code and data.We've talked about modifying the generator to spit out integer code; basically this means adding some extra information to the symbolic nodes in the generator to keep track of the decimal-point shift, and shouldn't be terribly difficult. (At least to do straightforward shifting, like the your current integer MDCT, from what I recall when I looked at it. You can save a few more bits if you make some assumptions about the statistics of the signal, but I guess it's not worth it if you have 32 bits to play with.) You're right that the code-size tradeoff may well be different on an ARM or whatever runs these portables. To handle n=2048, however, one would ideally want to modify the generator to support the recursive case as I mentioned above, so you could do this for n=256 as well in order to use smaller codelets. I'm a little too hosed this month to attempt an integer transplant on the generator right now, but maybe my co-author Matteo is interested (he wrote the generator, anyway, so he can hack it much more easily). (Not that being hosed has prevented me from goofing off with FFTW before. =)> Block sizes 256 and 2048 are the most common but not the only ones > encountered, so technically you would need an MDCT for all sizes from > 64 to 8192. It would be interesting if you could avoid this by > generating the code on the fly, but then the code generator would have > to be smaller than the total of all MDCTs :)The code generator is probably too slow for this, especially on an embedded machine. =) My thinking was that you could use generated/optimized code for the important common cases, and stick with your generic code for other cases. Of course, once the generator is modified to handle n > 256 efficiently (by generating the appropriate recursive/looped cases), it's not a problem to generate all 8 sizes from 64 to 8192.> So, how hard would it be to get your code generator integer-onlyNot too difficult; no more than a few days, I would expect.> and non-x86? It would certainly save a hell of a lot of hand optimisation.It's already non-x86, although we've talked about generating assembly instead of C. (It turns out that for FFT-like algorithms one can find an asymptotically optimal, and pretty darn good in practice, schedule regardless of the number of registers, so the C code is pretty portable. Then, of course, you have to trick the compiler into not screwing it up, which is becoming increasingly difficult. ^_^ ) Steven PS. Please cc me; I can read the mailing-list archives online, but it's easier to respond to email. --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.