Looking through the archives I have seen talk of making CPU specific optimisations for Vorbis, a la MMX/3DNow!/SSE. The feeling I gather is to wait until something is working well in C before committing to any kind of specific optimisation. What if oft used and needed DSP functions were identified and standardised DSP functionality be written for Vorbis? This would seperate the basically non-changing core signal processing functions (IIR, FIR and DCT/FFT) and allow them to be optimised with MMX and so on without fear of upsetting the other code that's in a state of flux at the moment. Wouldn't this be nice? ;) If anyone is interested in heading down this path with me please let me know! --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Jason Hecker wrote:> > Looking through the archives I have seen talk of making CPU specific > optimisations for Vorbis, a la MMX/3DNow!/SSE. The feeling I gather is to > wait until something is working well in C before committing to any kind of > specific optimisation. What if oft used and needed DSP functions were > identified and standardised DSP functionality be written for Vorbis? This > would seperate the basically non-changing core signal processing functions > (IIR, FIR and DCT/FFT) and allow them to be optimised with MMX and so on > without fear of upsetting the other code that's in a state of flux at the > moment. Wouldn't this be nice? ;) > > If anyone is interested in heading down this path with me please let me know!Sure I'm interested :-) For optimizing the current all-C version, I hand-unrolled a lot of the critical loops; this will come in handy for doing a vector-op version. Note, that the filters and fast transforms are not the most time-critical, however. Ciao, Segher --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> (but damned fast!) Does OggVorbis go off and do everything using single > precision floating point? How does this affect the truncation of viable > bits after various multiplications (ie accuracy after a very big number is > multiplied by a very small number)? I attended a DSP workshop last year byActually, *adding* many small numbers to a large number is what endangers you most in floating point, while multiplication, in general, preserves your significant digits very well. In fact, floating point multiplication accuracy is independent of the relative order of magnitude of the operands (except in limit cases). This is one of the advantages of floating point over fixed point (not to say fixed point is never appropriate). Cheers, Tim W. ____________________________________________________________ Timothy Wayper <timmy@wunderbear.com> Wunderbear Software <http://www.wunderbear.com> --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> > Why not in assembly? The GCC extensions won't necessarily work across > > platforms (i.e. with the Metrowerks compiler) while it's already > > accepted that assembly doesn't... And (to my mind) it's easier to > > separate two similar assembly files than C files. Besides, most PPC > > Just use some #ifdef's, no big deal. Or two separate src files, you'll need > them for asm as well.#ifdefs would work if there are bugs in the compilers. I say 'bugs' since the C extensions for Altivec are defined by Motorola and should be the same across all compilers. I have used MrC (a bit) and gcc and both are the same. I haven't used the MW compiler for Altivec, though.>MPW (MrC) does a great job (yeah, I did only one test, sorry). Btw, is there >a fused multiply-add in AltiVec? That would make it an absolute ROCKER!Altivec has lots of cool instructions: vmaddfp -- result = a*b + c vnmsubfp -- result = - (a*b - c) = c - a*b vrepe -- result =~ 1/a vrsqrte -- result =~ 1/sqrt(a) vperm -- result = a|b permuted by c vexpte -- result =~ 2^a vloge -- result =~ log2(a) vctf -- result = 2^n * (float)i (although, sadly, n >= 0) I have used all of these to great effect in other apps. Some of the estimate instructions are _very_ useful when you don't need IEEE exact results. Even when you do need really accurate results you can often find a refinement algorithm that will produce better results given a good starting estimate and still be way faster than a libm call (like Newton-Rhapson refinement for 1/sqrt as show on page 4-18 of the Altivec PEM).>If I understand correctly, the gcc extensions consist mainly of new datatypes >(like, floats4 or whatever they call it), such that'vector float', 'vector unsigned long', 'vector bool', 'vector unsigned char', etc>floats4 a, b, c; >c = a + b;vector float a, b, c; // vec_add is a polymorphic function that will select the right instruction based on the arguments and result type c = vec_add(a, b);> will do a vector addition. This is a quite natural thing to do, and > doesn't take > much effort to program, while the compiler will probably outsmart about > every asm > programmer (if enough work is put into the compiler).Probably, but it will probably be hard for the compiler to do some optimizations. For example, if your C code has needless conversions back and forth between ints and floats, the compiler really doesn't know whether you meant to loose precision or whether you are just being silly. If you take it down to the C Altivec bindings then you get some of the best of both worlds. You know the general instruction flow and can see where you have a lot of instructions. But, you don't have to worry about exact instruction selection, register assignment (which can be a real bear when you have 32 floats, 32 ints and 32 vectors to worry about), or instruction ordering for pipelining. -tim --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
>This sounds great! Are these extensions weel-thought out? Where can I >get-em? I'll look at mot.com, of course...DSP Kernels (Complex FIR, Real FIR, Real Delayed LMS FIR) http://motorola.com/SPS/PowerPC/AltiVec/CodeMain.html Altivec PIM (Programming Interface Manual -- the C bindings) http://a1008.g.akamai.net/7/1008/787/66cefa0933a341/www.motorola.com/SPS/PowerPC/teksupport/teklibrary/manuals/altivecpim.pdf Altivec PEM (Programming Environtment Manual -- the assembly level docs) http://a2016.g.akamai.net/7/2016/787/5087c1b5def3b1/www.motorola.com/SPS/PowerPC/teksupport/teklibrary/manuals/altivec_pem.pdf>So these will presumably still work when there will fit more then 4 >floats in a reg? >How do they do this? Or is it fixed at 4? In that case, vecor is a >mis-nomer, should be vector4 OSLT.That's a good point. I don't know if they ever plan on extending the number of elements per vector. They might just decide to extend the number of vector registers and number of pipelines :)>According to ANSI C, you want to loose precision.Well, yes. That is what the compiler has to assume. But the author may have just been silly when writing the code. I've seen lots of cases of this when optimizing for PPC (since it is really bad at doing int/float conversions). Needless casts between int/float can cost a lot. -tim --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
>They say they will in the near future support 16 operations at once. I don't think >they will be able to do four separate operations at once, so they most likely will >widen the registersThat's sure be cool.>Is just int->float bad, or float->int as well? i ask this, because I was pleasantly >surprised today, because my G3 was like 10 times faster than my Athlon >(and that one was _way_ faster than the P-III) in converting an array of float to an >array of int (in plain stupid C code).// cc -O3 -S -static float.c int floatToInt(float f); float intToFloat(int i); int main() { floatToInt(1.0); intToFloat(1); } int floatToInt(float f) { return (int)f; } float intToFloat(int i) { return (float)i; } This produces the following assembly for the two functions: _floatToInt: fctiwz f0,f1 stfd f0,-8(r1) lwz r3,-4(r1) blr .double 0r4.50360177485414400000e15 .text .align 2 .globl _intToFloat _intToFloat: lis r0,0x4330 lis r9,ha16(LC0) la r9,lo16(LC0)(r9) lfd f0,0(r9) xoris r11,r3,0x8000 stw r11,-4(r1) stw r0,-8(r1) lfd f1,-8(r1) fsub f1,f1,f0 frsp f1,f1 blr As you can see float->int isn't too bad. If you need the results in a register, you are wasting two memory operations due the fact that RISC machines don't move data between functional units usually. On the other hand, int->float is abominable. The case shown above makes it look a bit worse than it has to be since a bunch of the operations can be hoisted outside any potential loop (loading the address of the contant and initializing the first word of the double temporary on the stack). Sadly, even in a loop, gcc doesn't hoist the first store outside the loop so you get three memory operations plus two float operations per loop instead of two memory ops and two float ops. This is one nice thing about Altivec -- it has a very fast path for both int->float and float->int.>More generally, maybe all of the audience can help: what are the weakest points of all the >various processors Vorbis will be deployed on?Speaking from my experience trying to optimize Quake3 for Mac OS X, I find: - Memory bandwidth - Int->Float conversion to be the two worst problems on the PPC. Memory bandwidth probably isn't as big of an issue for Vorbis as for Quake, but it might still have some effect for lookup tables that don't fit in cache. This effect can be made less bad by using the data cache touch instructions when possible. The int->float conversion problems go away if you can use Altivec to do it (i.e., you have an array of ints and you need an array of floats and they are all in the right positions, etc.) -tim --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> Plan is, to rewrite critical parts of the C code, so to be more natural > to rewrite using a (possible assembler) vector implementation. > Nothing more; it's just simple unrolling && re-rolling. And the usual > putting-more-subroutines-into-one-and-refactoring-it-completely-different, > of course. make the code more natural to the machine, i.s.o. to the programmer. > I think Monty will hate me ;-) (If not now, soon he will).It seems like we should have a general framework for this. That is, say that we have a routine foo() that can be optimized various ways. It would be good to have a runtime switch to enable different optimizations for testing. For example, you might have foo_ppc() and foo_ppc_altivec() where the 'ppc' only takes advantage of instructions on all ppc machines while the ppc_altivec version uses PPC7440 specific instructions. Likewise you might have x86, x86_sse, x86_mmx, x86_3dnow, etc. So, it would be nice to compile in anything that is compilable on the target and have a runtime switch to select a particular optimization path (and possibly one to select the 'best' one for the current platform automatically). One benefit of this approach would be to make it easier to compare the results of the C version with a particular optimization. Another would be that Monty, or whoever else, can modify the C version and not worry about the optimized version so much (leaving that to the maintainer of that function), and people can still compile stuff and just select the 'use only C' optimization to always (well, usually :) get valid results. -tim --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> I don't think this will be too hard at all to set up. Leave the function > names the same, no need for blah_mmx() or blah_3dnow(), rather have > directories for each CPU with the equivalent functions in them and have the > linker link the right *.o files. > > I think this might be easier than farting about with macros, or tables that > register the functions with pointers at runtime and so on.This would work, obviously, and might be a very tiny bit faster, but it would make it harder to compare results between different versions. It would also mean you would have to do a lot more binary releases. If all the functions are present at link time and selectable manually or automatically, then you can ship ONE x86 binary instead of 5 or whatever. -tim --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> > One benefit of this approach would be to make it easier to compare the results of the C > version with a particular > > optimization. > > What would be the difference? >Well, ideally the output would be the same. This would allow developers to have a much easier time testing since they wouldn't have to rebuild to switch to a different optimization path.> > (leaving that to the maintainer of that function), and people can still compile stuff and > just select the 'use only C' > > Leaving stuff to the maintainer of a particular function, will make progress of > Vorbis as slow as the slowest of the developers (at least for some people). >No, actually, it would make development faster since Monty or other codec developers wouldn't have to worry so much (or at all) about their changes impacting optimized work -- they could just commit them and send out mail letting the optimized path people know something has changed. When release time comes around, any optimized paths that don't work would simply be disabled (since obviously they aren't getting enough support) rather than begin shipped broken. -tim --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.