Nils Pipenbrinck
2007-Dec-30 14:01 UTC
[theora-dev] Patch: fragment reconstruction MMX for GCC
Hi again, I measured my fragment reconstructions against the compiler output from GCC and well - the new codes perform better, so I brushed up my gcc inline assembler skills and made a port. Code is here: http://torus.untergrund.net/code/mmxfrag.c All routines perform much better now. Inter2 alone got a speedup of factor 5 on Pentium-M. Athlon CPU's execute roughly 3 times faster. Hadn't had the chance to benchmark core2 though. It would be nice to hear if the code compiles on 64bit intel. Regarding the MSVC patch I made a couple of days ago: I found out how to get the macro-magic working with MSVC. The IDCT has already been ported and now looks exactly like the GCC version. I hope we get the maintainability issues solved that way. When I'm done the loop filter that way I'll try to resubmit my patch. Nils
On Sun, Dec 30, 2007 at 11:01:18PM +0100, Nils Pipenbrinck wrote:> I measured my fragment reconstructions against the compiler output from > GCC and well - the new codes perform better, so I brushed up my gcc > inline assembler skills and made a port.Cool!> All routines perform much better now. Inter2 alone got a speedup of > factor 5 on Pentium-M. Athlon CPU's execute roughly 3 times faster. > Hadn't had the chance to benchmark core2 though. It would be nice to > hear if the code compiles on 64bit intel.gcc 4.1.3 on x86_64: gcc -DHAVE_CONFIG_H -I. -I.. -I../include -I../lib -I../lib/dec -I../lib/enc -Wall -Wno-parentheses -O3 -fforce-addr -fomit-frame-pointer -finline-functions -funroll-loops -MT mmxfrag.lo -MD -MP -MF .deps/mmxfrag.Tpo -c dec/x86/mmxfrag.c -fPIC -DPIC -o .libs/mmxfrag.o dec/x86/mmxfrag.c: In function 'oc_frag_recon_inter2_mmx': dec/x86/mmxfrag.c:197: error: memory input 6 is not directly addressable dec/x86/mmxfrag.c:197: error: memory input 7 is not directly addressable make[2]: *** [mmxfrag.lo] Error 1 -r
Timothy B. Terriberry
2007-Dec-30 18:46 UTC
[theora-dev] Patch: fragment reconstruction MMX for GCC
Nils Pipenbrinck wrote:> All routines perform much better now. Inter2 alone got a speedup of > factor 5 on Pentium-M. Athlon CPU's execute roughly 3 times faster. > Hadn't had the chance to benchmark core2 though. It would be nice to > hear if the code compiles on 64bit intel.Awesome. I've committed your code, with some modifications in r14336. It tests identical to the old code on both x86-64 and x86-32. There were two primary problems with the code as it stood. The first was specific to x86-64: you have to cast the strides to long's so that they are placed in 64-bit registers instead of 32-bit registers, or you can't use them in indexing instructions with 64-bit pointers. The second was specific to x86-32: when -fPIC is used and -fomit-frame-pointer is not, x86-32 gets just _five_ general purpose registers (%eax, %ecx, %edx, %esi, and %edi). All of your routines used six. This is the cause of the oft-reported problem that the encoder asm will not compile in debug mode. Fortunately, it's relatively easy to eliminate a register from each routine. oc_frag_recon_intra_mmx can get away with one fewer offset, and letting gcc handle the looping in oc_frag_recon_inter[2]_mmx allows it to unroll the loop when -funroll-loops is enabled, eliminating the need for a counter register. Without -funroll-loops, it will handle the register spill itself. On x86-64, there's obviously an extra register available, so it's not a problem. I also eliminated the "safe" version of oc_frag_recon_inter2_mmx that handled the case when the strides differ, because it never occurs, and I don't foresee a situation when we'd want it to. Also note that with -fomit-frame-pointer, gcc requires an extra register if you use any "m" arguments, because it can't track how %esp changes inside your asm block, so it can't generate a reference that is guaranteed to work if you start mucking around with the stack. That's what lead to the errors Ralph reported. Eliminating that version solved the problem, but getting down to 5 registers would've solved it also.