thr3ads.net - theora dev - [theora-dev] Patch: fragment reconstruction MMX for GCC [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Nils Pipenbrinck

2007-Dec-30 14:01 UTC

[theora-dev] Patch: fragment reconstruction MMX for GCC

Hi again,

I measured my fragment reconstructions against the compiler output from 
GCC and well - the new codes perform better, so I brushed up my gcc 
inline assembler skills and made a port.

Code is here: http://torus.untergrund.net/code/mmxfrag.c

All routines perform much better now. Inter2 alone got a speedup of 
factor 5 on Pentium-M. Athlon CPU's execute roughly 3 times faster. 
Hadn't had the chance to benchmark core2 though. It would be nice to 
hear if the code compiles on 64bit intel.


Regarding the MSVC patch I made a couple of days ago:

I found out how to get the macro-magic working with MSVC. The IDCT has 
already been ported and now looks exactly like the GCC version. I hope 
we get the maintainability issues solved that way. When I'm done the 
loop filter that way I'll try to resubmit my patch.

Nils

Ralph Giles

2007-Dec-30 17:45 UTC

head link

[theora-dev] Patch: fragment reconstruction MMX for GCC

On Sun, Dec 30, 2007 at 11:01:18PM +0100, Nils Pipenbrinck wrote:
> I measured my fragment reconstructions against the compiler output from 
> GCC and well - the new codes perform better, so I brushed up my gcc 
> inline assembler skills and made a port.
Cool!
> All routines perform much better now. Inter2 alone got a speedup of 
> factor 5 on Pentium-M. Athlon CPU's execute roughly 3 times faster. 
> Hadn't had the chance to benchmark core2 though. It would be nice to 
> hear if the code compiles on 64bit intel.
gcc 4.1.3 on x86_64:

 gcc -DHAVE_CONFIG_H -I. -I.. -I../include -I../lib -I../lib/dec 
-I../lib/enc -Wall -Wno-parentheses -O3 -fforce-addr 
-fomit-frame-pointer -finline-functions -funroll-loops -MT mmxfrag.lo 
-MD -MP -MF .deps/mmxfrag.Tpo -c dec/x86/mmxfrag.c  -fPIC -DPIC -o 
.libs/mmxfrag.o
dec/x86/mmxfrag.c: In function 'oc_frag_recon_inter2_mmx':
dec/x86/mmxfrag.c:197: error: memory input 6 is not directly addressable
dec/x86/mmxfrag.c:197: error: memory input 7 is not directly addressable
make[2]: *** [mmxfrag.lo] Error 1

 -r

Timothy B. Terriberry

2007-Dec-30 18:46 UTC

head link

[theora-dev] Patch: fragment reconstruction MMX for GCC

Nils Pipenbrinck wrote:> All routines perform much better now. Inter2 alone got a speedup of
> factor 5 on Pentium-M. Athlon CPU's execute roughly 3 times faster.
> Hadn't had the chance to benchmark core2 though. It would be nice to
> hear if the code compiles on 64bit intel.
Awesome. I've committed your code, with some modifications in r14336. It
tests identical to the old code on both x86-64 and x86-32.

There were two primary problems with the code as it stood. The first was
specific to x86-64: you have to cast the strides to long's so that they
are placed in 64-bit registers instead of 32-bit registers, or you can't
use them in indexing instructions with 64-bit pointers.

The second was specific to x86-32: when -fPIC is used and
-fomit-frame-pointer is not, x86-32 gets just _five_ general purpose
registers (%eax, %ecx, %edx, %esi, and %edi). All of your routines used
six. This is the cause of the oft-reported problem that the encoder asm
will not compile in debug mode.

Fortunately, it's relatively easy to eliminate a register from each
routine. oc_frag_recon_intra_mmx can get away with one fewer offset, and
 letting gcc handle the looping in oc_frag_recon_inter[2]_mmx allows it
to unroll the loop when -funroll-loops is enabled, eliminating the need
for a counter register. Without -funroll-loops, it will handle the
register spill itself. On x86-64, there's obviously an extra register
available, so it's not a problem.

I also eliminated the "safe" version of oc_frag_recon_inter2_mmx that
handled the case when the strides differ, because it never occurs, and I
don't foresee a situation when we'd want it to.

Also note that with -fomit-frame-pointer, gcc requires an extra register
if you use any "m" arguments, because it can't track how %esp
changes
inside your asm block, so it can't generate a reference that is
guaranteed to work if you start mucking around with the stack. That's
what lead to the errors Ralph reported. Eliminating that version solved
the problem, but getting down to 5 registers would've solved it also.

Reasonably Related Threads

Search for more maybe matching threads

theora dev - Dec 2007 - Patch: fragment reconstruction MMX for GCC

[theora-dev] Patch: fragment reconstruction MMX for GCC

[theora-dev] Patch: fragment reconstruction MMX for GCC

[theora-dev] Patch: fragment reconstruction MMX for GCC

Reasonably Related Threads