Hi folks, FYI:
I've finally made some benchmarks for inline-assembler versus intrinsic 
based mmx code.
I've just applied the changes to the fragment reconstruction functions 
as writing the IDCT and loopfilter have not been ported yet. 
Nevertheless here are some numbers:
As a baseline I'll take the current version from the trunk with all 
inline assembler functions enabled. Lower values mean lower performance.
    All functions with inline-asm:           100%     
    inter_mmx replaced by C-function:    93%
    no mmx at all:                                      60%
    all oc_frag functions intrinsic based:   98%
As you can see the current bugfix for mozilla just takes a 7% 
performance hit. Imho that's something we could live with. The intrinsic 
based approach is nearly as good as the handwritten code, and it 
compiles with gcc as well as VS.net (haven't tried it under linux yet, 
but will do so...). The gcc generated code is even a tad better than the 
vs.net one.
There is btw. a difference between VS.net whole program optimization or 
simple per translation unit optimization, but the performance difference 
is so small that it's nearly lost in the measurement noise. Moving the 
mmx intrinsic functions into the mmxstate.c file and declaring them as 
static inline made a bigger difference (still neglible).
Cheers,
  Nils
Timothy B. Terriberry
2009-Feb-11  14:39 UTC
[theora-dev] Benchmarks Inline-ASM vs. Intrinsics
Nils Pipenbrinck wrote:> I've just applied the changes to the fragment reconstruction functions > as writing the IDCT and loopfilter have not been ported yet. > Nevertheless here are some numbers:Keep in mind that oc_frag_recon_* together account for less than 6% of decoding time, so a 2% overall slowdown means a 33% slowdown in those functions (and similarly, about a 700% slowdown for the C version of oc_frag_recon_inter_mmx). The cost of the iDCTs are somewhat larger (8% of the total, or so), so a similar slowdown there will bring an even larger drop in total performance (and there should not be any cache misses to mask gcc's inefficiencies in the iDCTs, unlike the recon functions). Still, even having said that, I was expecting on the order of a 100% slowdown, so this is at least somewhat encouraging. What version of gcc did you use?
Ralph Giles wrote:> > So these benchmarks are for gcc output on Windows? Can you benchmark > the MSVC output too? >The numbers are from VS.net 2008. I just checked if the intrinsics compile under GCC as well. Nils
On Wed, Feb 11, 2009 at 12:56 PM, Nils Pipenbrinck <n.pipenbrinck at cubic.org> wrote:> The numbers are from VS.net 2008. > > I just checked if the intrinsics compile under GCC as well.Ah, ok. Thanks for clarifying. -r
For completes sake: The intrinsics compile on linux as well. I made the same benchmarks as before, but this time with GCC 4.2.4 on Linux (ubuntu) All functions with inline-asm: 100% inter_mmx replaced by C-function: 92% no mmx at all: 54% all oc_frag functions intrinsic based: 99% The reference performance (with mmx inline asm) differs by a percent between vs.net and gcc. Cheers, Nils