thr3ads.net - theora dev - [theora-dev] Benchmarks Inline-ASM vs. Intrinsics [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Nils Pipenbrinck

2009-Feb-11 13:47 UTC

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

Hi folks, FYI:

I've finally made some benchmarks for inline-assembler versus intrinsic 
based mmx code.

I've just applied the changes to the fragment reconstruction functions 
as writing the IDCT and loopfilter have not been ported yet. 
Nevertheless here are some numbers:

As a baseline I'll take the current version from the trunk with all 
inline assembler functions enabled. Lower values mean lower performance.

    All functions with inline-asm:           100%     
    inter_mmx replaced by C-function:    93%
    no mmx at all:                                      60%
    all oc_frag functions intrinsic based:   98%


As you can see the current bugfix for mozilla just takes a 7% 
performance hit. Imho that's something we could live with. The intrinsic 
based approach is nearly as good as the handwritten code, and it 
compiles with gcc as well as VS.net (haven't tried it under linux yet, 
but will do so...). The gcc generated code is even a tad better than the 
vs.net one.

There is btw. a difference between VS.net whole program optimization or 
simple per translation unit optimization, but the performance difference 
is so small that it's nearly lost in the measurement noise. Moving the 
mmx intrinsic functions into the mmxstate.c file and declaring them as 
static inline made a bigger difference (still neglible).

Cheers,
  Nils

Timothy B. Terriberry

2009-Feb-11 14:39 UTC

head link

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

Nils Pipenbrinck wrote:> I've just applied the changes to the fragment reconstruction functions 
> as writing the IDCT and loopfilter have not been ported yet. 
> Nevertheless here are some numbers:
Keep in mind that oc_frag_recon_* together account for less than 6% of
decoding time, so a 2% overall slowdown means a 33% slowdown in those
functions (and similarly, about a 700% slowdown for the C version of
oc_frag_recon_inter_mmx). The cost of the iDCTs are somewhat larger (8%
of the total, or so), so a similar slowdown there will bring an even
larger drop in total performance (and there should not be any cache
misses to mask gcc's inefficiencies in the iDCTs, unlike the recon
functions).

Still, even having said that, I was expecting on the order of a 100%
slowdown, so this is at least somewhat encouraging. What version of gcc
did you use?

Nils Pipenbrinck

2009-Feb-11 20:56 UTC

head link

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

Ralph Giles wrote:>
> So these benchmarks are for gcc output on Windows? Can you benchmark
> the MSVC output too?
>   The numbers are from VS.net 2008.

I just checked if the intrinsics compile under GCC as well.

Nils

Ralph Giles

2009-Feb-11 21:41 UTC

head link

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

On Wed, Feb 11, 2009 at 12:56 PM, Nils Pipenbrinck
<n.pipenbrinck at cubic.org> wrote:
> The numbers are from VS.net 2008.
>
> I just checked if the intrinsics compile under GCC as well.
Ah, ok. Thanks for clarifying.

 -r

Nils Pipenbrinck

2009-Feb-11 22:13 UTC

head link

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

For completes sake: The intrinsics compile on linux as well. I made the 
same benchmarks as before, but this time with GCC 4.2.4 on Linux (ubuntu)

All functions with inline-asm:           100%     
inter_mmx replaced by C-function:    	 92%
no mmx at all:                           54%
all oc_frag functions intrinsic based:   99%


The reference performance (with mmx inline asm) differs by a percent 
between vs.net and gcc.

Cheers,
  Nils

Maybe Matching Threads

Search for more possibly parallel threads

theora dev - Feb 2009 - Benchmarks Inline-ASM vs. Intrinsics

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

Maybe Matching Threads