thr3ads.net - theora dev - [Theora-dev] Theora, MMX and optimisation [Apr 2005]

If this information is useful, please help other people find it:
Share via:

denpo

2005-Apr-11 16:47 UTC

[Theora-dev] Theora, MMX and optimisation

Hi everyone,
I just landed into the theora planet, as a game programmer, I searched
for a free video fomat/codec and the theora choice became obvious.
However I experienced rather bad performance (at least from a game
programming point of view)
After a couple a profiling, I discovered, as previous discused in a
post found via Google, that the bottleneck is in the ogg library. An
unsane part of the CPU time was consumed in a single function :
oggpackB_read.
This was even more visible that I'm compiling the MMX branch.

So, starting form this MMX branche, I made a rewrite of the GCC
assembly code to Visual inline assembly. By the way I tweaked them to
maximize instruction pairing.
Then I rewrited the oggpackB_read to reduce the number of tests, put
some assemby in it, and then made specialized version of it, just like
the existing oggpackB_read1, I now have a 8,16,24 and 32 bit version.
Then I replaced all the theora call that use a fixed number of byte to
read.

I don't yet had time to make a detailed report on the speed gain, but
I'd say that it is noticeable.

I now have a couple of questions to submit to the community, here they are : 

-My hack must be crappy, I saw a couple byte-related function in
ogg... Any hints?

-The biggest probleme with these functions is that you can't assume
the datas to be read are byte aligned. However, I noticed that some
data are always byte aligned. Is it normal or just it is just my
sample video? being able to assume so would provide a big boost.

-What should I do with my version in more of putting the sources with
our realesed games? I'd love to share my work on theora, but my
version so far is somehow broken : I didn't wrote the Big Endian
counterpart of my functions, I replaced the assembly language (I read
somewhere this is plain wrong with theora guidelines). What should be
the best way to make my PC/Visua Studio/MMX/tweaked theora/ogg version
available to others?

Long life to theora!
Denpo


Note: the cpu-consuming IDCT functions seem by their structure perfect
candidate for a SSE therapy. Never done before?

Timothy B. Terriberry

2005-Apr-11 19:08 UTC

head link

[Theora-dev] Theora, MMX and optimisation

denpo wrote:> After a couple a profiling, I discovered, as previous discused in a
> post found via Google, that the bottleneck is in the ogg library. An
Switching to libogg2 could also help with this, as it does not need to
copy any of the packet data around like libogg1 does. The reference
implementation contains compiler switches to use libogg2 (though I think
using them would necessitate using Tremor instead of libvorbis for
Vorbis decoding, since I don't think libvorbis has been ported to
libogg2 yet).
> -My hack must be crappy, I saw a couple byte-related function in
> ogg... Any hints?The only code that current takes any advantage of byte-alignment is the
writecopy functions, and they're not even used in the reference library.
> -The biggest probleme with these functions is that you can't assume
> the datas to be read are byte aligned. However, I noticed that some
> data are always byte aligned. Is it normal or just it is just my
> sample video? being able to assume so would provide a big boost.The values in the header packets are byte-aligned for the most part, but
those need be read only once. The values in the data packets are not
normally byte-aligned to any significant degree.
> -What should I do with my version in more of putting the sources with
> our realesed games? I'd love to share my work on theora, but my
> version so far is somehow broken : I didn't wrote the Big Endian
> counterpart of my functions, I replaced the assembly language (I read
> somewhere this is plain wrong with theora guidelines). What should be
> the best way to make my PC/Visua Studio/MMX/tweaked theora/ogg version
> available to others?
oggpack_readB and friends _are_ the big endian versions... the others
are little endian.

I haven't been following the MMX branch of the reference implementation
in svn, but I would guess that at a minimum we'd want:
1) Confirmation that the output with your "tweaks" is bit-identical to
the unpatched reference decoder,
2) Backports of the tweaks to GCC's AT&T-style assembly (as an aside to
others, is maintaining two versions of all the optimized functions, one
for each compiler, really a good idea? Would porting to a stand-alone
assembler like nasm be worth the effort and extra (though optional)
dependency?)

Given those, I'd suggest posting patches to the mailing list. Ideally,
there'd be separate patches for the VC++ ports of the existing asm and
for your libogg modifications, as the latter have a much smaller chance
of actually being integrated into a release.
> Note: the cpu-consuming IDCT functions seem by their structure perfect
> candidate for a SSE therapy. Never done before?
I believe the vp32 sources contain a SSE2 implementation, but this has
not been forward-ported to Theora, to my knowledge. VP3HoSwiYo posted a
forward-port of vp32's MMX implementation to this mailing list (you can
search for it with Google). I don't believe it was ever officially
incorported into the theora-mmx branch.

You also might want to consider looking at the experimental decoder
(http://svn.xiph.org/experimental/derf/theora-exp/). This is where I've
been trying to focus future optimization efforts. It now sports some
(gcc-only) MMX optimizations thanks to Rudolf Marek, though notably not
for the iDCT or loop filter yet. But, it also has many algorithmic
optimizations, including a significant reduction in the number of calls
to oggpack_readB (by reading more than one bit at a time when possible).
In addition it supports a striped decode mode, which allows you to blit
decoded data to the display (and do color conversion or what have you)
as soon as it is available, while it's still in cache. It hasn't yet
been ported to libogg2 as the libogg2 API is not quite ready for a
public release yet, but I don't believe such a port would be difficult.

I think that covers all your options for further optimization. Obviously
which direction you go depends on your own schedule constraints and
project requirements.

Ivan Popov

2005-Apr-12 06:47 UTC

head link

[Theora-dev] Theora, MMX and optimisation

On Tue, Apr 12, 2005 at 06:41:34AM +0200, denpo wrote:> > VP3HoSwiYo posted a
> > forward-port of vp32's MMX implementation to this mailing list
(you can
> > search for it with Google). I don't believe it was ever officially
> > incorported into the theora-mmx branch.
> Yup, found it (
> http://lists.xiph.org/pipermail/theora-dev/2004-August/002242.html ),
> unfortunatly the link is dead. Since I didn't know the list was so
> reactive, I figured this was old news and had find another way to
> reach my goals. So I guess someone still have this patch somewhere.
> I'd be pleased to get a copy.
There is another (working) link:
http://kyoto.cool.ne.jp/vp3/developers/theora-a3-MMXd.zip
(http://lists.xiph.org/pipermail/theora-dev/2004-August/002348.html)

Regards,
--
Ivan
using the opportunity to remind the list that the corresponding optimized
libtheora is available at konvalo.org, along with mplayer.theora-mmx which
seems to be the fastest decoding tool available for theora right now anywhere.

(it is compiled and setup to work on any Linux 2.6 distro, as well as
on FreeBSD 5.3 and even on NetBSD 2.0, on Intel 32-bit cpus)

You may look at
/coda/konvalo.org/sw/pm/1/TOP/t/theora/V/svn20041118/L/2/NOTES
to see how the library was compiled.

Seemingly Similar Threads

Search for more reasonably related threads

theora dev - Apr 2005 - Theora, MMX and optimisation

[Theora-dev] Theora, MMX and optimisation

[Theora-dev] Theora, MMX and optimisation

[Theora-dev] Theora, MMX and optimisation

Seemingly Similar Threads