Rodolphe Ortalo
2003-May-08 16:33 UTC
[theora-dev] MMX and extended-MMX acceleration patch for encoding
Hello, attached is a gzipped patch file to the lib/mcomp.c source file of theora (as of AnonCVS current version) that implements MMX and extended-MMX optimizations in the most frequently used functions of the encoder (as shown by gprof). This is more a proof of concept than a real request for inclusion into the source tree. My personal intent was more to look deeper into the MMX instruction set and/or GCC and/or Theora than a real need for performance improvements. :-) Plus the fact that, apparently, I still have difficulties with the mathematics of video compression and could not do more than grunt work on this kind of code... :-)) Of course, some of you may find this interesting, so I just want to share the results. I have introduced 4-5 new (inline) functions that corresponds to the core compute intensive operations of the encoder (as found experimentally by gprof). These are in fact wrappers to allow switching between implementation variants and I have implemented several variants: C, MMX assembly and MMXEXT assembly (something like SSE maybe, recent extensions apparently, found in PIII and Athlon). Preprocessor directive HAVE_MMX and HAVE_MMXEXT allow to select at compile-time which code gets used for real. So, use CFLAGS="-DHAVE_MMX" to get the MMX implementation, and CFLAGS="-DHAVE_MMX -DHAVE_MMXEXT" to get the MMXEXT implementation (which uses both MMX and extMMX instructions). The wrappers also allow back-to-back testing of the C and assembly implementations (very useful for testing). Use -DTEST_MMX for this code to get in (in this case, both variants are called each time, so do not expect performance improvements when doing double work...:-). I have observed between 10% and 30% improvements in encoding speed using these assembly implementations. The MMXEXT implementations offer the most impressive improvements (on PIII or Athlon CPU some functions like sum of absolute difference can be done via a single extMMX instruction), but MMX too show improvements. Globally, I'd say that one could expect 15% improvement, but this should be assessed with longer testing, and different testsets. For testing, I have only used the test files published on theora.org web site some time ago. I include at the end of this mail some time measurements on various computers, using this testset. I do not know the impact of these modifications on the player. Note too that these implementations should also probably be validated with respect to accuracy. I have tried very hard not to introduce any arithmetic error when using assembly but, in some cases, C-based and MMX-based results differ (e.g. integer average of two values via the MMX instruction set adds a 1 to the intermediate result before division by 2), so encoding the same data via C or MMX functions does not produce the same ogg file. But both file seems correct visually, at least from what I saw, and they do not differ significantly in size. Maybe others on the list would like to run these optimizations on bigger testsets and compare C-based and assembly-based variants with more numerical techniques to finally assess the performance improvement and validate my code. All in all, it seems to me that it was worth the effort. (And also that one should not do such kind of efforts too often.), but feel free to disagree. Final note, I used __asm__() GCC assembly directives, so the code should compile easily with many versions of GCC (I used 2.95 I think, the default one on Debian 3.0 in fact). [Btw, note, you need to use some level of optimizations for GCC (I used the default ones) for actually inlining the inline functions I added and not getting a penalty...] Recently GCC 3.2 introduced new compiler builtins for MMX and vector operations. I have not used them because GCC 3.2 is very recent (and I have not yet it installed), but I looked at them and I think my implementation should be easy to translate to use the builtins instead of inline assembly, in 6 or 12 months (and then, maybe GCC will give us better loop unroling, register allocation and additional perf. or simpler code. Maybe...). Do not hesitate to react and give impressions, see you, Rodolphe <p>Some results (for the test file published on Theora site): ===========* Normal quality Athlon XP 2200+: MMX-ext optimization real 0m2.483s user 0m2.450s sys 0m0.030s Athlon XP 2200+: MMX optimization real 0m3.075s user 0m3.020s sys 0m0.050s Athlon XP 2200+: No optimization real 0m3.524s user 0m3.490s sys 0m0.040s * High quality (-v 9) Athlon XP 2200+: MMX-ext optimization real 0m3.155s user 0m3.090s sys 0m0.070s Athlon XP 2200+: MMX optimization real 0m4.316s user 0m4.260s sys 0m0.050s Athlon XP 2200+: No optimization real 0m5.131s user 0m5.080s sys 0m0.060s ======================================* Normal quality (no opt) K6-3 450MHz: MMX optimization real 0m13.880s user 0m13.590s sys 0m0.210s K6-3 450MHz: No opt real 0m17.418s user 0m16.850s sys 0m0.240s * High quality (-v 9) K6-3 450MHz: MMX optimization real 0m17.810s user 0m17.360s sys 0m0.270s K6-3 450MHz: No opt real 0m23.945s user 0m23.510s sys 0m0.240s * Highest quality (-v 10) K6-3 450MHz: MMX optimization real 0m18.100s user 0m17.850s sys 0m0.130s K6-3 450MHz: No opt real 0m24.082s user 0m23.590s sys 0m0.230s ================================* Normal quality (no opt) PIII 800MHz: MMX-ext optimization real 0m7.741s user 0m6.410s sys 0m0.120s PIII 800MHz: MMX optimization real 0m8.618s user 0m7.230s sys 0m0.100s PIII 800MHz: No opt real 0m9.645s user 0m8.020s sys 0m0.150s -------------- next part -------------- A non-text attachment was scrubbed... Name: mmx-mcomp.patch.gz Type: application/octet-stream Size: 5111 bytes Desc: GZIPed patch file Url : http://lists.xiph.org/pipermail/theora-dev/attachments/20030509/01908450/mmx-mcomp.patch-0001.obj
Jay Sprenkle
2003-May-09 08:00 UTC
[theora-dev] MMX and extended-MMX acceleration patch for encoding
--- Rodolphe Ortalo <ortalo@laas.fr> wrote:> Hello, > > attached is a gzipped patch file to the lib/mcomp.c > source file of theora > (as of AnonCVS current version) that implements MMX > and extended-MMX > optimizations in the most frequently used functions > of the encoder (as shown by gprof). >Wow! I'm impressed you got that much improvement from it. Impressive work! <p>__________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Dan Miller
2003-May-09 09:06 UTC
[theora-dev] MMX and extended-MMX acceleration patch for encoding
just as a point of reference, the original VP3 package was able to encode 320x240, 30 fps material in real-time on a 1 ghz processor. ___ Dan Miller (++,) Founder, On2 Technologies> -----Original Message----- > From: Jay Sprenkle [mailto:cupycake_jay@yahoo.com] > Sent: Friday, May 09, 2003 11:01 AM > To: theora-dev@xiph.org > Subject: Re: [theora-dev] MMX and extended-MMX acceleration patch for > encoding > > > > --- Rodolphe Ortalo <ortalo@laas.fr> wrote: > > Hello, > > > > attached is a gzipped patch file to the lib/mcomp.c > > source file of theora > > (as of AnonCVS current version) that implements MMX > > and extended-MMX > > optimizations in the most frequently used functions > > of the encoder (as shown by gprof). > > > > Wow! I'm impressed you got that much improvement > from it. Impressive work! > > > __________________________________ > Do you Yahoo!? > The New Yahoo! Search - Faster. Easier. Bingo. > http://search.yahoo.com > --- >8 ---- > List archives: http://www.xiph.org/archives/ > Ogg project homepage: http://www.xiph.org/ogg/ > To unsubscribe from this list, send a message to > 'theora-dev-request@xiph.org' > containing only the word 'unsubscribe' in the body. No > subject is needed. > Unsubscribe messages sent to the list will be ignored/filtered. >--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Rodolphe Ortalo
2003-May-11 15:09 UTC
[theora-dev] MMX and extended-MMX acceleration patch for encoding
Hello, I've just tried to play with GCC 3.2 (which includes C builtins functions and types to use MMX instructions) and I've noticed that, with this new version of GCC, my original MMX patch breaks theora. Apparently, GCC 3 handles some type casts differently, and, when using this compiler, theora encoding process is incorrect. (The primary symptom is that file sizes increase suddently.) I do not know if this is due only to my patch modifications (the additional C functions in particular) or if such problems occur with the original theora code currently in CVS too. So, my MMX patch should probably be used with a lot of care with GCC 3, and probably only with GCC 2.95. I am currently trying to change it anyway to work with GCC 3.2 MMX-oriented extensions. I all cases, I strongly suggest that theora developers try also to use GCC 3.2 with the original C code (and probably also with -Werror and -pedantic) to see if this new compiler does not reveal old typecasts subtleties... :-) Rodolphe <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.