On 31.5.2013 13:04, Miroslav Lichvar wrote:> On Wed, May 29, 2013 at 04:08:57PM +0200, Martijn van Beurden wrote: >> I was surprised to see that the Windows compile on wine actually >> outperformed the native Linux one. Probably GCC 4.6 optimized a little >> better or something very weird is going on in wine, I don't know. The >> assembly optimizations work very well on encoding, but actually slow >> things down when decoding. The difference is not very large however. > In a quick test with a pre 4.8 gcc on a Core 2 CPU I see a small > improvement in decoding speed with assembly optimizations turned on, > but I think the difference used to be larger. Perhaps the compilers > got better or MMX is slower relative to normal code on current CPUs. > > Disabling the FLAC__bitreader_read_rice_signed_block_asm_ia32_bswap > function seems to help a bit. (there is an #if disabling the function > with comment "OPT: not clearly faster, needs more testing" in the > src/libFLAC/stream_decoder.c file) > > Here is the relative decoding speed with -5 and -8: > -5 -8 > no asm 99.0% 97.0% > asm 100.0% 100.0% > asm (no ia32_bswap) 102.7% 102.7% > > I think we should drop that assembly function as the C > version seems to be faster now. > > Can anyone confirm this? > > Thanks, >I can confirm. I see 10% speed improvement with that change on Core i7. Decoding a 1h18min38.133s long test FLAC -8 encoded file takes with normal asm optimizations 7.656s (speed: 616,266x realtime) and with that tiny change 6.937s (speed: 680,140x realtime).
On 1.6.2013 14:24, Janne Hyv?rinen wrote:> On 31.5.2013 13:04, Miroslav Lichvar wrote: >> On Wed, May 29, 2013 at 04:08:57PM +0200, Martijn van Beurden wrote: >>> I was surprised to see that the Windows compile on wine actually >>> outperformed the native Linux one. Probably GCC 4.6 optimized a little >>> better or something very weird is going on in wine, I don't know. The >>> assembly optimizations work very well on encoding, but actually slow >>> things down when decoding. The difference is not very large however. >> In a quick test with a pre 4.8 gcc on a Core 2 CPU I see a small >> improvement in decoding speed with assembly optimizations turned on, >> but I think the difference used to be larger. Perhaps the compilers >> got better or MMX is slower relative to normal code on current CPUs. >> >> Disabling the FLAC__bitreader_read_rice_signed_block_asm_ia32_bswap >> function seems to help a bit. (there is an #if disabling the function >> with comment "OPT: not clearly faster, needs more testing" in the >> src/libFLAC/stream_decoder.c file) >> >> Here is the relative decoding speed with -5 and -8: >> -5 -8 >> no asm 99.0% 97.0% >> asm 100.0% 100.0% >> asm (no ia32_bswap) 102.7% 102.7% >> >> I think we should drop that assembly function as the C >> version seems to be faster now. >> >> Can anyone confirm this? >> >> Thanks, >> > I can confirm. I see 10% speed improvement with that change on Core i7. > Decoding a 1h18min38.133s long test FLAC -8 encoded file takes with > normal asm optimizations 7.656s (speed: 616,266x realtime) and with that > tiny change 6.937s (speed: 680,140x realtime). > >I noticed a side effect for this change. Encoding got a bit slower at least when md5 checksumming is enabled.
On Sat, Jun 01, 2013 at 02:33:55PM +0300, Janne Hyv?rinen wrote:> On 1.6.2013 14:24, Janne Hyv?rinen wrote: > > I can confirm. I see 10% speed improvement with that change on Core i7. > > Decoding a 1h18min38.133s long test FLAC -8 encoded file takes with > > normal asm optimizations 7.656s (speed: 616,266x realtime) and with that > > tiny change 6.937s (speed: 680,140x realtime).Thanks for the testing.> I noticed a side effect for this change. Encoding got a bit slower at > least when md5 checksumming is enabled.That's odd. How much slower was the encoding? Could it be caused by increase in the size of the function (only with -funroll-loops?) and not fitting in the cache during encoding? It might be good to use -funroll-loops only with some files, IIRC it helped most to stream_encoder.c. -- Miroslav Lichvar