On 28-05-13 20:09, Janne Hyv?rinen wrote:> On Windows the 32-bit NASM enabled compiles are always fastest. If you > can run 32-bit code on your Linux box you should compile with assembly > optimizations.That depends on the way you define speed. For decoding this doesn't seem to be true. I reran my tests, it took a little longer because I couldn't believe the results I got. However, they are perfectly reproducible (on my system at least), so I guess I'll have to believe them. In the linked PDFs is first a test with the average of 5 CDs and second the graph of only one of those 5. It is clearly visible that the 'speed ranking' for each compression setting match very closely, so the accuracy is probably pretty high. I did this comparison on Kubuntu 12.10 64-bit. http://www.icer.nl/misc_stuff/All tracks.pdf http://www.icer.nl/misc_stuff/Coldplay - Parachutes.pdf I was surprised to see that the Windows compile on wine actually outperformed the native Linux one. Probably GCC 4.6 optimized a little better or something very weird is going on in wine, I don't know. The assembly optimizations work very well on encoding, but actually slow things down when decoding. The difference is not very large however. Anyway, I think I'm convinced now that my lossless codec comparison was valid and I can keep running codecs through wine. I should probably run all of them through wine just for the sake of clarity.
On Wed, May 29, 2013 at 04:08:57PM +0200, Martijn van Beurden wrote:> I was surprised to see that the Windows compile on wine actually > outperformed the native Linux one. Probably GCC 4.6 optimized a little > better or something very weird is going on in wine, I don't know. The > assembly optimizations work very well on encoding, but actually slow > things down when decoding. The difference is not very large however.In a quick test with a pre 4.8 gcc on a Core 2 CPU I see a small improvement in decoding speed with assembly optimizations turned on, but I think the difference used to be larger. Perhaps the compilers got better or MMX is slower relative to normal code on current CPUs. Disabling the FLAC__bitreader_read_rice_signed_block_asm_ia32_bswap function seems to help a bit. (there is an #if disabling the function with comment "OPT: not clearly faster, needs more testing" in the src/libFLAC/stream_decoder.c file) Here is the relative decoding speed with -5 and -8: -5 -8 no asm 99.0% 97.0% asm 100.0% 100.0% asm (no ia32_bswap) 102.7% 102.7% I think we should drop that assembly function as the C version seems to be faster now. Can anyone confirm this? Thanks, -- Miroslav Lichvar
On 31.5.2013 13:04, Miroslav Lichvar wrote:> On Wed, May 29, 2013 at 04:08:57PM +0200, Martijn van Beurden wrote: >> I was surprised to see that the Windows compile on wine actually >> outperformed the native Linux one. Probably GCC 4.6 optimized a little >> better or something very weird is going on in wine, I don't know. The >> assembly optimizations work very well on encoding, but actually slow >> things down when decoding. The difference is not very large however. > In a quick test with a pre 4.8 gcc on a Core 2 CPU I see a small > improvement in decoding speed with assembly optimizations turned on, > but I think the difference used to be larger. Perhaps the compilers > got better or MMX is slower relative to normal code on current CPUs. > > Disabling the FLAC__bitreader_read_rice_signed_block_asm_ia32_bswap > function seems to help a bit. (there is an #if disabling the function > with comment "OPT: not clearly faster, needs more testing" in the > src/libFLAC/stream_decoder.c file) > > Here is the relative decoding speed with -5 and -8: > -5 -8 > no asm 99.0% 97.0% > asm 100.0% 100.0% > asm (no ia32_bswap) 102.7% 102.7% > > I think we should drop that assembly function as the C > version seems to be faster now. > > Can anyone confirm this? > > Thanks, >I can confirm. I see 10% speed improvement with that change on Core i7. Decoding a 1h18min38.133s long test FLAC -8 encoded file takes with normal asm optimizations 7.656s (speed: 616,266x realtime) and with that tiny change 6.937s (speed: 680,140x realtime).