Martijn van Beurden
2021-Jun-24 07:17 UTC
[flac-dev] Autocorrelation precision insufficient
Hi all, Recently I've been investigating various ways to improve FLAC compression, and now I've stumbled upon quite a small change with large implications. Flake, an alternative compressor using the FLAC format, has always provided better compression than FLAC. I've found out why: Flake uses doubles (64-bit floating point) for calculating autocorrelation values, while FLAC uses regular floats (32-bit floating point). The largest problem with implementing this, is that intrinsics routines (for SSE and VSX) have to be rewritten. I've done quite a bit of testing and comparing, see the next two PDFs. http://www.audiograaf.nl/misc_stuff/double-autoc-with-sse2-intrinsics.pdf http://www.audiograaf.nl/misc_stuff/double-autoc-with-sse2-intrinsics-per-track.pdf There are four lines, all going from setting -4 as the rightmost (fastest) through -5, -6, -7 to -8 as the leftmost (slowest). - darkblue line is current git - green line is current git but with SSE intrinsics for autocorrelation calculation disabled - lightblue line is calculating autocorrelation in doubles instead of real - red line is calculating autocorrelation in doubles but with new SSE2 intrinsics routines As you can see in the PDFs, the overall gain for setting -4 is large (0.3%point or 0.5%) with minimal slowdown. This gain grows smaller while the slowdown increases with increasing setting. The -per-track PDF shows that the gain is highly dependent on the kind of audio that is being compressed. Tracks with strong tonal components, like piano music (14 and 15) benefit the most. Orchestral music (2, 6, 10 and 9) and electronic music (4 and 13) benefit in varying degrees. Music with much more noisy content, like metal (3, 5 and 12) have (almost) no benefit. However, in the tracks that benefit, gains can be large. Track 15, which is piano music, sees a gain of 2.2%point or 5% for setting -4 and 1%point or 2% for -8. Code is here: https://github.com/ktmf01/flac/tree/autoc-sse2 Before I send a push request, I'd like to discuss a choice that has to be made. I see a few options - Don't switch to autoc[] as doubles, keep current speed and ignore possible compression gain - Switch to autoc[] as doubles, but keep current intrinsics routines. This means some platforms (with only SSE but not SSE2 or with VSX) will get less compression, but won't see a large slowdown. - Switch to autoc[] as doubles, but remove current SSE and disable VSX intrinsics for someone to update them later (I don't have any POWER8 or POWER9 hardware to test). This means all platforms will get the same compression, but some (with only SSE but not SSE2 or with VSX) will see a large slowdown. Thanks in advance for your replies and comments on this. Kind regards, Martijn van Beurden
Martijn van Beurden
2021-Jun-25 07:48 UTC
[flac-dev] Autocorrelation precision insufficient
Op do 24 jun. 2021 om 09:17 schreef Martijn van Beurden <mvanb1 at gmail.com>:> - Switch to autoc[] as doubles, but remove current SSE and disable VSX > intrinsics for someone to update them later (I don't have any POWER8 > or POWER9 hardware to test). This means all platforms will get the > same compression, but some (with only SSE but not SSE2 or with VSX) > will see a large slowdown.I see now that besides routines with SSE intrinsics (which I rewrote into SSE2) and with VSX intrinsics (which I don't have hardware for) there is also a open pull request for routines with ARM intrinsics. I am willing and able to rewrite those if this change is accepted and merged. I have access to ARMv8 with 32-bit OS, ARMv8 with 64-bit OS, ARMv6 and I might be able to get hold of ARMv7 hardware.