It is relatively easy to convert some SSE2/3/4 code into AVX2: just
use AVX2 intrinsics instead of SSE and the logic of the functions.
Unfortunately my CPU doesn't have AVX2. But today I managed to briefly
test AVX2 code on i5 Haswell CPU. Unfortunately I wasn't able to run
full test suite on Haswell, but it seems that the new code works correctly.
The results of a quick performance test are:
16-bit WAV encoding: ~20% speed increase
24-bit WAV encoding: ~40% speed increase
The speed increase isn't impressive for 16-bit input...
and this code requires Haswell. But it's still some
speed improvement that will cost another increase of
the size of executable files (by 20-30 kB).
What do you think?
Also the new code requires AVX CPU/OS support detection code to be added
to cpu.c I'd like to simplify it slightly further before this. For example,
by removing 3DNow code because it's hardly relevant these days.