I'm trying to improve Opus on an AMD Geode CPU, which has limited SSE
support (called 3DNow!), but MMX.
Without optimizations I can only encode 16 bit audio @16KHz with
complexity up to 2-3 without underruns.
I tried compiling with SSE2/4 optimizations, but all I got was a crash
with SIGILL, so I looked into optimized code and found that a good
starting point was the dot product, so I inserted an MMX implementation
of it, gaining a bit in performance.
Then I saw the xcorr function in its simplest form, which is looping and
calculating dot products, and substituted the dot product with a call to
the MMX version. This way I can go up to complexity 3-4 without underruns.
Since this is far from optimal, I was looking into other places that
would get big benefits from parallelization.
Can you point out some? I was thinking about the FIR/IIR filter
implementations, but I'm afraid the overhead of using MMX would offset
the gain, since the filter is probably not so long.
Of course I can share the MMX code, even if it's still not cleanly
incorporated in the source.
Thank you in advance,
Matteo