Because I want to use vorbis to stream multiple tracks of
compressed audio in a game, I care a bit about its performance.
So today, once I had the basic functionality I needed, I
started benchmarking. The app I have basically just uses
vorbisfile to decode a compressed stream as 44khz 16-bit PCM;
it's a win32 app compiled with Visual C++ 6.
I found that playback of one music stream took about 11% of
my Athlon 500's CPU which seemed like a lot.
Of course, I was not worried about this, as real
hardcore optimizations should not be done until the system
is finalized. But I wanted to get a good feel for where the
CPU time is spent, so I ran some profiles on some code.
I found one alarming thing -- 37% of the CPU time was being
spent in hypot(), in _vlpc_de_helper in lpc.c. Apparently the
Visual C++ version of hypot is hideously slow... it was spending
all its time setting up rounding modes and calling all kinds of
funky math functions, ensuring some form of numeric
stability that we don't need. So I changed the hypot to
a simple sqrt(a*a + b*b) and the code became drastically
faster (as the compiler then just uses the chip's fsqrt
instruction; hypot was calling some software sqrt thing).
I don't know whether gcc has similar issues but someone should
check this.
It also seemed like a lot of time was being spent in vorbisfile
converting float to int (as seen in the earlier discussion) and
writing the samples to the output buffer.
I made two basic changes here. One was using the fast integer
conversion code that I posted last week (but inserted into
vorbisfile, not in the dct code as was being pondered earlier).
The other change was, I checked whether the endianness of the
machine was the same as the output endianness that the user
asks for. If so, I use a loop that outputs each sample as a
single 'short', which eliminates the bit shifting and masking
and half the write operations. Because the user is almost
always going to want the numbers back in their native
endianness, this is an effective optimization.
With these changes in place, my app runs almost twice as fast
as before. And looking at the way things are set up, I'm
pretty confident I could get at least another factor of 2 or 3
speed improvement out of the existing code when it comes time
to Optimize For Real.
I also noticed that, though some are compiling in win32 sometimes,
there doesn't seem to be an official vc6 project checked into the
tree. I offer to check this in and maintain it if that is
acceptable (I'm going to be maintaining a vc6 project, so I might
as well share the effort.
Should I consider submitting a patch with these optimizations in
it? The sqrt(a*a + b*b) seems like a no-brainer to include;
the other ones are more questionable, it just depends on how
much we care about performance at this stage. What is the
procedure for submitting a patch, do I just email it to Monty?
Also, someone was talking earlier about assembly-language
optimization of the code at some point in future. One thing we
have learned in game development is that there is not much point
to writing assembly code for modern processors; you gain hardly
anything beyond what you can achieve from C. Even in the cases
of special instruction sets (like Katmai or the 3DNow instructions)
there are C macros that you can use to do the right thing, most of
the time. Speaking of which, 3DNow has an excellent fast inverse
square root function that would be just the sauce for replacing
that 1./hypot in _vlpc_de_helper.
-Jonathan.
--- >8 ----
List archives: http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/