Sorry that it has been a while since the last altivec patch. I have
noticed something interesting,
and so it remains unfinished...
On the ppc, even with the altivec optimizations, almost a quarter of
the time is spent in
FLAC__stream_encoder_process(). I finally discovered that it is
because of all the integer to
float conversions. Aside from being exceptionally slow on the g4, they
will cause a ton of load
store rejects on the 970, making matters even worse there.
Since the single precision float conversion is much more efficient in
altivec, I have hacked the
FLAC__compute_autocorrelation_altivec() function to take an integer
signal, not even computing
real signal at all. Is this ok? It doesn't seem to affect anything
else, though I admit it is ugly...
Anyways, the overall improvement is about 5% at -8, and 15% at
defaults. In both cases, with
this hack, the altivec version is now about 45% faster.
What's left of a default encode is shown below. :) It seems that most
of the remaining time
is consumed by the rice coding...
25.7% 25.7% flac FLAC__bitbuffer_write_raw_uint32
11.0% 11.0% flac FLAC__bitbuffer_write_rice_signed
10.8% 10.8% flac FLAC__MD5Accumulate
6.9% 6.9% flac set_partitioned_rice_
6.9% 6.9% flac FLAC__stream_encoder_process
6.8% 6.8% flac find_best_partition_order_
5.6% 5.6% flac FLAC__MD5Transform
4.8% 4.8% flac FLAC__fixed_compute_best_predictor_altivec
4.2% 4.2% flac format_input
2.9% 2.9% flac FLAC__lpc_compute_autocorrelation_altivec
2.0% 2.0% flac FLAC__fixed_compute_residual_altivec
1.9% 1.9% flac FLAC__crc16
1.8% 1.8% mach_kernel ml_set_interrupts_enabled
1.5% 1.5% flac FLAC__lpc_compute_residual_from_qlp_coefficients_altivec
1.2% 1.2% flac
FLAC__lpc_compute_residual_from_qlp_coefficients_16bit_altivec
For fun, I wrote a fast signed rice implementation, though I have yet
to adapt it to the bitbuffer.
Also, for those interested, I came across a very nice arithmetic coding
implementation at:
http://www.cipr.rpi.edu/~said/FastAC.html
With a very crude adaptive model, it comes fairly close to the
partitioned rice scheme, though I'm
betting it would be considerably faster, and a lot simpler. Perhaps it
is worth some more
investigation; it really is elegant compared to the others I've seen.
(Unfortunately, it is written in
the hideous language that is C++, but thankfully uses a fairly
reasonable subset of it.)
Chris