Hi All,
I've been fooling around NVIDIA CUDA (the new architecture for
parallel general-purpose computing on their latest video cards
such as the GeForce 8800 series). In my opinion, these devices
promise to be the perfect acceleration platform for FLAC. They
offer massive SIMD-type parallelism for floating-point processing,
with two available kinds of batching (thread blocks vs. grids)
mapping very nicely to FLAC's operations within audio frames vs.
across frames.
For example, compute_autocorrelation() parallelizes very nicely
using a thread block: in lpc.c, "the readable version"
while (lag--) {
for (i = lag, d = 0.0; i < data_len; i++)
d += data[i] * data[i-lag];
autoc[lag] = d;
}
instead of looping through the "while" block sequentially, we
launch a 'lag' number of threads, each computing autoc[lag] for
the corresponding index on its own processor, the code looking
something like this:
for (i = threadIdx, d = 0.0; i < data_len; i++)
d += data[i] * data[i-lag];
autoc[threadIdx] = d;
This only goes to a certain amount of threads per block.
In addition to this kind of parallelism, grids of blocks (which do
not share memory, unlike threads within the same block) can be used
to process several audio frames at once. This is somewhat tricky,
given the explicitly stream-oriented API, and also some CUDA
peculiarities.
With the 330 GFLOPS from the current cards... I'd expect quite a
significant acceleration.
Does anyone find this interesting?
Josh: do you think this would be worth including in the FLAC codebase
when implemented?
-- boris