OK, into the nitty-gritty, albeit a high-level version. If it sounds
like I'm glossing important details, you're right. This discusses
only the basic DSP; precise coding, framing, sync, etc, will be in
another mail.
Vorbis is a hybrid transform domain general purpose audio encoder,
like MPEG in some respects (it is rooted in much of the same basic
theory). For the most part, the similarity comes from the fact that
(like nearly all modern audio compressions), Vorbis codes primarily in
the MDCT domain, using nominal 50% overlap with a y=sin(2PI*sin^2(x))
window (the window is a bit unique). The current Vorbis code and
working spec support switching between two block sizes which must
result in power-of-two sized windowed blocks. From here, the details
diverge.
(Sidetrack note that will make more sense the second reading):
The LAME pages mention that variable blocksize may well be patented
(I can demonstrate prior art in Ogg back to '94, but I expect that's
way too late ;-) In fact we might as well get used to the fact right
now that practically anything we want to do with an audio stream
beyond addition probably is patented and I'm bracing to find out
that addition is too. The patents are bogus, of course, but the
legal route of saying so tends to be pricy.
I'll not stray too far, but I'll mention that preliminary
experiments on using envelope pre-clamping alone to control pre-echo
(described later) produces results apparently as good as absurbly
small blocks. Although I thought of this a while ago, I only got to
try it recently because it seemed like a wild shot in the
dark. Unexpectedly, the results were very good. It's possible we
will be able to get away with a fixed block-size encoder with no
quality penalty!
Unlike MPEG/AAC, there is no subbanding of the time domain data before
MDCT; my own experiments last year indicated that a good window
function and a 'monolithic' MDCT produces transform domain data just
as good as subbanding with much less complexity. Of course, the price
is that you need to make sure to have a *damned* fast MDCT at that
blocksize...
>From the raw MDCT coefficient values, one then uses whatever
psychoacoustics one wishes to generate a spectral envelope (I call it
a 'floor') curve and quantized MDCT residue coefficients. The dot
product of the floor and the residue coefficients results in
reconstructing (a quantized version of) the original MDCT spectrum.
(this floor curve need not be generated solely from analysis of the
MDCT domain data, obviously, although the current code does it this
way.)
The trick is that the floor curve tends to have formant-like features
due to the rolloff properties of masking tones. These can be encoded
into a low-order IIR LPC filter (20-30 poles, simplified direct form
II. Basic speech compression stuff). Speech compressions (and, I
might add, TwinVQ [VQF]) encode 20 pole systems into 20-28 bits. With
a little effort, a standard algorithm like Levinson-Durbin can be used
to generate a filter with a frequency response within a few percent of
the desired curve across the spectrum.
Unlike speech compressions and TwinVQ, Vorbis never applies the LPC
filter. It is used solely as a compact model to represent the
spectral envelope curve. The curve, at any point, can be efficiently
computed directly from the LPC coefficients.
With a high precision 'floor', the majority of the residual MDCT
coefficients are -1, 0, and 1. The floor curve can be used to
represent even high-power pure tones with a coefficient of |1|, but
only if the peak is in isolation. That is, get away with anything you
can while understanding that the curve has a maximum, known
resolution.
Like in modern speech compression, the LPC filter is encoded in LSP
form. The residual coeeficients are coded through a generic codeword
backend that uses a supplied codebook. The codebook may operate on
single scalars or vectors, and be lossless (direct) or lossy.
Basically, a generalization that handles scalar/vector huffman, and VQ
in the same model.
The points this doesn't cover are channel coupling (eg, side/mid
stereo) and the exact codebook mechanism used to code the LSP
coefficients and MDCT residual. The biggest reason is that large
chunks of both are unresolved and I need *alot* of input. I know of a
number of things that won't work or are the wrong thing to do, but
that doesn't mean I've yet settled on something that's right.
Envelope preclamping:
First, a caveat. This idea is simple and ugly. One's first reaction
tends to be, "hey, that's really simple!". The second is
"oh, that
can't possibly work." The strange thing is that it does, and it
doesn't even foul up psychoacoustics too badly. OK, enough hedging.
We're all familiar with pre-echo; large dynamic changes in the middle
of a block tend to smear and spread throughout the block when the MDCT
coefficients are quantized, and the result is audible as the familiar
pre-ehco. I won't go into implications and details here, I think you
all know them. If not, speak up and several of us can explain in
unison :-)
In addition to just using shorter blocks, AAC also uses a clever
little technique called 'noise shaping' that notes a sharp event in
the time domain tends to produce a frequency or MDCT domain
oscillation, a series of decaying exponentials, that encodes well into
an LPC-like filter. It's a cute idea, but complicated (and
computationally expensive).
Vorbis simply normalizes the time-domain envelope before the MDCT.
Each block is scanned for a dynamic range increase over a
predetermined threshhold; the envelope of the time domain signal is
normalized from the point of the change to the end of the block
(current code normalizes to an even =/- 6dB). The envelope is cheap
to compute, easy to encode efficiently (most blocks don't need it, so
we don't encode an envelope in 98% of the blocks) and simple. My
tests with 'castanets.wav' gets better results than with LAME or
Fraunhofer (IMHO ;-) Needless to say, I was a bit surprised.
OK, encoder summary:
1) Seperate the time domain data into windowed blocks of two potential
sizes.
2) Preclamp the windowed block if we so choose; the envelope for a
block is encoded in its entirety for each block, and overlapping
sections do not necessarily need to use the same envelope. Encode
the envelope.
3) forward MDCT
4) Deduce the desired spectral envelope curve (the 'floor'). This, of
course, is the step that includes 'major magic.'
5) Generate a set of LPC coefficients that encode the curve.
Transform the LPC coefficients into LSP coefficients (which
tolerate quantization more easliy). Encode the LSP coefficients.
6) Remove the floor from the MDCT coefficients and quantize; the
result is the 'residue'. On the decode side, the original spectrum
is reconstructed by multiplying the floor with the residue. Encode
the residue.
7) repeat.
Next... bitstream level structuring....
Monty
--- >8 ----
List archives: http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/