Ok, let me first explain why 5ms matters, even they are 0's, in my particular application. I am working on a speech synthesis system. The basic idea is concatenating pre-recorded phonemes or words into longer sentences. So any missing or extra samples, even it is as short as 5~10ms, cause very noticeable discontinuities. I want to use speex to compress/decompress those pre-recorded material. But I'm concerned about the extra 0's might be padded at both ends. For the zero padding at the last frame, I know how to remove it after decoding. But I am a little confused by the look ahead at the beginning. The sample code in the manual doesn't use look ahead, while the speexenc.c does. I'd like to know what difference it makes. Let me plugin some numbers. I am using wide-band mode, the frame size is 320 samples. Say I take the first frame of an audio buffer, i.e. the first 320 samples, and feed them into encoder. Then after decompress, do I get all 320 samples, or a portion of 320 samples with some 0 padding at the very beginning? Thank you. On Oct 31, 2006, at 12:38 PM, Jean-Marc Valin wrote:>> In my application, even 5ms (110 samples at 22KHz) matters. > > 1) If 5 ms matters, I don't recommend Ogg (and I definitely hope > you're > not running Windows!) > 2) 22 kHz is *not* recommended. Use 16 kHz instead > >> So what >> should I do to avoid discarding samples at the beginning? > > Why are they so precious, they're *zeros* (or nearly). > >> 1. Turning off look ahead? >> 2. Padding 0's at the beginning. > > Or you can always just play them if they're so precious. Ah, the sound > of 5ms worth of zeros... > > Jean-Marc
Andras Kadinger
2006-Oct-31 15:41 UTC
[Speex-dev] 2 questions, frame size and SPEEX_GET_LOOKAHEAD
[At the risk of educating you about something you might already know] Natural speech in most human languages gradually changes from one phoneme to the next. Concatenating phonemes together from a fixed, prerecorded, unflexible set would give rise to abrupt changes between them (both in phoneme quality and in pitch), and thus make the resulting speech hard to understand and/or uncomfortable to listen to. Most flexible (unlimited vocabulary), unit (e.g. "phoneme") concatenation speech synthesizers therefore use some strategy to blend the pieces of speech together, usually both in pitch and in phoneme quality. One very conceptually simple and therefore popular approach is storing "diphones" - phoneme transitions: e.g. the second half of "a" and the first half of "p" from the hypothetical word "apa". Since phonemes usually tend to reach their "most recognizable" state in the "middle", cutting and splicing them together around that point should minimize the amount of discontinuity. Obviously, if you concatenate speech from larger units (words, phrases, or even sentences) ensuring acoustical continuity becomes less and less of an issue, but you specifically mention phonemes. So unless you want to use Speex to (re)implement unit storage for a speech synthesizer that handles these issues, I suggest you take a look at the available literature on speech synthesis. Wikipedia seems to be a reasonable starting point: http://en.wikipedia.org/wiki/Speech_synthesis Jia Pu ?rta:> Ok, let me first explain why 5ms matters, even they are 0's, in my > particular application. > > I am working on a speech synthesis system. The basic idea is > concatenating pre-recorded phonemes or words into longer sentences. So > any missing or extra samples, even it is as short as 5~10ms, cause very > noticeable discontinuities. > > I want to use speex to compress/decompress those pre-recorded material. > But I'm concerned about the extra 0's might be padded at both ends. > > For the zero padding at the last frame, I know how to remove it after > decoding. But I am a little confused by the look ahead at the beginning. > The sample code in the manual doesn't use look ahead, while the > speexenc.c does. I'd like to know what difference it makes. > > Let me plugin some numbers. I am using wide-band mode, the frame size is > 320 samples. Say I take the first frame of an audio buffer, i.e. the > first 320 samples, and feed them into encoder. Then after decompress, do > I get all 320 samples, or a portion of 320 samples with some 0 padding > at the very beginning? > > Thank you. > > > > > On Oct 31, 2006, at 12:38 PM, Jean-Marc Valin wrote: > >>> In my application, even 5ms (110 samples at 22KHz) matters. >> >> 1) If 5 ms matters, I don't recommend Ogg (and I definitely hope you're >> not running Windows!) >> 2) 22 kHz is *not* recommended. Use 16 kHz instead >> >>> So what >>> should I do to avoid discarding samples at the beginning? >> >> Why are they so precious, they're *zeros* (or nearly). >> >>> 1. Turning off look ahead? >>> 2. Padding 0's at the beginning. >> >> Or you can always just play them if they're so precious. Ah, the sound >> of 5ms worth of zeros... >> >> Jean-Marc > > _______________________________________________ > Speex-dev mailing list > Speex-dev@xiph.org > http://lists.xiph.org/mailman/listinfo/speex-dev
Hi, Andras, Thanks for the comments. Yes, I am aware of those issues. I probably should have been more accurate on my usage of terms. Actually in my project, the unit collection is a mixture of diphones and words. However seems to me, these synthesizer specific issue is irrelevant to my question about speex. As you said, i merely use speex as storage methods. All I ask for is to get the samples as close to original recording as possible after encoding and decoding. Blending, cross fading, pitch adjustment, these signal processing issues are not a concern at this stage. On Oct 31, 2006, at 3:40 PM, Andras Kadinger wrote:> [At the risk of educating you about something you might already know] > > Natural speech in most human languages gradually changes from one > phoneme to the next. > > Concatenating phonemes together from a fixed, prerecorded, > unflexible set would give rise to abrupt changes between them (both > in phoneme quality and in pitch), and thus make the resulting speech > hard to understand and/or uncomfortable to listen to. > > Most flexible (unlimited vocabulary), unit (e.g. "phoneme") > concatenation speech synthesizers therefore use some strategy to > blend the pieces of speech together, usually both in pitch and in > phoneme quality. One very conceptually simple and therefore popular > approach is storing "diphones" - phoneme transitions: e.g. the > second half of "a" and the first half of "p" from the hypothetical > word "apa". Since phonemes usually tend to reach their "most > recognizable" state in the "middle", cutting and splicing them > together around that point should minimize the amount of > discontinuity. > > Obviously, if you concatenate speech from larger units (words, > phrases, or even sentences) ensuring acoustical continuity becomes > less and less of an issue, but you specifically mention phonemes. > > So unless you want to use Speex to (re)implement unit storage for a > speech synthesizer that handles these issues, I suggest you take a > look at the available literature on speech synthesis. > > Wikipedia seems to be a reasonable starting point: http://en.wikipedia.org/wiki/Speech_synthesis > >