thr3ads.net - Speex dev - [Speex-dev] 2 questions, frame size and SPEEX_GET

If this information is useful, please help other people find it:
Share via:

Jia Pu

2006-Oct-31 14:21 UTC

[Speex-dev] 2 questions, frame size and SPEEX_GET_LOOKAHEAD

Ok, let me first explain why 5ms matters, even they are 0's, in my  
particular application.

I am working on a speech synthesis system. The basic idea is  
concatenating pre-recorded phonemes or words into longer sentences. So  
any missing or extra samples, even it is as short as 5~10ms, cause  
very noticeable discontinuities.

I want to use speex to compress/decompress those pre-recorded  
material. But I'm concerned about the extra 0's might be padded at  
both ends.

For the zero padding at the last frame, I know how to remove it after  
decoding. But I am a little confused by the look ahead at the  
beginning. The sample code in the manual doesn't use look ahead, while  
the speexenc.c does. I'd like to know what difference it makes.

Let me plugin some numbers. I am using wide-band mode, the frame size  
is 320 samples. Say I take the first frame of an audio buffer, i.e.  
the first 320 samples, and feed them into encoder. Then after  
decompress, do I get all 320 samples, or a portion of 320 samples with  
some 0 padding at the very beginning?

Thank you.

On Oct 31, 2006, at 12:38 PM, Jean-Marc Valin wrote:
>> In my application, even 5ms (110 samples at 22KHz) matters.
>
> 1) If 5 ms matters, I don't recommend Ogg (and I definitely hope  
> you're
> not running Windows!)
> 2) 22 kHz is *not* recommended. Use 16 kHz instead
>
>> So what
>> should I do to avoid discarding samples at the beginning?
>
> Why are they so precious, they're *zeros* (or nearly).
>
>> 1. Turning off look ahead?
>> 2. Padding 0's at the beginning.
>
> Or you can always just play them if they're so precious. Ah, the sound
> of 5ms worth of zeros...
>
> 	Jean-Marc

Andras Kadinger

2006-Oct-31 15:41 UTC

head link

[Speex-dev] 2 questions, frame size and SPEEX_GET_LOOKAHEAD

[At the risk of educating you about something you might already know]

Natural speech in most human languages gradually changes from one 
phoneme to the next.

Concatenating phonemes together from a fixed, prerecorded, unflexible 
set would give rise to abrupt changes between them (both in phoneme 
quality and in pitch), and thus make the resulting speech hard to 
understand and/or uncomfortable to listen to.

Most flexible (unlimited vocabulary), unit (e.g. "phoneme") 
concatenation speech synthesizers therefore use some strategy to blend 
the pieces of speech together, usually both in pitch and in phoneme 
quality. One very conceptually simple and therefore popular approach is 
storing "diphones" - phoneme transitions: e.g. the second half of
"a"
and the first half of "p" from the hypothetical word "apa".
Since
phonemes usually tend to reach their "most recognizable" state in the 
"middle", cutting and splicing them together around that point should 
minimize the amount of discontinuity.

Obviously, if you concatenate speech from larger units (words, phrases, 
or even sentences) ensuring acoustical continuity becomes less and less 
of an issue, but you specifically mention phonemes.

So unless you want to use Speex to (re)implement unit storage for a 
speech synthesizer that handles these issues, I suggest you take a look 
at the available literature on speech synthesis.

Wikipedia seems to be a reasonable starting point: 
http://en.wikipedia.org/wiki/Speech_synthesis

Jia Pu ?rta:> Ok, let me first explain why 5ms matters, even they are 0's, in my 
> particular application.
> 
> I am working on a speech synthesis system. The basic idea is 
> concatenating pre-recorded phonemes or words into longer sentences. So 
> any missing or extra samples, even it is as short as 5~10ms, cause very 
> noticeable discontinuities.
> 
> I want to use speex to compress/decompress those pre-recorded material. 
> But I'm concerned about the extra 0's might be padded at both ends.
> 
> For the zero padding at the last frame, I know how to remove it after 
> decoding. But I am a little confused by the look ahead at the beginning. 
> The sample code in the manual doesn't use look ahead, while the 
> speexenc.c does. I'd like to know what difference it makes.
> 
> Let me plugin some numbers. I am using wide-band mode, the frame size is 
> 320 samples. Say I take the first frame of an audio buffer, i.e. the 
> first 320 samples, and feed them into encoder. Then after decompress, do 
> I get all 320 samples, or a portion of 320 samples with some 0 padding 
> at the very beginning?
> 
> Thank you.
> 
> 
> 
> 
> On Oct 31, 2006, at 12:38 PM, Jean-Marc Valin wrote:
> 
>>> In my application, even 5ms (110 samples at 22KHz) matters.
>>
>> 1) If 5 ms matters, I don't recommend Ogg (and I definitely hope
you're
>> not running Windows!)
>> 2) 22 kHz is *not* recommended. Use 16 kHz instead
>>
>>> So what
>>> should I do to avoid discarding samples at the beginning?
>>
>> Why are they so precious, they're *zeros* (or nearly).
>>
>>> 1. Turning off look ahead?
>>> 2. Padding 0's at the beginning.
>>
>> Or you can always just play them if they're so precious. Ah, the
sound
>> of 5ms worth of zeros...
>>
>>     Jean-Marc
> 
> _______________________________________________
> Speex-dev mailing list
> Speex-dev@xiph.org
> http://lists.xiph.org/mailman/listinfo/speex-dev

Jia Pu

2006-Oct-31 15:52 UTC

head link

[Speex-dev] 2 questions, frame size and SPEEX_GET_LOOKAHEAD

Hi, Andras,

Thanks for the comments. Yes, I am aware of those issues. I probably  
should have been more accurate on my usage of terms. Actually in my  
project, the unit collection is a mixture of diphones and words.

However seems to me, these synthesizer specific issue is irrelevant to  
my question about speex. As you said, i merely use speex as storage  
methods. All I ask for is to get the samples as close to original  
recording as possible after encoding and decoding. Blending, cross  
fading, pitch adjustment, these signal processing issues are not a  
concern at this stage.

On Oct 31, 2006, at 3:40 PM, Andras Kadinger wrote:
> [At the risk of educating you about something you might already know]
>
> Natural speech in most human languages gradually changes from one  
> phoneme to the next.
>
> Concatenating phonemes together from a fixed, prerecorded,  
> unflexible set would give rise to abrupt changes between them (both  
> in phoneme quality and in pitch), and thus make the resulting speech  
> hard to understand and/or uncomfortable to listen to.
>
> Most flexible (unlimited vocabulary), unit (e.g. "phoneme")  
> concatenation speech synthesizers therefore use some strategy to  
> blend the pieces of speech together, usually both in pitch and in  
> phoneme quality. One very conceptually simple and therefore popular  
> approach is storing "diphones" - phoneme transitions: e.g. the  
> second half of "a" and the first half of "p" from the
hypothetical
> word "apa". Since phonemes usually tend to reach their "most
> recognizable" state in the "middle", cutting and splicing
them
> together around that point should minimize the amount of  
> discontinuity.
>
> Obviously, if you concatenate speech from larger units (words,  
> phrases, or even sentences) ensuring acoustical continuity becomes  
> less and less of an issue, but you specifically mention phonemes.
>
> So unless you want to use Speex to (re)implement unit storage for a  
> speech synthesizer that handles these issues, I suggest you take a  
> look at the available literature on speech synthesis.
>
> Wikipedia seems to be a reasonable starting point:
http://en.wikipedia.org/wiki/Speech_synthesis
>
>

Seemingly Similar Threads

Search for more possibly parallel threads

Speex dev - Oct 2006 - 2 questions, frame size and SPEEX_GET_LOOKAHEAD

[Speex-dev] 2 questions, frame size and SPEEX_GET_LOOKAHEAD

[Speex-dev] 2 questions, frame size and SPEEX_GET_LOOKAHEAD

[Speex-dev] 2 questions, frame size and SPEEX_GET_LOOKAHEAD

Seemingly Similar Threads