thr3ads.net - Speex dev - [Speex-dev] How does the jitter buffer "catch up"? [Sep 2005]

If this information is useful, please help other people find it:
Share via:

Thorvald Natvig

2005-Sep-22 16:27 UTC

[Speex-dev] How does the jitter buffer "catch up"?

> Hello,
Hi :)

First off, could you try to set your email client to break long lines before 
transmitting? In my (somewhat outdated) pine, the lines appear as VERY long 
lines when I try to reply, making it hard to read :)

Minor detail though, I should probably fix pine. Some day.
> The way you describe how the jitter buffer should be implemented makes me 
> wonder: How does the jitter buffer works when there is no transmission?
> Let's say my "output" thread gets a speex frame from the
jitter buffer every
> 20ms. What happen when there is no frame that arrived on the socket? No 
> frames at all for a pretty long time (ie many seconds).
> This is my case because I chose not to transmit any sound data when speech 
> was not recognized (This speech probability from the preprocessor is so 
> sweet! Thanks Jean-marc!). Yes, I know, I'm cheap on bandwidth, but
that's on
> purpose... :(
What happens is this:

On the first _get where there are no valid frames (because you stopped 
transmitting from the other end), the jitter buffer will tell the decoder to 
just decode the last frame again. On the next one, it tells the decoder to 
extrapolate from the last frame, and on the next one after that to extrapolate 
even more. This goes on until 25 packets are missed, at which point the jitter 
buffer resets the decoder and stops extrapolating.
> I read Munble source code (v0.3.2) to see how you do. And I found this 
> comment:
> 	// Ideally, we'd like to go DTX (discontinous transmission)
> 	// if we didn't detect speech. Unfortunately, the jitter
> 	// buffer on the receiving end doesn't cope with that
> 	// very well.
Ah, this is a completely outdated comment, as I found a way to make it work 
well :)

What I do, is append one bit to each speex packet which indicates if this is a 
"end of transmission". If it is, I manually tell the jitter buffer to
reset
immediately and stop extrapolating, because I know no more packets will be 
forthcoming.

If this "end of transmission" packet should be lost, no harm is done,
because
all that happens is that the codec extrapolates a bit, meaning you get a few 
hundred ms of alien sounds :)

In an ideal world, you'd like to use Speex DTX mode, which puts the decoder
in
"generate comfort noise" mode and also transfers one packet every
400ms (I
think) to update the noise profile, but if you use the denoiser of the 
preprocessor then comfort noise == silence.
> I did not implemented the jitter buffer yet, but I wonder if I should?
> I was thinking about holding the first few sound frames before playing
them.
> That way, I introduce a delay, which should remove the jitter. Moreover, 
> since I'm not transmitting when not speaking, the delay does not sum up
to
> get pretty long in the end.
This will work, but will introduce latency in your transmission. This sort of 
buffering is very common in streaming media, such as shoutcasts and 
videostreams, as they are unidirectional and it doesn't matter if
there's a 2
second delay between sending and receiving time. For bidirectional speech, you 
want latency at an absolute minimum.

Why?

Humans start speaking when the other side isn't speaking. Let's take the
extreme case and say there's 10 seconds of delay. If you both start talking
at
the same time, it'll be 10 seconds before you hear the other end is also 
talking, 10 more seconds to notice that he stopped, and then 10 seconds before 
he hears you say "go ahead". 10 sec is extreme, but this effect is
quite
noticable even at 500ms total latency.

Jean-Marc Valin

2005-Sep-22 16:56 UTC

head link

[Speex-dev] How does the jitter buffer "catch up"?

> On the first _get where there are no valid frames (because you stopped 
> transmitting from the other end), the jitter buffer will tell the decoder
to
> just decode the last frame again. On the next one, it tells the decoder to 
> extrapolate from the last frame, and on the next one after that to
extrapolate
> even more. This goes on until 25 packets are missed, at which point the
jitter
> buffer resets the decoder and stops extrapolating.
Yeah, I think I should use something else than just "packet is not
available" to trigger the reset.

	Jean-Marc

> > I read Munble source code (v0.3.2) to see how you do. And I found this
> > comment:
> > 	// Ideally, we'd like to go DTX (discontinous transmission)
> > 	// if we didn't detect speech. Unfortunately, the jitter
> > 	// buffer on the receiving end doesn't cope with that
> > 	// very well.
> 
> Ah, this is a completely outdated comment, as I found a way to make it work
> well :)
> 
> What I do, is append one bit to each speex packet which indicates if this
is a
> "end of transmission". If it is, I manually tell the jitter
buffer to reset
> immediately and stop extrapolating, because I know no more packets will be 
> forthcoming.
> 
> If this "end of transmission" packet should be lost, no harm is
done, because
> all that happens is that the codec extrapolates a bit, meaning you get a
few
> hundred ms of alien sounds :)
> 
> In an ideal world, you'd like to use Speex DTX mode, which puts the
decoder in
> "generate comfort noise" mode and also transfers one packet every
400ms (I
> think) to update the noise profile, but if you use the denoiser of the 
> preprocessor then comfort noise == silence.
> 
> > I did not implemented the jitter buffer yet, but I wonder if I should?
> > I was thinking about holding the first few sound frames before playing
them.
> > That way, I introduce a delay, which should remove the jitter.
Moreover,
> > since I'm not transmitting when not speaking, the delay does not
sum up to
> > get pretty long in the end.
> 
> This will work, but will introduce latency in your transmission. This sort
of
> buffering is very common in streaming media, such as shoutcasts and 
> videostreams, as they are unidirectional and it doesn't matter if
there's a 2
> second delay between sending and receiving time. For bidirectional speech,
you
> want latency at an absolute minimum.
> 
> Why?
> 
> Humans start speaking when the other side isn't speaking. Let's
take the
> extreme case and say there's 10 seconds of delay. If you both start
talking at
> the same time, it'll be 10 seconds before you hear the other end is
also
> talking, 10 more seconds to notice that he stopped, and then 10 seconds
before
> he hears you say "go ahead". 10 sec is extreme, but this effect
is quite
> noticable even at 500ms total latency.
> 
> _______________________________________________
> Speex-dev mailing list
> Speex-dev@xiph.org
> http://lists.xiph.org/mailman/listinfo/speex-dev
> -- 
Jean-Marc Valin <Jean-Marc.Valin@USherbrooke.ca>
Universit? de Sherbrooke

Maybe Matching Threads

Search for more reasonably related threads

Speex dev - Sep 2005 - How does the jitter buffer "catch up"?

[Speex-dev] How does the jitter buffer "catch up"?

[Speex-dev] How does the jitter buffer "catch up"?

Maybe Matching Threads