thr3ads.net - Speex dev - [Speex-dev] Jitter buffer [Nov 2004]

If this information is useful, please help other people find it:
Share via:

Steve Kann

2004-Nov-17 07:56 UTC

[Speex-dev] Jitter buffer

Jean-Marc Valin wrote:
>>Heh.  I guess after playing with different jitter buffers long enough,
>>I've realized that there's always situations that you
haven't properly
>>accounted for when designing one.  
>>    
>>
>
>For example? :-)
>  
>I have a bunch of examples listed on the wiki page where I had written 
initial specifications:

voip-info.org/tiki-index.php?page=Asterisk new jitterbuffer

In particular, (I'm not really sure, because I don't thorougly 
understand it yet) I don't think your jitterbuffer handles:

DTX: discontinuous transmission.
clock skew: (see discussion, though)
shrink buffer length quickly during silence
>>I think the only difficult part here that you do is dealing with
>>multiple frames per packet, without that information being available
>>to the jitter buffer.  If the jitter buffer can be told when a packet
>>is added that the packet contains Xms of audio, then the jitter buffer
>>won't have a problem handling this.
>>    
>>
>
>That's always a solution, but I'm not sure it's the best. In my
current
>implementation, the application doesn't even have to care about the fact
>that there may (or may not) be more than one frame per packet.
>  
>That may be OK when the jitterbuffer is only used right before the audio 
layer, but I'm still not sure how I can integrate this functionality in 
the places I want to put the jitterbuffer.
>  
>
>>This is something I've encountered in trying to make a particular
>>asterisk application handle properly IAX2 frames which contain either
>>20ms of 40ms of speex data.  For a CBR case, where the bitrate is
>>known, this is fairly easy to do, especially if the frames _do_ always
>>end on byte boundaries.  For a VBR case, it is more difficult, because
>>it doesn't look like there's a way to just parse the speex
bitstream
>>and break it up into the constituent 20ms frames.
>>    
>>
>
>It would be possible, but unnecessarily messy.
>  
>
I looked at nb_celp.c, and it seems that it would be pretty messy.  I'd 
need to implement a lot of the actual codec just to be able to determine 
the number of frames in a packet.

I think the easiest thing for me is to just stick to one frame per 
"thing" as far as the jitterbuffer is concerned, and then handle 
additional framing for packets at a higher level.

Even if we use the "terminator" submode (i.e.     
speex_bits_pack(&encstate->bits, 15, 5); ), it seems hard to find that 
in the bitstream, no?
>>For example, I will be running this in front of a conferencing
>>application.  This conferencing application handles participants, each
>>of which can use a different codec.   Often, we "optimize" the
path
>>through the conferencing application by passing the encoded stream
>>straight-through to listeners when there is only one speaker, and the
>>speaker and participant use the same codec(*).  In this case, I want
>>to pass back the actual encoded frame, and also the information about
>>what to do with it, so that I can pass along the frame to some
>>participants, and decode (and possibly transcode) it for others.
>>    
>>
>
>It's another topic here, but why do you actually want to de-jitter the
>stream if you're going to resend encoded. Why not just redirect the
>packets as they arrive and let the last jitter buffer handle everything.
>That'll be both simpler and better (slightly lower latency, slightly
>less frame dropping/interpolation).
>  
>
Because we need to synchronize multiple speakers in the conference:  On 
the incoming side, each incoming "stream" has it's own timebase
and
timestamps, and jitter.  If we just passed that through (even if we 
adjusted the timebases), the different jitter characteristics of each 
speaker would create chaos for listeners, and they'd end up with 
overlapping frames, etc..

-SteveK


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
lists.xiph.org/pipermail/speex-dev/attachments/20041117/eaa15f34/attachment.htm

Jean-Marc Valin

2004-Nov-17 08:54 UTC

head link

[Speex-dev] Jitter buffer

> In particular, (I'm not really sure, because I don't thorougly
> understand it yet) I don't think your jitterbuffer handles:
> 
> DTX: discontinuous transmission.
That is dealt with by the codec, at least for Speex. When it stops
receiving packets, it already knows whether it's in DTX/CNG mode.
> clock skew: (see discussion, though)
Clock skew is one of the main thing I was trying to solve. Actually in
the way my jitter buffer is implemented, it's not even aware of the
difference between a clock skew and a (linear) change in network
latency.
> shrink buffer length quickly during silence
Everything is there for that, but I'm not yet looking at
silence/non-silence.
> That may be OK when the jitterbuffer is only used right before the
> audio layer, but I'm still not sure how I can integrate this
> functionality in the places I want to put the jitterbuffer.
I guess we'll need to discuss that in further details.
> I looked at nb_celp.c, and it seems that it would be pretty messy.
> I'd need to implement a lot of the actual codec just to be able to
> determine the number of frames in a packet.
No, it's one step above nb_celp.c, all you need to implement is 8
functions (init, destroy, process and ctl, for both encode and decode).
It can be done fairly easily. Look at modes.c perhaps. The only struct
that needs to be filled is SpeexMode. Even then, I'm willing to add an
even simpler layer if necessary.
> I think the easiest thing for me is to just stick to one frame per
> "thing" as far as the jitterbuffer is concerned, and then handle
> additional framing for packets at a higher level.
Right now, my jitter buffer assumes a fixed amount of time per frame
(but not per packet). I'm not sure if that's possible.
> Even if we use the "terminator" submode (i.e.
> speex_bits_pack(&encstate->bits, 15, 5); ), it seems hard to find
that
> in the bitstream, no?
Well, you just have to know the number of bits for each mode (that's
already in the mode struct, since I use it to skip wideband in some
cases) and do some jumping.
> Because we need to synchronize multiple speakers in the conference:
> On the incoming side, each incoming "stream" has it's own
timebase and
> timestamps, and jitter.  If we just passed that through (even if we
> adjusted the timebases), the different jitter characteristics of each
> speaker would create chaos for listeners, and they'd end up with
> overlapping frames, etc..
Assuming the clocks aren't synchronized (and skewed), I don't see what
you're gaining in doing that on (presumably) a server in the middle
instead of directly at the other end.

	Jean-Marc

Steve Kann

2004-Nov-17 09:29 UTC

head link

[Speex-dev] Jitter buffer

Jean-Marc Valin wrote:
>>In particular, (I'm not really sure, because I don't thorougly
>>understand it yet) I don't think your jitterbuffer handles:
>>
>>DTX: discontinuous transmission.
>>    
>>
>
>That is dealt with by the codec, at least for Speex. When it stops
>receiving packets, it already knows whether it's in DTX/CNG mode.
>  
>
[skipping a bunch of implementation details of the speex_jb, that I 
haven't studied enough to respond to accurately;  I'll get back to them]
I guess I have to look in more depth.  So, if I send packets at 20, 40, 
60, 80, then stop until 200, 220, 240, 260, won't this jb get confused?  
Or, is it relying on in-band signalling of speex so it will ask speex to 
predict what it thinks are lost frames, and speex would know that the 
interpolation should be silence (or CNG).
>>Because we need to synchronize multiple speakers in the conference:
>>On the incoming side, each incoming "stream" has it's own
timebase and
>>timestamps, and jitter.  If we just passed that through (even if we
>>adjusted the timebases), the different jitter characteristics of each
>>speaker would create chaos for listeners, and they'd end up with
>>overlapping frames, etc..
>>    
>>
>
>Assuming the clocks aren't synchronized (and skewed), I don't see
what
>you're gaining in doing that on (presumably) a server in the middle
>instead of directly at the other end.
>  
>
In the conference app, every (frame time), I need to:

1) Determine who is presently speaking; for some clients, we use remote 
VAD and DTX.  For some clients, we do VAD locally.
2) Notify an external application about changes in speaking
3) Send the appropriate frames to each participant, encoded properly for 
each
     For one-speaker case, all participants except the speaker get the 
frame.
     If the participant and speaker use the same codec, we just send the 
same frame to them. If they don't, we transcode the frame for the new 
codec type.  (we reuse the transcoded frame for each participant with 
the same codec).

    For the two-(or more) speaker case, each speaker gets the other 
speaker's frame (transcoded if needed), and we mix and recode the 
summation of each speaker for all others.

In the application we're using, there can be a _lot_ of jitter (not just 
the 200ms worth that your jitterbuffer seems to account for, but 1 
second or more), and if we don't dejitter first, we can easily end up 
with cases where:

a) We send out subsequent frames for different speakers with overlapping 
timestamps.
b) Different speakers have different clock skews, and over time, these 
will be very significant.  In this case, as speakers change, listeners 
will see this as a _huge_  jitter.  (i.e. many seconds worth).

-SteveK
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
lists.xiph.org/pipermail/speex-dev/attachments/20041117/699705bc/attachment-0001.html

Steve Kann

2004-Dec-21 16:26 UTC

head link

[Speex-dev] Jitter buffer

[sorry for the loss of proper attributions, this is from two messages]:

[Me]
>This is something I've encountered in trying to make a particular
> asterisk application handle properly IAX2 frames which contain either
> 20ms of 40ms of speex data.  For a CBR case, where the bitrate is
> known, this is fairly easy to do, especially if the frames _do_ always
> end on byte boundaries.  For a VBR case, it is more difficult, because
> it doesn't look like there's a way to just parse the speex
bitstream
> and break it up into the constituent 20ms frames.

[Jean Marc]
It would be possible, but unnecessarily messy. 




Jean-Marc Valin wrote:

[me]
>>I looked at nb_celp.c, and it seems that it would be pretty messy.
>>I'd need to implement a lot of the actual codec just to be able to
>>determine the number of frames in a packet.
>>    
>>
>
>  
>[Jean-Marc]
>No, it's one step above nb_celp.c, all you need to implement is 8
>functions (init, destroy, process and ctl, for both encode and decode).
>It can be done fairly easily. Look at modes.c perhaps. The only struct
>that needs to be filled is SpeexMode. Even then, I'm willing to add an
>even simpler layer if necessary.
>  
>
I'm revisiting this issue now.  It looks like it would help to have the 
ability to:

1) Take a look at some speex data, and return the number of samples it 
contains.  This would go here in asterisk, for example:

asterisk/channels/chan_iax2.c:
static int get_samples(struct ast_frame *f)
{
        int samples=0;
        switch(f->subclass) {
        case AST_FORMAT_SPEEX:
                samples = 160;  /* XXX Not necessarily true XXX */
                break;
        case AST_FORMAT_G723_1:
                samples = 240 /* XXX Not necessarily true XXX */;
                break;
        case AST_FORMAT_ILBC:
                samples = 240 * (f->datalen / 50);
                break;
        case AST_FORMAT_GSM:
                samples = 160 * (f->datalen / 33);
                break;
        case AST_FORMAT_G729A:
                samples = 160 * (f->datalen / 20);
                break;
[...]
}

In this case, though, chan_iax2.c doesn't necessarily know if the speex 
codec is loaded..

And later, it might also be useful to have an API which takes a bunch of 
SpeexBits, and gives the caller a way to split up the SpeexBits into 
separate 20ms frames.  [The first API could be a subset of this].

The main API would be:

int speex_decode_bits(SpeexBits *inBits, SpeexBits*outBits). 

inBits is SpeexBits containing the bits we're interested in.
outBits may be NULL.  If not NULL, and inBits contains valid frames, 
they are written, one frame per call, to outBits.

it would return the same values as speex_decode(_int).


SpeexBits inBits, outBits;
void *state;

initialize:
    state = speex_decoder_init(&speex_nb_mode);


process:
    speex_bits_read_from(&inBits, inBuf, inlen);

    for(i=0; ;i+=160) {
       if(speex_decode_bits(&inBits,&outBits)) break;
      
        /* do something with outBits, if you want */
    }

    (i) now contains the number of samples contained in inBuf.


I think this is the simplest, most sensible API, no?

-SteveK


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
lists.xiph.org/pipermail/speex-dev/attachments/20041221/bb0addb8/attachment.html

Possibly Parallel Threads

Search for more seemingly similar threads

Speex dev - Nov 2004 - Jitter buffer

[Speex-dev] Jitter buffer

[Speex-dev] Jitter buffer

[Speex-dev] Jitter buffer

[Speex-dev] Jitter buffer

Possibly Parallel Threads