Jean-Marc Valin wrote:>>OK, I'm actually about ready to start working on this now. >> >>If people in the speex community are interested in working with me on >>this, I can probably start with the speex buffer, but I imagine >>there's going to be a lot more work needed to get this where I'd like >>it to go. >> >> > >And where would you like it to go? ;-) > >Heh. I guess after playing with different jitter buffers long enough, I've realized that there's always situations that you haven't properly accounted for when designing one.>>At the API level, It seems pretty easy to make the speex >>implementation become speex-independent. Instead of having >>speex_jitter_get call any particular speex_decode or speex_read_bits, >>etc functions, it could instead just return the "thing" it got, and a >>flag. I.e. >> >> > >It's not as simple as it may look -- otherwise that's what I would have >done. These are some of the things that you can't do easily if you "just >return the thing": >- Allow more than one frame per packet, especially if the frames don't >end on a byte boundary >- Let the jitter buffer drop/interpolate frames during silence periods >- Anything that requires the jitter buffer to know about what is being >decoded. > >I think the only difficult part here that you do is dealing with multiple frames per packet, without that information being available to the jitter buffer. If the jitter buffer can be told when a packet is added that the packet contains Xms of audio, then the jitter buffer won't have a problem handling this. This is something I've encountered in trying to make a particular asterisk application handle properly IAX2 frames which contain either 20ms of 40ms of speex data. For a CBR case, where the bitrate is known, this is fairly easy to do, especially if the frames _do_ always end on byte boundaries. For a VBR case, it is more difficult, because it doesn't look like there's a way to just parse the speex bitstream and break it up into the constituent 20ms frames. The problem isn't so much that the jb can't return the right thing, but that internally it can't know if it just passed back a packet that contained 40ms of data or 20ms of data, so later it can't know if it's lost a frame or not. The other things can be handled based on the return value of the _get method: dropping frames, interpolating, etc.>>We could then have a second-level API wrapper around this for speex >>which would then call speex_decode, etc, as necessary. >> >>Basically, I'd like the jitterbuffer to do all the history, length, >>etc calculations, but not actually know anything about what it's >>managing. >> >> > >I would suggest the opposite. You can just think of the current >implementation as being callback-based. If you look at the >implementation of speex_decode, it merely looks for a callback function >in a struct (mode definition). It would not be very hard to provide >similar (callback structures) wrappers for other codecs. I'm willing to >modify the current implementation to make that easier (though it's >already not very hard). > >I can see how you'd do that, but I don't think that would work for me. I really don't want the jitterbuffer to handle decoding at all, because in some cases, I want to dejitter the stream, but not decode it. For example, I will be running this in front of a conferencing application. This conferencing application handles participants, each of which can use a different codec. Often, we "optimize" the path through the conferencing application by passing the encoded stream straight-through to listeners when there is only one speaker, and the speaker and participant use the same codec(*). In this case, I want to pass back the actual encoded frame, and also the information about what to do with it, so that I can pass along the frame to some participants, and decode (and possibly transcode) it for others. (*) (Yes, I do understand that this violates the expectations of the codecs, but so far, it seems to work well for GSM, and for speex, since the changes generally occur surrounded by silence, and there is a huge benefit in scalability, and also in clarity because there is no generational loss; the only drawback is that there is _sometimes_ some distortion when the contexts change).>>In asterisk and iaxclient (my project), the things I'd pass into the >>jitterbuffer would be different kinds of structures. Some of these >>may be audio frames, some might be control frames (DTMF, etc) that we >>want synchronized with the audio stream, etc. In the future, we'd >>also want to have video frames thrown in there, which would need to be >>synchronized. >> >> > >I'm not sure of the best way to do that. Audio has different constraints >as video when you're doing jitter buffering. For example, it's much >easier (in terms of perceptual degradation) to skip frames with video >than with audio, which means that the algorithm to handle that optimally >may be quite different. Don't you think? > >Yes. While I don't plan on doing video in the first pass, the API for this would be that when you pass in your "thing", you also pass along: a) The timestamp for that "thing" b) Some flags for the "thing". This might be a "stream number" for the thing, or it might just be a flag saying this "thing" is audio, or this "thing" is a control frame which must never be dropped, etc, or this is a video frame (and maybe a keyframe), etc. A jitterbuffer for an audio/video system needs to be integrated, of course, so that the audio and video are synchronized. In my first implementation, I haven't decided if I want to do more than video, but presently, in asterisk, all frames go through the jitterbuffer, and it makes some sense to include, e.g. DTMF frames in there, HANGUP frames, etc. (imagine leaving a voicemail, if DTMF or HANGUP isn't jitterbuffered, and your have a _lot_ of jitter, you could lose part of your message. If DTMF frames don't go through there, they may be processed out-of-order, etc.).>>So, I guess my questions (for Jean-Marc mostly, but others as well): >> >>1) Is it OK with you to add this extra abstraction layer to your >>jitter buffer? >> >> > >I think there might be better ways to abstract the codec out of that >(callbacks and all). > >callbacks or flags, I think, would work the same way, except for the above. I think it could work for your use as well, though, if you had some way to tell the jb out-of-band how long the "packet" you sent along was (how many milliseconds). -SteveK -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/speex-dev/attachments/20041116/95e9f96a/attachment.htm
> Heh. I guess after playing with different jitter buffers long enough, > I've realized that there's always situations that you haven't properly > accounted for when designing one.For example? :-)> I think the only difficult part here that you do is dealing with > multiple frames per packet, without that information being available > to the jitter buffer. If the jitter buffer can be told when a packet > is added that the packet contains Xms of audio, then the jitter buffer > won't have a problem handling this.That's always a solution, but I'm not sure it's the best. In my current implementation, the application doesn't even have to care about the fact that there may (or may not) be more than one frame per packet.> This is something I've encountered in trying to make a particular > asterisk application handle properly IAX2 frames which contain either > 20ms of 40ms of speex data. For a CBR case, where the bitrate is > known, this is fairly easy to do, especially if the frames _do_ always > end on byte boundaries. For a VBR case, it is more difficult, because > it doesn't look like there's a way to just parse the speex bitstream > and break it up into the constituent 20ms frames.It would be possible, but unnecessarily messy.> The problem isn't so much that the jb can't return the right thing, > but that internally it can't know if it just passed back a packet that > contained 40ms of data or 20ms of data, so later it can't know if it's > lost a frame or not.exactly.> The other things can be handled based on the return value of the _get > method: dropping frames, interpolating, etc.I guess...> I can see how you'd do that, but I don't think that would work for me. > I really don't want the jitterbuffer to handle decoding at all, > because in some cases, I want to dejitter the stream, but not decode > it.In that case, your callback can just send the encoded stream somewhere else, it doesn't have to actually decode anything.> For example, I will be running this in front of a conferencing > application. This conferencing application handles participants, each > of which can use a different codec. Often, we "optimize" the path > through the conferencing application by passing the encoded stream > straight-through to listeners when there is only one speaker, and the > speaker and participant use the same codec(*). In this case, I want > to pass back the actual encoded frame, and also the information about > what to do with it, so that I can pass along the frame to some > participants, and decode (and possibly transcode) it for others.It's another topic here, but why do you actually want to de-jitter the stream if you're going to resend encoded. Why not just redirect the packets as they arrive and let the last jitter buffer handle everything. That'll be both simpler and better (slightly lower latency, slightly less frame dropping/interpolation). Jean-Marc
Jean-Marc Valin wrote:>>Heh. I guess after playing with different jitter buffers long enough, >>I've realized that there's always situations that you haven't properly >>accounted for when designing one. >> >> > >For example? :-) > >I have a bunch of examples listed on the wiki page where I had written initial specifications: http://www.voip-info.org/tiki-index.php?page=Asterisk%20new%20jitterbuffer In particular, (I'm not really sure, because I don't thorougly understand it yet) I don't think your jitterbuffer handles: DTX: discontinuous transmission. clock skew: (see discussion, though) shrink buffer length quickly during silence>>I think the only difficult part here that you do is dealing with >>multiple frames per packet, without that information being available >>to the jitter buffer. If the jitter buffer can be told when a packet >>is added that the packet contains Xms of audio, then the jitter buffer >>won't have a problem handling this. >> >> > >That's always a solution, but I'm not sure it's the best. In my current >implementation, the application doesn't even have to care about the fact >that there may (or may not) be more than one frame per packet. > >That may be OK when the jitterbuffer is only used right before the audio layer, but I'm still not sure how I can integrate this functionality in the places I want to put the jitterbuffer.> > >>This is something I've encountered in trying to make a particular >>asterisk application handle properly IAX2 frames which contain either >>20ms of 40ms of speex data. For a CBR case, where the bitrate is >>known, this is fairly easy to do, especially if the frames _do_ always >>end on byte boundaries. For a VBR case, it is more difficult, because >>it doesn't look like there's a way to just parse the speex bitstream >>and break it up into the constituent 20ms frames. >> >> > >It would be possible, but unnecessarily messy. > >I looked at nb_celp.c, and it seems that it would be pretty messy. I'd need to implement a lot of the actual codec just to be able to determine the number of frames in a packet. I think the easiest thing for me is to just stick to one frame per "thing" as far as the jitterbuffer is concerned, and then handle additional framing for packets at a higher level. Even if we use the "terminator" submode (i.e. speex_bits_pack(&encstate->bits, 15, 5); ), it seems hard to find that in the bitstream, no?>>For example, I will be running this in front of a conferencing >>application. This conferencing application handles participants, each >>of which can use a different codec. Often, we "optimize" the path >>through the conferencing application by passing the encoded stream >>straight-through to listeners when there is only one speaker, and the >>speaker and participant use the same codec(*). In this case, I want >>to pass back the actual encoded frame, and also the information about >>what to do with it, so that I can pass along the frame to some >>participants, and decode (and possibly transcode) it for others. >> >> > >It's another topic here, but why do you actually want to de-jitter the >stream if you're going to resend encoded. Why not just redirect the >packets as they arrive and let the last jitter buffer handle everything. >That'll be both simpler and better (slightly lower latency, slightly >less frame dropping/interpolation). > >Because we need to synchronize multiple speakers in the conference: On the incoming side, each incoming "stream" has it's own timebase and timestamps, and jitter. If we just passed that through (even if we adjusted the timebases), the different jitter characteristics of each speaker would create chaos for listeners, and they'd end up with overlapping frames, etc.. -SteveK -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/speex-dev/attachments/20041117/eaa15f34/attachment.htm