> FYI: The below is just my interpretation of the code, I might be wrong.Most of it is right. Actually, would you mind if I use part of your email for documenting the jitter buffer in the manual?> Each time a new packet arrives, the jitter buffer calculates how far ahead > or behind the "current" timestamp it is; this is called arrival_margin. > The "current" timestamp is simply the last frame successfully decoded.Minor detail, it's the last played (whether it was successfully decoded or not).> It maintains a list of bins for margins, this is short and longterm > margin. > Think of the bins like this: > -60ms -40ms -20ms 0ms +20ms +40ms +60ms > when a packet arrives, the margin matching it's arrivel_margin is > increased, so if this packet was 40ms after the current timestamp, the > 40ms bin would be increased. If this packet arrived 60ms too late (and > hence is useless), the -60ms bin would increase.Right.> early_ratio_XX is the sum of all the positive bins. > late_ratio_XX is the sum of all the negative bins.Right. And only the packets that are "just in time" don't get counted in any ratio.> The difference between _long and _short is just how fast they change. > > If a packet has timestamp outside the bins, it's not used for calculation. > > Now, clearly, if early_ratio is high and late_ratio is very low, the > buffer is buffering more than it needs to; it will skip a frame to reduce > latency. Alternately, if late_ratio is even marginally above 0, more > buffering is needed, and it duplicates a frame. This decision is done when > decoding.Right.> Depending on your chosen transmission method, during network hiccups > you'll either have lost packets or they'll come in a burst when the > network conditions restore themselves. In either case, after missing 20 > packets or so the jitter buffer will prepare to "reset", and it's new > current timestamp will be the timestamp on whatever packet arrives. It > will also hold decoding until at least buffer_size frames have arrived.Right, except it will only actually reset when receiving the first new packet.> Since it sounds like you're using reliable transmission (packets are not > lost), what will happen is that there's a whole stream of packets suddenly > arriving, and they'll fill up the buffer much much faster than it's > emptied. In fact, you're likely to fill it so fast the buffer runs out of > room, meaning the first few packets gets dropped to make room for the > later ones. However, as the current timestamp was set to the first > arriving packet, the decoder won't find the packet it's looking for, > meaning the jitter buffer will soon reset again.I'm not sure here what will happen. Normally, you'd want to make the buffer larger than what you expect to have in it. In that case, the jitter buffer would likely drop frames until it catches up.> So no, it doesn't "catch up", it tries to keep latency to an absolute > minimum whatever the circumstances, so most of the late frames will be > dropped.Yes. Actually, the best way to handle that would be to (eventually) change the code to drop frames in silence or low-energy periods.> To achieve the effect you're describing, you'd need to increase > SPEEX_JITTER_MAX_BUFFER_SIZE to the longest delay you're expecting, and > then inside the block on line 231 (which says) > if (late_ratio_short + ontime_ratio_short < .005 && late_ratio_long + > ontime_ratio_long < .01 && early_ratio_short > .8) > .. add something that multiplies all the magins with 0.75 or so at the > end. This will force the jitter buffer to only skip 1 frame at a time and > wait a bit before it skips the next one.Don't think it's necessary since there's already some code that shifts the histogram whenever I skip or interpolate a packet. This means that if the packets are on average 20 ms in advance when we drop a frame, then they will be considered all "on time" (0 ms) after that. Jean-Marc -- Jean-Marc Valin <Jean-Marc.Valin@USherbrooke.ca> Universit? de Sherbrooke
>> FYI: The below is just my interpretation of the code, I might be wrong. > > Most of it is right. Actually, would you mind if I use part of your > email for documenting the jitter buffer in the manual?It would be my pleasure :)>> early_ratio_XX is the sum of all the positive bins. >> late_ratio_XX is the sum of all the negative bins. > > Right. And only the packets that are "just in time" don't get counted in > any ratio.Well.. they're counted in the ontime_ratio_long and _short, right? One thing that might be worth mentioning: the sum of all the margins will never be higher than 1.0, so a test for early_ratio_short > 0.7 means (roughly) that 70% or more of the packets in the last short-term time period were early.>> Depending on your chosen transmission method, during network hiccups >> you'll either have lost packets or they'll come in a burst when the >> network conditions restore themselves. In either case, after missing 20 >> packets or so the jitter buffer will prepare to "reset", and it's new >> current timestamp will be the timestamp on whatever packet arrives. It >> will also hold decoding until at least buffer_size frames have arrived. > > Right, except it will only actually reset when receiving the first new > packet.That's when I meant with "will be the timestamp on whatever packet arrives". .. Could be clearer though, I totally agree.>> Since it sounds like you're using reliable transmission (packets are not >> lost), what will happen is that there's a whole stream of packets suddenly >> arriving, and they'll fill up the buffer much much faster than it's >> emptied. In fact, you're likely to fill it so fast the buffer runs out of >> room, meaning the first few packets gets dropped to make room for the >> later ones. However, as the current timestamp was set to the first >> arriving packet, the decoder won't find the packet it's looking for, >> meaning the jitter buffer will soon reset again. > > I'm not sure here what will happen. Normally, you'd want to make the > buffer larger than what you expect to have in it. In that case, the > jitter buffer would likely drop frames until it catches up.There's a problem with increasing the buffer size, btw: you need to change the header, which means you need to recompile both speex and your application. So changing the maximum number of buffered packets means you can't share libspeex.dll/.so with other applications.>> To achieve the effect you're describing, you'd need to increase >> SPEEX_JITTER_MAX_BUFFER_SIZE to the longest delay you're expecting, and >> then inside the block on line 231 (which says) >> if (late_ratio_short + ontime_ratio_short < .005 && late_ratio_long + >> ontime_ratio_long < .01 && early_ratio_short > .8) >> .. add something that multiplies all the magins with 0.75 or so at the >> end. This will force the jitter buffer to only skip 1 frame at a time and >> wait a bit before it skips the next one. > > Don't think it's necessary since there's already some code that shifts > the histogram whenever I skip or interpolate a packet. This means that > if the packets are on average 20 ms in advance when we drop a frame, > then they will be considered all "on time" (0 ms) after that.Yes, but assume that after a long steady period, your network latency suddenly drops with 100ms. (100ms is excessive, but I see 60ms quite frequently from users on DSL/Cable connections who also do a bit of P2P on the same line) What happens now is that the +100ms bin starts increasing steadily, and suddenly it's enough to skip a frame. A frame is skipped, and the histogram gets shifted. On the next call to _get(), it's now the +80ms bin that has that high value, and the ratio is still more than high enough to skip a frame. A frame is skipped, and the histogram gets shifted. Repeat for +60, +40 and +20. In short, over a period to decode 5 frames, we're also skipping 5 frames, which means you have 100ms of audio that sounds weird. It works well for me though, I prefer that sudden network jumps result in an audible "jump" in dialogue rather then users not being sure that latency is at an absolute minimum. Come to think of it, it might actually be better if it just skipped 5 frames at once. Might be doable by shifting the histogram, and if it still meets the criteria, keep skipping and shifting it until it doesn't meet the criteria anymore. More work though, and less clear code.
> > Most of it is right. Actually, would you mind if I use part of your > > email for documenting the jitter buffer in the manual? > > It would be my pleasure :)Thanks. Whenever I have some time to update the manual I'll put that in.> >> early_ratio_XX is the sum of all the positive bins. > >> late_ratio_XX is the sum of all the negative bins. > > > > Right. And only the packets that are "just in time" don't get counted in > > any ratio. > > Well.. they're counted in the ontime_ratio_long and _short, right?Right. It's there so I know how many late packets I'll have if I drop a frame.> One thing that might be worth mentioning: the sum of all the margins will > never be higher than 1.0, so a test for early_ratio_short > 0.7 means > (roughly) that 70% or more of the packets in the last short-term time > period were early.Note that the sum can be <1 if the buffer had a reset recently.> > I'm not sure here what will happen. Normally, you'd want to make the > > buffer larger than what you expect to have in it. In that case, the > > jitter buffer would likely drop frames until it catches up. > > There's a problem with increasing the buffer size, btw: you need to change > the header, which means you need to recompile both speex and your > application. So changing the maximum number of buffered packets means you > can't share libspeex.dll/.so with other applications.I agree, which is why making the buffer dynamic is on the TODO list.> Yes, but assume that after a long steady period, your network latency > suddenly drops with 100ms. (100ms is excessive, but I see 60ms quite > frequently from users on DSL/Cable connections who also do a bit of P2P > on the same line) > What happens now is that the +100ms bin starts increasing steadily, > and suddenly it's enough to skip a frame. > A frame is skipped, and the histogram gets shifted. > On the next call to _get(), it's now the +80ms bin that has that high > value, and the ratio is still more than high enough to skip a frame. > A frame is skipped, and the histogram gets shifted. > Repeat for +60, +40 and +20. In short, over a period to decode 5 frames, > we're also skipping 5 frames, which means you have 100ms of audio that > sounds weird.Yes. And the fix would simply be to wait for silence periods (e.g. between words) before dropping frames. It's also on the TODO list.> Come to think of it, it might actually be better if it just skipped 5 > frames at once. Might be doable by shifting the histogram, and if it still > meets the criteria, keep skipping and shifting it until it doesn't meet > the criteria anymore. More work though, and less clear code.I can probably do that after the drop during silence. Jean-Marc -- Jean-Marc Valin <Jean-Marc.Valin@USherbrooke.ca> Universit? de Sherbrooke