Hey sorry to hijack this thread, but I just remembered a request I wanted to make to the speex devs. I tried using the activity detector, but I just couldn't get it working well. I ended up using my own, where I think it just considered voice on if it passed a certain threshold (I know, pretty primitive). I also tried one that checked for a signal, like if the strongest frequency was above a threshold. I don't remember what function it was, but it was very simple, not an FFT, but like an autocorrelation or something, but it didn't work any better than loudness detection. So I would like to use speex's. Anyway, my request is, can you build in a pre and post buffer into the VAD? In mine, if I detect voice any time between now and say a quarter second later, I start sending, and then I wait a half second or whatever after I stop detecting. You pretty much have to have this, or people start getting anxious talking over an internet stream. They have to enunciate expressions like "ya probably" because the ya isn't detected, only the probably. By sending a bit of padding around the detection, it also prevents the detector from dropping out mid-sentence. It takes it from being a screaming contest over a walkie talkie, to a normal telephone conversation. You might be reluctant to do this, because you have to add in some state information instead of just focusing on the current buffer, but the quality improvement is enormous. I'd just like to be able to pass a pre and post value to the VAD in milliseconds, defaulting to either 0 or values similar to what I quoted above. And I realize this can add some delay, but even detecting a single extra syllable makes a world of difference. Well, thanx for your time, --Zack
> Anyway, my request is, can you build in a pre and post buffer into the > VAD? In mine, if I detect voice any time between now and say a quarter > second later, I start sending, and then I wait a half second or whatever > after I stop detecting. You pretty much have to have this, or people > start getting anxious talking over an internet stream. They have to > enunciate expressions like "ya probably" because the ya isn't detected, > only the probably. By sending a bit of padding around the detection, it > also prevents the detector from dropping out mid-sentence. It takes it > from being a screaming contest over a walkie talkie, to a normal > telephone conversation. > > You might be reluctant to do this, because you have to add in some state > information instead of just focusing on the current buffer, but the > quality improvement is enormous. I'd just like to be able to pass a pre > and post value to the VAD in milliseconds, defaulting to either 0 or > values similar to what I quoted above. And I realize this can add some > delay, but even detecting a single extra syllable makes a world of > difference.If you like to buffer speech, just do it. There's no reason you need to have the buffer in the VAD itself. Jean-Marc
On Feb 15, 2008, at 2:09 PM, Jean-Marc Valin wrote:>> Anyway, my request is, can you build in a pre and post buffer into >> the >> VAD? In mine, if I detect voice any time between now and say a >> quarter >> second later, I start sending, and then I wait a half second or >> whatever >> after I stop detecting. You pretty much have to have this, or people >> start getting anxious talking over an internet stream. They have to >> enunciate expressions like "ya probably" because the ya isn't >> detected, >> only the probably. By sending a bit of padding around the >> detection, it >> also prevents the detector from dropping out mid-sentence. It >> takes it >> from being a screaming contest over a walkie talkie, to a normal >> telephone conversation. >> >> You might be reluctant to do this, because you have to add in some >> state >> information instead of just focusing on the current buffer, but the >> quality improvement is enormous. I'd just like to be able to pass >> a pre >> and post value to the VAD in milliseconds, defaulting to either 0 or >> values similar to what I quoted above. And I realize this can add >> some >> delay, but even detecting a single extra syllable makes a world of >> difference. > > If you like to buffer speech, just do it. There's no reason you need > to > have the buffer in the VAD itself.Well, what I am actually asking for is to have the VAD report detection a little before and after the sound. This is pretty easy conceptually, but really messes up the code if every single user has to re-create it. I faked it by just queuing up a FIFO of buffers, and if the VAD detects something, I begin sending a little before that point, and then I continue sending a little after. I know this is old hat for the devs hear, but this is a pretty complex thing for joe blow, and honestly I don't see many people ever doing it. People may get the perception that they really have to enunciate to get speex to work, but that's due more to the VAD's approach than to any limitation in speex itself. --Zack