Hello all: I am interested in using Speex for an application that streams audio from a (noisy) source, so I am interested in VAD and DTX operation. However, after browsing the archives of this list, I note that a number of people have not been satisfied with the operation of the VAD algorithm in Speex. This leads me to a few questions: - Is there a reference somewhere (other than the source itself) that explains how the latest VAD algorithm works? - Is it possible to obtain the VAD status of a Speex stream asynchronously? The current API seems to imply that some kind of polling is required to determine the voice/non-voice status. - Does the VAD algorithm implement syllabic/sonorant rate detection, as has been implemented many times in analog circuitry, and is described in this (and other) papers? http://people.csail.mit.edu/jrg/2005/IS05_schutte.pdf - Over what time period is VAD done? Is it done on a frame by frame basis or over some longer period? Thank you, -- Larry Gadallah, VE6VQ/W7 lgadallah AT gmail DOT com PGP Sig: 616D 4E52 CF1F 3FEC FFFB F11B 7DB9 C79A EA7E B25B
> - Is there a reference somewhere (other than the source itself) that > explains how the latest VAD algorithm works?Read the source, Luke :-) (sorry)> - Is it possible to obtain the VAD status of a Speex stream > asynchronously? The current API seems to imply that some kind of > polling is required to determine the voice/non-voice status.Don't understand your question. Also which VAD are you talking about? The one in the encoder or the one in the preprocessor?> - Does the VAD algorithm implement syllabic/sonorant rate detection, > as has been implemented many times in analog circuitry, and is > described in this (and other) papers? > http://people.csail.mit.edu/jrg/2005/IS05_schutte.pdfAs far as I understand, the paper you reference above isn't applicable to the problem here. Basically, we have to decide whether we have speech or silence based only on 20 ms of audio (and the past). If we could "look into the future" of the signals, things would be much easier.> - Over what time period is VAD done? Is it done on a frame by frame > basis or over some longer period?It *has* to be done frame by frame, otherwise you add latency, which isn't acceptable. Jean-Marc
Hello Jean-Marc et al: On 07/06/07, Jean-Marc Valin <jean-marc.valin@usherbrooke.ca> wrote:> > - Is there a reference somewhere (other than the source itself) that > > explains how the latest VAD algorithm works? > > Read the source, Luke :-) (sorry)Okay. I had to ask :-)> > > - Is it possible to obtain the VAD status of a Speex stream > > asynchronously? The current API seems to imply that some kind of > > polling is required to determine the voice/non-voice status. > > Don't understand your question. Also which VAD are you talking about? > The one in the encoder or the one in the preprocessor?Either one. The question is: If we treat the software like a black box, and we feed in PCM audio, we get Speex encoded data out. Where is the information that indicates whether the encoded data contains speech or not? The API has a "get VAD status", but it seems like that might only indicate whether VAD is currently enabled. Perhaps the VAD status is contained somewhere in the data frames?> > > - Does the VAD algorithm implement syllabic/sonorant rate detection, > > as has been implemented many times in analog circuitry, and is > > described in this (and other) papers? > > http://people.csail.mit.edu/jrg/2005/IS05_schutte.pdf > > As far as I understand, the paper you reference above isn't applicable > to the problem here. Basically, we have to decide whether we have speech > or silence based only on 20 ms of audio (and the past). If we could > "look into the future" of the signals, things would be much easier. > > > - Over what time period is VAD done? Is it done on a frame by frame > > basis or over some longer period? > > It *has* to be done frame by frame, otherwise you add latency, which > isn't acceptable.Okay. What I was trying to determine was whether or not the speech detection was done with something more sophisticated than frame energy. As you said above, I'll have to look at the sources. For many systems, sonorant energy rate detection is used to detect voice, even under very poor SNR conditions. Cheers, -- Larry Gadallah, VE6VQ/W7 lgadallah AT gmail DOT com PGP Sig: 616D 4E52 CF1F 3FEC FFFB F11B 7DB9 C79A EA7E B25B