Hello Jean-Marc et al: On 07/06/07, Jean-Marc Valin <jean-marc.valin@usherbrooke.ca> wrote:> > - Is there a reference somewhere (other than the source itself) that > > explains how the latest VAD algorithm works? > > Read the source, Luke :-) (sorry)Okay. I had to ask :-)> > > - Is it possible to obtain the VAD status of a Speex stream > > asynchronously? The current API seems to imply that some kind of > > polling is required to determine the voice/non-voice status. > > Don't understand your question. Also which VAD are you talking about? > The one in the encoder or the one in the preprocessor?Either one. The question is: If we treat the software like a black box, and we feed in PCM audio, we get Speex encoded data out. Where is the information that indicates whether the encoded data contains speech or not? The API has a "get VAD status", but it seems like that might only indicate whether VAD is currently enabled. Perhaps the VAD status is contained somewhere in the data frames?> > > - Does the VAD algorithm implement syllabic/sonorant rate detection, > > as has been implemented many times in analog circuitry, and is > > described in this (and other) papers? > > http://people.csail.mit.edu/jrg/2005/IS05_schutte.pdf > > As far as I understand, the paper you reference above isn't applicable > to the problem here. Basically, we have to decide whether we have speech > or silence based only on 20 ms of audio (and the past). If we could > "look into the future" of the signals, things would be much easier. > > > - Over what time period is VAD done? Is it done on a frame by frame > > basis or over some longer period? > > It *has* to be done frame by frame, otherwise you add latency, which > isn't acceptable.Okay. What I was trying to determine was whether or not the speech detection was done with something more sophisticated than frame energy. As you said above, I'll have to look at the sources. For many systems, sonorant energy rate detection is used to detect voice, even under very poor SNR conditions. Cheers, -- Larry Gadallah, VE6VQ/W7 lgadallah AT gmail DOT com PGP Sig: 616D 4E52 CF1F 3FEC FFFB F11B 7DB9 C79A EA7E B25B
> Either one. The question is: If we treat the software like a black > box, and we feed in PCM audio, we get Speex encoded data out. Where is > the information that indicates whether the encoded data contains > speech or not? The API has a "get VAD status", but it seems like that > might only indicate whether VAD is currently enabled. Perhaps the VAD > status is contained somewhere in the data frames?Look at the return value of either speex_encode() or speex_preprocess_run().> Okay. What I was trying to determine was whether or not the speech > detection was done with something more sophisticated than frame > energy. As you said above, I'll have to look at the sources. For many > systems, sonorant energy rate detection is used to detect voice, even > under very poor SNR conditions.I *do* use more than the frame energy. I use the pitch and (IIRC) one of two other things. However, it's still *very* hard to do any sort of good detection based only on 20 ms. Give me 1 second of latency and it would be *much* easier -- though completely useless. Jean-Marc
Hello Jean-Marc: On 08/06/07, Jean-Marc Valin <jean-marc.valin@usherbrooke.ca> wrote:> > Either one. The question is: If we treat the software like a black > > box, and we feed in PCM audio, we get Speex encoded data out. Where is > > the information that indicates whether the encoded data contains > > speech or not? The API has a "get VAD status", but it seems like that > > might only indicate whether VAD is currently enabled. Perhaps the VAD > > status is contained somewhere in the data frames? > > Look at the return value of either speex_encode() or speex_preprocess_run().OK. Thanks.> > > Okay. What I was trying to determine was whether or not the speech > > detection was done with something more sophisticated than frame > > energy. As you said above, I'll have to look at the sources. For many > > systems, sonorant energy rate detection is used to detect voice, even > > under very poor SNR conditions. > > I *do* use more than the frame energy. I use the pitch and (IIRC) one of > two other things. However, it's still *very* hard to do any sort of good > detection based only on 20 ms. Give me 1 second of latency and it would > be *much* easier -- though completely useless.While I can agree with this if you are dealing with real-time, full duplex links, for my application (non-real-time, half-duplex), the latency has no effect at all. Do you know of anyone else who has implemented some post-processing software to provide more "exotic" speech detection, even at the expense of increased latency? Cheers, -- Larry Gadallah, VE6VQ/W7 lgadallah AT gmail DOT com PGP Sig: 616D 4E52 CF1F 3FEC FFFB F11B 7DB9 C79A EA7E B25B