thr3ads.net - Speex dev - [Speex-dev] VAD Questions [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Larry Gadallah

2007-Jun-08 08:13 UTC

[Speex-dev] VAD Questions

Hello Jean-Marc et al:

On 07/06/07, Jean-Marc Valin <jean-marc.valin@usherbrooke.ca>
wrote:> > - Is there a reference somewhere (other than the source itself) that
> > explains how the latest VAD algorithm works?
>
> Read the source, Luke :-) (sorry)
Okay. I had to ask :-)
>
> > - Is it possible to obtain the VAD status of a Speex stream
> > asynchronously? The current API seems to imply that some kind of
> > polling is required to determine the voice/non-voice status.
>
> Don't understand your question. Also which VAD are you talking about?
> The one in the encoder or the one in the preprocessor?
Either one. The question is: If we treat the software like a black
box, and we feed in PCM audio, we get Speex encoded data out. Where is
the information that indicates whether the encoded data contains
speech or not? The API has a "get VAD status", but it seems like that
might only indicate whether VAD is currently enabled. Perhaps the VAD
status is contained somewhere in the data frames?
>
> > - Does the VAD algorithm implement syllabic/sonorant rate detection,
> > as has been implemented many times in analog circuitry, and is
> > described in this (and other) papers?
> > http://people.csail.mit.edu/jrg/2005/IS05_schutte.pdf
>
> As far as I understand, the paper you reference above isn't applicable
> to the problem here. Basically, we have to decide whether we have speech
> or silence based only on 20 ms of audio (and the past). If we could
> "look into the future" of the signals, things would be much
easier.
>
> > - Over what time period is VAD done? Is it done on a frame by frame
> > basis or over some longer period?
>
> It *has* to be done frame by frame, otherwise you add latency, which
> isn't acceptable.
Okay. What I was trying to determine was whether or not the speech
detection was done with something more sophisticated than frame
energy. As you said above, I'll have to look at the sources. For many
systems, sonorant energy rate detection is used to detect voice, even
under very poor SNR conditions.

Cheers,
-- 
Larry Gadallah, VE6VQ/W7                          lgadallah AT gmail DOT com
PGP Sig: 616D 4E52 CF1F 3FEC FFFB  F11B 7DB9 C79A EA7E B25B

Jean-Marc Valin

2007-Jun-08 08:47 UTC

head link

[Speex-dev] VAD Questions

> Either one. The question is: If we treat the software like a black
> box, and we feed in PCM audio, we get Speex encoded data out. Where is
> the information that indicates whether the encoded data contains
> speech or not? The API has a "get VAD status", but it seems like
that
> might only indicate whether VAD is currently enabled. Perhaps the VAD
> status is contained somewhere in the data frames?
Look at the return value of either speex_encode() or speex_preprocess_run().
> Okay. What I was trying to determine was whether or not the speech
> detection was done with something more sophisticated than frame
> energy. As you said above, I'll have to look at the sources. For many
> systems, sonorant energy rate detection is used to detect voice, even
> under very poor SNR conditions.
I *do* use more than the frame energy. I use the pitch and (IIRC) one of
two other things. However, it's still *very* hard to do any sort of good
detection based only on 20 ms. Give me 1 second of latency and it would
be *much* easier -- though completely useless.

	Jean-Marc

Larry Gadallah

2007-Jun-08 10:10 UTC

head link

[Speex-dev] VAD Questions

Hello Jean-Marc:

On 08/06/07, Jean-Marc Valin <jean-marc.valin@usherbrooke.ca>
wrote:> > Either one. The question is: If we treat the software like a black
> > box, and we feed in PCM audio, we get Speex encoded data out. Where is
> > the information that indicates whether the encoded data contains
> > speech or not? The API has a "get VAD status", but it seems
like that
> > might only indicate whether VAD is currently enabled. Perhaps the VAD
> > status is contained somewhere in the data frames?
>
> Look at the return value of either speex_encode() or
speex_preprocess_run().
OK. Thanks.
>
> > Okay. What I was trying to determine was whether or not the speech
> > detection was done with something more sophisticated than frame
> > energy. As you said above, I'll have to look at the sources. For
many
> > systems, sonorant energy rate detection is used to detect voice, even
> > under very poor SNR conditions.
>
> I *do* use more than the frame energy. I use the pitch and (IIRC) one of
> two other things. However, it's still *very* hard to do any sort of
good
> detection based only on 20 ms. Give me 1 second of latency and it would
> be *much* easier -- though completely useless.
While I can agree with this if you are dealing with real-time, full
duplex links, for my application (non-real-time, half-duplex), the
latency has no effect at all. Do you know of anyone else who has
implemented some post-processing software to provide more "exotic"
speech detection, even at the expense of increased latency?

Cheers,
-- 
Larry Gadallah, VE6VQ/W7                          lgadallah AT gmail DOT com
PGP Sig: 616D 4E52 CF1F 3FEC FFFB  F11B 7DB9 C79A EA7E B25B

Apparently Analagous Threads

Search for more seemingly similar threads

Speex dev - Jun 2007 - VAD Questions

[Speex-dev] VAD Questions

[Speex-dev] VAD Questions

[Speex-dev] VAD Questions

Apparently Analagous Threads