> Jean-Marc Valin wrote:
>> Andy Ross wrote:
>>> Not knowing how VAD works, I can't say for sure.
>> There are many ways to implement a VAD.
>
> I meant "not knowing how speex's VAD works", of course, not
VAD
> in general. If you would stop interpreting everything I say in
> the least charitable manner, this might be going more smoothly
> than it is.
Well, that's what you wrote :-) But sorry for misinterpreting. It's hard
to know what knowledge someone has from an email.
> (Tom was right, by the way, I was presuming that DENOISE was the
> feature at fault, but he's convinced me that I should really be
> looking at VAD)
Indeed, you are doing a form of VAD -- and I totally agree both VADs I
have in Speex (there's one in the encoder and one in the preprocessor)
suck to a certain amount.
> What I implemented is, in the strict sense, called "squelch". As
> an algorithm, it predates digital signal processing by several
> decades. Here is a quick definition and overview if you aren't
> familiar with analog stuff: http://en.wikipedia.org/wiki/Squelch
>
> It is not a sexy algorithm, and you won't find many academic
> papers on it. But (and Tom's post bears this out) it has the
> distinct advantage of *outperforming* Speex's internal tools (the
> ones designed to solve the same problem!) for at least two use
> cases. Like I said, I bet you that most or all of your
> production users are doing this in one form or another.
I wasn't aware of the name, but I was aware of the algorithm. I'm also
aware that it annoys the hell out of people as soon as there's a bit of
background noise because then they think the conversation's been cut
every time the algo switches the sound off.
> Seriously: take your laptop and run a quick speex transcoding
> with AGC enabled. Speak a little, then type something quickly,
> then speak a little more. Play it back. Now do the same thing
> with Skype, or Teamspeak, or Ventrilo, or pretty much any
> production VoIP application. Why does speex encode the typing
> while nothing else does? I don't have source code to those
> products, but I'm all but certain it's because they include a
> squelch feature in their preprocessing.
You're mixing AGC and VAD now. The difficulty in the AGC is knowing
what's the level of the speech, which is quite hard if you're not yet
sure you've heard speech at all so far. The AGC has improved in svn, but
it might need some more tweaking. In any case, it's totally independent
of the VAD and whether you implemented it with Squelch.
> If (1) you consider VAD and denoise are important enough to
> address in the codec library,
They are and they're both pretty hard to get right 100% of the time.
Especially VAD is next to impossible to do on very short windows (but
trivial on a 1-second window). At some point, I actually did manage to
get the preprocessor VAD (sort of) robust to keyboard clicks, but then
it failed on other scenarios, so I had to use a sort of "intermediate"
tuning.
> (2) there is a simpler algorithm
> that works better in at least some circumstances, and
I'm sure there are, but Squelch isn't one of them. Just for fun, try
adding white noise at 20 dB SNR and listen to the result with Squelch!
Not pretty is it?
> (3) most of
> your users (re)write it themselves anyway, shouldn't you at least
> *consider* adding a squelch feature? Or at least fixing/retuning
> VAD so it works as well?
Adding Squelch isn't planned, but improving the VAD is definitely on the
TODO list. You could also test the preprocessor VAD and see how it
works. I suggest you try both 1.2beta1 and svn, though I expect 1.2beta1
would work better for the VAD.
Jean-Marc