First, I'd just like to thank the Speex community, and Jean-Marc especially, for their great work. I'm developing a VoIP library (which uses IAX, the asterisk protocol) as the network protocol. I've been putting off integrating Speex for a while, as things have been working pretty well so far with GSM. (for those interested, the code is at iaxclient.sourceforge.net). However, as google has recently picked up the new signal processing stuff that has been added to Speex, that has been very interesting to me. I've been working with the speex-preprocessing stuff so far, with relatively impressive results. It's been a lot of fun so far, and like to give my (subjective) feedback here: 1) AGC: This seems to work pretty well in all cases. I had previously hacked-in the "compander" filter from sox for a similar effect. What I've noticed is that speex_preprocess's AGC has no "knobs", and it seems to use an attack/decay that is a lot faster than what I had chosen from the sox compander, but it works pretty well nonetheless. I think your choices may have been better. It's amazing how little difference I can hear now regardless of how I have my microphone gain set, from about 10% to 90% gain. 2) VAD: I never had a good VAD implementation in the library; I had a user-configurable audio energy threshold that did this, plus, I had a hokey algorithm where I did a pretty naive estimate of the noise floor, and then considered anything 5dB above that to be speech. This worked OK, but since I never updated my "noise floor" estimate, it was easily broken if there was additional noise added at any time (i.e. the user raised their microphone gain). Here, I have gone in and adjusted some knobs here: <p>/* if (st->speech_prob> .35 || (st->last_speech < 20 && st->speech_prob>.1)) */ if (st->speech_prob> .30 || (st->last_speech < 20 && st->speech_prob>.07)) to make it more sensitive, because I was getting some missed speech, and some dropouts. The dropouts were especially troubling, because they caused a big degradation in speech in some cases. The second parameter helped a bit in this case, but I think there might be a smarter implementation yet -- like immediately lowering the threshold once speech is detected, and then raising it gradually based on the previous probabilities? I had also experimented with the 3GPP AMR VAD code (which is, of course, copyrighted) to see how it compares, and it was still better than speex, but speex is still pretty good. 3) denoising: This option was the most interesting. Previously, using an omnidirectional microphone, like that in a notebook or whatnot, to pick up speech gave a really poor SNR; conversation was possible, but it was quite annoying. With the speex denoising filter, it comes through really clear, pretty much as good as if one were using a headset and a directional mic right next to the speaker. Combined with AGC, this was very effective. It does have lots of "interesting" things that it does, however: a) The most interesting thing it does is sometimes it also de-voices speech. I.e. if you say "aaaaaaa" into the filter, after about 3 seconds, you're voice just disappears :). I thought this was interesting, and I wanted to see how smart it was, so instead of a single vowel sound, I tried repeating vowel-consonant pairs, like "badumpbadumpbadump", and If I was consistent enough with that, I could make them mostly disappear as well. This was lots of fun. What it points out, though, is that denoising and, say, singing, won't go along very well at all! I'm also wondering if it could be used to cancel out a boring speaker :) b) There are some "musical" artifacts left over. They're not huge, but I did notice them as voices faded out, etc. I'm guessing this is de-noising, but I was using denoise + AGC at the time, so I'm not sure; if AGC is just scaling, then I guess it must be the denoise. I'll probably add options to my UI to individually control the different filters, which will make evaluation easier. Finally, echo cancellation. I haven't actually been able to get the echo canceller to do anything really useful for me. I'm currently using it something like this: ec = speex_echo_state_init(160, 500); /* in ms */ ... #if defined(SPEEX_EC) { /* convert buffers to float, echo cancel, convert back */ float finBuffer[160], foutBuffer[160], fcancBuffer[160]; int i; for(i=0;i<160;i++) { finBuffer[i] = virtualInBuffer[i]; foutBuffer[i] = ((short *)outputBuffer)[i]; } //fprintf(stderr, "echo cancelling virtual mono frame\n"); speex_echo_cancel(ec, finBuffer, foutBuffer, fcancBuffer, NULL); for(i=0;i<160;i++) { virtualInBuffer[i] = (short)(fcancBuffer[i]); } } #endif I've also tried to use it the same way, but scaling my short samples into the range -1< n < 1 (dividing/multiplying by 32767). When I scaled, the echo canceller seemed to have no effect at all. When I don't scale, all kinds of strange things happen :). [as I write this, I've been trying some more things. First thing I realized is that [duh] my "frames" in this audio driver are 10ms, not 20 ms, so they're only samples long. So, it's no wonder the echo canceller didn't do anything, because each frame it was given was 10ms of real stuff, and 10ms of garbage :). After I fixed that blunder, I got the echo canceller to do _things_ but not actually cancel echo. Mostly, it introduced additional echo. So, my questions are: 1) How should I call the echo canceller with frames of short samples? 2) Could the apparent "no effect" be due to also later using the preprocessor on the frames? I.e. if the echo canceller is only reducing the echo by -20 db or something, the AGC will later bring it right back. Is this the reason for the noise array? Should it work at all without that code (that I've read isn't quite complete yet?). [I haven't tried to use that yet, because the library architecture currently has the echo canceller down in the audio driver, where it gets well-correlated input/output buffers, and the preprocessing is much higher, in the audio-device independent layer, where it only has input buffers -- so it will be a bit of work to try this out]. For echo cancellation, there's a couple of situations where users might introduce echo: 1) stupid windows audio driver/card setups, where it is really difficult and non-obvious, or in some cases seemingly impossible to cause them to _not_ capture playout sound. This should be a relatively easy echo to cancel, but it's quite annoying to have to do that. 2) acoustic echo. The normal cases, where people are using open-air loudspeakers and microphones, as well as the degenerate case, which is Apple Powerbooks, where the microphone is actually embedded _in_ the left speaker enclosure. This is what I've been testing with so far, actually. Apple's iChat AV kinda cheats in this regard a bit; they seem to only play outgoing audio from the right speaker on powerbooks. I think they also have some API to tell if the user has put in a set of headphones, so they play through both Left and Right in that case. Since I expect that one of my primary use cases for the library will involve using the application in a multi-user conference, echo will be a killer. For now, this can be alleviated by using "push to talk", but it would be nice if it were feasible to have a completely automatic setup, with VAD and echo cancellation. Thanks again for your great work, and any comments out there on my experiences and problems. -SteveK --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Jean-Marc Valin
2004-Aug-06 15:01 UTC
[speex-dev] Re: Preprocessing and Echo Cancellation Notes.
> 1) AGC: This seems to work pretty well in all cases. I had previously > hacked-in the "compander" filter from sox for a similar effect. What > I've noticed is that speex_preprocess's AGC has no "knobs", and it > seems to use an attack/decay that is a lot faster than what I had > chosen from the sox compander, but it works pretty well nonetheless. I > think your choices may have been better. It's amazing how little > difference I can hear now regardless of how I have my microphone gain > set, from about 10% to 90% gain.Well, good to know that it works.> 2) VAD: I never had a good VAD implementation in the library; I had a > user-configurable audio energy threshold that did this, plus, I had a > hokey algorithm where I did a pretty naive estimate of the noise floor, > and then considered anything 5dB above that to be speech. This worked > OK, but since I never updated my "noise floor" estimate, it was easily > broken if there was additional noise added at any time (i.e. the user > raised their microphone gain). Here, I have gone in and adjusted some > knobs here: > > > /* if (st->speech_prob> .35 || (st->last_speech < 20 && > st->speech_prob>.1)) */ > if (st->speech_prob> .30 || (st->last_speech < 20 && > st->speech_prob>.07))Well, the tuning always depends on what you're trying to achieve. Currently, the VAD is mostly tuned to make sure it doesn't start transmitting noise.> to make it more sensitive, because I was getting some missed speech, > and some dropouts. The dropouts were especially troubling, because > they caused a big degradation in speech in some cases. The second > parameter helped a bit in this case, but I think there might be a > smarter implementation yet -- like immediately lowering the threshold > once speech is detected, and then raising it gradually based on the > previous probabilities?There's probably lots of improvements that can be done...> I had also experimented with the 3GPP AMR VAD code (which is, of > course, copyrighted) to see how it compares, and it was still better > than speex, but speex is still pretty good.Well, if this VAD was able to beat the 3GPP VAD, then some people would probably lose their job :)> a) The most interesting thing it does is sometimes it also de-voices > speech. I.e. if you say "aaaaaaa" into the filter, after about 3 > seconds, you're voice just disappears :). I thought this was > interesting, and I wanted to see how smart it was, so instead of a > single vowel sound, I tried repeating vowel-consonant pairs, like > "badumpbadumpbadump", and If I was consistent enough with that, I could > make them mostly disappear as well. This was lots of fun. What it > points out, though, is that denoising and, say, singing, won't go along > very well at all! I'm also wondering if it could be used to cancel out > a boring speaker :)Well, what you observe is the effect of noise adaptation. If (in general) a signal is stationary, there's no real way to differentiate it from noise... On easy way to solve the problem though is simply to increase the time over which the signal needs to be stationary to be considered as noise.> b) There are some "musical" artifacts left over. They're not huge, > but I did notice them as voices faded out, etc. I'm guessing this is > de-noising, but I was using denoise + AGC at the time, so I'm not sure; > if AGC is just scaling, then I guess it must be the denoise. I'll > probably add options to my UI to individually control the different > filters, which will make evaluation easier.Musical noise is something that most (all?) denoisers have at different degrees.> Finally, echo cancellation. I haven't actually been able to get the > echo canceller to do anything really useful for me. I'm currently > using it something like this: > > ec = speex_echo_state_init(160, 500); /* in ms */Actually, the second parameter is in samples, since there's no way to tell the sampling rate.> I've also tried to use it the same way, but scaling my short samples > into the range -1< n < 1 (dividing/multiplying by 32767).The right range is +- 32768. Actually in the CVS version, all inputs and outputs are now short, so it solves the problem.> 1) How should I call the echo canceller with frames of short samples?Not sure I understand the question?> 2) Could the apparent "no effect" be due to also later using the > preprocessor on the frames? I.e. if the echo canceller is only > reducing the echo by -20 db or something, the AGC will later bring it > right back. Is this the reason for the noise array? Should it work at > all without that code (that I've read isn't quite complete yet?). [I > haven't tried to use that yet, because the library architecture > currently has the echo canceller down in the audio driver, where it > gets well-correlated input/output buffers, and the preprocessing is > much higher, in the audio-device independent layer, where it only has > input buffers -- so it will be a bit of work to try this out].I'm not sure what's the problem. First, you need to know that the echo canceller is still in experimental state. The theory of echo cancellation is rather simple, but the implementation is not. For example, in order to get good results, you need a good crosstalk detector. The current one kind of sucks. One thing too. In your example, you have a 500 sample (not ms) filter length. However, if the input/output offset introduced by your card is larger than that (or in the same order), then you won't have any cancellation at all. Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20031109/077c9fe8/signature-0001.pgp
Steve Kann
2004-Aug-06 15:01 UTC
[speex-dev] Re: Preprocessing and Echo Cancellation Notes.
On Nov 9, 2003, at 1:00 AM, Jean-Marc Valin wrote:>> 2) VAD: I never had a good VAD implementation in the library; I had a >> user-configurable audio energy threshold that did this, plus, I had a >> hokey algorithm where I did a pretty naive estimate of the noise >> floor, >> and then considered anything 5dB above that to be speech. This worked >> OK, but since I never updated my "noise floor" estimate, it was easily >> broken if there was additional noise added at any time (i.e. the user >> raised their microphone gain). Here, I have gone in and adjusted some >> knobs here: >> >> >> /* if (st->speech_prob> .35 || (st->last_speech < 20 && >> st->speech_prob>.1)) */ >> if (st->speech_prob> .30 || (st->last_speech < 20 && >> st->speech_prob>.07)) > > Well, the tuning always depends on what you're trying to achieve. > Currently, the VAD is mostly tuned to make sure it doesn't start > transmitting noise.Right. I'm actually planning on using it in two places: 1) In the VoIP client, where it is used mostly to lower upstream bandwidth. (which is generally more expensive than downstream). 2) in the input to a conferencing mixer. In this case, some clients will be coming in via VoIP, and so I will not double-process them (i.e. they'll run VAD at the client), but some clients will be coming to us via the PSTN, in which case we will do VAD on them. The VAD is helpful here, because (a) the job of the conferencing mixer is greatly simplified if we only have to mix signals from actual speakers, and (b) noise is additive, so if we can just eliminate signals from non-speakers, we have less noise sent to everyone. In either case, though, I'd rather err on the side of getting false positives, rather than false negatives.>>> to make it more sensitive, because I was getting some missed speech, >> and some dropouts. The dropouts were especially troubling, because >> they caused a big degradation in speech in some cases. The second >> parameter helped a bit in this case, but I think there might be a >> smarter implementation yet -- like immediately lowering the threshold >> once speech is detected, and then raising it gradually based on the >> previous probabilities? > > There's probably lots of improvements that can be done...Yep. Also, since all three preprocessing functions are integrated, it means they can take advantage of each-others improvements.> >> I had also experimented with the 3GPP AMR VAD code (which is, of >> course, copyrighted) to see how it compares, and it was still better >> than speex, but speex is still pretty good. > > Well, if this VAD was able to beat the 3GPP VAD, then some people would > probably lose their job :)Hmm, or maybe they'd just use your VAD instead of theirs, and then get big bonuses for getting such great work done in such a short period of time!> >> a) The most interesting thing it does is sometimes it also >> de-voices >> speech. I.e. if you say "aaaaaaa" into the filter, after about 3 >> seconds, you're voice just disappears :). I thought this was >> interesting, and I wanted to see how smart it was, so instead of a >> single vowel sound, I tried repeating vowel-consonant pairs, like >> "badumpbadumpbadump", and If I was consistent enough with that, I >> could >> make them mostly disappear as well. This was lots of fun. What it >> points out, though, is that denoising and, say, singing, won't go >> along >> very well at all! I'm also wondering if it could be used to cancel >> out >> a boring speaker :) > > Well, what you observe is the effect of noise adaptation. If (in > general) a signal is stationary, there's no real way to differentiate > it > from noise... On easy way to solve the problem though is simply to > increase the time over which the signal needs to be stationary to be > considered as noise.Right. I understood that much. I think that it might be possible also to tune the bands in which denoising happens; i.e. don't remove (or completely remove) signals likely to be vowels.. (maybe a range from 500hz-1500hz or something? It might be something interesting to play with.> >> b) There are some "musical" artifacts left over. They're not >> huge, >> but I did notice them as voices faded out, etc. I'm guessing this is >> de-noising, but I was using denoise + AGC at the time, so I'm not >> sure; >> if AGC is just scaling, then I guess it must be the denoise. I'll >> probably add options to my UI to individually control the different >> filters, which will make evaluation easier. > > Musical noise is something that most (all?) denoisers have at different > degrees.Yes, that's what I've read, which is why I at least knew the correct term for it :) I guess I should read the techniques involved in reducing their perception, and see what can be done.> >> Finally, echo cancellation. I haven't actually been able to get the >> echo canceller to do anything really useful for me. I'm currently >> using it something like this: >> >> ec = speex_echo_state_init(160, 500); /* in ms */ > > Actually, the second parameter is in samples, since there's no way to > tell the sampling rate. >> I've also tried to use it the same way, but scaling my short samples >> into the range -1< n < 1 (dividing/multiplying by 32767). > > The right range is +- 32768. Actually in the CVS version, all inputs > and > outputs are now short, so it solves the problem. > >> 1) How should I call the echo canceller with frames of short samples? > > Not sure I understand the question?You've already answered it, I think, with the range answer above..> >> 2) Could the apparent "no effect" be due to also later using the >> preprocessor on the frames? I.e. if the echo canceller is only >> reducing the echo by -20 db or something, the AGC will later bring it >> right back. Is this the reason for the noise array? Should it work >> at >> all without that code (that I've read isn't quite complete yet?). [I >> haven't tried to use that yet, because the library architecture >> currently has the echo canceller down in the audio driver, where it >> gets well-correlated input/output buffers, and the preprocessing is >> much higher, in the audio-device independent layer, where it only has >> input buffers -- so it will be a bit of work to try this out]. > > I'm not sure what's the problem. First, you need to know that the echo > canceller is still in experimental state. The theory of echo > cancellation is rather simple, but the implementation is not. For > example, in order to get good results, you need a good crosstalk > detector. The current one kind of sucks. One thing too. In your > example, > you have a 500 sample (not ms) filter length. However, if the > input/output offset introduced by your card is larger than that (or in > the same order), then you won't have any cancellation at all.I'll play with it some more, using the correct parameters, and let you know how I fare. Thanks for your quick response, even on the weekend! -SteveK <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.