thr3ads.net - Speex dev - [Speex-dev] mdf -- better adaption of W? [Dec 2005]

If this information is useful, please help other people find it:
Share via:

Thorvald Natvig

2005-Dec-12 20:52 UTC

[Speex-dev] mdf -- better adaption of W?

>> Generate a test signal (10+x sine waves per frame), where x increases
by
>> one for each iteration, and wraps around at 100.
>
> Testing with sine waves is usually not a good idea. If you intend on
> cancelling speech, then test with speech.
Ok, I tested more extensively with both music and two-way speech.  More on 
this below.
>> However, when peeking at the values, it seems that the weights for
>> frame 0 (newest) are very low.
>
> Peeking at the value tells you nothing unless you do the inverse FFT and
> all so you can see them in the time domain. Even then, it's not that
> useful.
Actually, computing the "power spectrum" for each frame of W shows 
how large an ammount of the original signal at time offset j the 
echo canceller thinks should be removed from the current input frame.

If you compute W*X for each j and ifft, you'll get the 
original signal with each frequency component scaled and time-shifted 
according to what W was (for that j).


Anyway, I did some proper testing. I took my headset, bent the microphone 
arm so it's resting inside the .. uh.. whatever you call that large 
muffler thing that goes around your ear. This is an important testcase, as 
a lot of our users have complained about hearing echo that is propagated 
at the remote end either directly though the air from the "speaker" to
the
microphone (common with open headsets), and with closed headsets we see 
echo propagated mechanically down the arm of the microphone.


Playing regular pop music (Garbage: Push It), things work out well, and 
the canceller ends up with semi-stable weights, almost entirely in the 
(j==M-1) bin (0-20ms delay, which is quite natural). It's the same with 
normal speech as long as it's spoken reasonably fast.

I see some "banding" of the output, it seems there's more output
signal
(and more to cancel) in the 1-3khz and 5-6 khz area, but I blame that on 
the headphones; they're cheap.

However, when switching to AC DC: Big Gun, we see and hear a large 
residual echo from the opening el-guitar. This seems to be a result of a 
semi-stable sound that lasts more than 20 ms; the canceller finds a 
correlation in 4-5 timebins instead of just one. We could reproduce the 
same result by playing a human voice saying "aaaaaaaaaa" without
variation
in pitch; the weights for those frequency bins would increase for all the 
timeslots in W.

Now, people don't say "aaaaaaaaaaaaa" all that often, but they do
play
music that has a few "long" sounds, and saying "aaaaanyway"
is enough to
trigger this.


Next test, what happens if the user has an external (physical) on-off 
switch? Same setup, playing Big Gun as loud as it gets. Apart from the 
problems with the opening guitar everything is good, and we see the 
weights set as they should be and things are cancelled out.

So, I switch the mic off externally with the switch. Input becomes 
practically zero, so the weights readjust to zero as well. Turn the 
microphone back on and the echo canceller doesn't adapt.
That is, no echo cancellation, and the weights all stay at their zero 
values.

This can happen quite frequently, so it would be nice if the echo 
canceller could deal with this situation without a complete reset.



Now, when trying to visualize the weights to see a bit of what was going 
on, I also computed the phase for each frequency bin. When looking just at 
the phase, I can see a very clear and distinct pattern of going from -pi 
to +pi in the areas where I know there is echo (specifically, the lower 
7khz of j==M-1), and what looks like random noise for the rest. Do you 
have any idea where this pattern originates from, and more importantly, 
could it be used as additional conditioning of W? (ie: if the phase 
doesn't match the pattern, reduce the amplitude as it's a false match).

Jean-Marc Valin

2005-Dec-12 21:32 UTC

head link

[Speex-dev] mdf -- better adaption of W?

> Actually, computing the "power spectrum" for each frame of W
shows
> how large an ammount of the original signal at time offset j the 
> echo canceller thinks should be removed from the current input frame.
Careful when looking at W because of how the real and imaginary parts
are packed in the array.
> If you compute W*X for each j and ifft, you'll get the 
> original signal with each frequency component scaled and time-shifted 
> according to what W was (for that j).
Yes, that's the Y/y signal in the code.
> Anyway, I did some proper testing. I took my headset, bent the microphone 
> arm so it's resting inside the .. uh.. whatever you call that large 
> muffler thing that goes around your ear. This is an important testcase, as 
> a lot of our users have complained about hearing echo that is propagated 
> at the remote end either directly though the air from the
"speaker" to the
> microphone (common with open headsets), and with closed headsets we see 
> echo propagated mechanically down the arm of the microphone.
If you hold that in you're hand, you're probably making it harder than
for a real scenario because any movement causes the echo path to change.
> Playing regular pop music (Garbage: Push It), things work out well, and 
> the canceller ends up with semi-stable weights, almost entirely in the 
> (j==M-1) bin (0-20ms delay, which is quite natural). It's the same with
> normal speech as long as it's spoken reasonably fast.
Fine.
> I see some "banding" of the output, it seems there's more
output signal
> (and more to cancel) in the 1-3khz and 5-6 khz area, but I blame that on 
> the headphones; they're cheap.
Not sure what you mean but it doesn't seem to be a problem.
> However, when switching to AC DC: Big Gun, we see and hear a large 
> residual echo from the opening el-guitar. This seems to be a result of a 
> semi-stable sound that lasts more than 20 ms; the canceller finds a 
> correlation in 4-5 timebins instead of just one. We could reproduce the 
> same result by playing a human voice saying "aaaaaaaaaa" without
variation
> in pitch; the weights for those frequency bins would increase for all the 
> timeslots in W.
> 
> Now, people don't say "aaaaaaaaaaaaa" all that often, but
they do play
> music that has a few "long" sounds, and saying
"aaaaanyway" is enough to
> trigger this.
Can you sent a pair of files so I can run testecho on?
> Next test, what happens if the user has an external (physical) on-off 
> switch? Same setup, playing Big Gun as loud as it gets. Apart from the 
> problems with the opening guitar everything is good, and we see the 
> weights set as they should be and things are cancelled out.
> 
> So, I switch the mic off externally with the switch. Input becomes 
> practically zero, so the weights readjust to zero as well. Turn the 
> microphone back on and the echo canceller doesn't adapt.
> That is, no echo cancellation, and the weights all stay at their zero 
> values.
> 
> This can happen quite frequently, so it would be nice if the echo 
> canceller could deal with this situation without a complete reset.
That can be predicted from the code. It's sort of hard to fix without
hurting accuracy for the general case. I'll have to think about it.
> Now, when trying to visualize the weights to see a bit of what was going 
> on, I also computed the phase for each frequency bin. When looking just at 
> the phase, I can see a very clear and distinct pattern of going from -pi 
> to +pi in the areas where I know there is echo (specifically, the lower 
> 7khz of j==M-1), 
What you see is a "linear phase", which is the frequency equivalent of
a
delay in the time domain. So basically, the phase you see is just the
representation of where the "main impulse" is in the time domain
version
of W (i.e. the time offset between the two signals you sent to the AEC).
> and what looks like random noise for the rest. Do you 
> have any idea where this pattern originates from, and more importantly, 
> could it be used as additional conditioning of W? (ie: if the phase 
> doesn't match the pattern, reduce the amplitude as it's a false
match).
A random phase is expected. I don't see much usefult info you can get
from that.

	Jean-Marc

Thorvald Natvig

2005-Dec-12 22:29 UTC

head link

[Speex-dev] mdf -- better adaption of W?

>> Actually, computing the "power spectrum" for each frame of W
shows
>> how large an ammount of the original signal at time offset j the
>> echo canceller thinks should be removed from the current input frame.
>
> Careful when looking at W because of how the real and imaginary parts
> are packed in the array.
Err. Ok, as I got it, 'bin 0' has it's amplitude in W[0], bin 1 to
N-1 has
it's real part in W[i*2-1] and it's imag in W[i*2], and finally the 
nyquist amplitude is in W[N-1]

I took this from how power_spectrum() computes, so I might be off :)
>> Anyway, I did some proper testing. I took my headset, bent the
microphone
>> arm so it's resting inside the .. uh.. whatever you call that large
>> muffler thing that goes around your ear. This is an important testcase,
as
>> a lot of our users have complained about hearing echo that is
propagated
>> at the remote end either directly though the air from the
"speaker" to the
>> microphone (common with open headsets), and with closed headsets we see
>> echo propagated mechanically down the arm of the microphone.
>
> If you hold that in you're hand, you're probably making it harder
than
> for a real scenario because any movement causes the echo path to change.
Actually, with maximum volume (which I used to make sure the echo really 
dominated over the noise), it's quite loud, so I left it in the corner.
>> Now, people don't say "aaaaaaaaaaaaa" all that often, but
they do play
>> music that has a few "long" sounds, and saying
"aaaaanyway" is enough to
>> trigger this.
>
> Can you sent a pair of files so I can run testecho on?
I'll need to add support for saving audio to my program, so I can give you 
the "actual" sampled loudspeaker and mic files, and I'll also need
to get
hold of a test person again. (I had a friend with a friend who has an 
exceptionally clear voice. My own "aaaaaa" is far too muddy to cause 
this). I'll try to get this done this week, but it might be delayed
'till
after christmas.
>> This can happen quite frequently, so it would be nice if the echo
>> canceller could deal with this situation without a complete reset.
>
> That can be predicted from the code. It's sort of hard to fix without
> hurting accuracy for the general case. I'll have to think about it.
An idea might be to enable the noise cancellation to "feed back" into
the
echo cancellator. If, after noise cancellation, there's nothing left at 
all, then stop adapting the echo cancellator.
>> Now, when trying to visualize the weights to see a bit of what was
going
>> on, I also computed the phase for each frequency bin. When looking just
at
>> the phase, I can see a very clear and distinct pattern of going from
-pi
>> to +pi in the areas where I know there is echo (specifically, the lower
>> 7khz of j==M-1),
>
> What you see is a "linear phase", which is the frequency
equivalent of a
> delay in the time domain. So basically, the phase you see is just the
> representation of where the "main impulse" is in the time domain
version
> of W (i.e. the time offset between the two signals you sent to the AEC).
Ah, yes. I'm reading up on my DFT now. Amazing how much stuff you can 
forget.
>> and what looks like random noise for the rest. Do you
>> have any idea where this pattern originates from, and more importantly,
>> could it be used as additional conditioning of W? (ie: if the phase
>> doesn't match the pattern, reduce the amplitude as it's a false
match).
>
> A random phase is expected. I don't see much usefult info you can get
> from that.
Well, from what I can see in this testcase, it's only "random"
where there
is no correlation. For example, in the 20ms-40ms timeslot, the amplitude 
can spike a bit (such as on those "aaaaaa"), but the phase is still 
random, whereas in the '0-20ms' slot, it's very regular. My thought
was to
use the "regularity" of the phase shift as an indication for a good
match.
So, if arg(W[i+1])-arg(W[i])==arg(W[i])-arg(W[i-1]), we know it's a steady 
increase, so it's probably a good match. It's quite hackish, and
probably
not based in any kind of good scientific basis, but it's an idea for 
dealing better with the specific kind of echo I see here.

Then again, it will likely fail horribly if you have 2 echos; one delayed 
by 5ms with equal amplitude, and another delayed by 15ms with a much lower 
amplitude. I have no idea what the "phase diagram" will look like
then.

Maybe Matching Threads

Search for more maybe matching threads

Speex dev - Dec 2005 - mdf -- better adaption of W?

[Speex-dev] mdf -- better adaption of W?

[Speex-dev] mdf -- better adaption of W?

[Speex-dev] mdf -- better adaption of W?

Maybe Matching Threads