thr3ads.net - Speex dev - [Speex-dev] mdf -- better adaption of W? [Dec 2005]

If this information is useful, please help other people find it:
Share via:

Thorvald Natvig

2005-Dec-05 18:28 UTC

[Speex-dev] mdf -- better adaption of W?

Hi,

I'm still working on visualizing the echo canceller, but I discovered 
something that might be interresting.

During testing, i did this:

Generate a test signal (10+x sine waves per frame), where x increases by
one for each iteration, and wraps around at 100.

Set the speaker signal for the frame to the test signal.
Add 0.5*test signal to the mic signal.

When watching the power graph (visualized from ps in the preprocessor), I 
see a large spike starting at 10 sines and moving up, then wrapping 
around. It is slowly diminished, but never goes away.. It's also only much 
more diminished while "moving" (slowly increasing frequency), and much
less so at the wraparound point.

This was with a tail of 5*framesize (M=5).

However, if I set the tail to M=1, the filter seems to adapt much more 
quickly, and also gives much better results; the moving sine is now almost 
gone. Odd.

Next test, I delayed the signal added to the mic by one frame and set M=2. 
Still adapts, but does so much slower. Good.

Next test, delay the signal 3 frames, keep M=2. Complete deterioation of 
state; output is just noise, and the preprocessor starts spitting out NaN 
values for loudness and Zlast.

Repeat with M=5 (mic still delayed 3 frames). Adapts, but does not 
completely cancel as it did earlier, and has very little cancellation for 
the "edges" (when the sine wraps from 110 sines/frame back to 10 
sines/frame).

Repeat with M=5 and mic delayed 8 frames. No cancellation, as expected.

So... Next step, I skimmed through the "Multidelay Block Frequency Domain 
Adaptive Filter" paper, which I understand mdf.c is based on. If I 
understand this correctly:
  - it keeps the frequency domain of the last M frames (in the X array)
  - The "output" (the signal to cancel?), is computed by taking the
    last M frequency domains, multiply each frequency band by a weight,
    sum them together and inverse FFT. The weights are stored in W.
  - Update W through some magic.

If I got that right, then for the 'mic delay by 3 frames', I'd
expect the
W[0] to W[3*N] to be 0 (or close to it), then W[3*N] to W[4*N] to be 0.5, 
and the rest 0.

First off, it seems W is stored 'backwards'. The first values are for
the
oldest frame, ok :)

However, when peeking at the values, it seems that the weights for 
frame 0 (newest) are very low.
For frame 1, they are slightly positive.
For frame 2, they are fairly low, except in the specific 
range of my test signal, where they range from somwhat posivie (around 
0.25) to somewhat negative (-0.25).
For frame 3, they are positive all around, around the 0.5 area, but higher 
in the frequency bands of my test signal.
For frame 4, they are very low, except in the range of the test signal, 
where they are slightly negative.
For frame 5, they are low, but positive.
For the rest of the frames, the weights switch from "slightly
positive" to
"slightly negative" -- odd index frames are positive, even index are 
negative.

If I delay the signal by 4 frames instead, it wants to use 
indexes 2, 4 and 6 (with emphasis on 4), with the negatives in 
frames 3 and 5 (and less so in all other odd-index frames).

Looking at the negative weights closest in time to the actual echo, I see 
they are more negative near the "edges" of my test signal, so it seems
they're an artifact of trying to cope with the fact that my signal jumps 
in frequency every 2 seconds.

If I manually force W to be 0 all over, and 0.5 for the real parts of the 
4th delayed frame, echo cancellation is perfect.

If I initialize W to the "perfect" value, it stays more or less at
that
level, though it does adapt away from it every so slightly in the 
frequency bands where there are no components at all in the "speaker" 
signal.

.. So my question is, why doesn't W adapt to the perfect values? Is there 
something that can be done to tune the adaption?

Jean-Marc Valin

2005-Dec-09 14:38 UTC

head link

[Speex-dev] mdf -- better adaption of W?

> Generate a test signal (10+x sine waves per frame), where x increases by
> one for each iteration, and wraps around at 100.
Testing with sine waves is usually not a good idea. If you intend on
cancelling speech, then test with speech.
> First off, it seems W is stored 'backwards'. The first values are
for the
> oldest frame, ok :)
right.
> However, when peeking at the values, it seems that the weights for 
> frame 0 (newest) are very low.
Peeking at the value tells you nothing unless you do the inverse FFT and
all so you can see them in the time domain. Even then, it's not that
useful.
> If I initialize W to the "perfect" value, it stays more or less
at that
> level, though it does adapt away from it every so slightly in the 
> frequency bands where there are no components at all in the
"speaker"
> signal.
Normal
> .. So my question is, why doesn't W adapt to the perfect values? Is
there
> something that can be done to tune the adaption?
Why? Because it's not perfect. What can be done? More tuning and
research into better adaptation algorithms. This is not a simple topic.

	Jean-Marc

Thorvald Natvig

2005-Dec-12 20:52 UTC

head link

[Speex-dev] mdf -- better adaption of W?

>> Generate a test signal (10+x sine waves per frame), where x increases
by
>> one for each iteration, and wraps around at 100.
>
> Testing with sine waves is usually not a good idea. If you intend on
> cancelling speech, then test with speech.
Ok, I tested more extensively with both music and two-way speech.  More on 
this below.
>> However, when peeking at the values, it seems that the weights for
>> frame 0 (newest) are very low.
>
> Peeking at the value tells you nothing unless you do the inverse FFT and
> all so you can see them in the time domain. Even then, it's not that
> useful.
Actually, computing the "power spectrum" for each frame of W shows 
how large an ammount of the original signal at time offset j the 
echo canceller thinks should be removed from the current input frame.

If you compute W*X for each j and ifft, you'll get the 
original signal with each frequency component scaled and time-shifted 
according to what W was (for that j).


Anyway, I did some proper testing. I took my headset, bent the microphone 
arm so it's resting inside the .. uh.. whatever you call that large 
muffler thing that goes around your ear. This is an important testcase, as 
a lot of our users have complained about hearing echo that is propagated 
at the remote end either directly though the air from the "speaker" to
the
microphone (common with open headsets), and with closed headsets we see 
echo propagated mechanically down the arm of the microphone.


Playing regular pop music (Garbage: Push It), things work out well, and 
the canceller ends up with semi-stable weights, almost entirely in the 
(j==M-1) bin (0-20ms delay, which is quite natural). It's the same with 
normal speech as long as it's spoken reasonably fast.

I see some "banding" of the output, it seems there's more output
signal
(and more to cancel) in the 1-3khz and 5-6 khz area, but I blame that on 
the headphones; they're cheap.

However, when switching to AC DC: Big Gun, we see and hear a large 
residual echo from the opening el-guitar. This seems to be a result of a 
semi-stable sound that lasts more than 20 ms; the canceller finds a 
correlation in 4-5 timebins instead of just one. We could reproduce the 
same result by playing a human voice saying "aaaaaaaaaa" without
variation
in pitch; the weights for those frequency bins would increase for all the 
timeslots in W.

Now, people don't say "aaaaaaaaaaaaa" all that often, but they do
play
music that has a few "long" sounds, and saying "aaaaanyway"
is enough to
trigger this.


Next test, what happens if the user has an external (physical) on-off 
switch? Same setup, playing Big Gun as loud as it gets. Apart from the 
problems with the opening guitar everything is good, and we see the 
weights set as they should be and things are cancelled out.

So, I switch the mic off externally with the switch. Input becomes 
practically zero, so the weights readjust to zero as well. Turn the 
microphone back on and the echo canceller doesn't adapt.
That is, no echo cancellation, and the weights all stay at their zero 
values.

This can happen quite frequently, so it would be nice if the echo 
canceller could deal with this situation without a complete reset.



Now, when trying to visualize the weights to see a bit of what was going 
on, I also computed the phase for each frequency bin. When looking just at 
the phase, I can see a very clear and distinct pattern of going from -pi 
to +pi in the areas where I know there is echo (specifically, the lower 
7khz of j==M-1), and what looks like random noise for the rest. Do you 
have any idea where this pattern originates from, and more importantly, 
could it be used as additional conditioning of W? (ie: if the phase 
doesn't match the pattern, reduce the amplitude as it's a false match).

Apparently Analagous Threads

Search for more apparently analagous threads

Speex dev - Dec 2005 - mdf -- better adaption of W?

[Speex-dev] mdf -- better adaption of W?

[Speex-dev] mdf -- better adaption of W?

[Speex-dev] mdf -- better adaption of W?

Apparently Analagous Threads