Hi, I'm still working on visualizing the echo canceller, but I discovered something that might be interresting. During testing, i did this: Generate a test signal (10+x sine waves per frame), where x increases by one for each iteration, and wraps around at 100. Set the speaker signal for the frame to the test signal. Add 0.5*test signal to the mic signal. When watching the power graph (visualized from ps in the preprocessor), I see a large spike starting at 10 sines and moving up, then wrapping around. It is slowly diminished, but never goes away.. It's also only much more diminished while "moving" (slowly increasing frequency), and much less so at the wraparound point. This was with a tail of 5*framesize (M=5). However, if I set the tail to M=1, the filter seems to adapt much more quickly, and also gives much better results; the moving sine is now almost gone. Odd. Next test, I delayed the signal added to the mic by one frame and set M=2. Still adapts, but does so much slower. Good. Next test, delay the signal 3 frames, keep M=2. Complete deterioation of state; output is just noise, and the preprocessor starts spitting out NaN values for loudness and Zlast. Repeat with M=5 (mic still delayed 3 frames). Adapts, but does not completely cancel as it did earlier, and has very little cancellation for the "edges" (when the sine wraps from 110 sines/frame back to 10 sines/frame). Repeat with M=5 and mic delayed 8 frames. No cancellation, as expected. So... Next step, I skimmed through the "Multidelay Block Frequency Domain Adaptive Filter" paper, which I understand mdf.c is based on. If I understand this correctly: - it keeps the frequency domain of the last M frames (in the X array) - The "output" (the signal to cancel?), is computed by taking the last M frequency domains, multiply each frequency band by a weight, sum them together and inverse FFT. The weights are stored in W. - Update W through some magic. If I got that right, then for the 'mic delay by 3 frames', I'd expect the W[0] to W[3*N] to be 0 (or close to it), then W[3*N] to W[4*N] to be 0.5, and the rest 0. First off, it seems W is stored 'backwards'. The first values are for the oldest frame, ok :) However, when peeking at the values, it seems that the weights for frame 0 (newest) are very low. For frame 1, they are slightly positive. For frame 2, they are fairly low, except in the specific range of my test signal, where they range from somwhat posivie (around 0.25) to somewhat negative (-0.25). For frame 3, they are positive all around, around the 0.5 area, but higher in the frequency bands of my test signal. For frame 4, they are very low, except in the range of the test signal, where they are slightly negative. For frame 5, they are low, but positive. For the rest of the frames, the weights switch from "slightly positive" to "slightly negative" -- odd index frames are positive, even index are negative. If I delay the signal by 4 frames instead, it wants to use indexes 2, 4 and 6 (with emphasis on 4), with the negatives in frames 3 and 5 (and less so in all other odd-index frames). Looking at the negative weights closest in time to the actual echo, I see they are more negative near the "edges" of my test signal, so it seems they're an artifact of trying to cope with the fact that my signal jumps in frequency every 2 seconds. If I manually force W to be 0 all over, and 0.5 for the real parts of the 4th delayed frame, echo cancellation is perfect. If I initialize W to the "perfect" value, it stays more or less at that level, though it does adapt away from it every so slightly in the frequency bands where there are no components at all in the "speaker" signal. .. So my question is, why doesn't W adapt to the perfect values? Is there something that can be done to tune the adaption?
> Generate a test signal (10+x sine waves per frame), where x increases by > one for each iteration, and wraps around at 100.Testing with sine waves is usually not a good idea. If you intend on cancelling speech, then test with speech.> First off, it seems W is stored 'backwards'. The first values are for the > oldest frame, ok :)right.> However, when peeking at the values, it seems that the weights for > frame 0 (newest) are very low.Peeking at the value tells you nothing unless you do the inverse FFT and all so you can see them in the time domain. Even then, it's not that useful.> If I initialize W to the "perfect" value, it stays more or less at that > level, though it does adapt away from it every so slightly in the > frequency bands where there are no components at all in the "speaker" > signal.Normal> .. So my question is, why doesn't W adapt to the perfect values? Is there > something that can be done to tune the adaption?Why? Because it's not perfect. What can be done? More tuning and research into better adaptation algorithms. This is not a simple topic. Jean-Marc
>> Generate a test signal (10+x sine waves per frame), where x increases by >> one for each iteration, and wraps around at 100. > > Testing with sine waves is usually not a good idea. If you intend on > cancelling speech, then test with speech.Ok, I tested more extensively with both music and two-way speech. More on this below.>> However, when peeking at the values, it seems that the weights for >> frame 0 (newest) are very low. > > Peeking at the value tells you nothing unless you do the inverse FFT and > all so you can see them in the time domain. Even then, it's not that > useful.Actually, computing the "power spectrum" for each frame of W shows how large an ammount of the original signal at time offset j the echo canceller thinks should be removed from the current input frame. If you compute W*X for each j and ifft, you'll get the original signal with each frequency component scaled and time-shifted according to what W was (for that j). Anyway, I did some proper testing. I took my headset, bent the microphone arm so it's resting inside the .. uh.. whatever you call that large muffler thing that goes around your ear. This is an important testcase, as a lot of our users have complained about hearing echo that is propagated at the remote end either directly though the air from the "speaker" to the microphone (common with open headsets), and with closed headsets we see echo propagated mechanically down the arm of the microphone. Playing regular pop music (Garbage: Push It), things work out well, and the canceller ends up with semi-stable weights, almost entirely in the (j==M-1) bin (0-20ms delay, which is quite natural). It's the same with normal speech as long as it's spoken reasonably fast. I see some "banding" of the output, it seems there's more output signal (and more to cancel) in the 1-3khz and 5-6 khz area, but I blame that on the headphones; they're cheap. However, when switching to AC DC: Big Gun, we see and hear a large residual echo from the opening el-guitar. This seems to be a result of a semi-stable sound that lasts more than 20 ms; the canceller finds a correlation in 4-5 timebins instead of just one. We could reproduce the same result by playing a human voice saying "aaaaaaaaaa" without variation in pitch; the weights for those frequency bins would increase for all the timeslots in W. Now, people don't say "aaaaaaaaaaaaa" all that often, but they do play music that has a few "long" sounds, and saying "aaaaanyway" is enough to trigger this. Next test, what happens if the user has an external (physical) on-off switch? Same setup, playing Big Gun as loud as it gets. Apart from the problems with the opening guitar everything is good, and we see the weights set as they should be and things are cancelled out. So, I switch the mic off externally with the switch. Input becomes practically zero, so the weights readjust to zero as well. Turn the microphone back on and the echo canceller doesn't adapt. That is, no echo cancellation, and the weights all stay at their zero values. This can happen quite frequently, so it would be nice if the echo canceller could deal with this situation without a complete reset. Now, when trying to visualize the weights to see a bit of what was going on, I also computed the phase for each frequency bin. When looking just at the phase, I can see a very clear and distinct pattern of going from -pi to +pi in the areas where I know there is echo (specifically, the lower 7khz of j==M-1), and what looks like random noise for the rest. Do you have any idea where this pattern originates from, and more importantly, could it be used as additional conditioning of W? (ie: if the phase doesn't match the pattern, reduce the amplitude as it's a false match).