Kevin Connor
2016-Jun-13 17:10 UTC
[opus] Opus application_mode==AUDIO, 20ms framing issue?
Hi Jean-Marc, Sorry for late reply, thanks for interest. It's quality good for 10ms/audio, poorer for 20ms/audio. Quality equivalent for 10,20ms for mode=voip. PESQ was the tool that alerted me to something of interest, but I don't trust PESQ to almost any degree! It's good for hearing relative differences, of course, but not absolutes. Bitrate here was 28kbps, but I hear same thing at 32kbps. Please find attached a zip file with the audio files, converted to .wavs for simpler listening. https://www.dropbox.com/s/bzu4i3dmg5f91tv/20msAudioModeQuestion.zip?dl=0 <https://www.dropbox.com/s/bzu4i3dmg5f91tv/20msAudioModeQuestion.zip?dl=0> If there is one single thing to listen to, it would be ar3_20_audio.wav, loop the section "china hit" starting t=0.6s and listen for artifacts in the unvoiced speech. reference is ar3.wav. and by comparison ar2_10_audio.wav ( same segment, sounds more like the reference ar3.wav) Here is a cat of the README.txt. Thanks very much! 16bit, 16kHz input wav files (ar1, ar2, ar3), content from ~50Hz to near 8kHz. All .pcm files are 16kHz, 16bit, signed ints, little (intel) endian. ./opus_demo -e voip 16000 1 28000 -framesize 20 ~/ar1.wav ar1_20_voip.bit ./opus_demo -d 16000 ar1_20_voip.bit ar1_20_voip.pcm opus_demo reports version: libopus 1.1-alpha Using recent pesq code compiled from src, +16000 option. ( same phenomenon seen with +16000 +wb option) 5ms 10ms 20ms 40ms ar1_NN_voip 4.314 4.493 4.488 4.488 ar2_NN_voip 4.346 4.442 4.436 4.474 ar3_NN_voip 3.993 4.375 4.414 4.390 ar1_NN_audio 4.292 4.485 -> 4.313 4.313 ar2_NN_audio 4.364 4.460 -> 4.350 4.350 ar3_NN_audio 3.924 4.327 -> 4.218 4.218 Note that this size/type of pesq test is insufficient to draw ANY conclusions. However, it is useful for drawing attention to relative differences, that might be interesting for HUMAN LISTENING. So the question here was, is this pesq drop from 10ms to 20ms framesize, seen in the case of mode=AUDIO (but not VOIP) something REAL? It warranted listening. ( same results, interleaved mode=VOIP,AUDIO numbers ) 5ms 10ms 20ms 40ms ar1_NN_voip 4.314 4.493 4.488* 4.488 ar1_NN_audio 4.292 4.485 4.313* 4.313 ar2_NN_voip 4.346 4.442 4.436* 4.474 ar2_NN_audio 4.364 4.460 4.350* 4.350 ar3_NN_voip 3.993 4.375 4.414* 4.390 ar3_NN_audio 3.924 4.327 4.218* 4.218 same data, interleaved to highlight fact that drop is seen for same sentences, from mode=VOIP to mode=AUDIO, for 20ms framesize. (40ms is same processing as 20ms, I believe). So the that is implied: - is there a phenomenon for mode=AUDIO that results in lower scores for 20ms in particular, but not 10ms? Listening to the processed files (sighted), I have the following subjective opinion: - Given: sampling rate = 16000, bitrate = 28000. (also replicated at 32 kbps) - the 10ms versions (voip,audio) and the 20ms (audio) version sound "focused" and have high fidelity to the ref. - the 20ms mode=AUDIO versions sound "hollow", "smeared", "unfocused", especially during unvoiced segments. - example "china hit" file ar3.pcm, t=0.6s. Very clear diff between 10ms and 20ms framesize in mode=audio. This isn't about pesq scores -- pesq was just the "difference noticed" flag that got me to listen to some files. I notice this same kind of de-focused sound in the same samples processed using recent opus lib in linux. I'm not surprised at a delta between mode=voip and mode=audio for a constant framesize. That's entirely expected. What I'm curious about is the delta between 10ms and 20ms , for mode=audio. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20160613/2b54589e/attachment-0001.html>