Here's the problem:> 2) Encoding with rate control as in single pass "Bitrate > control" will not > lead to better quality than fixed quant (with the right value > of the fixed > quant). Ratecontrol doens't know anything about "quality". It > will try to > reach more-or-less CBR. > > But somehow this is not a fair comparison, because how do you > determine > the right quantizer value? You have to look at the material, > so you have > extra information. > > --------------------------------------------------------------- > > 3) Two-pass-encoding with varying quantizer can lead to better overall > quality than fixed quantizer encoding. > > E.g.: Encode Barcelona with Quant 25, but Suzie with quant 8. > Total size will be similar: > > Suzie-Q8: 275442 + Barcelona-Q25: 347980 = Total 623422 > Suzie-Q20: 115378 + Barcelona-Q20: 550760 = Total 666138 > > But visual quality makes a real difference as you can see > from th other > attached pictures: Barcelona-Q25 isn't too much worse than Q20. > Suzie-Q8 is _much_ better than Q20. > > These are just examples, of course...everything you say is basically true. However, what you are not accounting for is that it is the job of the codec to define what "Q=8" means. In the DIVX case, I would claim the codec is at fault for not accounting for the fact that some material will look terrible at Q=20, and redefining Q on that basis. Your theory seems to be that this is the job of a hypothetical "2-pass encoder", but I don't see how multiple passes per se makes any difference. It's an issue of where the logic resides. How does any encoder, whether one-pass, 2-pass, or whatever, determine that the 'suzy' scenes need a different setting than the Barcelona clip to achieve subjectively similar quality? I can tell you how this is usually dealt with in practice: most encoder apps provide modes where quality and bitrate can both be variable within some range. In your example, we might say that Q can vary up to 25, but only if necessary to pull the bitrate down below some threshold. Below that threshold, Q can go down (ie quality increases in your example) until the threshold bitrate is approximated. 2-pass encoders simply have more information on how to do this effectively (ie knowing that a simple scene is coming up, they can increase quality on the cut so you don't see an ugly transitional period of a few frames). True CBR is basically this strategy rigorously enforced against a given transport speed and playback buffer model. This sort of relates to the PSNR discussion in the following way: internally, when making various encoding choices (block type, quantizers), most video codecs simply use some variation of MSE (mean squared error, which is what PSNR is derived from), or more typically SAD (Sum of Absolute Differences), which is a very similar metric (but easier to calculate). In either case, as has been discussed, the results of this approach do not correlate very well with perceived quality, especially when taken over varying types of source material (as your examples prove). So, for my money, the codecs should be doing a better job of incorporating some intelligence to correlate their 'Q' values to actual perceived quality, rather than some arbitrary pixel difference value. That way, fixed-Q could actually mean something useful. I suspect that audio codecs, particularly Vorbis, do this intrinsically, because their internal psycho-accoustic models tend to be rather complex. In the video world, for reasons that elude me, this is not the case. I know of no codec that incorporates any useful psycho-visual model into its encoder (though there are encoding apps that sit on top of codecs that claim to do this). IMSHO, this should be a major design goal of any improved Theora encoders we develop. --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
On Tue, 2003-03-25 at 19:25, Dan Miller wrote:> So, for my money, the codecs should be doing a better job of > incorporating some intelligence to correlate their 'Q' values to > actual perceived quality, rather than some arbitrary pixel difference > value. That way, fixed-Q could actually mean something useful. I > suspect that audio codecs, particularly Vorbis, do this intrinsically, > because their internal psycho-accoustic models tend to be rather > complex. In the video world, for reasons that elude me, this is not > the case. I know of no codec that incorporates any useful > psycho-visual model into its encoder (though there are encoding apps > that sit on top of codecs that claim to do this).I'll ask the obvious follow-up. :) Is there a reasonable "psycho-visual" model to work with? --- Stan Seibert <p><p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
From: "Dan Miller" <dan@on2.com>> everything you say is basically true. However, what you are not > accounting for is that it is the job of the codec to define what "Q=8" > means.I think the general assumption was that you meant quantizer by Q, not quality. Christoph most certainly means quantizer with Q.> I suspect that audio codecs, particularly Vorbis, do this intrinsically, > because their internal psycho-accoustic models tend to be rather complex. > In the video world, for reasons that elude me, this is not the case. I > know of no codec that incorporates any useful psycho-visual model into its > encoder (though there are encoding apps that sit on top of codecs that > claim to do this).I think audio is also easier because our hearing is mostly frequency sensitive, and our sight more structure sensitive. To oversimplify ... our hearing perceives the max error, which makes quantization for constant quality much easier, but our sight perceives an error which is a more complex function of the errors at the seperate frequencies. A coding mode which puts a hard limit on a MB's MSE shouldnt be too slow or hard to code BTW. Would be an easy point to start for constant quality coding. Marco --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> From: Stan Seibert [mailto:volsung@mailsnare.net]...> Is there a reasonable "psycho-visual" model to work with? >(in booming narrator voice:) "Well Stan, that's an excellent question!!" I'm just starting to review the present state of research (see my link in a previous post to the 'ITS' objective measurement stuff for instance -- I'm pretty impressed with their stuff so far). In my own research, I've looked at frequency-banded PSNR, as well as modifications to PSNR to account for the fact that low contrast scenes will have a much lower MSE for the perceived error (presumably because the eye/brain is doing contrast adjustments on a region basis). This is a big issue -- more on that later (quick point: PSNR usually is calculated with a presumed pixel value range of 0-255 [20 * log10(255 / sqrt(mse) )]. What if the image has a range of 50 to 200? Shoudn't the formula then be 20 * log10(150 / sqrt(mse) ) ?? ) All of this begs the question: what exactly does the eye/brain do with an image? One big problem that makes the video side harder than audio is that viewing conditions can vary so widely. everything from a movie theater (dark room with a large, hi-res screen) to looking at some multimedia on your iPAQ outside on a sunny day. My general impression is that most people agree we perceive images through some sort of wavelet-like combination spatial/frequency decomposition. Obviously, we have circuits to do feature extraction at various levels (edge detectors, etc). So my guess would be that we need to break the image down into reasonably sized areas (the size of the regions is very dependent on viewing conditions; optimum is probably a specific angle of vision). We also have to consider how to segment an image into regions without problems arising at the region boundaries. Then, within these regions, we need to do some sort of frequency domain analysis, and empirically learn what the JND's (Just Noticeable Differences) are for various types of distortion (noise, low-pass, phase distortion, quantization...), all normalized to the overall energy of the region. In other words, we need a comprehensive model of allowable threshold distortions (as a function of total energy) in a combined spatial/frequency domain. Then we can tune our codecs to produce errors that fall within those thresholds, allocating bits accordingly. Yeah, something like that sounds nice. -dan --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> From: Marco Al [mailto:marco@simplex.nl]...> I think the general assumption was that you meant quantizer by Q, not > quality. Christoph most certainly means quantizer with Q.Fair enough. I guess then my point is that offering some sort of raw 'Quantizer' knob to an end user of a codec is a baad idea. The user usually wants to go for maximum quality M (Q could be confusing), limited to peak datarate P, with average datarate D. These are the sorts of knobs a good codec should be presenting to the world. --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.