David Flynn has proposed a new Ogg Dirac mapping. The draft is here: http://davidf.woaf.net/dirac-mapping-ogg.pdf This is a much bigger break from other codecs than my draft (at http://wiki.xiph.org/index.php/OggDirac). We talked a bit about it on IRC today. Below is my summary; hopefully David can correct anything I got wrong or misleading. Comments? There are two main differences from the earlier proposal: * The granulepos is split into three fields instead of two, with the extra field encoding the reordering offset. * The mapping requires a page flush after every frame data packet. The first allows the actual presentation time of the corresponding packet to be determined, while in my scheme a group of reordered frames all get the same granulepos. The second assigns a granulepos to every *packet* instead of every *page* as is usual, so the granulepos can be used in practice to calculate a presentation timestamp for every frame. An offset to a restart point for restarting after seek is included as in my draft. Pros: The muxer doesn't have to crack data packets or maintain state to figure out the presentation timestamps. Demux code is simpler. Both presentation and decode timestamps are readily available from a simple look at the granulepos on each packet out of libogg. The encoding is clever, so the frame number calculation by adding the two halves according to the skeleton granule shift still works. Cons: Restart after seek still requires new code; that part of skeleton doesn't work. Muxing overhead for one-page-per-packet is excessive for small packets. 3% (vs 0.7%) for 230 kbps video, up to 10% for 50 kbps. This isn't going to work for cell phone video. Many Ogg tools assume they can repaginate, and probably won't get the one-page-per-packet stuff right. This leads to the usual argument that the demuxer has to be able to reconstruct the timestamps anyway, if it's going to be liberal in what it accepts, so the demuxer isn't actually simpler. The granulepos will no longer be numerically non-decreasing, so implementations that make this assumption will break. Summary: Adding a new codec has always required code changes to the muxer in Ogg. The question is whether this is a better precedent for future codecs with future-predicted data. My draft tried to be minimally different from previous practice: same granulepos logic as theora, custom timestamp generation like all the other codecs. David's draft requires new seek as well as timestamp generation code, but moves some of the complexity for the latter from simplistic to sophisticated implementations. The idea of one-packet-per-page isn't unprecedented. CSIRO did that for their mobile video version of theora (and reported the overhead was a real problem). We also talked about long pages without spanning packets at FOMS in January. There's a buffering issue with packets that are both packed and spanning, and the overhead can actually be lower for large (>8K) packets. I'd almost rather see us take this route, with a new Ogg page type, if the Dirac developers want a timestamp per frame, but that certainly doesn't minimize disruption. -r
> Many Ogg tools assume they can repaginate, and probably won't get the > one-page-per-packet stuff right. This leads to the usual argumentThis could be something to add to Skeleton. Kate (and probably CMML) needs the one-packet-per-page thing also, and any discontinuous codec probably needs it as well (well, not *need*, but no good buffering without it). It's trivial for a muxer to do, and it's transparent to a demuxer.> The granulepos will no longer be numerically non-decreasing, so > implementations that make this assumption will break.Wouldn't pretty much anything that deal with Ogg be broken by this ? How would, say, oggz-validate deal with this, apart from treating Dirac differently ? Seeking would not work without knowing a particular stream is Dirac too (would have to bsearch on time, not granpos directly). I'd have thought this to be a hard requirement rather than an assumption. If using Skeleton, seeking can now be done on time values even if you don't know which codec it is - I think this breaks this too (though I have not thought too hard about this).> lower for large (>8K) packets. I'd almost rather see us take this > route, with a new Ogg page type, if the Dirac developers want aIntriguing, can you expand on what you mean by "new Ogg page type" ?
On Tue, Aug 12, 2008 at 5:46 AM, ogg.k.ogg.k wrote:> This could be something to add to Skeleton. Kate (and probably CMML) > needs the one-packet-per-page thing also, and any discontinuous codec > probably needs it as well (well, not *need*, but no good buffering without > it). It's trivial for a muxer to do, and it's transparent to a demuxer.That's a good idea. Any suggestions for where?> Wouldn't pretty much anything that deal with Ogg be broken by this ?It depends how they're written. If they calculate a numerical granulepos for the desired point on the timeline and seek by comparing stream granulepos to that numeric value, they will fail. If they convert the stream values to time and compare that way, it will work fine. We've tried to encourage that direction with calls like th_granule_time(), and in general it's difficult to calculate a numeric value for theora because of the skips at the keyframes. David, can you think of a fancier encoding that would make your granulepos values non-decreasing? OTOH, the RFC can be read to require the numeric values be increasing. granule position: An increasing position number for a specific logical bitstream stored in the page header. Its meaning is dependent on the codec for that logical bitstream and specified in a specific media mapping. The question is whether 'position number' is literally the same as the value of the granule position field. :) Also, I say 'non-decreasing' since a codec doesn't necessarily advance decoding with every packet. The seeking algorithm only requires non-decreasing time-equivalents, so 'increasing' in the RFC is an artificial constraint.>> I'd almost rather see us take this >> route, with a new Ogg page type, if the Dirac developers want a > > Intriguing, can you expand on what you mean by "new Ogg page type" ?The Ogg page header has a version field and 5 unused flag bits, so we can add new page types if we want. ogg_stream_pagein() will reject pages with newer version numbers, and extra flags are ignored. There were a couple of things we talked about wanting to do if we revised the Ogg page structure: * The CRC rejection of corrupt data isn't always what you want (i.e. good for audio where digital noise is unacceptable, bad for video where a goopy picture is better than no picture at all) and is expensive to calculate when muxing high-bitrate streams. So it would be nice if we could flag the CRC just covering the page header fields and not the packet data. * Having packets both span and pack in the same page increases the expense of seeking, so it would be nice to do something about this. A preserve-flag in skeleton like you suggested for one-page-per-packet might help here, but we could also solve it with a new page type that just doesn't allow it. * Various codecs have wanted one-page-per-packet regardless to reduce buffering requirements for low-frequency packet streams. * The lacing method of encoding packet length and page spanning is less efficient for large packets, so for HD video a 'large page' type would be nice. There hasn't been a concrete suggestion for this, but the general idea is that we introduce an new Ogg page version 1, with a flag for whether CRC includes the payload data or not, and either another flag or another page version that selects a different packet length encoding which doesn't support packing. Whether we want to add multiple explicit timestamp fields like mux authors have requested, I don't know. -r
ogg.k.ogg.k at googlemail.com
2008-Aug-13 10:05 UTC
[ogg-dev] Fwd: New Ogg Dirac mapping draft
>> This could be something to add to Skeleton. Kate (and probably CMML) >> needs the one-packet-per-page thing also, and any discontinuous codec >> probably needs it as well (well, not *need*, but no good buffering without >> it). It's trivial for a muxer to do, and it's transparent to a demuxer. > > That's a good idea. Any suggestions for where?Well, there are reserved bits in fisbone just after the granule shift. I doubt adding this to the message headers would be a good way to do it, but that'd another way.>> Wouldn't pretty much anything that deal with Ogg be broken by this ? > > It depends how they're written. If they calculate a numerical > granulepos for the desired point on the timeline and seek by comparing > stream granulepos to that numeric value, they will fail. If theyAnd that's the canonical way AFAIK. Comparing times computed from the granpos you get from pages you get from a bsearch requires good knowledge of the codec, whereas comparing granpos can seek within any codec.> OTOH, the RFC can be read to require the numeric values be increasing.I recall pointing out a discrepancy between Ogg docs and the RFC, and I think someone (either Silvia or Conrad, probably Silvia) fixed the RFC. (That was the reason I'd originally included the low counter bits for Kate).>> Intriguing, can you expand on what you mean by "new Ogg page type" ? > > The Ogg page header has a version field and 5 unused flag bits, so we > can add new page types if we want. ogg_stream_pagein() will reject > pages with newer version numbers, and extra flags are ignored.Which is essentially creating a new format, as an old demuxer will be able to do nothing at all with such a stream, or do you have a cunning plan to make those backward compatible ?> * The lacing method of encoding packet length and page spanning is > less efficient for large packets, so for HD video a 'large page' type > would be nice.I'll take this opportunity to mention my pet idea (don't think I ever mentioned it on ogg-dev), to start with a byte, then, if larger than 255, add two bytes, then for packets larger than 255+65536, add 4 bytes. This lacing is worse than the current one only for packets between 256 and 511 bytes (admittedly a probably common case, but adding only one byte). Now that you mention flags, lacing type could be put in flags, so this worsening could be avoided too.> encoding which doesn't support packing. Whether we want to add > multiple explicit timestamp fields like mux authors have requested, I > don't know.Certainly a criticism of Ogg I heard more than once :)
On 2008-08-12, Ralph Giles <giles at xiph.org> wrote:> David Flynn has proposed a new Ogg Dirac mapping.I thought it'd be a good idea to explain some of the rationale in why we want to change the definition of granulepos in the ogg-dirac mapping. Terms used in this document: - GP64 = The 64bit granule_position as found in the page header. - GPH+L = Granule pos high + low as split by granule_shift. - ST = System Time; this is the monotonically increasing decoder clock - PT = Presentation Time; Picture is displayed when PT = ST, which implies AV sync. NOTE, we will not use the terms I,P,B -- they are mpeg2 terms which do not map to constructs in dirac or h264 properly. Properties of an Out-of-order video codec (dirac,h264,vc-1,mpeg2) - Each picture has a unique PT. - Pictures in the stream are not in PT order. - The decoder reorders pictures at output into PT order. - ST != PT in stream order (ie, input to decoder). Defacto rules of ogg (I've not found these actually written down anywhere): [A1] One of GP64 or GPH+L must increase for each packet For in-order codecs using keyframe-granuleshift, both are true. [A2] GPH+L == time. All codecs so far are inorder, so ST=PT=time. [A3] Page flushes are NOT invariant across remuxes. The ogg RFC does states that GP64 is codec specific without any restriction. What is needed to decode & display Out-of-order coded video? Each picture must have a unique & accurate(correct) PT. ST needs to be derived from the stream correctly: - Can interpolate ST for a particular picture - Can not determine the starting value of ST from the first picture. - This happens in streaming, example: PT: 14 10 11 12 13 ST: 10 11 12 13 14 What is problematic with the xiph mapping? - Here is an example using the xiph mapping: Sync point: V V V PT(actual): 0 3 1 2 6 4 5 9 7 8 c a b d GP_high: 0 0 0 0 6 6 6 6 6 6 6 6 6 d GP_low: 1 1 2 3 1 1 1 1 2 3 3 4 5 1 GPH+L-1: 0 0 1 2 6 6 6 6 7 8 8 9 a d - Each picture does not have a unique value for granulepos => Cannot determine unique&correct PT => Cannot determine correct ST => If (due to paging) no GP64 is available for a frame, it is impossible to correctly interpolate the value of PT. => Don't know when to display pictures - Seeking is difficult: - Want to seek to frame N - GPH+L is non-unique (don't know if the right one has been found) => Some values of GPH+L do not exist (searches may fail) - and GPH+L != N (ie, may find the wrong frame) - Locating the sync point (eg, after seek) is irritating - GP_low != to number of packets(pictures) since sync point. => Have to search backwards until GP_high changes - Copes badly with open gop: To correctly decode picture(PT=4) in above example, the sync point it depends upon is picture(PT=0). However, this would violate the property of GP64(n) > GP64(n-1). - It requires that a page is flushed before transmitting a sync point so that a syncpoint is guaranteed to have a valid GP64. This violates axiom A3 Some comments on choice of GP64 in bbc mapping: Consider axiom A2 (GPH+L == time), assume this is PT. - PT (not ST) makes sense for AV sync - PT (not ST) makes sense for locating pictures (seek) although a naive sync will find the wrong picture. - The stream is in the order required to satisfy decoding dependencies, ie PT jumps around. This violates axiom A1 (GPH+L(n) > GPH+L(n-1)). This violates axiom A1 (GP64(n) > GP64(n-1)). Consider axiom A2 (GPH+L == time), assume this is ST. - Complies with axiom A1 (GPH+L(n) > GPH+L(n-1)). - Complies with axiom A1 (GP64(n) > GP64(n-1)). - Is not useful for AV sync. - Is not useful for seeking (you will end up with the wrong picture). => No good reason for GPH+L == ST .'. choose GPH+L = PT. Some interactions with skeleton: > ... allowing to map a granule position [GPH+L] to time by calculating > "granulepos [GPH+L] / granulerate" -- http://wiki.xiph.org/OggSkeleton '.' the only useful time to decoding is the PT => GPH+L = PT. Ie, you can seek based upon presentation time, however a binary search can hit a reordered picture and therefor choose the wrong picture at the end. The error is +/- one GOP. --- > Restart after seek still requires new code; that part of skeleton > doesn't work. -- http://article.gmane.org/gmane.comp.multimedia.ogg.devel/1118 Actually, ogg skeleton does not provide such information to any GOP based video codec. - It only has Preroll, which in a GOP based video codec is constantly varying. - Preroll only makes sense for video when using something such as Ponly-with-intra-slice-refresh, where there are no keyframes. Some final remarks: - one-packet-per-page: It has been said that one-packet-per-page (ie, a page flush per packet) upsets remuxing due to axiom A3. however, it is a requirement of the xiph mapping that a page flush occurs before a sync point. - To resolve the above contradiction, i assume that axiom A3 is invalid - The bbc mapping allows reconstruction of PT, ST and distance to syncpoint without any a priori information. - The bbc mapping does not require peeking into the packet payload to fill in the blanks - If GPH+L is to be useful, it is not possible to comply with axiom A1 (GPH+L(n) > GPH+L(n-1)). Stop press: - I've realised that it is possible to rearrange GP64 in such a way that: + Complies with axiom A1 (GP64(n) > GP64(n-1)). + violates axiom A1 (GPH+L(n) > GPH+L(n-1)). However, i doubt that is any use, since i hope any sane demuxer searches based upon GPH+L. Regards, ..david
On Wed, Aug 13, 2008 at 3:05 AM, ogg.k.ogg.k at googlemail.com wrote:> And that's the canonical way AFAIK. Comparing times computed from > the granpos you get from pages you get from a bsearch requires good > knowledge of the codec, whereas comparing granpos can seek within > any codec.No. it's in general impossible to calculate the granulepos that corresponds to a particular time in a theora stream; only the reverse is possible. That's why David was talking about comparing High Word + Low Word, which is the frame count in theora, and can be treated as the seek time in different units.>> OTOH, the RFC can be read to require the numeric values be increasing. > > I recall pointing out a discrepancy between Ogg docs and the RFC, and > I think someone (either Silvia or Conrad, probably Silvia) fixed the RFC. > (That was the reason I'd originally included the low counter bits for Kate).Right, thanks for pointing that out. The correction is recorded at http://wiki.xiph.org/RFC_3533_Errata David's proposal still violates that without the "stop the press" fancier encoding.> I'll take this opportunity to mention my pet idea (don't think I ever mentioned > it on ogg-dev), to start with a byte, then, if larger than 255, add two bytes, > then for packets larger than 255+65536, add 4 bytes. This lacing is worse > than the current one only for packets between 256 and 511 bytes (admittedly > a probably common case, but adding only one byte). Now that you mention > flags, lacing type could be put in flags, so this worsening could be avoided > too.So this would be: len = read_uint8() if (len == 255): len += read_uint16() if (len == 255+65535): len += read_uint32() And a len of 2^32 - 1 would indicate a continued packet? We can't change the lacing scheme without changing the stream structure version. So I'd propose something like: stream_structure_version = 1 header_type_flags: bit 0 : fresh/continued packet bit 1 : bos bit 2 : eos new flags: bit 3 : CRC is only the header data bit 4-5: 0 is traditional lacing, 1 is 16 bit packet length, 2 is 32 bit packet length, 3 is 64 bit packet length. The packet length field would start on byte 26 where the segment table length is in the traditional lacing. Using this kind of length encoding implies one packet per page with no continuation. If the lacing type and crc flags are zero, stream structure can be zero. Muxers SHOULD not mix stream structure values within a logical bitstream to avoid confusing legacy implementations. -r
We've been discussing this on irc. Short summary, followed by some responses. I think we've verified now that my old proposal works fine for MPEG-2 style reordered streams. I believe it can be made to work with 'open gop' streams by making the granulepos assignment more sophisticated than I described. However, Dirac allows essentially random reference structures, so it's possible to construct streams with overlapping keyframe dependencies my proposal can't handle without breaking the numerically non-decreasing granulepos rule. That's an argument for David's granulepos mapping, especially since the open gop stuff in my mapping is hacky. My thinking now is that the non-decreasing numeric encoding (the stop-the-presses version) is better. GPH+GPL=frame works for theora, but doesn't do any better with naive seeking than 'find this numerical granulepos' and doesn't simplify frame-accurate seeking if you relax the one-page-per-packet rule, which I think we must. On Wed, Aug 13, 2008 at 1:08 PM, David Flynn <davidf+nntp at woaf.net> wrote:> Defacto rules of ogg (I've not found these actually written down anywhere):No, we've not really worked out these parts of the spec. Thanks for helping!> - Seeking is difficult: > - Want to seek to frame N > - GPH+L is non-unique (don't know if the right one has been found) > => Some values of GPH+L do not exist (searches may fail) > - and GPH+L != N (ie, may find the wrong frame)You're really wanting the granulepos field to be a frame timestamp. Ogg just isn't designed to provide this information. The granulepos isn't present in the stream for every packet. They're just supposed to provide "seeking signposts" during the bisection search and, mostly as a side effect, let an encoder give some hints to the muxer about interleave order to reduce buffering. Your proposal stuffs sequence headers and other aux data units in with the following frame in a single Ogg packet, and then insists on special one-packet-per-page encapsulation, so you can get this frame timestamp behaviour. I think that's why it feels like such a hack to me. New constraints, breaking abstraction layers, to do something that the format doesn't intend. I agree seeking is hard. To recap, Monty's original vision was that granulepos would be monotonically increasing, and you could map your seek time onto a granulepos and bisection search for that number. That worked great for vorbis-only streams, but as soon as you have multiplexed data, you have multiple granulepos schemes (or just timebases) so it's easier to map any granulepos you find to time and then compare in that space. With theora, we took advantage of this to squeeze in a reference to the closest restart point (keyframe) without revising the container code. So you can't calculate f:time->granulepos at all in general now, only its inverse. And it turns out, because of packed and continued packets, that you can't even find a single restart point, you have to find "the last page with a timestamp that maps to a time prior to the seek point, for each substream you care about, and start decoding each substream there." And then you have to search again for keyframe streams, back up by the preroll in lapped streams, etc. This is all about being able to do frame accurate seeking. Maybe applications don't actually care about that, just getting in the neighborhood is good enough. There are things a muxer can do (and mapping spec recommend) as "best practices" to improve the performance of such naive implementations. Like strategic page flushes. I think we're all for that, but assuming those practices will always happen in an application (like an editor) that needs frame-accurate access without fail violates 'liberal in what you accept'. -r
On Mon, Aug 18, 2008 at 2:25 AM, ogg.k.ogg.k at googlemail.com <ogg.k.ogg.k at googlemail.com> wrote:>> If the lacing type and crc flags are zero, stream structure can be >> zero. > > I do not understand this.My point here was that stream structure version 0 pages (the current ogg spec) are a strict subset of stream structure version 1 pages in my proposal. (It could be so in yours too.) The new flag values are chosen so that if they are set to zero as they are in version 0 pages, the header can be parsed just like a version 0 page, which hopefully simplifies implementation. -r