Listening to the meeting on granule pos tonight/today it became clear that the issues everyone is concerned with for the most part don't affect my implementations and the issues i have pretty much don't affect anyone else... and in the cases where they overlap, the reasoning seems to be different. And since everyone else has had a lot more time to consider all these issues and i'm pretty new to this, it's a lot harder for me to make a cogent argument on the fly. So i figure i'd spell out all the things i've come across in my implementation, just to put them out there. I'll just preface to say, that my experience in audio/video is probably considerably less than most of the others working on this stuff, so if i make false assumptions, am missnig the point etc, then just tell me ! :) DShow Background =============Directshow is a very structured media framework, there are specific interfaces for communication and guidelines for how and when data can be passed. It is however highly modular and flexible enabling hundreds of codecs to be implemented with it. Some background... there are a few major components. www.illiminable.com/ogg/graphedit.html for a look what the graphs are like. Graphs, filters, pins, samples and allocator pools. In order for two filters to connect, their pins need to offer certain media types specifying the type of data and various parameters of the data (frame rates, frame size, sample rate etc) depending on the media type. Allocator pools exist between the connection of any two pins. An allocator pool is a fixed number of fixed size samples. All data is passed through the allocator pools. Before the user starts the graph (presses play), no data is passed in the graph. When the usr presses play the graph goes into pause mode and data is pushed through the graph filling up all the allocator pools, until all the threads are blocked, then the graph goes into play mode. As the downstream end (renderers) pulls data out of the downstream allocators it frees up a spot for an upstream filter, it's thread unblocks and fills the space etc. Directshow requires start and end times for all samples. Demuxing ====== Ok, so given that the graph has to be built before data is passed downstream, there is a problem. How can the demuxer know what filters to connect to (ie what the streams are) ? The demux needs to read ahead enough to find the BOS pages. Now we know how many streams there are. How does it know what kind of streams they are ? It has to be able to recognise the capture patterns of every possible codec. So a "codec oblivious" demux is already out of the question. Lets look further downstream for the moment... we'll assume we have a vorbis only stream. Now the directsound audio renderer won't connect to any decoder unless it tells it the audio parameters, number of channels, sample rate etc etc. Now if no data can flow in the graph yet, how can the decoder have seen the header pages to know this ? It can't. This information is considered part of the setup data. Hence the media parameters have to come from the demux when it connected to the decoder, ie the media type the demux offers is (Audio/Vorbis 2 channel 44100) for example. So the demux has to be able to parse the BOS page headers to offer a useful media type. So now the demux has to be able to not only identify the streams but also know how to get at least the key information out of them. ie The demux has to know how to parse the header of every possible codec header format it will offer. Now, why isn't this an issue with every other codec i assume you are thinking ? The main reason is that the header format of ogg codecs (ie vorbis headers, speex headers etc) is completely arbitrary and defined completely by the codec. That's good in a way that codecs can define whatever information they want. But it's bad in the sense that your demux can't be as dumb as you'd want. Other formats have at least portions of fixed header, where no matter what the exact details of the codec, some core information can be gauranteed to be found at fixed locations. And also codecs identifiers are fixed (or at least bounded) size and in fixed location. So you can do for example a fourcc map of the identifier to a directshow media-type guid, and get the key parameters from a fixed place. So all this information is available up front, and the demux doesn't need to know any specifics of codec headers, and it can handle new codecs without modification to the demux. Incidentally this is all that OGM is, just an extra header before the codec specific ones that contains this information. Similarly annodex for example uses anxdata headers which preface each codec stream and contains information like granule rate and codec identifiers in fixed locations of bounded size. The related issue is that of identifying streams... the codec identifier has no bounds, there is no way to say this is the end of teh identifier, and this is the rest of the header. In other words \001vorbis is pretty much indistinguishable to \001vorbis2. How can you tell if the 2 is part fo the identifier or the rest of teh header ? Time Stamps ======== Directshow works in UNITS of 1/10,000,000 of a second, it knows nothing of granule pos. When something like media player requests a seek or a position request it wants these units. So the seek request comes into the graph. It needs to be passed back to the demux being the only portion of the graph with direct access to the data source. Now in order to seek in ogg, you need granule pos, so again the demux needs to know how to make the conversion. The decoding filters can't make this conversion, because they only know about their granule pos, so even if they did convert and try to get the demux to seek on this granule pos, it would restrict the available seeking landmarks to only that codec. So again the demux needs that information about "granule rate" in order to make the conversion for each codec it may come across in it's seek. Now after we seek, we hit a page we want to start from (and maybe go back a bit to ensure we get a keyframe etc)... so when we scan back we find a new starting point. Directshow now considers the time point it asked to seek to as time 0. It doesn't want to know about absolute times. So we are at a point a few pages before we want to start, we have to make sure we hit one page of every logical stream in order to get a landmark granule pos. Now thats kind of ok for dense codecs... but what about sparse ones ? With the end time stamp scheme we have to find at least one page of every stream before we get our deisred one. Using the start stamp scheme we can resync as we hit a page. As we get a page we know what time this page starts at.and we then have a reference point to determine start and end times of every subsequent sample in that stream. this means less seek back. My personal preference for the timestamp scheme would be start timestamps for all codecs. As assuming you want start and end times finding the end given the start is much easier and efficient than finding the start given the end time. As for stream duration, i see no problem with having an empty EOS page which has the end time in it. But from the sounds of it, this isn't the general consensus. =============================================== Anyway... i've said my bit for now ! :) This is long enough for now and i'm tired ! Zen. <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Timothy B. Terriberry
2004-May-08 13:58 UTC
[theora-dev] My issues with ogg and directshow...
> Ok, so given that the graph has to be built before data is passed > downstream, there is a problem. How can the demuxer know what filters to > connect to (ie what the streams are) ? The demux needs to read aheadenough> to find the BOS pages. Now we know how many streams there are. How doesit> know what kind of streams they are ? It has to be able to recognise the > capture patterns of every possible codec. So a "codec oblivious" demuxis> already out of the question.This is an issue of where the separation line is drawn, not whether or not separation can exist. The Ogg abstraction has a richer interaction between codec and muxer than the DS framework mandates. But this doesn't prevent you from defining an "Ogg codec" interface as a richer instance of the general DS codec interface, adding such things as generic functions to answer questions like, "Given this initial packet, can you decode this stream?" or, "What are the DS media parameters corresponding to this complete set of header packets?" or, "What is the time associated with this granule position?" The muxer can still rely wholly on the codecs to answer these questions, it just needs a richer codec API than the DS framework in general has. New codecs can still be added without modifications to the demux so long as they implement this extended API. And as an aside, please don't use the phrase "granule rate"... it implies, incorrectly, than the granule position->time mapping can be accomplished by multiplying by a simple scale factor, and this is NOT true in general. In particular, it is not true for Theora.> Directshow works in UNITS of 1/10,000,000 of a second, it knows nothingof> granule pos. When something like media player requests a seek or aposition> request it wants these units. So the seek request comes into the graph.It Generally one seeks to a time, not a granule position. The granule position->time mapping is unique, but the reverse does not have to be. So when dealing with multiple codecs, you convert everything to a time in order to be able to compare values among them. It's unfortunate that DS does not let one work in the native time base of the streams, but units of 100 nanoseconds should be accurate enough for most purposes. --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
On Sun, May 09, 2004 at 03:14:37AM +0800, illiminable wrote:> Listening to the meeting on granule pos tonight/today it became clear that > the issues everyone is concerned with for the most part don't affect my > implementations and the issues i have pretty much don't affect anyone > else... and in the cases where they overlap, the reasoning seems to be > different. And since everyone else has had a lot more time to consider all > these issues and i'm pretty new to this, it's a lot harder for me to make a > cogent argument on the fly. So i figure i'd spell out all the things i've > come across in my implementation, just to put them out there.Thanks for putting this together, Zen. It's really nice to have a solid introduction to the issues from someone experienced with the framework.> Allocator pools exist between the connection of any two pins. An allocator > pool is a fixed number of fixed size samples.I can see how this works fixed-bitrate codecs (and most uncompressed media, of course). Does one just use 'really big buffers' for vbr data?> Directshow requires start and end times for all samples.And you've succeeded in calculating this for all our codecs?> Ok, so given that the graph has to be built before data is passed > downstream, there is a problem. How can the demuxer know what filters to > connect to (ie what the streams are) ? The demux needs to read ahead enough > to find the BOS pages. Now we know how many streams there are. How does it > know what kind of streams they are ? It has to be able to recognise the > capture patterns of every possible codec. So a "codec oblivious" demux is > already out of the question. > > Lets look further downstream for the moment... we'll assume we have a vorbis > only stream. Now the directsound audio renderer won't connect to any decoder > unless it tells it the audio parameters, number of channels, sample rate etc > etc. Now if no data can flow in the graph yet, how can the decoder have seen > the header pages to know this ? It can't. This information is considered > part of the setup data. Hence the media parameters have to come from the > demux when it connected to the decoder, ie the media type the demux offers > is (Audio/Vorbis 2 channel 44100) for example. > > So the demux has to be able to parse the BOS page headers to offer a useful > media type. So now the demux has to be able to not only identify the streams > but also know how to get at least the key information out of them. ie The > demux has to know how to parse the header of every possible codec header > format it will offer. > > Now, why isn't this an issue with every other codec i assume you are > thinking ?To clarify here, it's my understanding that format parameter lookup is a feature of the AVI and ogm container formats (and asf, presumedly) not of any of the specific codecs. Is this correct? That's why lookup of this information is always possible there, and not for ogg, even if we provide a convenience library that can do the header parse for all the codec embeddings it knows about, as I think derf was suggesting. Practically speaking, I think this can be dealt with. After all, being able to identify a codec by FOURCC doesn't help if you can't find an implementing dll. From the point of view of DirectShow, it's just a limitation of this particular container format. Not knowing anything about them, I'd guess that quicktime can optionally provide a table with this information, and that MPEG program streams, like ogg, don't provide much beyond the packet types. How does DirectShow handle those containers?> The related issue is that of identifying streams... the codec identifier has > no bounds, there is no way to say this is the end of teh identifier, and > this is the rest of the header. In other words \001vorbis is pretty much > indistinguishable to \001vorbis2. How can you tell if the 2 is part fo the > identifier or the rest of teh header ?Yes. It's well defined in specific codec specs, but more flexible in general. Just looking file-magic style at some of the initial bytes should always work.> Using the start stamp scheme we can resync as we hit a page. As we get a > page we know what time this page starts at.and we then have a reference > point to determine start and end times of every subsequent sample in that > stream. this means less seek back.This is another good example of problems with the end-time granule. Thanks.> As for stream duration, i see no problem with having an empty EOS page which > has the end time in it.The only problem here is that you can't rely on the page being there (the stream might be truncated, and in fact my explicitly be so in Ogg Vorbis). So it's sugar, not something that's 'built-in' to the format design.> But from the sounds of it, this isn't the general consensus.Dunno. Sounded like Aaron was on your side. :) Cheers, -r --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'theora-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.