thr3ads.net - theora dev - [theora-dev] Keyframe seeking in Ogg and spec [Sep 2002]

If this information is useful, please help other people find it:
Share via:

Monty

2002-Sep-13 11:50 UTC

[theora-dev] Keyframe seeking in Ogg and spec

Folks have noticed that the documentation is semi-silent about how to
properly encode the granule position and interleave synchronization of
keyframe-based video.  The primary reasons for this:

a) we at Xiph hadn't had to do it yet
b) there are several easy possibilities, and the longer we had to
   think about it before mandating One True Spec, the better that spec
   would likely be.

The lack of a painfully explicit spec has led to the theory that it's
not possible; that's not true, there are a few ways to do it.  Several
require no extension to Ogg stream v 0.  A last way requires an extra
field (a point against it), but does not actually break any stream
that currently exists.

The time has come to lay down the spec as we're currently building the
real abstraction layers in a concrete Ogg framework now where the Ogg
engine, the codecs, and the overarching Ogg control layers are neatly
put into boxes connected in formalized ways.  Below I go into detail
about each scheme in a 'thinking aloud' sort of way.  This is not
because I haven't already given the matter sufficient thought, it is
because I wish to give the reader sufficient background information to
understand why one way is better than the others.  This is not a call
for input so much as an educational effort (and a public sanity check
of my thinking; please do pipe up if it appears I missed a salient
point).

Starting Assumptions:

1) Ogg is not a non-linear format.  It is not a replacement for the
scripting system of a DVD player.  It is a media transport format
designed to do nothing more than deliver content, in a stream, and
have all the pieces arrive on time and in sync.  It is not designed to
*prevent* more complex use of content, it merely does not implement
anything beyond a linear representation of the data contained within.
If you want to build a real non-linear format, build it *from* Ogg,
not *into* Ogg.  This has been the intent from day 1.

2) The Ogg layer does not know specifics of the codec data it's
multiplexing into a stream.  It knows nothing beyond 'Oooo, packets!',
that the packets belong to different buckets, that the packets go in
order, and that packets have position markers.  Ogg does not even have
a concept of 'time'; it only knows about the sequentially increasing,
unitless position markers.  It is up to higher layers which have
access to the codec APIs to assign and convert units of framing or
time. 

3) Given pre-cached decode headers, a player may seek into a stream at
any point and begin decode. It may be the case that audio may start
after video by a fraction of a second, or video might be blank until
the stream hits the next keyframe, but this simplest case must just
work, and there will be sufficient information to maintain perfect
cross-media sync.

4) (This departs from current reality, but it will be the reality very
soon; vorbisfile currently blurs the careful abstraction I'm about to
describe) Seeking at an arbitrary level of precision is a distributed
abstraction in the larger Ogg picture. At the lowest-level Ogg stream
abstraction, seeking is one operation: "find me the page from logical
stream 'n' with granule position 'x'".  All more complex
seeking
operations are a function of a higher-level layer (with knowledge of
the media types and codec in use) making intelligent use of this
lowest Ogg abstraction.  The Ogg stream abstraction need deal with
nothing more complex than 'find this page'. 

The various granulepos strategies for keyframes concern this last point.

The basic issue with video from which complexity arises is that frames
often depend on previous and possibly future frames.  This happens in
a larger, general category of codecs whose streams may not begin
decode from just any packet as well as packets that may not represent
an entire frame, or even a fixed-time sampling algorithm.  It is a
mistake to design a seeking system tied to an exact set of very
specific cases.  While one could implement an explicit keyframe
mechanism at the Ogg level, this mechanism would not cover any of the
other interesting seeking cases while, as I'll show below, the
mechanism would not actually be necessary.

There will be a few complaints that Ogg is being unnecessarily subtle
and shifts a great deal of complexity into software which a few extra
page header fields could eliminate. Consider the following:

1) Ogg was designed to impose a roughly .5-1% over the raw packet data
over a wide range of packet usage patterns.  'A few extra fields'
begins inflating that figure for specific special cases that only
apply to a few stream types.  Right now there is no header field that
is not general to every stream.  There is no fat in the page headers.

2) The Ogg-level seeking algorithm is exceptionally simple and can be
described in a single sentence: "Find the earliest page with a
granulepos less than but closest to 'x'".  This shifts the onus of
assembling more complex seeking operation requiring knowledge of a
specific media type into a higher layer that has knowledge of that
media type.  The higher layer becomes responsible for determining for
what 'x' Ogg should search.  The division of labor is clear and
sensible.

3) Complex, precise seeking operations are still contained entirely
within the framework, just at a higher layer than Ogg-stream.  At no
time is an application developer required to deal with seeking
mechanisms within an Ogg stream or to manually maintain stream
synchronization.

High level handwaving- How seeking really works:

The granulepos is intended to mean, roughly, 'If I stop decode at the
end of this page, I will get data from my decoder up to position
'granulepos'.  The granulepos simultaneously provides seeking
information and a 'length-of-stream' indicator.  Depending on the
codec, it can also usually be used to indicate a timebase, but that
isn't our problem right now.

By inference, the granulepos is also used to construct a value 'y'
such that 'if I begin decode *from* point 'y', I will get data
beginning at position 'granulepos'.  Although in some codecs, y
=granulepos, that is not necessarily the case when decode can't begin
at any arbitrary packet.  The granulepos encoding method candidates I
will now describe affect exactly the 'granulepos' to 'y'
conversion
process.  Note also that none of these affect Ogg, only the higher
decision-making layers... Different circumstanced necessitated by
different codecs can lead to different valid choices, all of which
work as far as Ogg is concerned.  However, for our I-/P-/B-frame video
case, there is a pretty clear winner.

Strategy 1: Straight Granulepos, Keyframes Are Not Our Problem.
  
  In this scheme, the granulepos is a simple frame counter.  The
  seeking decision-maker in the codec's framework plugin is
  responsible for determining if a frame is a keyframe or not, and if
  it can't begin decode from a given frame, it must request another
  earlier frame until it finds a keyframe.  If the codec so desires,
  it can store 'what is my keyframe?' information in the stream
  packets.

  This case means that each seek to a *specific* frame in a video
  stream will generally result in two Ogg seeks; a first seek to the
  the requested frame, then a second seek backwards to find that
  frame's keyframe.

  A larger concern is the semantic accuracy of the granulepos; it's
  intended to reflect position accurately when decoding forward.  In
  this scheme, it's fine for a P-frame to update the counter (as it
  can be decoded going strictly forward), but B frames will also
  advance the counter; they can't be decoded without subsequent P or I
  frames.  Thus, the semantic value of granulepos no longer strictly
  represents 'we can decode up to 'granulepos' at the end of this
  frame'.

Strategy 2: Granulepos Represents Keyframes Only

  In this scheme, only keyframes update the granulepos (monotonically
  or non-monotonically).  It simplifies the seeking process to a
  keyframe as an Ogg-level seek to page 'x' will always yield a page
  with a keyframe.  In addition, granulepos will also always mean 'we
  can decode up to *at least* this point in the stream.  If the stream
  is truncated at P or B frames past granulepos, the extra frames can
  be discarded.  (A special case would need to be defined to terminate
  a stream that doesn't end on an I frame).

  The difficulty with this scheme is that it presents slightly more
  for the software level decoder to track; a proper frame number could
  not be determined internally without tracking from an I frame.
  Also, the granulepos an Ogg page would not necessarily map to the
  last packet on the page, or even any packet on that page; multiple
  sequential pages could have the same granulepos.  It is conceptually
  slightly messy, although the 'messiness' does not make it at all
  impractical.

Strategy 3: Granulepos Encodes Some State

  In some ways, this strategy is the most semantically 'over clever',
  but also the easiest to implement and the one that gives the most
  correct, up to date sync information.  Pending comments, it is the
  I/P/B video strategy I currently favor.

  The granulepos is 64 bits, a size that is absolutely necessary if,
  for example, it represents the PCM sample count in an audio codec.
  When being used to encode video frame number, however, it is
  comparatively absurdly large*.

  * note that although granulepos is not permitted to wrap around, we
    can simply begin a new logical stream segment with a new serial
    number should a 30fps video stream ever hit the ten-billion year
    mark.

  Thus we clearly have room to skim a few bits off the bottom of
  granulepos to represent I, P or B frame.  These bits are not used as
  flags, but rather, frame representation becomes a counting problem;
  We do this such that the count is still always strictly increasing.

  For example, we know that I frames will never be more than 256
  frames apart and P frames no more than 31 B frames apart, the
  granulepos of an I frame can be defined to always be granulepos |
  0xff == 0.  If we can have up to seven intervening P frames, they
  could be numbered in granulepos-of-iframe + 0x20, 0x40,
  0x60... 0xe0.  B frames between the I and P frames would use the
  remaining five bits and be numbers as sub-I and sub-P frames 1
  through 31.  Thus, starting from zero, the frames/packets in the
  pattern IPBBPBBI would be numbered 0x000, 0x020, 0x021, 0x022,
  0x040, 0x041, 0x042, 0x100.

  If we wish to preserve the ability to represent a timebase, the
  granulepos number for I frames need not be increased monotonically
  and shifted; it can be used to represent the frame number.  The
  above example becomes 0x000, 0x020, 0x021, 0x022, 0x040, 0x041,
  0x042, 0x700.  To get real frame number (from an I frame), we just
  shift granulepos >> 8.  This scheme can be taken further or modified
  to get frame number from any video frame.

  In this way, we can always seek, first time, to a desired key frame
  page (by seeking to Ogg page 'x' where x | 0xff == 0).  In
  addition, each frame still has a unique frame number and also a
  clear 'group' number, potentially useful information to the decoder.
  Lastly, granulepos is still semantically correct, although it is
  now, in a sense, representing a whole.fractional frame number for
  buffering purposes.  

Scheme Four: Extra 'Seekpos' Field / Straw Man

  Another possibility requires extension of the current Ogg page
  format.  Although older players would reject any such extended pages
  as invalid, we do have versioning and typing fields, so there's not
  actually any compatibility problems with current Ogg pages... in the
  future.

  The idea in this scheme is to keep the current granulepos as a frame
  number field (ala scheme 1), but also add a new field 'seekpos' that
  is used, rather than granulepos, in seeking.  The seekpos would
  represent the number of the last keyframe that passed by.  

  advantages: 

    1) The net effect of this strategy is to modify scheme 1 to only
    require one bisection seek rather than two.  Some amount of code
    simplification (over scheme 1) at the decision-making level.

  disadvantages: 

    1) The Ogg format will need to be revved.  No current (ala 1.0) Ogg
    code will understand the new pages.

    2) The header becomes larger, from a minimum size of 27 bytes to a
    minimum size of 35.

    3) This strategy only enhances keyframes; it is of no use in other
    odd seeking cases.

    4) Gives no more information than scheme 3, but is still more
    complicated, both in code and API (Ogg would have to understand
    keyframes).

  Thus, there's no substantial reason to prefer extending the format
  over a scheme that's possible within the existing framework.  Note
  that schemes 1-3 can all be implemented within the Ogg stream today.

Monty
--- >8 ----
List archives:  xiph.org/archives
Ogg project homepage: xiph.org/ogg
To unsubscribe from this list, send a message to
'theora-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Monty

2002-Sep-13 12:07 UTC

head link

[theora-dev] Keyframe seeking in Ogg and spec

>   For example, we know that I frames will never be more than 256
>   frames apart and P frames no more than 31 B frames apart, the
>   granulepos of an I frame can be defined to always be granulepos |
/we know/if we know/

Monty

--- >8 ----
List archives:  xiph.org/archives
Ogg project homepage: xiph.org/ogg
To unsubscribe from this list, send a message to
'theora-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Dan Miller

2002-Sep-13 13:30 UTC

head link

[theora-dev] Keyframe seeking in Ogg and spec

Nicely put.  I vote for scheme #3, if only because you have clearly thought
about this more than I.
 
For the record, Vp3 in its present form doesn't use B frames.  However I am
all for a solution that supports them, as OGG needs to be as flexible as
possible.
 
-dbm

        -----Original Message----- 
        From: Monty [mailto:xiphmont@xiph.org] 
        Sent: Fri 9/13/2002 2:50 PM 
        To: theora-dev@xiph.org 
        Cc: 
        Subject: [theora-dev] Keyframe seeking in Ogg and spec
        
        

        Folks have noticed that the documentation is semi-silent about how to
        properly encode the granule position and interleave synchronization of
        keyframe-based video.  The primary reasons for this:
        
        a) we at Xiph hadn't had to do it yet
        b) there are several easy possibilities, and the longer we had to
           think about it before mandating One True Spec, the better that spec
           would likely be.
        
        The lack of a painfully explicit spec has led to the theory that
it's
        not possible; that's not true, there are a few ways to do it. 
Several
        require no extension to Ogg stream v 0.  A last way requires an extra
        field (a point against it), but does not actually break any stream
        that currently exists.
        
        The time has come to lay down the spec as we're currently building
the
        real abstraction layers in a concrete Ogg framework now where the Ogg
        engine, the codecs, and the overarching Ogg control layers are neatly
        put into boxes connected in formalized ways.  Below I go into detail
        about each scheme in a 'thinking aloud' sort of way.  This is
not
        because I haven't already given the matter sufficient thought, it is
        because I wish to give the reader sufficient background information to
        understand why one way is better than the others.  This is not a call
        for input so much as an educational effort (and a public sanity check
        of my thinking; please do pipe up if it appears I missed a salient
        point).
        
        Starting Assumptions:
        
        1) Ogg is not a non-linear format.  It is not a replacement for the
        scripting system of a DVD player.  It is a media transport format
        designed to do nothing more than deliver content, in a stream, and
        have all the pieces arrive on time and in sync.  It is not designed to
        *prevent* more complex use of content, it merely does not implement
        anything beyond a linear representation of the data contained within.
        If you want to build a real non-linear format, build it *from* Ogg,
        not *into* Ogg.  This has been the intent from day 1.
        
        2) The Ogg layer does not know specifics of the codec data it's
        multiplexing into a stream.  It knows nothing beyond 'Oooo,
packets!',
        that the packets belong to different buckets, that the packets go in
        order, and that packets have position markers.  Ogg does not even have
        a concept of 'time'; it only knows about the sequentially
increasing,
        unitless position markers.  It is up to higher layers which have
        access to the codec APIs to assign and convert units of framing or
        time.
        
        3) Given pre-cached decode headers, a player may seek into a stream at
        any point and begin decode. It may be the case that audio may start
        after video by a fraction of a second, or video might be blank until
        the stream hits the next keyframe, but this simplest case must just
        work, and there will be sufficient information to maintain perfect
        cross-media sync.
        
        4) (This departs from current reality, but it will be the reality very
        soon; vorbisfile currently blurs the careful abstraction I'm about
to
        describe) Seeking at an arbitrary level of precision is a distributed
        abstraction in the larger Ogg picture. At the lowest-level Ogg stream
        abstraction, seeking is one operation: "find me the page from
logical
        stream 'n' with granule position 'x'".  All more
complex seeking
        operations are a function of a higher-level layer (with knowledge of
        the media types and codec in use) making intelligent use of this
        lowest Ogg abstraction.  The Ogg stream abstraction need deal with
        nothing more complex than 'find this page'.
        
        The various granulepos strategies for keyframes concern this last point.
        
        The basic issue with video from which complexity arises is that frames
        often depend on previous and possibly future frames.  This happens in
        a larger, general category of codecs whose streams may not begin
        decode from just any packet as well as packets that may not represent
        an entire frame, or even a fixed-time sampling algorithm.  It is a
        mistake to design a seeking system tied to an exact set of very
        specific cases.  While one could implement an explicit keyframe
        mechanism at the Ogg level, this mechanism would not cover any of the
        other interesting seeking cases while, as I'll show below, the
        mechanism would not actually be necessary.
        
        There will be a few complaints that Ogg is being unnecessarily subtle
        and shifts a great deal of complexity into software which a few extra
        page header fields could eliminate. Consider the following:
        
        1) Ogg was designed to impose a roughly .5-1% over the raw packet data
        over a wide range of packet usage patterns.  'A few extra
fields'
        begins inflating that figure for specific special cases that only
        apply to a few stream types.  Right now there is no header field that
        is not general to every stream.  There is no fat in the page headers.
        
        2) The Ogg-level seeking algorithm is exceptionally simple and can be
        described in a single sentence: "Find the earliest page with a
        granulepos less than but closest to 'x'".  This shifts the
onus of
        assembling more complex seeking operation requiring knowledge of a
        specific media type into a higher layer that has knowledge of that
        media type.  The higher layer becomes responsible for determining for
        what 'x' Ogg should search.  The division of labor is clear and
        sensible.
        
        3) Complex, precise seeking operations are still contained entirely
        within the framework, just at a higher layer than Ogg-stream.  At no
        time is an application developer required to deal with seeking
        mechanisms within an Ogg stream or to manually maintain stream
        synchronization.
        
        High level handwaving- How seeking really works:
        
        The granulepos is intended to mean, roughly, 'If I stop decode at
the
        end of this page, I will get data from my decoder up to position
        'granulepos'.  The granulepos simultaneously provides seeking
        information and a 'length-of-stream' indicator.  Depending on
the
        codec, it can also usually be used to indicate a timebase, but that
        isn't our problem right now.
        
        By inference, the granulepos is also used to construct a value
'y'
        such that 'if I begin decode *from* point 'y', I will get
data
        beginning at position 'granulepos'.  Although in some codecs, y
=        granulepos, that is not necessarily the case when decode can't
begin
        at any arbitrary packet.  The granulepos encoding method candidates I
        will now describe affect exactly the 'granulepos' to 'y'
conversion
        process.  Note also that none of these affect Ogg, only the higher
        decision-making layers... Different circumstanced necessitated by
        different codecs can lead to different valid choices, all of which
        work as far as Ogg is concerned.  However, for our I-/P-/B-frame video
        case, there is a pretty clear winner.
        
        Strategy 1: Straight Granulepos, Keyframes Are Not Our Problem.
         
          In this scheme, the granulepos is a simple frame counter.  The
          seeking decision-maker in the codec's framework plugin is
          responsible for determining if a frame is a keyframe or not, and if
          it can't begin decode from a given frame, it must request another
          earlier frame until it finds a keyframe.  If the codec so desires,
          it can store 'what is my keyframe?' information in the stream
          packets.
        
          This case means that each seek to a *specific* frame in a video
          stream will generally result in two Ogg seeks; a first seek to the
          the requested frame, then a second seek backwards to find that
          frame's keyframe.
        
          A larger concern is the semantic accuracy of the granulepos; it's
          intended to reflect position accurately when decoding forward.  In
          this scheme, it's fine for a P-frame to update the counter (as it
          can be decoded going strictly forward), but B frames will also
          advance the counter; they can't be decoded without subsequent P or
I
          frames.  Thus, the semantic value of granulepos no longer strictly
          represents 'we can decode up to 'granulepos' at the end of
this
          frame'.
        
        Strategy 2: Granulepos Represents Keyframes Only
        
          In this scheme, only keyframes update the granulepos (monotonically
          or non-monotonically).  It simplifies the seeking process to a
          keyframe as an Ogg-level seek to page 'x' will always yield a
page
          with a keyframe.  In addition, granulepos will also always mean
'we
          can decode up to *at least* this point in the stream.  If the stream
          is truncated at P or B frames past granulepos, the extra frames can
          be discarded.  (A special case would need to be defined to terminate
          a stream that doesn't end on an I frame).
        
          The difficulty with this scheme is that it presents slightly more
          for the software level decoder to track; a proper frame number could
          not be determined internally without tracking from an I frame.
          Also, the granulepos an Ogg page would not necessarily map to the
          last packet on the page, or even any packet on that page; multiple
          sequential pages could have the same granulepos.  It is conceptually
          slightly messy, although the 'messiness' does not make it at
all
          impractical.
        
        Strategy 3: Granulepos Encodes Some State
        
          In some ways, this strategy is the most semantically 'over
clever',
          but also the easiest to implement and the one that gives the most
          correct, up to date sync information.  Pending comments, it is the
          I/P/B video strategy I currently favor.
        
          The granulepos is 64 bits, a size that is absolutely necessary if,
          for example, it represents the PCM sample count in an audio codec.
          When being used to encode video frame number, however, it is
          comparatively absurdly large*.
        
          * note that although granulepos is not permitted to wrap around, we
            can simply begin a new logical stream segment with a new serial
            number should a 30fps video stream ever hit the ten-billion year
            mark.
        
          Thus we clearly have room to skim a few bits off the bottom of
          granulepos to represent I, P or B frame.  These bits are not used as
          flags, but rather, frame representation becomes a counting problem;
          We do this such that the count is still always strictly increasing.
        
          For example, we know that I frames will never be more than 256
          frames apart and P frames no more than 31 B frames apart, the
          granulepos of an I frame can be defined to always be granulepos |
          0xff == 0.  If we can have up to seven intervening P frames, they
          could be numbered in granulepos-of-iframe + 0x20, 0x40,
          0x60... 0xe0.  B frames between the I and P frames would use the
          remaining five bits and be numbers as sub-I and sub-P frames 1
          through 31.  Thus, starting from zero, the frames/packets in the
          pattern IPBBPBBI would be numbered 0x000, 0x020, 0x021, 0x022,
          0x040, 0x041, 0x042, 0x100.
        
          If we wish to preserve the ability to represent a timebase, the
          granulepos number for I frames need not be increased monotonically
          and shifted; it can be used to represent the frame number.  The
          above example becomes 0x000, 0x020, 0x021, 0x022, 0x040, 0x041,
          0x042, 0x700.  To get real frame number (from an I frame), we just
          shift granulepos >> 8.  This scheme can be taken further or
modified
          to get frame number from any video frame.
        
          In this way, we can always seek, first time, to a desired key frame
          page (by seeking to Ogg page 'x' where x | 0xff == 0).  In
          addition, each frame still has a unique frame number and also a
          clear 'group' number, potentially useful information to the
decoder.
          Lastly, granulepos is still semantically correct, although it is
          now, in a sense, representing a whole.fractional frame number for
          buffering purposes. 
        
        Scheme Four: Extra 'Seekpos' Field / Straw Man
        
          Another possibility requires extension of the current Ogg page
          format.  Although older players would reject any such extended pages
          as invalid, we do have versioning and typing fields, so there's
not
          actually any compatibility problems with current Ogg pages... in the
          future.
        
          The idea in this scheme is to keep the current granulepos as a frame
          number field (ala scheme 1), but also add a new field
'seekpos' that
          is used, rather than granulepos, in seeking.  The seekpos would
          represent the number of the last keyframe that passed by. 
        
          advantages:
        
            1) The net effect of this strategy is to modify scheme 1 to only
            require one bisection seek rather than two.  Some amount of code
            simplification (over scheme 1) at the decision-making level.
        
          disadvantages:
        
            1) The Ogg format will need to be revved.  No current (ala 1.0) Ogg
            code will understand the new pages.
        
            2) The header becomes larger, from a minimum size of 27 bytes to a
            minimum size of 35.
        
            3) This strategy only enhances keyframes; it is of no use in other
            odd seeking cases.
        
            4) Gives no more information than scheme 3, but is still more
            complicated, both in code and API (Ogg would have to understand
            keyframes).
        
          Thus, there's no substantial reason to prefer extending the format
          over a scheme that's possible within the existing framework.  Note
          that schemes 1-3 can all be implemented within the Ogg stream today.
        
        Monty
        --- >8 ----
        List archives:  xiph.org/archives
        Ogg project homepage: xiph.org/ogg
        To unsubscribe from this list, send a message to
'theora-dev-request@xiph.org'
        containing only the word 'unsubscribe' in the body.  No subject
is needed.
        Unsubscribe messages sent to the list will be ignored/filtered.
        

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 15514 bytes
Desc: winmail.dat
Url :
lists.xiph.org/pipermail/theora-dev/attachments/20020913/9f2d2c30/winmail-0001.bin

Ralph Giles

2002-Sep-13 15:15 UTC

head link

[theora-dev] Keyframe seeking in Ogg and spec

On Friday, Sep 13, 2002, at 19:50 Europe/London, Monty wrote:
> This is not a call for input so much as an educational effort (and a 
> public sanity check
> of my thinking; please do pipe up if it appears I missed a salient 
> point).
Just some nitpicks, feel free to ignore:
> 2) The Ogg layer does not know specifics of the codec data it's
> multiplexing into a stream.  It knows nothing beyond 'Oooo,
packets!',
> that the packets belong to different buckets, that the packets go in
> order, and that packets have position markers.
I understand this description is meant to be fuzzy, but it is a little 
misleading. The ogg layer doesn't know about anything but packets in 
buckets, but anything that operates on ogg streams must use some 
higher-level information if it is going to redistribute packets or 
re-interleave streams. Vorbis specifies that a page break occur after 
the last required header for the convenience of streaming/editing 
applications; it seems likely that this is a rule we'll want to follow 
with other codec data. And to do a good job, any mux/interleave tool 
will have to understand the timecodes of the various streams it's 
dealing with.
> 2) The Ogg-level seeking algorithm is exceptionally simple and can be
> described in a single sentence: "Find the earliest page with a
> granulepos less than but closest to 'x'".
Wouldn't "find the latest page with a granulepos not later than x"
be a
simpler description?
>   * note that although granulepos is not permitted to wrap around, we
>     can simply begin a new logical stream segment with a new serial
>     number should a 30fps video stream ever hit the ten-billion year
>     mark.
This is a bit flip. If you can do this with video, you can do it with 
audio.

  -r

--- >8 ----
List archives:  xiph.org/archives
Ogg project homepage: xiph.org/ogg
To unsubscribe from this list, send a message to
'theora-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

jbradford@dial.pipex.com

2002-Sep-14 14:02 UTC

head link

[theora-dev] Keyframe seeking in Ogg and spec

> please do pipe up if it appears I missed a salient point.
I might be missing something obvious here, (probably am :-) ), but here goes:

Shouldn't we include metadata that indicates which packets can be ignored to
create a lower bandwidth version of the stream?  How does Vorbis do this?  What
I'm refering to is the ability to take an, e.g. 128 Kbps stream and create
the 64 Kbps stream that *would* have been created, had the material been encoded
at 64 Kbps to begin with, not just re-lossy-compressing an existing compressed
stream.

The reason I'm mentioning this is because whilst with Vorbis the idea was
primarily, I think, to allow people to encode at a high bit rate, and then
create low bit rate streams from that, with video, isn't it possible that a
server might be configured to drop packets gracefully, if bandwidth was short? 
So, low bandwidth would not result in jumpy, skippy video, but just lower
quality video.  The problem is, substream priority identification would really
have to be implemented at the Ogg container file level, so that the sever
didn't have to actually parse the Theora stream to intellegently drop
packets.

John.
--- >8 ----
List archives:  xiph.org/archives
Ogg project homepage: xiph.org/ogg
To unsubscribe from this list, send a message to
'theora-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Dan Miller

2002-Sep-14 15:51 UTC

head link

[theora-dev] Keyframe seeking in Ogg and spec

>Is there a maximum I frame spacing?  
no
>Should this be left open and setin the header?

yes


Monty
--- >8 ----
List archives:  xiph.org/archives
<xiph.org/archives>
Ogg project homepage: xiph.org/ogg <xiph.org/ogg> 
To unsubscribe from this list, send a message to
'theora-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3855 bytes
Desc: winmail.dat
Url :
lists.xiph.org/pipermail/theora-dev/attachments/20020914/e3c61d1b/winmail-0001.bin

Tabuleiro

2002-Sep-16 12:43 UTC

head link

[theora-dev] Keyframe seeking in Ogg and spec

Hi, Monty.

One problem I see with your proposed implementation is that it does not
specify how audio and video would be synchronized, since there are no
absolute time stamps for the video frames. So it is difficult to account for
audio latency on different systems, much like when you play an AVI file.
Ideally it would be wonderful to have timestamps that could be used to
synchronize audio and video in a meaningful way, and if I understand it
correctly the proposed granulepos formats do not address this.

One obvious thing to add is a header to the vp3 stream (first page of the
ogg stream) specifying the video size and frame rate, much like you do with
audio quality and audio rate in Vorbis. Each player application could cache
audio and video data as it arrives, buffer it and synch the video display to
the actual audio output. This however assumes that you can find the absolute
frame number without errors, and that there are no variable frame
duration...

Of course, it may be that the ogg stream does not need to know anything
about the duration of each frame in milliseconds and its position in time,
but I find it difficult to believe. As an application writer I would like to
know things about the file I have just opened, or some high level api should
be able to provide this info to me. The big question is if you want to
squish this info in the vp3 stream or in the ogg stream, imo. The current
VP3 stream does not provide this info, it is all provided in the avi or qt
standard header structures.

In Ogg, things like the duration of the whole video could come in the first
page (you could send -1 for live transmissions, for example). But
information about the absolute position of the frame in time and its
duration, and if the data represents a keyframe or not need to be provided
with each frame (page), don't you think? To be fair I am mainly interested
in file-based playback, so ideally I would love to have a structure with the
location of each keyframe, the duration of the file, etc, without having to
parse the stream (or at least a way to optionally put this information in a
header page). But I understand this is not what Ogg was designed for, and I
will probably need to implement this using a custom file format.

Of the solutions you proposed I believe option 4 is actually better. The
playing application could keep a list of keyframes as they arrive, in order
to optimize seek operations, and you could extend the format to include
absolute timestamping and the duration of each frame. Option 3 is clever but
it breaks if you consider that you can have keyframe spacing greater than
255 frames, also (although it is probably terrible to seek in a file with
very spaced keyframes).

On a separate note,  I spent some hours looking at the various "Ogg
Media"
utilities available at the net, and people are mainly using them to
multiplex existing video formats with vorbis streams. There is even a
DirectShow filter that will split these OGM files and pass the appropriate
bits to the decoders on Windows.
Of course these are not "official" ogg files. People have been trying
to
guess a recipe to add video frames to an ogg stream, and each one is using a
different strategy for the granulepos field, for example, and sometimes
leveraging the synchronization capabilities of other architectures.. I
understand that once we have an "official" way to mark video streams
this
problem will be alleviated ...

Regards,
Mauricio Piacentini
Tabuleiro

----- Original Message -----
From: "Monty" <xiphmont@xiph.org>
To: <theora-dev@xiph.org>
Sent: Friday, September 13, 2002 3:50 PM
Subject: [theora-dev] Keyframe seeking in Ogg and spec

<p>> Folks have noticed that the documentation is semi-silent about how
to> properly encode the granule position and interleave synchronization of
> keyframe-based video.  The primary reasons for this:
>
> a) we at Xiph hadn't had to do it yet
> b) there are several easy possibilities, and the longer we had to
>    think about it before mandating One True Spec, the better that spec
>    would likely be.
>
> The lack of a painfully explicit spec has led to the theory that it's
> not possible; that's not true, there are a few ways to do it.  Several
> require no extension to Ogg stream v 0.  A last way requires an extra
> field (a point against it), but does not actually break any stream
> that currently exists.
>
> The time has come to lay down the spec as we're currently building the
> real abstraction layers in a concrete Ogg framework now where the Ogg
> engine, the codecs, and the overarching Ogg control layers are neatly
> put into boxes connected in formalized ways.  Below I go into detail
> about each scheme in a 'thinking aloud' sort of way.  This is not
> because I haven't already given the matter sufficient thought, it is
> because I wish to give the reader sufficient background information to
> understand why one way is better than the others.  This is not a call
> for input so much as an educational effort (and a public sanity check
> of my thinking; please do pipe up if it appears I missed a salient
> point).
>
> Starting Assumptions:
>
> 1) Ogg is not a non-linear format.  It is not a replacement for the
> scripting system of a DVD player.  It is a media transport format
> designed to do nothing more than deliver content, in a stream, and
> have all the pieces arrive on time and in sync.  It is not designed to
> *prevent* more complex use of content, it merely does not implement
> anything beyond a linear representation of the data contained within.
> If you want to build a real non-linear format, build it *from* Ogg,
> not *into* Ogg.  This has been the intent from day 1.
>
> 2) The Ogg layer does not know specifics of the codec data it's
> multiplexing into a stream.  It knows nothing beyond 'Oooo,
packets!',
> that the packets belong to different buckets, that the packets go in
> order, and that packets have position markers.  Ogg does not even have
> a concept of 'time'; it only knows about the sequentially
increasing,
> unitless position markers.  It is up to higher layers which have
> access to the codec APIs to assign and convert units of framing or
> time.
>
> 3) Given pre-cached decode headers, a player may seek into a stream at
> any point and begin decode. It may be the case that audio may start
> after video by a fraction of a second, or video might be blank until
> the stream hits the next keyframe, but this simplest case must just
> work, and there will be sufficient information to maintain perfect
> cross-media sync.
>
> 4) (This departs from current reality, but it will be the reality very
> soon; vorbisfile currently blurs the careful abstraction I'm about to
> describe) Seeking at an arbitrary level of precision is a distributed
> abstraction in the larger Ogg picture. At the lowest-level Ogg stream
> abstraction, seeking is one operation: "find me the page from logical
> stream 'n' with granule position 'x'".  All more
complex seeking
> operations are a function of a higher-level layer (with knowledge of
> the media types and codec in use) making intelligent use of this
> lowest Ogg abstraction.  The Ogg stream abstraction need deal with
> nothing more complex than 'find this page'.
>
> The various granulepos strategies for keyframes concern this last point.
>
> The basic issue with video from which complexity arises is that frames
> often depend on previous and possibly future frames.  This happens in
> a larger, general category of codecs whose streams may not begin
> decode from just any packet as well as packets that may not represent
> an entire frame, or even a fixed-time sampling algorithm.  It is a
> mistake to design a seeking system tied to an exact set of very
> specific cases.  While one could implement an explicit keyframe
> mechanism at the Ogg level, this mechanism would not cover any of the
> other interesting seeking cases while, as I'll show below, the
> mechanism would not actually be necessary.
>
> There will be a few complaints that Ogg is being unnecessarily subtle
> and shifts a great deal of complexity into software which a few extra
> page header fields could eliminate. Consider the following:
>
> 1) Ogg was designed to impose a roughly .5-1% over the raw packet data
> over a wide range of packet usage patterns.  'A few extra fields'
> begins inflating that figure for specific special cases that only
> apply to a few stream types.  Right now there is no header field that
> is not general to every stream.  There is no fat in the page headers.
>
> 2) The Ogg-level seeking algorithm is exceptionally simple and can be
> described in a single sentence: "Find the earliest page with a
> granulepos less than but closest to 'x'".  This shifts the
onus of
> assembling more complex seeking operation requiring knowledge of a
> specific media type into a higher layer that has knowledge of that
> media type.  The higher layer becomes responsible for determining for
> what 'x' Ogg should search.  The division of labor is clear and
> sensible.
>
> 3) Complex, precise seeking operations are still contained entirely
> within the framework, just at a higher layer than Ogg-stream.  At no
> time is an application developer required to deal with seeking
> mechanisms within an Ogg stream or to manually maintain stream
> synchronization.
>
> High level handwaving- How seeking really works:
>
> The granulepos is intended to mean, roughly, 'If I stop decode at the
> end of this page, I will get data from my decoder up to position
> 'granulepos'.  The granulepos simultaneously provides seeking
> information and a 'length-of-stream' indicator.  Depending on the
> codec, it can also usually be used to indicate a timebase, but that
> isn't our problem right now.
>
> By inference, the granulepos is also used to construct a value 'y'
> such that 'if I begin decode *from* point 'y', I will get data
> beginning at position 'granulepos'.  Although in some codecs, y
=> granulepos, that is not necessarily the case when decode can't begin
> at any arbitrary packet.  The granulepos encoding method candidates I
> will now describe affect exactly the 'granulepos' to 'y'
conversion
> process.  Note also that none of these affect Ogg, only the higher
> decision-making layers... Different circumstanced necessitated by
> different codecs can lead to different valid choices, all of which
> work as far as Ogg is concerned.  However, for our I-/P-/B-frame video
> case, there is a pretty clear winner.
>
> Strategy 1: Straight Granulepos, Keyframes Are Not Our Problem.
>
>   In this scheme, the granulepos is a simple frame counter.  The
>   seeking decision-maker in the codec's framework plugin is
>   responsible for determining if a frame is a keyframe or not, and if
>   it can't begin decode from a given frame, it must request another
>   earlier frame until it finds a keyframe.  If the codec so desires,
>   it can store 'what is my keyframe?' information in the stream
>   packets.
>
>   This case means that each seek to a *specific* frame in a video
>   stream will generally result in two Ogg seeks; a first seek to the
>   the requested frame, then a second seek backwards to find that
>   frame's keyframe.
>
>   A larger concern is the semantic accuracy of the granulepos; it's
>   intended to reflect position accurately when decoding forward.  In
>   this scheme, it's fine for a P-frame to update the counter (as it
>   can be decoded going strictly forward), but B frames will also
>   advance the counter; they can't be decoded without subsequent P or I
>   frames.  Thus, the semantic value of granulepos no longer strictly
>   represents 'we can decode up to 'granulepos' at the end of
this
>   frame'.
>
> Strategy 2: Granulepos Represents Keyframes Only
>
>   In this scheme, only keyframes update the granulepos (monotonically
>   or non-monotonically).  It simplifies the seeking process to a
>   keyframe as an Ogg-level seek to page 'x' will always yield a
page
>   with a keyframe.  In addition, granulepos will also always mean 'we
>   can decode up to *at least* this point in the stream.  If the stream
>   is truncated at P or B frames past granulepos, the extra frames can
>   be discarded.  (A special case would need to be defined to terminate
>   a stream that doesn't end on an I frame).
>
>   The difficulty with this scheme is that it presents slightly more
>   for the software level decoder to track; a proper frame number could
>   not be determined internally without tracking from an I frame.
>   Also, the granulepos an Ogg page would not necessarily map to the
>   last packet on the page, or even any packet on that page; multiple
>   sequential pages could have the same granulepos.  It is conceptually
>   slightly messy, although the 'messiness' does not make it at all
>   impractical.
>
> Strategy 3: Granulepos Encodes Some State
>
>   In some ways, this strategy is the most semantically 'over
clever',
>   but also the easiest to implement and the one that gives the most
>   correct, up to date sync information.  Pending comments, it is the
>   I/P/B video strategy I currently favor.
>
>   The granulepos is 64 bits, a size that is absolutely necessary if,
>   for example, it represents the PCM sample count in an audio codec.
>   When being used to encode video frame number, however, it is
>   comparatively absurdly large*.
>
>   * note that although granulepos is not permitted to wrap around, we
>     can simply begin a new logical stream segment with a new serial
>     number should a 30fps video stream ever hit the ten-billion year
>     mark.
>
>   Thus we clearly have room to skim a few bits off the bottom of
>   granulepos to represent I, P or B frame.  These bits are not used as
>   flags, but rather, frame representation becomes a counting problem;
>   We do this such that the count is still always strictly increasing.
>
>   For example, we know that I frames will never be more than 256
>   frames apart and P frames no more than 31 B frames apart, the
>   granulepos of an I frame can be defined to always be granulepos |
>   0xff == 0.  If we can have up to seven intervening P frames, they
>   could be numbered in granulepos-of-iframe + 0x20, 0x40,
>   0x60... 0xe0.  B frames between the I and P frames would use the
>   remaining five bits and be numbers as sub-I and sub-P frames 1
>   through 31.  Thus, starting from zero, the frames/packets in the
>   pattern IPBBPBBI would be numbered 0x000, 0x020, 0x021, 0x022,
>   0x040, 0x041, 0x042, 0x100.
>
>   If we wish to preserve the ability to represent a timebase, the
>   granulepos number for I frames need not be increased monotonically
>   and shifted; it can be used to represent the frame number.  The
>   above example becomes 0x000, 0x020, 0x021, 0x022, 0x040, 0x041,
>   0x042, 0x700.  To get real frame number (from an I frame), we just
>   shift granulepos >> 8.  This scheme can be taken further or
modified
>   to get frame number from any video frame.
>
>   In this way, we can always seek, first time, to a desired key frame
>   page (by seeking to Ogg page 'x' where x | 0xff == 0).  In
>   addition, each frame still has a unique frame number and also a
>   clear 'group' number, potentially useful information to the
decoder.
>   Lastly, granulepos is still semantically correct, although it is
>   now, in a sense, representing a whole.fractional frame number for
>   buffering purposes.
>
> Scheme Four: Extra 'Seekpos' Field / Straw Man
>
>   Another possibility requires extension of the current Ogg page
>   format.  Although older players would reject any such extended pages
>   as invalid, we do have versioning and typing fields, so there's not
>   actually any compatibility problems with current Ogg pages... in the
>   future.
>
>   The idea in this scheme is to keep the current granulepos as a frame
>   number field (ala scheme 1), but also add a new field 'seekpos'
that
>   is used, rather than granulepos, in seeking.  The seekpos would
>   represent the number of the last keyframe that passed by.
>
>   advantages:
>
>     1) The net effect of this strategy is to modify scheme 1 to only
>     require one bisection seek rather than two.  Some amount of code
>     simplification (over scheme 1) at the decision-making level.
>
>   disadvantages:
>
>     1) The Ogg format will need to be revved.  No current (ala 1.0) Ogg
>     code will understand the new pages.
>
>     2) The header becomes larger, from a minimum size of 27 bytes to a
>     minimum size of 35.
>
>     3) This strategy only enhances keyframes; it is of no use in other
>     odd seeking cases.
>
>     4) Gives no more information than scheme 3, but is still more
>     complicated, both in code and API (Ogg would have to understand
>     keyframes).
>
>   Thus, there's no substantial reason to prefer extending the format
>   over a scheme that's possible within the existing framework.  Note
>   that schemes 1-3 can all be implemented within the Ogg stream today.
>
> Monty
> --- >8 ----
> List archives:  xiph.org/archives
> Ogg project homepage: xiph.org/ogg
> To unsubscribe from this list, send a message to
'theora-dev-request@xiph.org'> containing only the word 'unsubscribe' in the body.  No subject is
needed.
> Unsubscribe messages sent to the list will be ignored/filtered.
>
--- >8 ----
List archives:  xiph.org/archives
Ogg project homepage: xiph.org/ogg
To unsubscribe from this list, send a message to
'theora-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Seemingly Similar Threads

Search for more seemingly similar threads

theora dev - Sep 2002 - Keyframe seeking in Ogg and spec

[theora-dev] Keyframe seeking in Ogg and spec

[theora-dev] Keyframe seeking in Ogg and spec

[theora-dev] Keyframe seeking in Ogg and spec

[theora-dev] Keyframe seeking in Ogg and spec

[theora-dev] Keyframe seeking in Ogg and spec

[theora-dev] Keyframe seeking in Ogg and spec

[theora-dev] Keyframe seeking in Ogg and spec

Seemingly Similar Threads