Hello: As seen in some of the MP3-oriented P2P programs and audio organizing tools, the underlying uniqueness of a given mp3 file can be learned (for the most part) by, for instance, taking a hash of the first 300,000 bytes of the non-id3 tag content of an mp3 file to obtain a content signature (This hash could then further be paired with the length of the non tag portion of the entire file for an even more unique signature). This signature could then be stored in an organizer program DB (or in a p2p system DB) such that even though the filenames and tag content can change or be from different sources, the underlying audio content can be tied back to a DB entry via the signature. Note that this scheme is understood to not work for identifying identical content encoded under different bitrate/quality settings. Can anyone guide me on whether or not there any way to accomplish the same goal with Vorbis using the existing APIs, that is, getting at the first x bytes of non-tagging/metadata content of a stream, and similarly, getting the length of the non-tagging/metadata portion of an entire file stream? Or, if not that, any ideas on obtaining "uniqueness" through another means in Vorbis? One might say, "Why not just put a unique identifier in a tag in each file, and not worry about this hash business?" To preemptively respond to this, arguments against this approach follow: 1) The DB program (organizer or p2p system) might not have write access to the files, and thus can't set an identifier tag. For instance, users with large collections (let's call large 20 - 30,000 files) are likely to have a good portion of it set to read-only(not to mention read-only media), for archival purposes. Also, large collection holders probably have a specific tagging/metadata program that they trust, and don't want a program that they just downloaded deciding to write to every single one of their content files. 2) Files can't be checked for underlying audio content duplication, other than through tagging / file size methods, which is generally inadequate, due to different tagging/filename schemes. Another might say, "How about decoding the first x seconds, and taking a hash of that, to get uniqueness?". This could work, except that different decoder implementations/versions might produce different hashes for the same file, and decoding is likely to be a much slower technique. Tom Wadzinski <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Tom Wadzinski (twadzins@yahoo.com) wrote:> Can anyone guide me on whether or not there any way to accomplish the > same goal with Vorbis using the existing APIs, that is, getting at the > first x bytes of non-tagging/metadata content of a stream> One might say, "Why not just put a unique identifier in a tag in each > file, and not worry about this hash business?"Because you can't write to the files. You covered that in the part I snipped.> Another might say, "How about decoding the first x seconds, and taking a > hash of that, to get uniqueness?".Because MP3 decoding, at least, is nondeterministic. I'm not sure about Vorbis. Why don't you just hash the first 300 kbytes like giFT does? Then you can be content-neutral, and your implementation is greatly simplified. Third-party implementations might even get it right! ;-) Besides that, imagine I download an Ogg from a P2P network and find that it's not tagged. I might decide to add at least ARTIST and TITLE tags. If you hash my tagged version against the original, but ignore the tags, then you get the same hash for both files. That's no good -- you can't swarm the file if it's not the same file on both sources. So, if you skip over some part of the file arbitrarily, you ruin the whole purpose of making a hash in the first place. -- Greg Wooledge | "Truth belongs to everybody." greg@wooledge.org | - The Red Hot Chili Peppers http://wooledge.org/~greg/ | -------------- next part -------------- A non-text attachment was scrubbed... Name: part Type: application/pgp-signature Size: 241 bytes Desc: not available Url : http://lists.xiph.org/pipermail/vorbis/attachments/20020311/822a50a6/part-0001.pgp
This is an interesting question which I can't fully answer. I presume this technique is used for MP3 files in some file sharing apps such as Morpheus & Audiogalaxy. A similar routine will have to be developed for Ogg files. Something like this will not be too difficult. I think all that is needed is to omit all page headers and comments. We would need to hear from the ogg file format experts out there to confirm this. I doubt it is possible with existing library functions. Regards, Ross Levis. Tom Wadzinski wrote:> Hello: > > As seen in some of the MP3-oriented P2P programs and audio organizing > tools, the underlying uniqueness of a given mp3 file can be > learned (for > the most part) by, for instance, taking a hash of the first 300,000 > bytes of the non-id3 tag content of an mp3 file to obtain a content > signature (This hash could then further be paired with the > length of the > non tag portion of the entire file for an even more unique signature). > This signature could then be stored in an organizer program > DB (or in a > p2p system DB) such that even though the filenames and tag content can > change or be from different sources, the underlying audio > content can be > tied back to a DB entry via the signature. Note that this scheme is > understood to not work for identifying identical content encoded under > different bitrate/quality settings. > > Can anyone guide me on whether or not there any way to accomplish the > same goal with Vorbis using the existing APIs, that is, getting at the > first x bytes of non-tagging/metadata content of a stream, and > similarly, getting the length of the non-tagging/metadata > portion of an > entire file stream? Or, if not that, any ideas on obtaining > "uniqueness" through another means in Vorbis? > > One might say, "Why not just put a unique identifier in a tag in each > file, and not worry about this hash business?" To > preemptively respond > to this, arguments against this approach follow: > 1) The DB program (organizer or p2p system) might not have > write access > to the files, and thus can't set an identifier tag. For > instance, users > with large collections (let's call large 20 - 30,000 files) are likely > to have a good portion of it set to read-only(not to mention read-only > media), for archival purposes. Also, large collection > holders probably > have a specific tagging/metadata program that they trust, and > don't want > a program that they just downloaded deciding to write to every single > one of their content files. > 2) Files can't be checked for underlying audio content duplication, > other than through tagging / file size methods, which is generally > inadequate, due to different tagging/filename schemes. > > Another might say, "How about decoding the first x seconds, > and taking a > hash of that, to get uniqueness?". This could work, except that > different decoder implementations/versions might produce different > hashes for the same file, and decoding is likely to be a much slower > technique. > > Tom Wadzinski > > > --- >8 ---- > List archives: http://www.xiph.org/archives/ > Ogg project homepage: http://www.xiph.org/ogg/ > To unsubscribe from this list, send a message to > 'vorbis-request@xiph.org' > containing only the word 'unsubscribe' in the body. No > subject is needed. > Unsubscribe messages sent to the list will be ignored/filtered. >--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.