I've found a much better solution; it's standard (in Unicode itself), simple and more flexible: Unicode language tagging. It was made for just this purpose, in fact. A technical description is at http://www.unicode.org/unicode/reports/tr27/#tag which, like all specs, makes it sound a bit more complicated than it really is. It comes down to this: mark the language of text with U+E0001 LANGUAGE TAG, followed by the RFC 3066 language ID (ie. "ja") encoded in lowercase ASCII plus 0xE0000. There's really nothing needed in the spec, except to 1: recommend its use, and 2: define where language tags go out of scope (at the end of the tag--that is, language tags shouldn't scope between tags.) Programs which don't want to interpret this can simply ignore them; they're zero-width, no-print characters intended to be ignorable. This also eliminates the major restrictions of UTF8_LANG; you can change language if you want, wherever you want. (We were already able to use this, as it's part of Unicode; nobody would use it if it wasn't explicitely recommended, though.) We could even make vcomment automatically add some of these tags. If the local encoding is Shift-JIS, we're pretty safe adding the Japanese tag. (It might be English, or some other Roman language, but it's OK to display those in a Japanese font.) Of course, this should be optional, though I don't see any problem with it defaulting to on. "The use of Unicode language tags is encouraged in tag data. Language tags go out of scope at the end of each tag. See http://www.unicode.org/unicode/reports/tr27/#tag for a full description." (This also means that we're back to all tags actually holding data that the user is likely to care about directly. UTF8_LANG is data that only a program is likely to care about, and a program that simply displays all tags is likely to make users go "what the heck is UTF8_LANG?") -- Glenn Maynard --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
begin Glenn Maynard quotation:> It comes down to this: mark the language of text with U+E0001 LANGUAGE > TAG, followed by the RFC 3066 language ID (ie. "ja") encoded in > lowercase ASCII plus 0xE0000.When I read this, I was wondering how those "lowercase ASCII" characters could be ignored if the marker characters weren't handled. Obviously, if they were the "lowercase ASCII characters" U+00061 - U+0007A, it wouldn't work, since they would be displayed as normal if U+E0000 weren't recognized; however, the standard actually says that the characters used for the language ID part is U+E0020 - U+E007E, which are related to the ASCII counterparts by the transform: c - 0xE0000. Thus, they don't display as normal ASCII characters. Another thing to mention is that this standard is part of Unicode 3.1, which is not widely supported yet, AFAIK. Hopefully, most existing Unicode implementations are smart enough to skip over characters in planes which they don't implement, which would make the visual result on non-supporting platforms the same as not using the tags at all. -md --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
On Thu, Jan 10, 2002 at 06:06:49AM -0500, Glenn Maynard wrote:>A technical description is at http://www.unicode.org/unicode/reports/tr27/#tag >which, like all specs, makes it sound a bit more complicated than it >really is.Thank you Glenn. I have updated the proposal in light of your information. Excellent research. It now reads like this: Character Set Encoding of Tags: ============================== UTF-8 is the default encoding for tag data. Unfortunately UTF-8 muffed it for Asian languages by doing the equivalent of giving the same character codes to English, Russian, and Greek letters. So originally we were going to let people use RFC2047 encoding, or a UTF8_LANG tag. Fortunately UTF-8 itself has an internal, standard solution to the problem: http://www.unicode.org/unicode/reports/tr27/#tag which basically says: mark the language of text with U+E0001 LANGUAGE TAG, followed by the RFC 3066 language ID (ie. "ja") encoded in lowercase ASCII plus 0xE0000. This is the only mechanism recognized by the standard. Programs which don't want to interpret such markup can simply ignore it; it is zero width. The scope of the language setting is until the end of the tag, or until a new language setting is encountered, whichever comes first. -------------- next part -------------- A non-text attachment was scrubbed... Name: part Type: application/pgp-signature Size: 797 bytes Desc: not available Url : http://lists.xiph.org/pipermail/vorbis/attachments/20020110/2b6eb668/part.pgp
On Thu, 10 Jan 2002, Glenn Maynard wrote:> It comes down to this: mark the language of text with U+E0001 LANGUAGE > TAG, followed by the RFC 3066 language ID (ie. "ja") encoded in > lowercase ASCII plus 0xE0000. > There's really nothing needed in the spec, except to 1: recommend its > use, and 2: define where language tags go out of scope (at the end of > the tag--that is, language tags shouldn't scope between tags.) > This also eliminates the major restrictions of UTF8_LANG; you can change > language if you want, wherever you want.This seems to be a better solution.> We could even make vcomment automatically add some of these tags.I would hope so. If stock vorbis utils don't handle it, you could hardly blame others for not doing it either :-P> If the local encoding is Shift-JIS, we're pretty safe adding the Japanese > tag.Local encoding for japanese on Linux is EUC-JP (korean it's EUC-KR). I don't know about others. -Dan -- [-] Omae no subete no kichi wa ore no mono da. [-] <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'vorbis-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.