John Alexander Thacker
2005-Nov-21 08:50 UTC
[Fontconfig] Font matching in Unicode locales
On Sat, Oct 25, 2003 at 09:35:24AM -0400, Owen Taylor wrote:> Well, actually, you are almost certainly going to have to tell the > rendering system that you are rendering Japanese to get good results > when you render Japanese. There is no reliable automatic way to tell > that text is in Japanese rather than Chinese or Korean.Sure, that''s true, although it still should be possible to set a preferred CJK font for rending Han characters, IMO.> It''s also possible to put magic in your fonts.conf to replace ''en'' > language tags with ''en,ja''. That would be an immediate workaround > for your problem, though I don''t consider it a solution.Wouldn''t the right workaround be to replace "ja" with "ja,en" instead? I want to make the Japanese fonts available under an English Unicode locale, right? Now, in the current Fedora Core release, you made a workaround since language matching doesn''t work correctly, so I guess I have to match on the font names explicitly. I thought the correct syntax should look like: <match target="font"> <test name="family" compare="eq"> <string>Kochi Mincho</string> </test> <edit name="lang" mode="append"> <string>en</string> </edit> </match> But this doesn''t seem to have any effect. Nor does putting a comma before "en" work. Using "assign" instead of "append" and trying to assign "ja,en" doesn''t work either. It seems that no matter what I do, Kochi Mincho keeps the same list of supported languages, according to fc-list. Perhaps I have the syntax wrong? Maybe I should try 2.2.90 or the CVS version of fontconfig? John Thacker
On Sat, 2003-10-25 at 01:09, John Alexander Thacker wrote:> I have a real problem with fontconfigi-2.2.1 selecting the wrong fonts for > the Sans alias when I''m in a Unicode locale, specifically en_US.UTF-8. It > seems that fontconfig prioritizes the current language setting too much, > which is a real problem when the intended glyphs come from a language other > than English. > > Specifically, I occasionally view Japanese Unicode files, though I want > my UI to remain English, and hence keep my locale as en_US.UTF-8. I''d > like the Sans alias to use the preferred Kochi Gothic for viewing Japanese > glyphs, and similarly to use Kochi Mincho for Serif. In fact, these are > listed as preferred by default in fonts.conf, if I understand the file. > > However, neither Kochi Gothic nor Kochi Mincho list "en" as a supported > language. This means that fontconfig will strongly prefer any other font > which has those glyphs to Kochi Gothic or Kochi Mincho. On a RedHat 9 > installation, this will mean preferring to use the far inferior hiragana > and katakana contained in the MiscFixed bitmap font to that in Kochi Gothic > or Kochi Mincho. (MiscFixed is installed by default as part of the > bitmap-fonts package, and placed in /usr/share/fonts/bitmap-fonts.) > Mozilla, of course, will completely refuse to use Kochi Gothic or Kochi > Mincho for Unicode at all, since, as previously discussed, it is even > more restrictive based on the language supported by a font.Yes, this is a known difficulty. I''ve discussed it with Keith at length several times, but we''ve never come up with a satisfactory solution. I have a pretty good workaround that will be in Pango-1.4. How that will work is that Pango computes the script for each run of text (Latin, Arabic, Han, Hirigana, etc.) If that script isn''t one of the scripts used for the language tag, the language tag is removed and replaced either with an appropriate language tag for the script (Arabic => ar, Greek => el, etc.), or if, as for Han characters, there is no good default language tag, with ??. This approach could be copied by other apps, though they''d need the an equivalent of the codepoint => script and script <=> language code that Pango has. [...]> It seems to me that the promise of Unicode and fontconfig would be that > I could just set all my applications to use "Sans" (or "Monospace" or > "Serif") by default, and that would always select the proper font. I > shouldn''t have to change my locale or my font just to view something in > Japanese-- that breaks the whole point of Unicode.Well, actually, you are almost certainly going to have to tell the rendering system that you are rendering Japanese to get good results when you render Japanese. There is no reliable automatic way to tell that text is in Japanese rather than Chinese or Korean. Generally, this has to occur at the application level. For Pango-1.4 I may add some hooks so that you can configure things so that Han characters will get, say, a ja language tag when no other information is available instead of ??, but that is going to be an expert thing. It''s also possible to put magic in your fonts.conf to replace ''en'' language tags with ''en,ja''. That would be an immediate workaround for your problem, though I don''t consider it a solution. Regards, Owen
John Alexander Thacker
2005-Nov-21 08:50 UTC
[Fontconfig] Font matching in Unicode locales
This mailing list is being very slow for me, and I''m seeing the responses in the online archives long before I''m receiving the emails. So I''ll respond now anyway. First, a problem I noticed when upgrading: confdir.sgml.in isn''t getting added to the list of files to include in the tarball created by doing "make dist". This causes compilation problems when using said tarball. I don''t know my way around automake that well, but it seems to me that this solves the problem: --- fontconfig/configure.in 2003-06-26 04:19:10.000000000 -0400 +++ fontconfig-work/configure.in 2003-10-25 15:08:33.000000000 -0400 @@ -390,6 +390,7 @@ fc-list/Makefile fc-match/Makefile doc/Makefile +doc/confdir.sgml doc/version.sgml test/Makefile fontconfig.spec Thanks for the responses, and I think I have a better idea of how this all works. Keith Packard''s suggestion of adding this to local.conf: <!-- set desired language if unset --> <match target="pattern"> <test compare="eq" name="lang" qual="all" > <string>unset</string> </test> <edit name="lang" > <string>en</string> <string>ja</string> </edit> </match> works perfectly as expected when my locale is unset. If I do have LANG set to en_US.UTF-8 there''s no change, of course, since fontconfig looks for languages supporting "en," or possibly "en_US." I haven''t tried Keith''s other suggestion, and I haven''t been able to get Owen Taylor''s suggestion to change the behavior at all. If I upgrade to CVS or 2.2.90 and change Owen''s suggestion from <test compare="eq" name="lang"><string>en</string></test> ... to <test compare="contains" name="lang"><string>en</string></test> ... then applications (gucharmap, gedit, etc.) start preferring to use Kochi Gothic and Kochi Mincho for Roman characters as well, over all the the English fonts that appear first in "fc-match --sort." I don''t really understand why, though. I''ll look into the strong binding suggestion. John Thacker
On Sat, 2003-10-25 at 11:36, John Alexander Thacker wrote:> On Sat, Oct 25, 2003 at 09:35:24AM -0400, Owen Taylor wrote: > > Well, actually, you are almost certainly going to have to tell the > > rendering system that you are rendering Japanese to get good results > > when you render Japanese. There is no reliable automatic way to tell > > that text is in Japanese rather than Chinese or Korean. > > Sure, that''s true, although it still should be possible to set a > preferred CJK font for rending Han characters, IMO.Yes, with the Pango-1.4 scheme, that will work as expected - you''ll get a language tag of ?? so the ordering in your Sans alias will be respected.> > It''s also possible to put magic in your fonts.conf to replace ''en'' > > language tags with ''en,ja''. That would be an immediate workaround > > for your problem, though I don''t consider it a solution. > > Wouldn''t the right workaround be to replace "ja" with "ja,en" instead? > I want to make the Japanese fonts available under an English Unicode > locale, right? Now, in the current Fedora Core release, you made a > workaround since language matching doesn''t work correctly, so I guess > I have to match on the font names explicitly. I thought the correct > syntax should look like: > > <match target="font"> > <test name="family" compare="eq"> > <string>Kochi Mincho</string> > </test> > <edit name="lang" mode="append"> > <string>en</string> > </edit> > </match> > > But this doesn''t seem to have any effect. Nor does putting a comma > before "en" work. Using "assign" instead of "append" and trying to > assign "ja,en" doesn''t work either. It seems that no matter what I > do, Kochi Mincho keeps the same list of supported languages, according > to fc-list. Perhaps I have the syntax wrong?<match target="font"> applies when fontconfig has found the final font, not to the patterns that are matched against. Basically, how it works is: Apply <match target="pattern> rules to the pattern Match against all fonts on the system Apply <match target="font"> rules to the result In the original that you are copied, what it is doing is saying "When you find Kochi Mincho, don''t prefer embedded bitmaps". What you want to say is "When someone is looking for English fonts, Japanese fonts are OK too". Something like: <match target="pattern"> <test name="lang" compare="eq"> <string>en</string> </test> <edit name="lang" mode="append"> <stringja</string> </edit> </match> Regards, Owen
John Alexander Thacker
2005-Nov-21 08:50 UTC
[Fontconfig] Font matching in Unicode locales
On Sat, Oct 25, 2003 at 09:35:24AM -0400, Owen Taylor wrote:> It''s also possible to put magic in your fonts.conf to replace ''en'' > language tags with ''en,ja''. That would be an immediate workaround > for your problem, though I don''t consider it a solution.I don''t think that this kind of magic is possible at all, but I could be wrong. I''ve been running fc-cache with debugging turned on, and looking through the fontconfig sources. It seems like fontconfig won''t let the settings in fonts.conf override the languages that it determines that a font supports. John Thacker
Around 11 o''clock on Oct 25, John Alexander Thacker wrote:> Wouldn''t the right workaround be to replace "ja" with "ja,en" instead? > I want to make the Japanese fonts available under an English Unicode > locale, rightOne of the design limitations of fontconfig is that font information is immutable -- there''s no way to adjust the information about fonts found in your system. That makes the cached font information independent of the configuration, allowing it to be shared among multiple users or even across multiple machines with completely separate font configurations. So, one of the things you cannot do is force fontconfig to believe that Kochi Mincho supports English; it''s missing several characters which occur occasionally in English text. What you *can* do is adjust the language tags that fontconfig will use to perform matching, and you can create substitution rules that make certain fonts take precedence over the language matching rules. Recall that in fontconfig, matching is done based on values provided by the application as edited by the configuration. There is a fixed ordered list of values used, mis-matching in earlier entries dominates matching in later entries. The relevant part of the matching list for this discussion is: "strong" family names language tags "weak" family names Each value in the system has either a "strong" binding or a "weak" binding; values provided by the application are (by default) strong while values provided by a substitution rule are (by default) weak. So, any aliases in your configuration file add family names which will *always* have less influence on the overall match than the language tag values. This mechanism was designed (or at least putatively designed) to provide reasonable default fonts based on the language present in the document for cases when a specified font name is either not provided or unavailable. When Mozilla renders HTML pages with language tags, the selected fonts will always match the language of the text, unless that text also has a specific font tag which will override the language tag. As a (semi) kludge, this system was amended to assume that font requests without a specific language should probably select fonts suitable for the current locale. That''s where you''re finding adventure today. You can "solve" your problem in two possible ways; either whack your default language to include Japanese or create some strong bindings to force font matching to prefer those fonts directly. I added configuration entries to default the lang value to include both en and ja: <!-- set desired language if unset --> <match target="pattern"> <test compare="eq" name="lang" qual="all" > <string>unset</string> </test> <edit name="lang" > <string>en</string> <string>ja</string> </edit> </match> This makes all unspecified Han usage reference a Japanese font instead of a Chinese or Korean one. The precise font used now depends on the generic aliases used. The strong binding alternative would be to do something like: <match target="pattern"> <test name="family"> <string>sans-serif</string> </test> <edit binding="strong" mode="prepend"> <string>Kochi Mincho</string> </edit> </match> (this is untested and may have errors) This creates a "strong" alias for sans-serif that will override language concerns. Please feel free to try and come up with an alternative matching scheme; the current mechanism is not sacred in any way, it''s just the result of a couple of years of iterative design and obviously is still imperfect in many ways. -keith
On Sat, Oct 25, 2003 at 09:06:32AM -0700, Keith Packard wrote:> So, one of the things you cannot do is force fontconfig to > believe that Kochi Mincho supports English; it''s missing > several characters which occur occasionally in English text.If I could ask the question again, would this mean that "Big5" would be considered "supported" only when all the 13051 characters are supported? Unfortunately, in the case of "Big5", there are fonts (esp. some professional fonts for special effects) with support of only the 5401 "frequently used characters". I understand that a similar situation exists also for Japanese; i.e., there are professional fonts where Japanese should be considered supported, when the font only supports a small subset of the JIS character set. -- Ambrose LI Cheuk-Wing <a.c.li@ieee.org> http://ada.dhs.org/~acli/
This problem also plagues Chinese speakers (e.g., specifying Sans or in fact any sans font would make the application use kanji from perhaps two different fonts, with the wrong prioritization), but mentioning Mozilla is probably not fair for fontconfig, since Mozilla seems to be doing its own thing (e.g., it seems to always prefer Korean fonts by default for some strange reason -- something that does not seem to be happening with other apps --, even though using Korean punctuation marks would be usually wrong for both Chinese and Japanese). On Sat, Oct 25, 2003 at 01:09:50AM -0400, John Alexander Thacker wrote:> I have a real problem with fontconfigi-2.2.1 selecting > the wrong fonts for the Sans alias when I''m in a Unicode > locale, specifically en_US.UTF-8. It seems that fontconfig > prioritizes the current language setting too much, which is > a real problem when the intended glyphs come from a language > other than English.[...] -- Ambrose LI Cheuk-Wing <a.c.li@ieee.org> http://ada.dhs.org/~acli/
John Alexander Thacker
2005-Nov-21 08:50 UTC
[Fontconfig] Font matching in Unicode locales
I have a real problem with fontconfigi-2.2.1 selecting the wrong fonts for the Sans alias when I''m in a Unicode locale, specifically en_US.UTF-8. It seems that fontconfig prioritizes the current language setting too much, which is a real problem when the intended glyphs come from a language other than English. Specifically, I occasionally view Japanese Unicode files, though I want my UI to remain English, and hence keep my locale as en_US.UTF-8. I''d like the Sans alias to use the preferred Kochi Gothic for viewing Japanese glyphs, and similarly to use Kochi Mincho for Serif. In fact, these are listed as preferred by default in fonts.conf, if I understand the file. However, neither Kochi Gothic nor Kochi Mincho list "en" as a supported language. This means that fontconfig will strongly prefer any other font which has those glyphs to Kochi Gothic or Kochi Mincho. On a RedHat 9 installation, this will mean preferring to use the far inferior hiragana and katakana contained in the MiscFixed bitmap font to that in Kochi Gothic or Kochi Mincho. (MiscFixed is installed by default as part of the bitmap-fonts package, and placed in /usr/share/fonts/bitmap-fonts.) Mozilla, of course, will completely refuse to use Kochi Gothic or Kochi Mincho for Unicode at all, since, as previously discussed, it is even more restrictive based on the language supported by a font. The end result of this is fairly ridiculous. Despite my locale being en_US, when I''m in Unicode, I want to view Japanese characters with a Japanese font, not a font which is primarily an English font which happens to have a few badly drawn Japanese characters! Ideally, I shouldn''t have to uninstall or unconfigure all fonts supporting English which also have badly drawn characters from other languages. And fonts which only contain characters from other languages, but contain no Roman characters or are otherwise unsuitable for English, shouldn''t have to list "en" (and every other language) as a supported language just to work right with Unicode locales. It seems to me that the promise of Unicode and fontconfig would be that I could just set all my applications to use "Sans" (or "Monospace" or "Serif") by default, and that would always select the proper font. I shouldn''t have to change my locale or my font just to view something in Japanese-- that breaks the whole point of Unicode. Perhaps this is different in 2.2.90 or CVS, but it doesn''t look like it to me from the Changelog. John Thacker
Around 10 o''clock on Oct 26, Ambrose Li wrote:> Firstly, from the viewpoint of a (traditional) Chinese speaker, > "supporting Big5" is indistinguishable from "supporting > traditional Chinese"That was the assumption I made in using the Han coverage from Big5 as the basis for the zh-tw orthography, similarly, I used GB18030 for zh-cn coverage and JIS for ja coverage. As you say though, there are an awful lot of glyphs in that encoding which are so rarely used that a font without them could still be usefully used for essentially all documents.> And not all traditional Chinese fonts support the bulk of Big5; decorative > fonts may support only the "frequently used characters", i.e., the first > 5401/13051 = 41% of the original Big5 space, a lot of these aren''t even Han > characters; I believe this would make fontconfig conclude that such fonts > do not support Chinese.Insomuch as they can''t really be used to correctly display nearly all documents, this statement is true. Much as many decorative Latin fonts include only upper case, you wouldn''t want one of them picked to present messages to the user.> This is probably ultimately more philosophical than practical, > unless an app plainly refuses to work with a font when > fontconfig tells it that a language is not supported. Mozilla > used to be such an app, but it seems to have somewhat improved > in this regard.The current design of fontconfig places the application (and indirectly the user) requested family name at the highest priority -- if an application requests ''Bitstream Vera Serif'' while running in a zh-tw locale, it will get ''Bitstream Vera Serif'' and not have it''s choice overridden to a font which supports zh-tw. Mozilla may still limit the fall-back font choices in it''s dialogs to those which do support the specified language, but it will not override document specified families. -keith
Around 23 o''clock on Oct 25, Ambrose Li wrote:> If I could ask the question again, would this mean that > "Big5" would be considered "supported" only when all the > 13051 characters are supported?That''s a trick question -- ''Big5'' is not a language, but a text encoding. Fontconfig doesn''t concern itself with supporting encodings, but only in supporting ''orthographies'' for individual languages, those Unicode values necessary to represent the bulk of words in the standard character set for the language as used in a particular territory. I gratefully accept authoritative changes to the orthographies that fontconfig uses in computing language support for each font; the ones that I have were gathered from a wide variety of sources. For European scripts, I was able to rely on the fine work of Michael Everson (http:// www.evertype.com). For other latin and cyrillic scripts, I found the Institute of the Estonian language (http://www.eki.ee) very useful. For other languages, I scrounged around the net. I recall spending a day or so looking for an authoritative reference for the orthography of Luxemborgish which is related to German but had no official written representation until sometime after WWII. The Unicode standard provided quite a bit of help with languages using unique scripts, although the coverage for those languages is often far more comprehensive than used with any kind of regularity (or provided in fonts, for some). Again, local expertise is the best information, unfortunately fluency in a language does not equate to expertise in the character set. For the Han languages, I relied heavily on the tables which transcode between Unicode and standard local encodings. I know those are probably way too inclusive, but I don''t have a better source at the current time, and as the goal is to identify fonts supporting a particular language, it turns out to work relatively well -- most fonts for zh-tw started as Big5 fonts and generally cover all of the Han glyphs in that encoding quite well. I''d really like to get better references for these orthographies, perhaps someone on this list can point me at national standards for each language that lists exactly the characters considered "required" for representing each one. -keith
Hi, sorry that my question was poorly worded (and thus perceived as a "trick question"). Firstly, from the viewpoint of a (traditional) Chinese speaker, "supporting Big5" is indistinguishable from "supporting traditional Chinese" (part of the reason being, strictly speaking, "Big5" *not* supporting traditional Chinese [at least a small but significant number of characters used in real-life situations] anyway -- i.e., we are used to our own de facto standard encoding not supporting our own language for perhaps 15 years). I''d say that ordinary end users (i.e., people who do not know what an encoding is) are even more likely to not distinguish between the two. And not all traditional Chinese fonts support the bulk of Big5; decorative fonts may support only the "frequently used characters", i.e., the first 5401/13051 = 41% of the original Big5 space, a lot of these aren''t even Han characters; I believe this would make fontconfig conclude that such fonts do not support Chinese. I reason that if this is the case it would not make sense for the end user (since if he/she posess such a font, he/she would likely to be a graphic designer and knows what he/she is doing.) (Is this valid, or am I mistaken? Perhaps I''m mistaken. Or perhaps I''m too old fashioned and perhaps we don''t find such fonts very often any more.) This is probably ultimately more philosophical than practical, unless an app plainly refuses to work with a font when fontconfig tells it that a language is not supported. Mozilla used to be such an app, but it seems to have somewhat improved in this regard. -- Ambrose LI Cheuk-Wing <a.c.li@ieee.org> http://ada.dhs.org/~acli/
On Sun, Oct 26, 2003 at 07:42:16AM -0800, Keith Packard wrote:> > And not all traditional Chinese fonts support the bulk of > > Big5; decorative fonts may support only the "frequently > > used characters", i.e., the first 5401/13051 = 41% of > > the original Big5 space, a lot of these aren''t even Han > > characters; I believe this would make fontconfig conclude > > that such fonts do not support Chinese. > > Insomuch as they can''t really be used to correctly display > nearly all documents, this statement is true. Much as many > decorative Latin fonts include only upper case, you wouldn''t > want one of them picked to present messages to the user.Theoretically this is correct and makes perfect sense. However, I would reason that beyond a certain percentage of incomplete coverage (perhaps with the additional condition that a font must completely cover the first 5401 characters in Big5, and perhaps only if the FreeType backend is being used), a font should be considered "supports Chinese" even though it does not cover all 13051 glyphs in Big5. Revisiting the "sans" family: Even in Microsoft Windows, there is *no* free (in whatever sense) sans font covering all of Big5 (unless you only use Windows 2000 or later, in which case you have "SimSun"). Instead, an augmented GB2312 font, "MS Hei", is used for the "sans" font for traditional Chinese. This is despite the fact that "MS Hei" cannot display "nearly all documents" encoded in Big5. (On GNU/Linux systems, AR PL KaiTiM Big5 is often mapped as "sans". Typographically speaking, this is wrong, though in practice probably nothing better could be done; the "kai" family is a stylized calligraphic type, which when mapped to Western typographic terminology should strictly speaking be classed as "italic" [cursive forms deriving from stylized handwriting, not necessarily slanted per se].) Or perhaps there should be a property specifying that a font "most likely will support" a given language. I don''t know if I am making sense here. -- Ambrose LI Cheuk-Wing <a.c.li@ieee.org> http://ada.dhs.org/~acli/
On Sun, Oct 26, 2003 at 11:12:47AM -0500, Ambrose Li wrote:> > Revisiting the "sans" family: Even in Microsoft Windows, there > is *no* free (in whatever sense) sans font covering all of Big5 > (unless you only use Windows 2000 or later, in which case you > have "SimSun"). [...skipped...]I was too quick to hit Send without checking my post carefully. Of course I meant "SimHei" instead of "SimSun" :-( -- Ambrose LI Cheuk-Wing <a.c.li@ieee.org> http://ada.dhs.org/~acli/