On Thu, Dec 09, 2004 at 12:02:10PM -0800, Keith Packard wrote:> > Around 9 o''clock on Dec 9, John Thacker wrote: > > > Therefore, 0406 and 0456 should be removed from or commented out of ru.orth > > Done.Thanks> I know in my own typographical history I''ve seen the slow increase in the > set of glyphs considered "necessary" for proper English publication; novels > which 30 years ago would have been published in essentially ASCII > (originating as they did on typewriters) are now starting to include > accented characters as a guide to both the etymology and pronounciation of > words. Certainly common words like na?ve or r?sum? are more accurately > spelled with the appropriate accents than without.I don''t know that I''d say that they are "more accurately spelled," honestly. I grant that in the case of r?sum? the final accent is useful due to the pronunciation, and to distinguish it from resume. Would you argue that h?tel should still be spelled that way? It is "more accurate," in a historical sense. ?ngstr?m? co?perate? (The last for phonetic but not etymological reasons-- you will see it used in the New Yorker for example.) To get really silly, ? propos? Ca?on instead of canyon? I agree that modern computers and technology make it easier to use accents, but I''m not completely convinced that their use is actually on the upswing. Most of the sources I''ve consulted claim that their use has been decreasing, though I suppose recent technology may have reversed that. In my experience, diacritical marks are used only in words perceived as foreign or of obvious foreign origin, and they tend to lose their accents. That said, they certainly are used. OTOH, not every font will include them even when they intend to cover English. It''s an annoying problem. Being rendered in a single face is nice-- but I deal with a lot of Japanese documents which contain Latin characters, and often mixed Japanese/English. My Japanese fonts mostly do not contain all the accented Latin characters; this causes problems because Pango wants to render English words with fonts which support "en." Result is not using a single face even though the preferred Japanese font contains all characters on the page. Or, if you''re not using Pango, then applications tend to assume that your en_US.UTF-8 locale means you want fonts which support "en," and thus will render Japanese UTF-8 documents using ugly fonts like MiscFixed in preference to other Japanese fonts. I know how to tweak the problem myself using weak family bindings, but there''s a lot of discussion of it by Owen Taylor and others in a few places: https://bugzilla.redhat.com/beta/show_bug.cgi?id=138783 https://bugzilla.redhat.com/beta/show_bug.cgi?id=107952 https://bugzilla.redhat.com/beta/show_bug.cgi?id=107617> So, I guess it''s my cultural bias to encourage people to spell words right > rather than spelling them as if they were using an IBM selectric > typewriter. It was not done randomly.Most dictionaries I''ve seen contain the unaccented spelling only, or indicate it as the primary one with the accented as a variant, for most of those words, including naive/na?ve and resume/r?sum?. (The latter actually has a common variant with only the final accent mark, resum?, found in most dictionaries as well.) Of course, not only is the "certainly" disputed, but there are certainly those prescriptivists who argue that accents are not a part of English orthography at all, and belong only on foreign words. That said, many people do use them of course, and I''m a descriptivist. I may disagree with your "certainly" remark, but the usage may be enough to make it worth it. For me personally, however, it''s a pain, since quite a lot of font writers do not consider accent marks to be essential for English support. (And really, they aren''t essential at all.) John Thacker -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/86062b66/attachment.pgp
On Wed, Dec 08, 2004 at 12:27:46PM -0800, Keith Packard wrote:> That''s probably just that the ko.lang file has too many glyphs. We can fix > this easily enough. Place all of your Korean fonts in a directory and run: > > $ cd <directory containing korean fonts> > $ FC_DEBUG=384 fc-cache -f .Ah, thanks, I always forget that command. Here''s a couple of notes: Currently Russian (ru) requires 0406 and 0456 (? and ?), but these were eliminated in Russian in 1918 in favor of 0418 and 0438 (? and ?), and don''t even appear in KOI8-R. (The hypothesis that they don''t appear in KOI8-R due to their similarity with Latin I and i is eliminated by their presence in KOI8-U.) I have a couple of fonts with Russian support that don''t have the letter. Therefore, 0406 and 0456 should be removed from or commented out of ru.orth How necessary are the accented characters for English? I have a decent number of fonts designed to cover English but which lack the accented characters currently required (i.e., 00c0, 00c7-00cb, 00cf, 00d1, 00d4, 00d6, 00e0, 00e7-00eb, 00ef, 00f1, 00f4, 00f6). My understanding of English orthography is that they are optional. (Similarly, we don''t require the oe and especially the ae ligatures, despite their use in British spellings, and ?''s presence in ISO 8859-1.) John Thacker -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/946c2ff4/attachment.pgp
hi> Thanks muchly for the additional data. The list of alternate common (but > non-free) family names is really nice to have.the list is far from complete, but at least a starting point.> If this stuff isn''t working for you, we''ve just got bugs in fontconfig > that need fixing.this approach as-is has two problems for asian fonts. one: if you forget to select language you tend to end up with times or verdana. two: even if you do set the language, you may end up with slightly less-than-preferred fonts. for example selecting lang=zh-tw you may end up with arial unicode or a simplified chinese font... asian fonts in the same family are more substitutable than latin fonts, since they are mostly mono-spaced. they are somewhere in between the ''times/times new roman'' and ''serif'' categorisation. stratifying this into two layers would be a solution. how about splitting the categories into separate scripts and chaining them together at the end. this way if you ask for a typically cyrillic font and one doesnt find an ''exact'' substitute you''ll go through the list of cyrillic generic fonts first before trying arial unicode or falling back to verdana? how about a font config that does the following: first, exact aliases: * alias to truetype substitutes -- Times to Times New Roman * alias to ghostscript urw substitutes -- Times to Nimbus Roman No9 L then, classification of known fonts: * classify known fonts into generic+script -- Kochi Gothic to ''sans-serif+japan'' some cleaning of the pattern, tie together generic families with priority: * if asked for generic and language tag is set, change to generic+script * after the ''generic+script'' append a pure ''generic'' to catch unicode fonts * if not classified and language tag is set, add sans-serif+script * if not classified yet, add ''sans-serif'' * after pure ''generic'' add all ''generic+script'' versions to catch the case of wanting all ''generic'' fonts and finally... * font substitution ''generic+script'' to preferred -- serif+korea to Baekmuk Batang * font substitution ''generic'' to preferred unicode font example: Batang Batang, serif+korea Batang, serif+korea, serif Batang, serif+korea, serif, serif+latin,serif+chinasimp,.... Batang, Baekmuk Batang, serif+korea, serif, ...latin fonts..., serif+latin, ...chinasimp fonts...... does that sound reasonable? attached is a subs.conf that should be included instead of the current alias/substitute stuff in the default fonts.conf.> > i understand that due to the incapability of freetype to use CMaps to > > encode CID fonts, the ability to use CID-fonts with fontconfig is severely > > limited. however, it would be really really nice if fontconfig were extended > > in this area. > > I don''t have a lot of experience with Type1 CID fonts as I''ve tried to > stick to TrueType which supports Unicode so much more nicely. > > If you know of code or even documentation which clearly shows how to get > from Unicode to CID stuff, it would be greatly appreciated.Adobe-CNS1-UCS2 Adobe-GB1-UCS2 Adobe-Japan1-UCS2 Adobe-Korea1-UCS2 the above CMaps map from CID to unicode. inferring the reverse mapping should be trivial. if you need code, i have a cmap data structure and parser that are part of my pdf project that should be easy to extract, or you can just generate static mapping tables with a python script. i''d be happy to code one up if you need it.> If I can get from Unicode to CID, I can generate FC_LANG tags, but I''m not > sure what other information belongs in fontconfig itself; remember that > fontconfig is designed to provide information needed to select among > fonts, not all of the information needed once you have a font in hand.now how about a char *FcFindCMap(char *name) function? if i find the CID-font with fontconfig, it is useless without the CMaps. CMaps are basically a size-optimisation, instead of adding encoding tables to all of the fonts like you do with truetype. tor -------------- next part -------------- A non-text attachment was scrubbed... Name: subs.conf Type: application/octet-stream Size: 13539 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/9e3cd194/subs.obj
On Sun, Dec 12, 2004 at 08:39:20PM -0800, Keith Packard wrote:> > Around 10 o''clock on Dec 13, Tor Andersson wrote: > > > 8822 hangul syllables in unicode are missing from KS X 1001, and most > > of the fonts miss them too. about 5000 chinese characters are also in > > the old ko.orth that are missing from a lot of korean fonts. > > It almost seems like we should preserve the KSC 5601-1992 orthography > somehow; is there a territorial difference where the Han glyphs are used > in one area and not in another? Or is it that Han is just slowly leaving > the "normal" Korean language and that fontconfig should respect that > change?Well, in the DPRK Han glyphs (Hanja) haven''t been used at all officially since 1949. In the ROK their use has definitely been steadily decreasing as well. I''ve heard that learning all the standard set is no longer mandatory in high schools, for example. There are occasions where they are used still, though. According to most things I''ve seen (e.g., http://jshin.net/faq/qa8.html ) KS C 5601-1992 is the same as KS X 1001:1992, there was merely a change in format and number scheme made in August 1997. The "extra" 8822 precomposed hangul syllables are NOT an official part of the standard, but there is a provision on how to represent them in the document. The later KS C 5700 / KS X 1005-1 does include them, however. Yet, since so many Korean fonts are based on KS X 1001 / KS C 5601, it''s reasonable to follow it and not include the extra precomposed syllables, and restrict to only the Hangul syllables in the list Tor provided. However, the 4882 Hanja are definitely part of the standard no matter what. I''m somewhat surprised that so many fonts would not have them at all, but it certainly must be easier. (Just as it''s easier to not include accented characters in fonts designed for English.) John Thacker -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/7ec11491/attachment.pgp
Owen Taylor
2005-Nov-21 08:51 UTC
[Fontconfig] Overly aggresive English orthography? [was asian font configuration]
On Mon, 2004-12-13 at 10:56 -0800, Keith Packard wrote:> Around 2 o''clock on Dec 13, John Thacker wrote: > > > However, the 4882 Hanja are definitely part of the standard no matter > > what. I''m somewhat surprised that so many fonts would not have them at all, > > but it certainly must be easier. > > That''s what I thought; we have a ''standard orthography'' for Korean as used > in the Republic of Korea which includes a large set of glyphs which are > no-longer in common use in the Republic of Korea. > > THere is also a standard from the Democratic People''s Republic of Korea > (KPS 9566-97) which inclues 4653 Korean Hanja characters. > > http://www.itscj.ipsj.or.jp/ISO-IR/202.pdf > > The guiding principle for fontconfig''s orthography construction is to > select fonts capable of displaying the preponderance of documents in the > given language. The English orthography, as an example, includes uncommonly > used accented letters like ? and ? as they appear in many documents, > although many people accept and use alternate spellings without them.If you aren''t using Pango-style language tag refinement, this causes some bad problems, see, e.g.: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=107952 I don''t actually see much value in such an extensive orthography for English ... if fontconfig is hunting through all the fonts on a system for a font to display English, the chances of it coming up with a nice one is pretty minimal, just on a numerical basis. Regards, Owen -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/2d70fa49/attachment.pgp
On Mon, Dec 13, 2004 at 12:22:21PM -0800, Keith Packard wrote:> > Around 14 o''clock on Dec 13, Owen Taylor wrote: > > > If you aren''t using Pango-style language tag refinement, this causes > > some bad problems, see, e.g.: > > > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=107952 > > Well, that''s just a symptom of a more serious issue -- that selection of > glyphs beyond the orthography of the current locale is driven by font > suitability for the locale.That is part of it. The problem is that there are several possible goals and situations. There are two different approaches. One is to render an entire document with a single font when possible. Another approach is to render different orthographies in a single document with the "best" font for each orthography/language. There are cases where the first approach looks better (a combination of non-accented Latin characters and Japanese, all contained in one Japanese font), and cases where it looks worse (a particularly varied combination of glyphs that are only all contained by an ugly "fallback" font of last resort, like MiscFixed). It''s pretty easy to come up with other examples, too: imagine a document mostly written in English that contains a few words of foreign origin using accents outside the list currently provided in en.orth. Many people would like view the entire document in a single font, rather than switching fonts just for those words, especially if extra glyphs are related to the Latin alphabet. E.g., if I''m reading something in English which suddenly quotes Dutch and uses ? or references a Welsh placename and uses ?, then maybe I just want the whole document to use Verdana, which contains both, rather than using Luxi Sans for everything except the words containing those two letters, even though Luxi Sans is normally my first choice for English. Another big problem is that the Unicode standard treats fullwidth (doublewide) Latin characters are part of the Latin alphabet, and as inappropriate for Japanese, Chinese, etc. This is despite their use being almost completely limited to Far Eastern languages in my experience. For that reason, Pango doesn''t allow ''ja'' language tags on the fullwidth Latin character, AIUI. In any case, Japanese fairly frequently contains unaccented Latin characters, and the fonts reflect that. However, fontconfig and Pango conspire to prevent Japanese fonts from ever using their Latin characters. John Thacker -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/6bcddcc4/attachment.pgp
Around 1 o''clock on Dec 9, Tor Andersson wrote:> korean language detection is "broken". only two of all the korean fonts i > have on my system are correctly identified as being korean:That''s probably just that the ko.lang file has too many glyphs. We can fix this easily enough. Place all of your Korean fonts in a directory and run: $ cd <directory containing korean fonts> $ FC_DEBUG=384 fc-cache -f . If there are only a few codepoints in ko.lang which aren''t included in your Korean fonts, that will list them as in: ch(6) { 00c2 00d1 00dc 00e2 00f1 00fc } This says the font in question (Watanabe) is missing six glyphs required to support Chamorro. If there are more than 10 missing glyphs, it won''t list the individual glyphs; we''d have to change the libary to display them all. When you get the list of glyphs, you can remove them from ko.lang, rebuild the library and see if fontconfig now recognises the fonts correctly. If the font advertises support for a single Han language, and that language is not Korean, then the language coverage checking code won''t even consider Korean when checking for language coverage. You can distinguish this case by the absense of any ''ko(xx)'' debug output in the above list. That would be more worrisome; I haven''t yet found any fonts which mis-mark their Han language support. If you can (legally) send the fonts in question along, I''d be happy to do this work.> the default configuration for fontconfig is a bit on the scarce side > regarding font aliases and substitutions for asian fonts.Thanks muchly for the additional data. The list of alternate common (but non-free) family names is really nice to have. I think I''d like to try them in a different part of the configuration and see if they work for you. For example, as ''Kochi Gothic'' is in mapping from ''sans-serif'' to a set of family names, it should suffice to place the other gothic Japanese fonts in the mapping from family names to ''sans-serif''. From there, it will be mapped into the list of acceptable ''sans-serif'' faces, including ''Kochi Gothic''. Also, you shouldn''t need to map specific languages to specific family names; the language identification in fontconfig should suffice to select among the families listed for the generic aliases, just placing the family names in the generic ''serif'', ''sans-serif'', and ''monospace'' aliases should select them when the application specifies the cooresponding language. For both of these, the guiding principle is that we do font substitution in two ways: 1) For intentionally similar faces (Helvetica/Arial, Times/Times New Roman), we have specific aliases mapping those names: <alias> <family>Times</family> <accept><family>Times New Roman</family></accept> </alias> <alias> <family>Helvetica</family> <accept><family>Arial</family></accept> </alias> 2) To set the preferred faces to use when an intentionally similar face not available, we map first from the given family to one of the generic names: <alias> ... <family>Times</family> ... <default><family>serif</family></default> </alias> Then we map from the generic name to a list of suitable fonts: <alias> <family>serif</family> <prefer> ... <family>FreeSerif</family> ... </prefer> </alias> If this stuff isn''t working for you, we''ve just got bugs in fontconfig that need fixing.> i understand that due to the incapability of freetype to use CMaps to > encode CID fonts, the ability to use CID-fonts with fontconfig is severely > limited. however, it would be really really nice if fontconfig were extended > in this area.I don''t have a lot of experience with Type1 CID fonts as I''ve tried to stick to TrueType which supports Unicode so much more nicely. If you know of code or even documentation which clearly shows how to get from Unicode to CID stuff, it would be greatly appreciated.> i think that fontconfig should look at the registry-ordering in the > CID System Info dict and put to good use. add a property FC_CSI and > put in a corresponding FC_LANG tag for CID-fonts.If I can get from Unicode to CID, I can generate FC_LANG tags, but I''m not sure what other information belongs in fontconfig itself; remember that fontconfig is designed to provide information needed to select among fonts, not all of the information needed once you have a font in hand. -keith -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 228 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041208/fd0e1d6b/attachment.pgp
gah! my reply keeps getting stuck in the moderation filter. i made a new ko.orth which only contains the ideographs and syllables in KS X 1001. it makes a world of difference, now it recognizes all my korean fonts as such. the results on the old ko.orth gave the following results on my fonts: Baekmuk fonts: ? ?Scanning file ./batang.ttf... ?ko(0) ? ?Scanning file ./dotum.ttf... ?ko(8822) ? ?Scanning file ./gulim.ttf... ?ko(0) ? ?Scanning file ./hline.ttf... ?ko(13703) Apple fonts: ? ?Scanning file ./#Gungseouche.dfont... ?ko(8827) ? ?Scanning file ./#HeadlineA.dfont... ?ko(13710) ? ?Scanning file ./#PCmyoungjo.dfont... ?ko(8826) ? ?Scanning file ./#Pilgiche.dfont... ?ko(13710) ? ?Scanning file ./AppleGothic.dfont... ?ko(8827) ? ?Scanning file ./AppleMyungjo.dfont... ?ko(8824) 8822 hangul syllables in unicode are missing from KS X 1001, and most of the fonts miss them too. about 5000 chinese characters are also in the old ko.orth that are missing from a lot of korean fonts. find the new ko.orth here: http://ghostscript.com/~tor/software/patches/ko.orth tor
Around 9 o''clock on Dec 9, John Thacker wrote:> Therefore, 0406 and 0456 should be removed from or commented out of ru.orthDone.> How necessary are the accented characters for English? I have a decent > number of fonts designed to cover English but which lack the accented > characters currently required (i.e., 00c0, 00c7-00cb, 00cf, 00d1, 00d4, > 00d6, 00e0, 00e7-00eb, 00ef, 00f1, 00f4, 00f6). My understanding of > English orthography is that they are optional.The problem with English orthography is that there is no authority upon which we can lean. I adapted orthographies published by Michael Everson and others for our use, preferring to err slightly on the side of including more characters than would commonly appear to ensure that the bulk of English documents could be rendered in a single face. I know in my own typographical history I''ve seen the slow increase in the set of glyphs considered "necessary" for proper English publication; novels which 30 years ago would have been published in essentially ASCII (originating as they did on typewriters) are now starting to include accented characters as a guide to both the etymology and pronounciation of words. Certainly common words like na?ve or r?sum? are more accurately spelled with the appropriate accents than without. I did elide letters associated with older English spelling like ? and ?; their use in modern English documents seems an anachronism. So, I guess it''s my cultural bias to encourage people to spell words right rather than spelling them as if they were using an IBM selectric typewriter. It was not done randomly.> (Similarly, we don''t require the oe and especially the ae ligatures, > despite their use in British spellings, and ?''s presence in ISO 8859-1.)I wanted to stick to just the orthography and not typographical niceties, so I left out all ligatures, which is how I think these are used in English (unlike in Danish where I believe the ? is treated as a separate letter). Similarly, I had considered adding the conventional quotation marks for each language and decided to leave those out. Obviously, a lot of this is up to the author of the orthograpy files themselves, and in many cases there are no ideal answers. The goal of the whole exercise is to identify fonts which can accurately reproduce the bulk of documents in the given language, a much different goal that enumerating either commonly used letters or all possible letters. -keith -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 228 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/741daf5b/attachment.pgp
Around 13 o''clock on Dec 9, Tor Andersson wrote:> stratifying this into two layers would be a solution. how about splitting > the categories into separate scripts and chaining them together at > the end. this way if you ask for a typically cyrillic font and one doesnt > find an ''exact'' substitute you''ll go through the list of cyrillic generic fonts > first before trying arial unicode or falling back to verdana?We currently have three layers here, the first is exact family matches ("strong" family matches), the second is language and territory matches and the third is generic alias matches ("weak" family matches). This is supposed to do exactly what you want.> * classify known fonts into generic+script -- Kochi Gothic to ''sans-serif+japan''That''s done by adding the Kochi Gothic -> sans-serif alias entry. The Japanese classification is done automatically by fontconfig''s language support detection code.> the above CMaps map from CID to unicode. inferring the reverse mapping > should be trivial. if you need code, i have a cmap data structure and > parser that are part of my pdf project that should be easy to extract, > or you can just generate static mapping tables with a python script. > i''d be happy to code one up if you need it.Fontconfig needs to do two things. The first is construct a list of Unicode values supported by the font. For this it needs to enumerate the encoded glyphs and compute Unicode values for each one. The second abilty is mapping from Unicode codepoints to glyphs; this is largely a convenience for applications which don''t want to deal with the obscurities of non-Unicode mappings. Fontconfig already has a couple of built-in transcoding tables to handle Apple Roman and Adobe Symbol encoded fonts, which often are either missing Unicode mappings or which have broken Unicode mappings (I have about 1200 fonts with broken Unicode mappings which have functional Apple Roman mappings). FreeType has functions to enumerate the encoded glyphs in a font (FT_Get_First_Char and FT_Get_Next_Char) which fontconfig happily uses to enumerate glyphs, but for non-Unicode mappings it will need an additional function to convert from the encoded value to a Unicode value. Please take a look at FcFreeTypeCharSetAndSpacing in fcfreetype.c to see how that all works.> now how about a char *FcFindCMap(char *name) function? > > if i find the CID-font with fontconfig, it is useless without the CMaps. > CMaps are basically a size-optimisation, instead of adding encoding > tables to all of the fonts like you do with truetype.As I recall, the CMaps are external to the fonts themselves, just like kerning data in the .pfa files. Is there a standard naming convention which is used to locate CMap files for particular font files? Or could we construct such a convention? I don''t see how fontconfig would otherwise locate the files, and if there is a suitable convention, we should just get applications to use the same. -keith -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 228 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/b3662044/attachment.pgp
Around 10 o''clock on Dec 13, Tor Andersson wrote:> 8822 hangul syllables in unicode are missing from KS X 1001, and most > of the fonts miss them too. about 5000 chinese characters are also in > the old ko.orth that are missing from a lot of korean fonts.It almost seems like we should preserve the KSC 5601-1992 orthography somehow; is there a territorial difference where the Han glyphs are used in one area and not in another? Or is it that Han is just slowly leaving the "normal" Korean language and that fontconfig should respect that change? -keith -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 228 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041212/f7d48b40/attachment.pgp
On Mon, Dec 13, 2004 at 02:17:28AM -0500, John Thacker wrote:> Yet, since so many Korean fonts are based on KS X 1001 / KS > C 5601, it''s reasonable to follow it and not include the > extra precomposed syllables, and restrict to only the Hangul > syllables in the list Tor provided. However, the 4882 Hanja > are definitely part of the standard no matter what. I''m > somewhat surprised that so many fonts would not have them at > all, but it certainly must be easier. (Just as it''s easier > to not include accented characters in fonts designed for > English.)A similar situation exists for Chinese fonts: Many commercial traditional Chinese fonts (esp. if they are "decorative" in some sense) used to only contain the "frequently used" portion of Big5; such fonts would not be detected as Chinese fonts. However, I haven''t been using commercial Chinese fonts for the past few years, so the above is likely very outdated information and probably no longer true. That said, I still imagine the fonts that are missing the hanja would likely to also be "decorative" in nature, or used for emphasis, etc.
Around 2 o''clock on Dec 13, John Thacker wrote:> However, the 4882 Hanja are definitely part of the standard no matter > what. I''m somewhat surprised that so many fonts would not have them at all, > but it certainly must be easier.That''s what I thought; we have a ''standard orthography'' for Korean as used in the Republic of Korea which includes a large set of glyphs which are no-longer in common use in the Republic of Korea. THere is also a standard from the Democratic People''s Republic of Korea (KPS 9566-97) which inclues 4653 Korean Hanja characters. http://www.itscj.ipsj.or.jp/ISO-IR/202.pdf The guiding principle for fontconfig''s orthography construction is to select fonts capable of displaying the preponderance of documents in the given language. The English orthography, as an example, includes uncommonly used accented letters like ? and ? as they appear in many documents, although many people accept and use alternate spellings without them. Where available, fontconfig leans on official standards published by relevant bodies like the Acad?mie fran?aise, but in the case of Korean, the standards above appear to be aimed at representing more than just Korean as currently written. Is there perhaps a more relevant standard than the encoding tables here? An actual orthography would be really nice to have. -ketih -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 228 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/a68c1b9e/attachment.pgp
Around 8 o''clock on Dec 13, Ambrose Li wrote:> That said, I still imagine the fonts that are missing the hanja > would likely to also be "decorative" in nature, or used for > emphasis, etc.The question is whether Hanja is still present in any significant fraction of Korean texts; if so, then these "decorative" faces might be thought of as equivalent to a Latin font without accents -- usable if specifically requested by name, but not otherwise presented to the user as a reasonable font to use. -keith -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 228 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/2d4ece1e/attachment.pgp
Keith Packard
2005-Nov-21 08:51 UTC
[Fontconfig] Overly aggresive English orthography? [was asian font configuration]
Around 14 o''clock on Dec 13, Owen Taylor wrote:> If you aren''t using Pango-style language tag refinement, this causes > some bad problems, see, e.g.: > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=107952Well, that''s just a symptom of a more serious issue -- that selection of glyphs beyond the orthography of the current locale is driven by font suitability for the locale. Perhaps what we need is a better sorting mechanism for FcFontSort. What we want is for the list to include fonts that cover the specified language(s) to be at the front of the list, and for the remaining elements of the list to be sorted without regard to language. That seems doable, but it may make FcFontSort even slower than it is today. -keith -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 228 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/9a10f938/attachment.pgp
Around 17 o''clock on Dec 13, John Thacker wrote:> That is part of it. The problem is that there are several possible goals > and situations. There are two different approaches. One is to render > an entire document with a single font when possible. Another approach is > to render different orthographies in a single document with the "best" > font for each orthography/language.I believe these two approaches must be used in conjunction, and that software can ''guess'' when each approach should be used, but provisions for user-specified overrides may need to be permitted. No single approach will work for all documents and users. However, we should strive for reasonable and predictable behaviour from fontconfig so that people aren''t just confused by the weird results and have some chance of actually figuring out how to make it do what they want.> E.g., if I''m reading something in English which suddenly quotes Dutch > and uses ? or references a Welsh placename and uses ?, then maybe I just > want the whole document to use Verdana, which contains both, rather than > using Luxi Sans for everything except the words containing those two > letters, even though Luxi Sans is normally my first choice for English.As Fontconfig can''t know about the actual document content directly, the only way to have applications automatically present this as you desire is to have them construct the set of necessary Unicode values for the document and ask Fontconfig for fonts containing those codepoints. The use of lang tags in fontconfig is both a short-hand notation for these Unicode sets and a predictive mechanism for guessing what future glyphs may be presented for drawing. -keith -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 228 bytes Desc: not available Url : http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/b0dca1d6/attachment.pgp
hi keithp and all. a few issues relating to asian font support in fontconfig. one! korean language detection is "broken". only two of all the korean fonts i have on my system are correctly identified as being korean: # fc-list '':lang=ko'' Baekmuk Batang:style=Regular Baekmuk Gulim:style=Regular all of apples korean fonts that ship with macos x, and even two of the baekmuk fonts are missing. this needs to be fixed. two! the default configuration for fontconfig is a bit on the scarce side regarding font aliases and substitutions for asian fonts. please consider incorporating the attached file into the distribution. exactly which default substitution fonts should be used -- batang or dotum, mincho or gothic, etc -- should probably be discussed. three! i understand that due to the incapability of freetype to use CMaps to encode CID fonts, the ability to use CID-fonts with fontconfig is severely limited. however, it would be really really nice if fontconfig were extended in this area. i would like to query fontconfig for the location of CMap resource files. they are font-related and forcing pdf and postscript tools to use a separate configuration for finding them kinda stinks. i think that fontconfig should look at the registry-ordering in the CID System Info dict and put to good use. add a property FC_CSI and put in a corresponding FC_LANG tag for CID-fonts. Munhwa-Regular:csi=Adobe-Korea1:lang=ko MOEKai:csi=Adobe-CNS1:lang=zh-TW thanks, tor -------------- next part -------------- <?xml version="1.0"?> <fontconfig> <dir>/usr/local/share/ghostscript/fonts</dir> <!-- Defaults by asian language tag --> <match target="pattern"> <test name="lang"><string>zh-TW</string></test> <edit name="family" mode="append"><string>AR PL KaitiM Big5</string></edit> </match> <match target="pattern"> <test name="lang"><string>zh-CN</string></test> <edit name="family" mode="append"><string>AR PL KaitiM GB</string></edit> </match> <match target="pattern"> <test name="lang"><string>ja</string></test> <edit name="family" mode="append"><string>Kochi Gothic</string></edit> </match> <match target="pattern"> <test name="lang"><string>ko</string></test> <edit name="family" mode="append"><string>Baekmuk Dotum</string></edit> </match> <!-- workaround korean lang not being detected correctly ... --> <match target="pattern"> <test name="lang"><string>ko</string></test> <!-- hmm there is no mode="delete" ... --> <edit name="lang" mode="assign"><string>zu</string></edit> </match> <!-- Traditional Chinese --> <alias> <family>MSung-Light</family> <family>MSung-Medium</family> <accept><family>AR PL Mingti2L Big5</family></accept> </alias> <alias> <family>MHei-Medium</family> <family>MKai-Medium</family> <family>MingLiU</family> <family>PMingLiU</family> <accept><family>AR PL KaitiM Big5</family></accept> </alias> <!-- Simplified Chinese --> <alias> <family>STSong-Light</family> <family>STFangsong-Light</family> <family>SimSun</family> <family>NSimSun</family> <accept><family>AR PL SungtiL GB</family></accept> </alias> <alias> <family>STHeiti-Regular</family> <family>STKaiti-Regular</family> <family>SimHei</family> <accept><family>AR PL KaitiM GB</family></accept> </alias> <!-- Japanese --> <alias> <family>Ryumin-Light</family> <family>Ryumin-Medium</family> <family>HeiseiMin-W3</family> <family>MS-Mincho</family> <family>MS-PMincho</family> <accept><family>Hiragino Mincho Pro W3</family></accept> <accept><family>Sazanami Mincho</family></accept> <accept><family>Kochi Mincho</family></accept> </alias> <alias> <family>GothicBBB-Medium</family> <family>HeiseiKakuGo-W5</family> <family>MS-Gothic</family> <family>MS-PGothic</family> <family>MS-UIGothic</family> <accept><family>Hiragino Kaku Gothic Pro W3</family></accept> <accept><family>Sazanami Gothic</family></accept> <accept><family>Kochi Gothic</family></accept> </alias> <!-- Korean --> <alias> <family>Batang</family> <family>BatangChe</family> <family>Gungsuh</family> <family>GungsuhChe</family> <family>HYSMyeongJo-Medium</family> <accept><family>Baekmuk Batang</family></accept> </alias> <alias> <family>Gulim</family> <family>GulimChe</family> <family>HYRGoThic-Medium</family> <accept><family>Baekmuk Gulim</family></accept> </alias> <alias> <family>Dotum</family> <family>DotumChe</family> <family>HYGoThic-Medium</family> <accept><family>Baekmuk Dotum</family></accept> </alias> </fontconfig>