thr3ads.net - Fontconfig - [Fontconfig] asian font configuration [Nov 2005]

If this information is useful, please help other people find it:
Share via:

John Thacker

2005-Nov-21 08:51 UTC

[Fontconfig] asian font configuration

On Thu, Dec 09, 2004 at 12:02:10PM -0800, Keith Packard
wrote:> 
> Around 9 o''clock on Dec 9, John Thacker wrote:
> 
> > Therefore, 0406 and 0456 should be removed from or commented out of
ru.orth
> 
> Done.
Thanks
> I know in my own typographical history I''ve seen the slow increase
in the
> set of glyphs considered "necessary" for proper English
publication; novels
> which 30 years ago would have been published in essentially ASCII
> (originating as they did on typewriters) are now starting to include
> accented characters as a guide to both the etymology and pronounciation of
> words.  Certainly common words like na?ve or r?sum? are more accurately
> spelled with the appropriate accents than without.
I don''t know that I''d say that they are "more accurately
spelled,"
honestly.  I grant that in the case of r?sum? the final accent is
useful due to the pronunciation, and to distinguish it from resume.
Would you argue that h?tel should still be spelled that way?  It is
"more accurate," in a historical sense.  ?ngstr?m?  co?perate?
(The last for phonetic but not etymological reasons-- you will see
it used in the New Yorker for example.)  To get really silly, ? propos?
Ca?on instead of canyon?

I agree that modern computers and technology make it easier to use
accents, but I''m not completely convinced that their use is actually
on the upswing.  Most of the sources I''ve consulted claim that their
use has been decreasing, though I suppose recent technology may have
reversed that.  In my experience, diacritical marks are used only in 
words perceived as foreign or of obvious foreign origin, and they tend
to lose their accents.

That said, they certainly are used.  OTOH, not every font will include
them even when they intend to cover English.  It''s an annoying problem.
Being rendered in a single face is nice-- but I deal with a lot of
Japanese documents which contain Latin characters, and often mixed
Japanese/English.  My Japanese fonts mostly do not contain all the
accented Latin characters; this causes problems because Pango wants
to render English words with fonts which support "en."  Result is
not using a single face even though the preferred Japanese font contains
all characters on the page.  Or, if you''re not using Pango, then 
applications tend to assume that your en_US.UTF-8 locale means you want
fonts which support "en," and thus will render Japanese UTF-8
documents
using ugly fonts like MiscFixed in preference to other Japanese fonts.

I know how to tweak the problem myself using weak family bindings, but
there''s a lot of discussion of it by Owen Taylor and others in a few
places:

https://bugzilla.redhat.com/beta/show_bug.cgi?id=138783
https://bugzilla.redhat.com/beta/show_bug.cgi?id=107952
https://bugzilla.redhat.com/beta/show_bug.cgi?id=107617
> So, I guess it''s my cultural bias to encourage people to spell
words right
> rather than spelling them as if they were using an IBM selectric 
> typewriter.  It was not done randomly.
Most dictionaries I''ve seen contain the unaccented spelling only, or
indicate it as the primary one with the accented as a variant, for most
of those words, including naive/na?ve and resume/r?sum?.  (The latter
actually has a common variant with only the final accent mark, resum?,
found in most dictionaries as well.)  Of course, not only is the
"certainly"
disputed, but there are certainly those prescriptivists who argue that 
accents are not a part of English orthography at all, and belong only on 
foreign words.  That said, many people do use them of course, and I''m a
descriptivist.  I may disagree with your "certainly" remark, but the
usage
may be enough to make it worth it.  For me personally, however, it''s a
pain,
since quite a lot of font writers do not consider accent marks to be 
essential for English support.  (And really, they aren''t essential at
all.)

John Thacker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/86062b66/attachment.pgp

John Thacker

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

On Wed, Dec 08, 2004 at 12:27:46PM -0800, Keith Packard
wrote:> That''s probably just that the ko.lang file has too many glyphs. 
We can fix
> this easily enough.  Place all of your Korean fonts in a directory and run:
> 
>  $ cd <directory containing korean fonts>
>  $ FC_DEBUG=384 fc-cache -f .
Ah, thanks, I always forget that command.  Here''s a couple of notes:

Currently Russian (ru) requires 0406 and 0456 (? and ?), but these
were eliminated in Russian in 1918 in favor of 0418 and 0438 (? and ?),
and don''t even appear in KOI8-R.  (The hypothesis that they
don''t appear
in KOI8-R due to their similarity with Latin I and i is eliminated by
their presence in KOI8-U.)  I have a couple of fonts with Russian support
that don''t have the letter.

Therefore, 0406 and 0456 should be removed from or commented out of ru.orth

How necessary are the accented characters for English?  I have a decent
number of fonts designed to cover English but which lack the accented
characters currently required (i.e., 00c0, 00c7-00cb, 00cf, 00d1, 00d4,
00d6, 00e0, 00e7-00eb, 00ef, 00f1, 00f4, 00f6).  My understanding of 
English orthography is that they are optional.  (Similarly, we don''t
require the oe and especially the ae ligatures, despite their use in 
British spellings, and ?''s presence in ISO 8859-1.)

John Thacker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/946c2ff4/attachment.pgp

Tor Andersson

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

hi
> Thanks muchly for the additional data.  The list of alternate common (but
> non-free) family names is really nice to have.
the list is far from complete, but at least a starting point.
 > If this stuff isn''t working for you, we''ve just got bugs
in fontconfig
> that need fixing.
this approach as-is has two problems for asian fonts. one: if you forget to
select language you tend to end up with times or verdana. two: even if you
do set the language, you may end up with slightly less-than-preferred fonts.
for example selecting lang=zh-tw you may end up with arial unicode or a
simplified chinese font...

asian fonts in the same family are more substitutable than latin fonts,
since they are mostly mono-spaced. they are somewhere in between
the ''times/times new roman'' and ''serif''
categorisation.

stratifying this into two layers would be a solution. how about splitting
the categories into separate scripts and chaining them together at
the end. this way if you ask for a typically cyrillic font and one doesnt
find an ''exact'' substitute you''ll go through the list
of cyrillic generic fonts
first before trying arial unicode or falling back to verdana?

how about a font config that does the following:

first, exact aliases:

* alias to truetype substitutes -- Times to Times New Roman
* alias to ghostscript urw substitutes -- Times to Nimbus Roman No9 L

then, classification of known fonts:

* classify known fonts into generic+script -- Kochi Gothic to
''sans-serif+japan''

some cleaning of the pattern, tie together generic families with priority:

* if asked for generic and language tag is set, change to generic+script
* after the ''generic+script'' append a pure
''generic'' to catch unicode fonts

* if not classified and language tag is set, add sans-serif+script
* if not classified yet, add ''sans-serif''

* after pure ''generic'' add all
''generic+script'' versions to catch the
  case of wanting all ''generic'' fonts

and finally...

* font substitution ''generic+script'' to preferred --
serif+korea to
Baekmuk Batang
* font substitution ''generic'' to preferred unicode font

example:

  Batang
  Batang, serif+korea
  Batang, serif+korea, serif
  Batang, serif+korea, serif, serif+latin,serif+chinasimp,....
  Batang, Baekmuk Batang, serif+korea, serif, ...latin fonts...,
serif+latin, ...chinasimp fonts......

does that sound reasonable?

attached is a subs.conf that should be included instead of the
current alias/substitute stuff in the default fonts.conf.
 > > i understand that due to the incapability of freetype to use CMaps to
> > encode CID fonts, the ability to use CID-fonts with fontconfig is
severely
> > limited. however, it would be really really nice if fontconfig were
extended
> > in this area.
> 
> I don''t have a lot of experience with Type1 CID fonts as
I''ve tried to
> stick to TrueType which supports Unicode so much more nicely.
> 
> If you know of code or even documentation which clearly shows how to get
> from Unicode to CID stuff, it would be greatly appreciated.
Adobe-CNS1-UCS2
Adobe-GB1-UCS2
Adobe-Japan1-UCS2
Adobe-Korea1-UCS2

the above CMaps map from CID to unicode. inferring the reverse mapping
should be trivial. if you need code, i have a cmap data structure and
parser that are part of my pdf project that should be easy to extract,
or you can just generate static mapping tables with a python script.
i''d be happy to code one up if you need it.
> If I can get from Unicode to CID, I can generate FC_LANG tags, but
I''m not
> sure what other information belongs in fontconfig itself; remember that
> fontconfig is designed to provide information needed to select among
> fonts, not all of the information needed once you have a font in hand.
now how about a char *FcFindCMap(char *name) function?

if i find the CID-font with fontconfig, it is useless without the CMaps.
CMaps are basically a size-optimisation, instead of adding encoding
tables to all of the fonts like you do with truetype.

tor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: subs.conf
Type: application/octet-stream
Size: 13539 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/9e3cd194/subs.obj

John Thacker

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

On Sun, Dec 12, 2004 at 08:39:20PM -0800, Keith Packard
wrote:> 
> Around 10 o''clock on Dec 13, Tor Andersson wrote:
> 
> > 8822 hangul syllables in unicode are missing from KS X 1001, and most
> > of the fonts miss them too. about 5000 chinese characters are also in
> > the old ko.orth that are missing from a lot of korean fonts.
> 
> It almost seems like we should preserve the KSC 5601-1992 orthography 
> somehow; is there a territorial difference where the Han glyphs are used 
> in one area and not in another?  Or is it that Han is just slowly leaving 
> the "normal" Korean language and that fontconfig should respect
that
> change?
Well, in the DPRK Han glyphs (Hanja) haven''t been used at all
officially
since 1949.  In the ROK their use has definitely been steadily decreasing
as well.  I''ve heard that learning all the standard set is no longer 
mandatory in high schools, for example.  There are occasions where they
are used still, though.

According to most things I''ve seen (e.g., http://jshin.net/faq/qa8.html
)
KS C 5601-1992 is the same as KS X 1001:1992, there was merely a change
in format and number scheme made in August 1997.  The "extra" 8822 
precomposed hangul syllables are NOT an official part of the standard, but
there is a provision on how to represent them in the document.  The later
KS C 5700 / KS X 1005-1 does include them, however.

Yet, since so many Korean fonts are based on KS X 1001 / KS C 5601,
it''s
reasonable to follow it and not include the extra precomposed syllables, 
and restrict to only the Hangul syllables in the list Tor provided.  
However, the 4882 Hanja are definitely part of the standard no matter 
what.  I''m somewhat surprised that so many fonts would not have them at
all,
but it certainly must be easier.  (Just as it''s easier to not include
accented characters in fonts designed for English.)

John Thacker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/7ec11491/attachment.pgp

Owen Taylor

2005-Nov-21 08:51 UTC

head link

[Fontconfig] Overly aggresive English orthography? [was asian font configuration]

On Mon, 2004-12-13 at 10:56 -0800, Keith Packard wrote:> Around 2 o''clock on Dec 13, John Thacker wrote:
> 
> > However, the 4882 Hanja are definitely part of the standard no matter 
> > what.  I''m somewhat surprised that so many fonts would not
have them at all,
> > but it certainly must be easier.  
> 
> That''s what I thought; we have a ''standard
orthography'' for Korean as used
> in the Republic of Korea which includes a large set of glyphs which are
> no-longer in common use in the Republic of Korea.
> 
> THere is also a standard from the Democratic People''s Republic of
Korea
> (KPS 9566-97) which inclues 4653 Korean Hanja characters.
> 
> 	http://www.itscj.ipsj.or.jp/ISO-IR/202.pdf
> 
> The guiding principle for fontconfig''s orthography construction is
to
> select fonts capable of displaying the preponderance of documents in the
> given language. The English orthography, as an example, includes uncommonly
> used accented letters like ? and ? as they appear in many documents,
> although many people accept and use alternate spellings without them.
If you aren''t using Pango-style language tag refinement, this causes
some bad problems, see, e.g.:

  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=107952

I don''t actually see much value in such an extensive orthography
for English ... if fontconfig is hunting through all the fonts on
a system for a font to display English, the chances of it coming
up with a nice one is pretty minimal, just on a numerical basis.

Regards,
						Owen

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/2d70fa49/attachment.pgp

John Thacker

2005-Nov-21 08:51 UTC

head link

[Fontconfig] Overly aggresive English orthography?

On Mon, Dec 13, 2004 at 12:22:21PM -0800, Keith Packard
wrote:> 
> Around 14 o''clock on Dec 13, Owen Taylor wrote:
> 
> > If you aren''t using Pango-style language tag refinement, this
causes
> > some bad problems, see, e.g.:
> > 
> >   https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=107952
> 
> Well, that''s just a symptom of a more serious issue -- that
selection of
> glyphs beyond the orthography of the current locale is driven by font 
> suitability for the locale.
That is part of it.  The problem is that there are several possible goals
and situations.  There are two different approaches.  One is to render
an entire document with a single font when possible.  Another approach is 
to render different orthographies in a single document with the "best"
font for each orthography/language.

There are cases where the first approach looks better (a combination of
non-accented Latin characters and Japanese, all contained in one Japanese
font), and cases where it looks worse (a particularly varied combination 
of glyphs that are only all contained by an ugly "fallback" font of
last
resort, like MiscFixed).  It''s pretty easy to come up with other
examples,
too:  imagine a document mostly written in English that contains a few 
words of foreign origin using accents outside the list currently provided 
in en.orth.  Many people would like view the entire document in a single
font, rather than switching fonts just for those words, especially if 
extra glyphs are related to the Latin alphabet.

E.g., if I''m reading something in English which suddenly quotes Dutch
and uses ? or references a Welsh placename and uses ?, then maybe I just
want the whole document to use Verdana, which contains both, rather than
using Luxi Sans for everything except the words containing those two 
letters, even though Luxi Sans is normally my first choice for English.

Another big problem is that the Unicode standard treats fullwidth
(doublewide) Latin characters are part of the Latin alphabet, and as
inappropriate for Japanese, Chinese, etc.  This is despite their use
being almost completely limited to Far Eastern languages in my experience.
For that reason, Pango doesn''t allow ''ja'' language
tags on the fullwidth
Latin character, AIUI.  In any case, Japanese fairly frequently contains
unaccented Latin characters, and the fonts reflect that.  However, 
fontconfig and Pango conspire to prevent Japanese fonts from ever using
their Latin characters.

John Thacker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/6bcddcc4/attachment.pgp

Keith Packard

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

Around 1 o''clock on Dec 9, Tor Andersson wrote:
> korean language detection is "broken". only two of all the korean
fonts i
> have on my system are correctly identified as being korean:
That''s probably just that the ko.lang file has too many glyphs.  We can
fix
this easily enough.  Place all of your Korean fonts in a directory and run:

 $ cd <directory containing korean fonts>
 $ FC_DEBUG=384 fc-cache -f .

If there are only a few codepoints in ko.lang which aren''t included in 
your Korean fonts, that will list them as in:

ch(6) { 00c2 00d1 00dc 00e2 00f1 00fc }

This says the font in question (Watanabe) is missing six glyphs required 
to support Chamorro.  If there are more than 10 missing glyphs, it
won''t
list the individual glyphs; we''d have to change the libary to display
them
all.  When you get the list of glyphs, you can remove them from ko.lang, 
rebuild the library and see if fontconfig now recognises the fonts 
correctly.

If the font advertises support for a single Han language, and that 
language is not Korean, then the language coverage checking code won''t 
even consider Korean when checking for language coverage.  You can
distinguish this case by the absense of any ''ko(xx)'' debug
output in the
above list.  That would be more worrisome; I haven''t yet found any
fonts
which mis-mark their Han language support.

If you can (legally) send the fonts in question along, I''d be happy to
do
this work.
> the default configuration for fontconfig is a bit on the scarce side
> regarding font aliases and substitutions for asian fonts.
Thanks muchly for the additional data.  The list of alternate common (but
non-free) family names is really nice to have.  

I think I''d like to try them in a different part of the configuration
and
see if they work for you.  For example, as ''Kochi Gothic'' is
in mapping
from ''sans-serif'' to a set of family names, it should suffice
to place the
other gothic Japanese fonts in the mapping from family names to
''sans-serif''.  From there, it will be mapped into the list of
acceptable
''sans-serif'' faces, including ''Kochi
Gothic''.

Also, you shouldn''t need to map specific languages to specific family 
names; the language identification in fontconfig should suffice to select 
among the families listed for the generic aliases, just placing the family 
names in the generic ''serif'', ''sans-serif'',
and ''monospace'' aliases should
select them when the application specifies the cooresponding language.

For both of these, the guiding principle is that we do font substitution 
in two ways:

 1)	For intentionally similar faces (Helvetica/Arial, Times/Times New 
	Roman), we have specific aliases mapping those names:

        <alias>
                <family>Times</family>
                <accept><family>Times New
Roman</family></accept>
        </alias>

        <alias>
                <family>Helvetica</family>
                <accept><family>Arial</family></accept>
        </alias>

 2)	To set the preferred faces to use when an intentionally similar
	face not available, we map first from the given family to one
	of the generic names:

	<alias>
		...
                <family>Times</family>
		...

<default><family>serif</family></default>
        </alias>

	Then we map from the generic name to a list of suitable fonts:

        <alias>
                <family>serif</family>
                <prefer>
			...
                        <family>FreeSerif</family>
			...
                </prefer>
        </alias>

If this stuff isn''t working for you, we''ve just got bugs in
fontconfig
that need fixing.
> i understand that due to the incapability of freetype to use CMaps to
> encode CID fonts, the ability to use CID-fonts with fontconfig is severely
> limited. however, it would be really really nice if fontconfig were
extended
> in this area.
I don''t have a lot of experience with Type1 CID fonts as I''ve
tried to
stick to TrueType which supports Unicode so much more nicely.

If you know of code or even documentation which clearly shows how to get 
from Unicode to CID stuff, it would be greatly appreciated.
> i think that fontconfig should look at the registry-ordering in the
> CID System Info dict and put to good use. add a property FC_CSI and 
> put in a corresponding FC_LANG tag for CID-fonts.
If I can get from Unicode to CID, I can generate FC_LANG tags, but I''m
not
sure what other information belongs in fontconfig itself; remember that 
fontconfig is designed to provide information needed to select among 
fonts, not all of the information needed once you have a font in hand.

-keith

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041208/fd0e1d6b/attachment.pgp

Tor Andersson

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

gah!

my reply keeps getting stuck in the moderation filter.
i made a new ko.orth which only contains the ideographs and syllables
in KS X 1001. it makes a world of difference, now it recognizes all my
korean fonts as such.

the results on the old ko.orth gave the following results on my fonts:

Baekmuk fonts:
? ?Scanning file ./batang.ttf... ?ko(0)
? ?Scanning file ./dotum.ttf... ?ko(8822)
? ?Scanning file ./gulim.ttf... ?ko(0)
? ?Scanning file ./hline.ttf... ?ko(13703)

Apple fonts:
? ?Scanning file ./#Gungseouche.dfont... ?ko(8827)
? ?Scanning file ./#HeadlineA.dfont... ?ko(13710)
? ?Scanning file ./#PCmyoungjo.dfont... ?ko(8826)
? ?Scanning file ./#Pilgiche.dfont... ?ko(13710)
? ?Scanning file ./AppleGothic.dfont... ?ko(8827)
? ?Scanning file ./AppleMyungjo.dfont... ?ko(8824)

8822 hangul syllables in unicode are missing from KS X 1001, and most
of the fonts miss them too. about 5000 chinese characters are also in
the old ko.orth that are missing from a lot of korean fonts.

find the new ko.orth here:

http://ghostscript.com/~tor/software/patches/ko.orth

tor

Keith Packard

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

Around 9 o''clock on Dec 9, John Thacker wrote:
> Therefore, 0406 and 0456 should be removed from or commented out of ru.orth
Done.
> How necessary are the accented characters for English?  I have a decent
> number of fonts designed to cover English but which lack the accented
> characters currently required (i.e., 00c0, 00c7-00cb, 00cf, 00d1, 00d4,
> 00d6, 00e0, 00e7-00eb, 00ef, 00f1, 00f4, 00f6).  My understanding of
> English orthography is that they are optional.
The problem with English orthography is that there is no authority upon 
which we can lean.  I adapted orthographies published by Michael Everson 
and others for our use, preferring to err slightly on the side of 
including more characters than would commonly appear to ensure that the 
bulk of English documents could be rendered in a single face.

I know in my own typographical history I''ve seen the slow increase in
the
set of glyphs considered "necessary" for proper English publication;
novels
which 30 years ago would have been published in essentially ASCII
(originating as they did on typewriters) are now starting to include
accented characters as a guide to both the etymology and pronounciation of
words.  Certainly common words like na?ve or r?sum? are more accurately
spelled with the appropriate accents than without.

I did elide letters associated with older English spelling like ? and ?; 
their use in modern English documents seems an anachronism.

So, I guess it''s my cultural bias to encourage people to spell words
right
rather than spelling them as if they were using an IBM selectric 
typewriter.  It was not done randomly.
> (Similarly, we don''t require the oe and especially the ae
ligatures,
> despite their use in British spellings, and ?''s presence in ISO
8859-1.)
I wanted to stick to just the orthography and not typographical niceties,
so I left out all ligatures, which is how I think these are used in English
(unlike in Danish where I believe the ? is treated as a separate letter).

Similarly, I had considered adding the conventional quotation marks for 
each language and decided to leave those out.

Obviously, a lot of this is up to the author of the orthograpy files
themselves, and in many cases there are no ideal answers. The goal of the
whole exercise is to identify fonts which can accurately reproduce the bulk
of documents in the given language, a much different goal that enumerating 
either commonly used letters or all possible letters.

-keith

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/741daf5b/attachment.pgp

Keith Packard

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

Around 13 o''clock on Dec 9, Tor Andersson wrote:
> stratifying this into two layers would be a solution. how about splitting
> the categories into separate scripts and chaining them together at
> the end. this way if you ask for a typically cyrillic font and one doesnt
> find an ''exact'' substitute you''ll go through the
list of cyrillic generic fonts
> first before trying arial unicode or falling back to verdana?
We currently have three layers here, the first is exact family matches
("strong" family matches), the second is language and territory
matches and
the third is generic alias matches ("weak" family matches).  This is 
supposed to do exactly what you want.
> * classify known fonts into generic+script -- Kochi Gothic to
''sans-serif+japan''
That''s done by adding the Kochi Gothic -> sans-serif alias entry. 
The
Japanese classification is done automatically by fontconfig''s language 
support detection code.
> the above CMaps map from CID to unicode. inferring the reverse mapping
> should be trivial. if you need code, i have a cmap data structure and
> parser that are part of my pdf project that should be easy to extract,
> or you can just generate static mapping tables with a python script.
> i''d be happy to code one up if you need it.
Fontconfig needs to do two things.  The first is construct a list of 
Unicode values supported by the font.  For this it needs to enumerate the
encoded glyphs and compute Unicode values for each one.

The second abilty is mapping from Unicode codepoints to glyphs; this is
largely a convenience for applications which don''t want to deal with
the
obscurities of non-Unicode mappings.  Fontconfig already has a couple of
built-in transcoding tables to handle Apple Roman and Adobe Symbol encoded 
fonts, which often are either missing Unicode mappings or which have 
broken Unicode mappings (I have about 1200 fonts with broken Unicode 
mappings which have functional Apple Roman mappings).

FreeType has functions to enumerate the encoded glyphs in a font 
(FT_Get_First_Char and FT_Get_Next_Char) which fontconfig happily uses to 
enumerate glyphs, but for non-Unicode mappings it will need an additional 
function to convert from the encoded value to a Unicode value.

Please take a look at FcFreeTypeCharSetAndSpacing in fcfreetype.c to see 
how that all works.
> now how about a char *FcFindCMap(char *name) function?
> 
> if i find the CID-font with fontconfig, it is useless without the CMaps.
> CMaps are basically a size-optimisation, instead of adding encoding
> tables to all of the fonts like you do with truetype.
As I recall, the CMaps are external to the fonts themselves, just like 
kerning data in the .pfa files.  Is there a standard naming convention 
which is used to locate CMap files for particular font files?   Or could 
we construct such a convention?  I don''t see how fontconfig would 
otherwise locate the files, and if there is a suitable convention, we 
should just get applications to use the same.

-keith


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041209/b3662044/attachment.pgp

Keith Packard

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

Around 10 o''clock on Dec 13, Tor Andersson wrote:
> 8822 hangul syllables in unicode are missing from KS X 1001, and most
> of the fonts miss them too. about 5000 chinese characters are also in
> the old ko.orth that are missing from a lot of korean fonts.
It almost seems like we should preserve the KSC 5601-1992 orthography 
somehow; is there a territorial difference where the Han glyphs are used 
in one area and not in another?  Or is it that Han is just slowly leaving 
the "normal" Korean language and that fontconfig should respect that 
change?

-keith


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041212/f7d48b40/attachment.pgp

Ambrose Li

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

On Mon, Dec 13, 2004 at 02:17:28AM -0500, John Thacker wrote:
> Yet, since so many Korean fonts are based on KS X 1001 / KS
> C 5601, it''s reasonable to follow it and not include the
> extra precomposed syllables, and restrict to only the Hangul
> syllables in the list Tor provided.  However, the 4882 Hanja
> are definitely part of the standard no matter what.  I''m
> somewhat surprised that so many fonts would not have them at
> all, but it certainly must be easier.  (Just as it''s easier
> to not include accented characters in fonts designed for
> English.)
A similar situation exists for Chinese fonts: Many commercial
traditional Chinese fonts (esp. if they are "decorative" in some
sense) used to only contain the "frequently used" portion of
Big5; such fonts would not be detected as Chinese fonts.

However, I haven''t been using commercial Chinese fonts for the
past few years, so the above is likely very outdated information
and probably no longer true.

That said, I still imagine the fonts that are missing the hanja
would likely to also be "decorative" in nature, or used for
emphasis, etc.

Keith Packard

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

Around 2 o''clock on Dec 13, John Thacker wrote:
> However, the 4882 Hanja are definitely part of the standard no matter
> what. I''m somewhat surprised that so many fonts would not have
them at all,
> but it certainly must be easier.
That''s what I thought; we have a ''standard
orthography'' for Korean as used
in the Republic of Korea which includes a large set of glyphs which are
no-longer in common use in the Republic of Korea.

THere is also a standard from the Democratic People''s Republic of Korea
(KPS 9566-97) which inclues 4653 Korean Hanja characters.

http://www.itscj.ipsj.or.jp/ISO-IR/202.pdf

The guiding principle for fontconfig''s orthography construction is to
select fonts capable of displaying the preponderance of documents in the
given language. The English orthography, as an example, includes uncommonly
used accented letters like ? and ? as they appear in many documents,
although many people accept and use alternate spellings without them.

Where available, fontconfig leans on official standards published by
relevant bodies like the Acad?mie fran?aise, but in the case of Korean,
the standards above appear to be aimed at representing more than just
Korean as currently written.

Is there perhaps a more relevant standard than the encoding tables here?
An actual orthography would be really nice to have.

-ketih

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/a68c1b9e/attachment.pgp

Keith Packard

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

Around 8 o''clock on Dec 13, Ambrose Li wrote:
> That said, I still imagine the fonts that are missing the hanja
> would likely to also be "decorative" in nature, or used for
> emphasis, etc.
The question is whether Hanja is still present in any significant fraction 
of Korean texts; if so, then these "decorative" faces might be thought
of
as equivalent to a Latin font without accents -- usable if specifically 
requested by name, but not otherwise presented to the user as a reasonable 
font to use.

-keith


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/2d4ece1e/attachment.pgp

Keith Packard

2005-Nov-21 08:51 UTC

head link

[Fontconfig] Overly aggresive English orthography? [was asian font configuration]

Around 14 o''clock on Dec 13, Owen Taylor wrote:
> If you aren''t using Pango-style language tag refinement, this
causes
> some bad problems, see, e.g.:
> 
>   https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=107952
Well, that''s just a symptom of a more serious issue -- that selection
of
glyphs beyond the orthography of the current locale is driven by font 
suitability for the locale.

Perhaps what we need is a better sorting mechanism for FcFontSort.

What we want is for the list to include fonts that cover the specified 
language(s) to be at the front of the list, and for the remaining elements 
of the list to be sorted without regard to language.

That seems doable, but it may make FcFontSort even slower than it is today.

-keith

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/9a10f938/attachment.pgp

Keith Packard

2005-Nov-21 08:51 UTC

head link

[Fontconfig] Overly aggresive English orthography?

Around 17 o''clock on Dec 13, John Thacker wrote:
> That is part of it.  The problem is that there are several possible goals
> and situations.  There are two different approaches.  One is to render
> an entire document with a single font when possible.  Another approach is 
> to render different orthographies in a single document with the
"best"
> font for each orthography/language.
I believe these two approaches must be used in conjunction, and that 
software can ''guess'' when each approach should be used, but
provisions for
user-specified overrides may need to be permitted.  No single approach 
will work for all documents and users.

However, we should strive for reasonable and predictable behaviour from 
fontconfig so that people aren''t just confused by the weird results and
have some chance of actually figuring out how to make it do what they want.
> E.g., if I''m reading something in English which suddenly quotes
Dutch
> and uses ? or references a Welsh placename and uses ?, then maybe I just
> want the whole document to use Verdana, which contains both, rather than
> using Luxi Sans for everything except the words containing those two 
> letters, even though Luxi Sans is normally my first choice for English.
As Fontconfig can''t know about the actual document content directly,
the
only way to have applications automatically present this as you desire is 
to have them construct the set of necessary Unicode values for the 
document and ask Fontconfig for fonts containing those codepoints.  The 
use of lang tags in fontconfig is both a short-hand notation for these 
Unicode sets and a predictive mechanism for guessing what future glyphs 
may be presented for drawing.

-keith


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
Url :
http://lists.freedesktop.org/archives/fontconfig/attachments/20041213/b0dca1d6/attachment.pgp

Tor Andersson

2005-Nov-21 08:51 UTC

head link

[Fontconfig] asian font configuration

hi keithp and all.

a few issues relating to asian font support in fontconfig.

one!

korean language detection is "broken". only two of all the korean
fonts i
have on my system are correctly identified as being korean:

# fc-list '':lang=ko''
Baekmuk Batang:style=Regular
Baekmuk Gulim:style=Regular

all of apples korean fonts that ship with macos x, and even two of
the baekmuk fonts are missing. this needs to be fixed.


two!

the default configuration for fontconfig is a bit on the scarce side
regarding font aliases and substitutions for asian fonts.
please consider incorporating the attached file into the distribution.
exactly which default substitution fonts should be used -- batang or dotum,
mincho or gothic, etc -- should probably be discussed.


three!

i understand that due to the incapability of freetype to use CMaps to
encode CID fonts, the ability to use CID-fonts with fontconfig is severely
limited. however, it would be really really nice if fontconfig were extended
in this area.

i would like to query fontconfig for the location of CMap resource files.
they are font-related and forcing pdf and postscript tools to use a separate
configuration for finding them kinda stinks. 

i think that fontconfig should look at the registry-ordering in the
CID System Info dict and put to good use. add a property FC_CSI and 
put in a corresponding FC_LANG tag for CID-fonts.

Munhwa-Regular:csi=Adobe-Korea1:lang=ko
MOEKai:csi=Adobe-CNS1:lang=zh-TW

thanks,
tor
-------------- next part --------------
<?xml version="1.0"?>

<fontconfig>

<dir>/usr/local/share/ghostscript/fonts</dir>

<!-- Defaults by asian language tag -->

<match target="pattern">
	<test
name="lang"><string>zh-TW</string></test>
	<edit name="family" mode="append"><string>AR PL
KaitiM Big5</string></edit>
</match>

<match target="pattern">
	<test
name="lang"><string>zh-CN</string></test>
	<edit name="family" mode="append"><string>AR PL
KaitiM GB</string></edit>
</match>

<match target="pattern">
	<test name="lang"><string>ja</string></test>
	<edit name="family" mode="append"><string>Kochi
Gothic</string></edit>
</match>

<match target="pattern">
	<test name="lang"><string>ko</string></test>
	<edit name="family"
mode="append"><string>Baekmuk
Dotum</string></edit>
</match>

<!-- workaround korean lang not being detected correctly ... -->
<match target="pattern">
	<test name="lang"><string>ko</string></test>
	<!-- hmm there is no mode="delete" ... -->
	<edit name="lang"
mode="assign"><string>zu</string></edit>
</match>

<!-- Traditional Chinese -->

<alias>
	<family>MSung-Light</family>
	<family>MSung-Medium</family>
	<accept><family>AR PL Mingti2L Big5</family></accept>
</alias>

<alias>
	<family>MHei-Medium</family>
	<family>MKai-Medium</family>
	<family>MingLiU</family>
	<family>PMingLiU</family>
	<accept><family>AR PL KaitiM Big5</family></accept>
</alias>

<!-- Simplified Chinese -->

<alias>
	<family>STSong-Light</family>
	<family>STFangsong-Light</family>
	<family>SimSun</family>
	<family>NSimSun</family>
	<accept><family>AR PL SungtiL GB</family></accept>
</alias>

<alias>
	<family>STHeiti-Regular</family>
	<family>STKaiti-Regular</family>
	<family>SimHei</family>
	<accept><family>AR PL KaitiM GB</family></accept>
</alias>

<!-- Japanese -->

<alias>
	<family>Ryumin-Light</family>
	<family>Ryumin-Medium</family>
	<family>HeiseiMin-W3</family>
	<family>MS-Mincho</family>
	<family>MS-PMincho</family>
	<accept><family>Hiragino Mincho Pro
W3</family></accept>
	<accept><family>Sazanami Mincho</family></accept>
	<accept><family>Kochi Mincho</family></accept>
</alias>

<alias>
	<family>GothicBBB-Medium</family>
	<family>HeiseiKakuGo-W5</family>
	<family>MS-Gothic</family>
	<family>MS-PGothic</family>
	<family>MS-UIGothic</family>
	<accept><family>Hiragino Kaku Gothic Pro
W3</family></accept>
	<accept><family>Sazanami Gothic</family></accept>
	<accept><family>Kochi Gothic</family></accept>
</alias>

<!-- Korean -->

<alias>
	<family>Batang</family>
	<family>BatangChe</family>
	<family>Gungsuh</family>
	<family>GungsuhChe</family>
	<family>HYSMyeongJo-Medium</family>
	<accept><family>Baekmuk Batang</family></accept>
</alias>

<alias>
	<family>Gulim</family>
	<family>GulimChe</family>
	<family>HYRGoThic-Medium</family>
	<accept><family>Baekmuk Gulim</family></accept>
</alias>

<alias>
	<family>Dotum</family>
	<family>DotumChe</family>
	<family>HYGoThic-Medium</family>
	<accept><family>Baekmuk Dotum</family></accept>
</alias>

</fontconfig>

Fontconfig - Nov 2005 - asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] Overly aggresive English orthography? [was asian font configuration]

[Fontconfig] Overly aggresive English orthography?

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] asian font configuration

[Fontconfig] Overly aggresive English orthography? [was asian font configuration]

[Fontconfig] Overly aggresive English orthography?

[Fontconfig] asian font configuration