thr3ads.net - Fontconfig - [Fontconfig] Regularizing contains operator semantics [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Owen Taylor

2005-Nov-21 08:50 UTC

[Fontconfig] Regularizing contains operator semantics

On Sat, 2003-07-12 at 01:00, Ambrose Li wrote:
> With the current version of fontconfig (and gtk2), it is getting
> difficult to get X applications to let me use, for example, a
> Japanese font for traditional Chinese, even if the font is
> perfectly fine for the task I do, because the application will
> believe the fontconfig notion of "support" for my locale, and
> filter out all the "unsupported" fonts.  I think it
> counter-productive to put so much trust in the mechanical notion
> of "complete code space coverage".
Can you give a concrete example? As Keith said, if you specify
a font explicitly, you''ll get that font for every character
it contains.

Regards,
						Owen

Owen Taylor

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

On Sat, 2003-07-12 at 12:14, Keith Packard wrote:> Around 11 o''clock on Jul 12, Owen Taylor wrote:
> 
> >  when a key is referred
> > to as having the value "foo,bar", there are three possible
> > interpretations of that:
> > 
> >  * A string containing an embedded comma
> >  * A pattern with multiple values with the same key
> >  * A pattern with a single value with a composite type (LangSet)
> 
> and the winner is 2) -- foo,bar represents a pattern with multiple values 
> for the same key.  LangSets and Charsets were designed to be a more compact
> representation of this idea for those specific kinds of values; I think 
> there are some places where the fact that they are stored in a single 
> entry are exposed to the user and I''d like to close those holes.
What about in <match><test>? What does 

 <string>times,courier</string>
 <lang>en,de</lang>

Mean there? If it means "an embedded comma" then I would suggest
that fontconfig should probably print a warning like:

 "Interpreting '','' as part of value"

because otherwise, people will definitely get confused.

(the docs say currently "These elements hold a single value of the
indicated type.")

Regards,
						Owen

Owen Taylor

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

On Sat, 2003-07-12 at 12:04, Keith Packard wrote:> Around 10 o''clock on Jul 12, Owen Taylor wrote:
> 
> > Can you give a concrete example? As Keith said, if you specify
> > a font explicitly, you''ll get that font for every character
> > it contains.
> 
> I thought the problem mentioned was that applications were using lang to 
> restrict the presented list of available fonts in some context.  I know 
> Mozilla does this when selecting preferred fonts for language groups; I 
> can believe that other apps also do this; perhaps we should find a way to 
> deprecate this activity.  The Mozilla behaviour was inherited from the 
> core font listing techniques and so is not specific to it''s
interactions
> with fontconfig.
gtk2 was explicitely mentioned, and gtk2 doesn''t expose
fontconfig''s
listing system at all.

But maybe "Mozilla using gtk2" was meant.

Regards,
						Owen

Ambrose Li

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

On Fri, Jul 11, 2003 at 01:54:56PM -0700, Keith Packard wrote:
> The font supports all of the langs requested by the
> application.  I think this means that the font ''contains''
> all of the langs requested by the application (remember,
> we''re talking about LISTING here).  Now, the tricky part of
> defining what ''support'' means for a specific lang entry. 
When
> the application provides a language/territory pair, then the
> font must either provide a matching language/territory pair,
> or a bare language entry.  When the application provides
> a bare language, the font must either provide a matching
> bare language entry or a language/territory pair with *any*
> territory:
> 
> 	application	font		"supports"
> 	-----------	----		----------
> 	zh		zh_cn		YES
> 	zh_tw		zh_cn		NO
This is theoretically sound. However, for practical purposes it
is wrong; fonts having incomplete coverages are generally still
useful (not in general but for particular tasks like typesetting
short pieces or even longer pieces of text). Especially with the
scarcity of free CJK fonts, it is almost a must to, for example,
use zh_CN or even ja/ko fonts for zh_TW in certain cases. (The
reverse is also true; i.e., a zh_TW and/or zh_CN font will be
useful for setting Japanese in a limited way.) In fact, there
are even commercial zh_TW fonts that cover less than half of the
Big5 code space (e.g., only the "frequently used characters"
space, i.e., 4501 code points out of the complete Big5 coverage
of 17552; because of the structure of Big5, just having 4501
of the "most frequently used" characters should at least already
make the font "support zh_TW").

With the current version of fontconfig (and gtk2), it is getting
difficult to get X applications to let me use, for example, a
Japanese font for traditional Chinese, even if the font is
perfectly fine for the task I do, because the application will
believe the fontconfig notion of "support" for my locale, and
filter out all the "unsupported" fonts.  I think it
counter-productive to put so much trust in the mechanical notion
of "complete code space coverage".


Regards,
-- 
Ambrose LI Cheuk-Wing  <a.c.li@ieee.org>

http://ada.dhs.org/~acli/

Keith Packard

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

Around 10 o''clock on Jul 12, Owen Taylor wrote:
> Can you give a concrete example? As Keith said, if you specify
> a font explicitly, you''ll get that font for every character
> it contains.
I thought the problem mentioned was that applications were using lang to 
restrict the presented list of available fonts in some context.  I know 
Mozilla does this when selecting preferred fonts for language groups; I 
can believe that other apps also do this; perhaps we should find a way to 
deprecate this activity.  The Mozilla behaviour was inherited from the 
core font listing techniques and so is not specific to it''s
interactions
with fontconfig.

-keith

Owen Taylor

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

On Fri, 2003-07-11 at 16:54, Keith Packard wrote:
> LISTING FONTS
> 
> When listing fonts, contains should have "obvious" semantics, I
suggest
> that those semantics depend on the type of the value:
> 
> 	string, number, boolean:
> 
> font has an equal value for every value in the pattern.  This means
> that using ''times,courier'' for the family will result in
no fonts
> being listed as no font has both times and courier family names.  In fact,
I
> can''t see a good use for multiple values here as it would require
multiple
> values in the fonts; let''s see if that is broken.  For strings,
the change
> here is that ''contains'' does not mean sub string -- list
''courier'' and you
> won''t see ''courier 10 pitch''.  I think strings
should be treated as atomic
> values in this context; fontconfig doesn''t have string operators,
which
> is at least consistent.
What you are saying in this mail generally makes sense, but when
I get down to details I get a little confused, especially about
the interpretation of multiple values - when a key is referred
to as having the value "foo,bar", there are three possible
interpretations of that:

 * A string containing an embedded comma
 * A pattern with multiple values with the same key
 * A pattern with a single value with a composite type (LangSet)

When reading through your mail, I had some trouble figuring out
when each of these interpretations was applicable in what context,
and in fact, it''s not always clear to me in practice using fontconfig
either. If I do:

   fc-list times,courier

I assume that the resulting pattern has to FC_FAMILY elements, one
for times, and one for courier. 

But then I don''t see how your proposed changes section:
> 1)      Use a Contains-alike operator for LISTING which does exact
>        matching for strings, permit Contains for EDITING to do
>        substring matching
Is going to result in going from the current result:

 List both fonts with a family of Times and those with a family
 of Courier

to the behavior described above.

Regards,
						Owen

Keith Packard

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

Around 1 o''clock on Jul 12, Ambrose Li wrote:
> With the current version of fontconfig (and gtk2), it is getting
> difficult to get X applications to let me use, for example, a
> Japanese font for traditional Chinese, even if the font is
> perfectly fine for the task I do, because the application will
> believe the fontconfig notion of "support" for my locale, and
> filter out all the "unsupported" fonts.
Perhaps this is not a problem with fontconfig, but rather with how 
applications interpret it''s interface in presenting fonts.  Fontconfig 
always places application specified families higher in precedence than 
fonts selected strictly through language coverage concerns, so you should 
be able to specify any font family by name and have it work in whatever 
locale you are using.

The notion of language support is designed precisely for the case where 
no specified font family is available on the system and a
''fall-back'' to
available fonts is required; choosing one with ''support'' for
the language
ensures that multiple fonts won''t be needed.  Font substitution is a
hard
problem, and this language coverage mechanism has made a positive change 
in many environments on the resulting presentation of documents.
> I think it counter-productive to put so much trust in the mechanical notion
> of "complete code space coverage".
Perhaps we need to create better interfaces for applications to help 
clarify where language coverage is intended to be used.  Suggestions on 
what should be done are welcome.

-keith

Keith Packard

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

Around 11 o''clock on Jul 12, Owen Taylor wrote:
>  when a key is referred
> to as having the value "foo,bar", there are three possible
> interpretations of that:
> 
>  * A string containing an embedded comma
>  * A pattern with multiple values with the same key
>  * A pattern with a single value with a composite type (LangSet)
and the winner is 2) -- foo,bar represents a pattern with multiple values 
for the same key.  LangSets and Charsets were designed to be a more compact 
representation of this idea for those specific kinds of values; I think 
there are some places where the fact that they are stored in a single 
entry are exposed to the user and I''d like to close those holes.
> If I do:
> 
>    fc-list times,courier
> 
> I assume that the resulting pattern has two FC_FAMILY elements, one
> for times, and one for courier. 
yes, that''s correct -- commas separate multiple values with the same
key.
> But then I don''t see how your proposed changes section:
> 
> > 1)      Use a Contains-alike operator for LISTING which does exact
> >        matching for strings, permit Contains for EDITING to do
> >        substring matching
>
> (will result in a change ...) to the behavior described above.
I think I missed a step -- LISTING will require matches for all values of 
each key, so

	$ fc-list times,courier

will list only fonts with *both* family times and family courier (i.e. no 
fonts at all).  Yes, this is useless, but I want to make sure the meaning
of 

	$ fc-list :lang=en,de

means to list only fonts with *both* english and german support.  Having 
different meanings for different keys seems like a really bad idea, worse 
than defining the behaviour of ''fc-list times,courier'' as
useless.

Thanks for reading through this stuff; I''m hoping to get a chance to
write
down a specification for the library semantics from this discussion.

-keith

Keith Packard

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

Around 14 o''clock on Jul 12, Owen Taylor wrote:
> What about in <match><test>? What does 
> 
>  <string>times,courier</string>
>  <lang>en,de</lang>
> 
> Mean there? If it means "an embedded comma" then I would suggest
> that fontconfig should probably print a warning like:
Sigh.  Yes, it means an embedded comma; only the string name parser 
(FcNameParse) splits things at punctuation.  This is useful for
''-'' where
<string>sans-serif</string> means the sans-serif family and not the
sans
family at size ''serif''.

If you want to check for any of a list, you can have multiple values in the
<test> case:

	<test name="family" qual=any>
		<string>times</string>
		<string>courier</string>
	</test>

That will look for either ''times'' or
''courier''.

Or, you can use:

	<test name="lang" qual=all>
		<string>en</string>
		<string>de</string>
	</test>

to check for both en and de.

I''d prefer to not emit warnings for reasonable syntax; I''m not
sure how
one would rewrite the values to avoid the warnings which seems pretty 
harsh.

-keith

Keith Packard

2005-Nov-21 08:50 UTC

head link

[Fontconfig] Regularizing contains operator semantics

"Contains" matching issues.

The contains operator is currently used in font listing and can be used in
match/edit rules.

LISTING FONTS

When listing fonts, contains should have "obvious" semantics, I
suggest
that those semantics depend on the type of the value:

	string, number, boolean:

font has an equal value for every value in the pattern.  This means
that using ''times,courier'' for the family will result in no
fonts
being listed as no font has both times and courier family names.  In fact, I
can''t see a good use for multiple values here as it would require
multiple
values in the fonts; let''s see if that is broken.  For strings, the
change
here is that ''contains'' does not mean sub string -- list
''courier'' and you
won''t see ''courier 10 pitch''.  I think strings should
be treated as atomic
values in this context; fontconfig doesn''t have string operators, which
is at least consistent.

	charset:
	
font contains listed Unicode codepoints, in otherwords, the charset provided
by the font ''contains'' all of the glyphs requested by the
application.

	lang:

(Remember that ''lang'' is a composite value consisting of a
language value and
 a territory value.  The list of lang values in a font is computed from
 Unicode coverage ranges based on orthographies.  Except for Chinese, all of
 these coverage ranges are (currently) assocated only with a language and not
 a territory.  Chinese is (currently) split into three territory groups
 (mainland China and Singapore, Hong Kong, Taiwan and Macau).  So, most
 language comparisons will be done with a language/territory pair supplied by
 the application (often from the current locale) against fonts which know
 only languages and not territories.  However, applications will also provide
 only languages at times to be matched against fonts which have languages and
 territories.)

The font supports all of the langs requested by the application.  I think
this means that the font ''contains'' all of the langs requested
by the
application (remember, we''re talking about LISTING here).  Now, the
tricky
part of defining what ''support'' means for a specific lang
entry.  When
the application provides a language/territory pair, then the font must
either provide a matching language/territory pair, or a bare language entry.
When the application provides a bare language, the font must either provide
a matching bare language entry or a language/territory pair with *any*
territory:

	application	font		"supports"
	-----------	----		----------
	zh		zh_cn		YES
	zh_tw		zh_cn		NO
	en_gb		en		YES
	en		en		YES

MATCHING

The LISTING algorithm is designed to sharply restrict the set of provided
fonts; an empty list is often the result of overspecified patterns; that
matches the expected usage of providing precise information to users about
what actual fonts are available, rather than what font will be used when a
specific pattern is matched.  In contrast, MATCHING is designed to always
provide a font, and in fact to provide a score measuring how accurate that
match is so that the set of available fonts can be sorted by this metric 
and returned to the application.

When matching fonts, we''re not using the boolean
''contains'' operators, but
rather measuring distance from the pattern to the font (in CS terms, LISTING
is a constraint satsifaction problem while MATCHING is an constraint
optimization problem)

	string, boolean:

Distance in these objects is measured with only two values -- matching and
nonmatching -- matching strings or booleans have distance 0 while
mismatching values have distance 1.

	number:
	
Distance between two numbers is just the absolute value of thier difference
(the obvious value).  This is used for things like weight and slant, the
numeric values for those constants was carefully chosen to prefer reasonable
substitutions (italic and oblique and closer together than either is to
roman).

	charset:

Distance between two charsets is the count of characters requested by
the pattern but not provided by the font.  This means that a font which
fully covers the requested characters has distance ''0''.

	lang: 

Distance has three values:

	0:	pattern and font have equal language/country,
		or pattern has only language and font has language with
		any country.

	1:	Pattern and font have equal language and different
		country (zh_CN vs zh_TW)

	2:	Pattern and font have different language

EDITING

The EDITING algorithm needs a method for matching patterns for each edit
operation; this is another constraint satisfaction problem as the edit rules
are either applied or not applied.

Match rules in edit instructions can use many different operators to
constrain pattern selection:

	eq
	not_eq
	less
	less_eq
	more
	more_eq
	contains
	not_contains

Each of these opeators behave differently for each datatype.  For
datatypes which aren''t ordered, I''ve defined the ordered
operators to always
return false.

	string:
	
I think these should be treated as unordered objects so that collation
isn''t visible to the user.  The remaining question is whether the
''contains''
operator should be used to detect sub-string presense.  The LISTING
operation above should not do this as the operator is not selectable, but
allowing ''contains'' to do substring detection in an EDITING
context means
that LISTING won''t use Contains, but rather some Contains-like analog
which
is actuall Equal for strings.  Hmm.  Permitting Contains for EDITING would
probably be useful, especially for FC_STYLE pattern elements.

	boolean, number:

These have obvious semantics for all of the operators if
contains/not_contains are allowed to be synonyms for eq/not_eq.

	charset, lang:

I think the semantics described above for LISTING should apply here.

PROPOSED CHANGES

I believe the only changes necessary to implement these semantics are:

1)	Use a Contains-alike operator for LISTING which does exact
	matching for strings, permit Contains for EDITING to do
	substring matching

2)	Change lang Contains semantics to make ll_xx contain ll and
	ll contain ll_xx (currently, I believe ll_xx does not contain ll)

Fontconfig - Nov 2005 - Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics

[Fontconfig] Regularizing contains operator semantics