thr3ads.net - freebsd stable - Uppercase RE matching problems in FreeBSD 11 [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Stefan Bethke

2016-Nov-06 21:49 UTC

Uppercase RE matching problems in FreeBSD 11

Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin <bapt at
FreeBSD.org>:> 
>> But under what circumstances would [A-Z] mean anything other than a
character whose Unicode codepoint is between U+0041 and U+005A, inclusive? 
Especially given the locale in the example is en_US.UTF-8.  Or, put another way,
why would an implementation interpret [A-Z] as anything other than [ABCDE?XYZ]?
> 
> The collation rules for unicode comes from: http://cldr.unicode.org/ and
they do
> match the one on linux for example and the one on illumos.
> 
> On some gnu tool they explicitly decide to be non locale aware to avoid
that
> kind of "surprises"
>> 
>> From reading your reference, I can see in 9.3.5.7:
>>> In the POSIX locale, a range expression represents the set of
collating elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified behavior[?]
>> 
>> So even if the observed behaviour is conforming, I?d think it?s still
highly undesirable.
>> 
> That works for POSIX locale aka C aka ASCII only world
So what do I set my LANG and LC variables to?  I do want UTF-8, but I do also
want my scripts to continue to work.  Clearly, en_US.UTF-8 is not what I want. 
Is it C.UTF-8?  Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?


Stefan

-- 
Stefan Bethke <stb at lassitu.de>   Fon +49 151 14070811

Charles Swiger

2016-Nov-07 21:13 UTC

head link

Uppercase RE matching problems in FreeBSD 11

On Nov 6, 2016, at 1:49 PM, Stefan Bethke <stb at lassitu.de>
wrote:> Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin <bapt at
FreeBSD.org>:
>> That works for POSIX locale aka C aka ASCII only world
> 
> So what do I set my LANG and LC variables to?  I do want UTF-8, but I do
also want my scripts to continue to work.  Clearly, en_US.UTF-8 is not what I
want.  Is it C.UTF-8?  Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?
If you want to use a UTF8 locale, then you must start using character classes
like '[:upper:]' and '[:lower:]' because those will-- or at
least "should", modulo bugs-- properly handle the collation issues
including for languages which do not possess a 1-1 mapping between upper and
lower case letters.

Someone with a German email address is presumably familiar with ? / Eszett...? 
:-)

Regards,
-- 
-Chuck

Mark Martinec

2016-Nov-07 23:12 UTC

head link

Uppercase RE matching problems in FreeBSD 11

2016-11-06 22:49, Stefan Bethke wrote:> So what do I set my LANG and LC variables to?  I do want UTF-8, but I
> do also want my scripts to continue to work.  Clearly, en_US.UTF-8 is
> not what I want.  Is it C.UTF-8?
> Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?
Yes, that is the safest bet. The LANG sets a default, but the
LC_COLLATE, LC_TIME, LC_NUMERIC and LC_MONETARY should better
be set to "C" to overrule the LANG in their domains.

Leave the LC_ALL undefined or empty, as this one overrules
every other locale setting (unless you really want everything
to be set to "C").

   Mark

freebsd stable - Nov 2016 - Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11