Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin <bapt at FreeBSD.org>:> >> But under what circumstances would [A-Z] mean anything other than a character whose Unicode codepoint is between U+0041 and U+005A, inclusive? Especially given the locale in the example is en_US.UTF-8. Or, put another way, why would an implementation interpret [A-Z] as anything other than [ABCDE?XYZ]? > > The collation rules for unicode comes from: http://cldr.unicode.org/ and they do > match the one on linux for example and the one on illumos. > > On some gnu tool they explicitly decide to be non locale aware to avoid that > kind of "surprises" >> >> From reading your reference, I can see in 9.3.5.7: >>> In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior[?] >> >> So even if the observed behaviour is conforming, I?d think it?s still highly undesirable. >> > That works for POSIX locale aka C aka ASCII only worldSo what do I set my LANG and LC variables to? I do want UTF-8, but I do also want my scripts to continue to work. Clearly, en_US.UTF-8 is not what I want. Is it C.UTF-8? Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C? Stefan -- Stefan Bethke <stb at lassitu.de> Fon +49 151 14070811
On Nov 6, 2016, at 1:49 PM, Stefan Bethke <stb at lassitu.de> wrote:> Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin <bapt at FreeBSD.org>: >> That works for POSIX locale aka C aka ASCII only world > > So what do I set my LANG and LC variables to? I do want UTF-8, but I do also want my scripts to continue to work. Clearly, en_US.UTF-8 is not what I want. Is it C.UTF-8? Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?If you want to use a UTF8 locale, then you must start using character classes like '[:upper:]' and '[:lower:]' because those will-- or at least "should", modulo bugs-- properly handle the collation issues including for languages which do not possess a 1-1 mapping between upper and lower case letters. Someone with a German email address is presumably familiar with ? / Eszett...? :-) Regards, -- -Chuck
2016-11-06 22:49, Stefan Bethke wrote:> So what do I set my LANG and LC variables to? I do want UTF-8, but I > do also want my scripts to continue to work. Clearly, en_US.UTF-8 is > not what I want. Is it C.UTF-8?> Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?Yes, that is the safest bet. The LANG sets a default, but the LC_COLLATE, LC_TIME, LC_NUMERIC and LC_MONETARY should better be set to "C" to overrule the LANG in their domains. Leave the LC_ALL undefined or empty, as this one overrules every other locale setting (unless you really want everything to be set to "C"). Mark