On 06.11.2016 21:57, Stefan Bethke wrote:>
>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin
>> <bapt at FreeBSD.org>:
>>
>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>>> I happened to run an old script today that uses sed(1) to extract
>>> the system boot time from the kern.boottime sysctl MIB. On 11.0
>>> this no longer works as expected:
..>>> Here sed thinks every lowercase character except for 'a' is
>>> uppercase! This differs from the first test where sed did not
>>> think 'o' is uppercase. Again, the above behaves as
expected with
>>> LANG=C.
>>>
>>> Does anyone have any insight into this? This is likely to break a
>>> lot of existing code.
>>>
>>
>> Yes A-Z only means uppercase in an ASCII only world in a unicode
>> world it means AaBb... Z because there are way more characters that
>> simple A-Z. In FreeBSD 11 we have a unicode collation instead of
>> falling back in on LC_COLLATE=C which means ascii only
>>
>> For regrexp for example one should use the classes: :upper: or
>> :lower:.
>
> That is rather surprising. Is there a normative reference for the
> treatment of bracket expressions and character classes when using
> locales other than C and/or encodings like UTF-8?
I found an interesting article about this issue in gawk:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
Apparently the meaning of ranges is unspecified outside the "C"
locale.
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05
says:
"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified
behavior: strictly conforming applications shall not rely on whether the
range expression is valid, or on the set of collating elements matched"