thr3ads.net - freebsd stable - Uppercase RE matching problems in FreeBSD 11 [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Stefan Ehmann

2016-Nov-06 21:14 UTC

Uppercase RE matching problems in FreeBSD 11

On 06.11.2016 21:57, Stefan Bethke wrote:> 
>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin
>> <bapt at FreeBSD.org>:
>> 
>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>>> I happened to run an old script today that uses sed(1) to extract
>>> the system boot time from the kern.boottime sysctl MIB. On 11.0
>>> this no longer works as expected:
..>>> Here sed thinks every lowercase character except for 'a' is
>>> uppercase! This differs from the first test where sed did not
>>> think 'o' is uppercase. Again, the above behaves as
expected with
>>> LANG=C.
>>> 
>>> Does anyone have any insight into this? This is likely to break a
>>> lot of existing code.
>>> 
>> 
>> Yes A-Z only means uppercase in an ASCII only world in a unicode
>> world it means AaBb... Z because there are way more characters that
>> simple A-Z. In FreeBSD 11 we have a unicode collation instead of
>> falling back in on LC_COLLATE=C which means ascii only
>> 
>> For regrexp for example one should use the classes: :upper: or
>> :lower:.
> 
> That is rather surprising.  Is there a normative reference for the
> treatment of bracket expressions and character classes when using
> locales other than C and/or encodings like UTF-8?
I found an interesting article about this issue in gawk:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

Apparently the meaning of ranges is unspecified outside the "C"
locale.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05
says:

"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified
behavior: strictly conforming applications shall not rely on whether the
range expression is valid, or on the set of collating elements matched"

Stefan Bethke

2016-Nov-06 21:30 UTC

head link

Uppercase RE matching problems in FreeBSD 11

> Am 06.11.2016 um 22:14 schrieb Stefan Ehmann <shoesoft at gmx.net>:
> 
>> That is rather surprising.  Is there a normative reference for the
>> treatment of bracket expressions and character classes when using
>> locales other than C and/or encodings like UTF-8?
> 
> I found an interesting article about this issue in gawk:
> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
OK, I give up.  Back to jwz: "now you have two problems.?

Although with en_US.UTF-8 on other systems, I have not had that experience.  A
quick check on stuff I have immediate access to:

macOS 10.12:
$ echo 'abcdABCD' | sed 's/[A-Z]/X/g?
abcdXXXX

Ubuntu 14.04.5
$ echo 'abcdABCD' | sed 's/[A-Z]/X/g?
abcdXXXX

FreeBSD 10-stable
$ echo 'abcdABCD' | sed 's/[A-Z]/X/g'
abcdXXXX


Stefan

-- 
Stefan Bethke <stb at lassitu.de>   Fon +49 151 14070811

freebsd stable - Nov 2016 - Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11