thr3ads.net - freebsd stable - Uppercase RE matching problems in FreeBSD 11 [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Stefan Bethke

2016-Nov-06 20:57 UTC

Uppercase RE matching problems in FreeBSD 11

> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt at
FreeBSD.org>:
> 
> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>> I happened to run an old script today that uses sed(1) to extract the
system
>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer
works as
>> expected:
>> 
>> $ sysctl kern.boottime
>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34
2016
>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>> v  5 16:18:34 2016
>> 
>> sed passes over 'S' and 'N' until it hits 'v',
which it considers uppercase
>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
>> expected:
>> 
>> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
>> Nov  5 16:18:34 2016
>> 
>> Testing every lowercase character separately gives even more
inconsistent
>> results:
>> 
>> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/?p
>> Here sed thinks every lowercase character except for 'a' is
uppercase! This
>> differs from the first test where sed did not think 'o' is
uppercase. Again,
>> the above behaves as expected with LANG=C.
>> 
>> Does anyone have any insight into this? This is likely to break a lot
of
>> existing code.
>> 
> 
> Yes A-Z only means uppercase in an ASCII only world in a unicode world it
means
> AaBb... Z because there are way more characters that simple A-Z. In FreeBSD
11
> we have a unicode collation instead of falling back in on LC_COLLATE=C
which
> means ascii only
> 
> For regrexp for example one should use the classes: :upper: or :lower:.
That is rather surprising.  Is there a normative reference for the treatment of
bracket expressions and character classes when using locales other than C and/or
encodings like UTF-8?


Stefan

-- 
Stefan Bethke <stb at lassitu.de>   Fon +49 151 14070811

Baptiste Daroussin

2016-Nov-06 21:06 UTC

head link

Uppercase RE matching problems in FreeBSD 11

On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke
wrote:> 
> > Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt at
FreeBSD.org>:
> > 
> > On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
> >> I happened to run an old script today that uses sed(1) to extract
the system
> >> boot time from the kern.boottime sysctl MIB. On 11.0 this no
longer works as
> >> expected:
> >> 
> >> $ sysctl kern.boottime
> >> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5
16:18:34 2016
> >> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> >> v  5 16:18:34 2016
> >> 
> >> sed passes over 'S' and 'N' until it hits
'v', which it considers uppercase
> >> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it
works as
> >> expected:
> >> 
> >> $ sysctl kern.boottime | LANG=C sed -e
's/.*\([A-Z].*\)$/\1/'
> >> Nov  5 16:18:34 2016
> >> 
> >> Testing every lowercase character separately gives even more
inconsistent
> >> results:
> >> 
> >> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/?p
> 
> >> Here sed thinks every lowercase character except for 'a'
is uppercase! This
> >> differs from the first test where sed did not think 'o' is
uppercase. Again,
> >> the above behaves as expected with LANG=C.
> >> 
> >> Does anyone have any insight into this? This is likely to break a
lot of
> >> existing code.
> >> 
> > 
> > Yes A-Z only means uppercase in an ASCII only world in a unicode world
it means
> > AaBb... Z because there are way more characters that simple A-Z. In
FreeBSD 11
> > we have a unicode collation instead of falling back in on LC_COLLATE=C
which
> > means ascii only
> > 
> > For regrexp for example one should use the classes: :upper: or
:lower:.
> 
> That is rather surprising.  Is there a normative reference for the
treatment of bracket expressions and character classes when using locales other
than C and/or encodings like UTF-8?
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

For example:

"Regular expressions are a context-independent syntax that can represent a
wide
variety of character sets and character set orderings, where these character
sets are interpreted according to the current locale. While many regular
expressions can be interpreted differently depending on the current locale, many
features, such as character class expressions, provide for contextual invariance
across locales."

Best regards,
Bapt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20161106/173af9ec/attachment.sig>

Stefan Ehmann

2016-Nov-06 21:14 UTC

head link

Uppercase RE matching problems in FreeBSD 11

On 06.11.2016 21:57, Stefan Bethke wrote:> 
>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin
>> <bapt at FreeBSD.org>:
>> 
>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>>> I happened to run an old script today that uses sed(1) to extract
>>> the system boot time from the kern.boottime sysctl MIB. On 11.0
>>> this no longer works as expected:
..>>> Here sed thinks every lowercase character except for 'a' is
>>> uppercase! This differs from the first test where sed did not
>>> think 'o' is uppercase. Again, the above behaves as
expected with
>>> LANG=C.
>>> 
>>> Does anyone have any insight into this? This is likely to break a
>>> lot of existing code.
>>> 
>> 
>> Yes A-Z only means uppercase in an ASCII only world in a unicode
>> world it means AaBb... Z because there are way more characters that
>> simple A-Z. In FreeBSD 11 we have a unicode collation instead of
>> falling back in on LC_COLLATE=C which means ascii only
>> 
>> For regrexp for example one should use the classes: :upper: or
>> :lower:.
> 
> That is rather surprising.  Is there a normative reference for the
> treatment of bracket expressions and character classes when using
> locales other than C and/or encodings like UTF-8?
I found an interesting article about this issue in gawk:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

Apparently the meaning of ranges is unspecified outside the "C"
locale.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05
says:

"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified
behavior: strictly conforming applications shall not rely on whether the
range expression is valid, or on the set of collating elements matched"

freebsd stable - Nov 2016 - Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11