thr3ads.net - freebsd stable - Uppercase RE matching problems in FreeBSD 11 [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Baptiste Daroussin

2016-Nov-06 11:07 UTC

Uppercase RE matching problems in FreeBSD 11

On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers
wrote:> I happened to run an old script today that uses sed(1) to extract the
system
> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works
as
> expected:
> 
> $ sysctl kern.boottime
> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> v  5 16:18:34 2016
> 
> sed passes over 'S' and 'N' until it hits 'v',
which it considers uppercase
> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
> expected:
> 
> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
> Nov  5 16:18:34 2016
> 
> Testing every lowercase character separately gives even more inconsistent
> results:
> 
> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/'p
> > a
> > b
> > c
> > d
> > e
> > f
> > g
> > h
> > i
> > j
> > k
> > l
> > m
> > n
> > o
> > p
> > q
> > r
> > s
> > t
> > u
> > v
> > w
> > x
> > y
> > z
> > !
> b
> c
> d
> e
> f
> g
> h
> i
> j
> k
> l
> m
> n
> o
> p
> q
> r
> s
> t
> u
> v
> w
> x
> y
> z
> 
> Here sed thinks every lowercase character except for 'a' is
uppercase! This
> differs from the first test where sed did not think 'o' is
uppercase. Again,
> the above behaves as expected with LANG=C.
> 
> Does anyone have any insight into this? This is likely to break a lot of
> existing code.
> 
Yes A-Z only means uppercase in an ASCII only world in a unicode world it means
AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11
we have a unicode collation instead of falling back in on LC_COLLATE=C which
means ascii only

For regrexp for example one should use the classes: :upper: or :lower:.

Best regards,
Bapt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20161106/6b3c254a/attachment.sig>

Mark Martinec

2016-Nov-06 12:26 UTC

head link

Uppercase RE matching problems in FreeBSD 11

2016-11-06 12:07, Baptiste Daroussin wrote:> Yes A-Z only means uppercase in an ASCII only world in a unicode world 
> it means
> AaBb... Z because there are way more characters that simple A-Z. In 
> FreeBSD 11
> we have a unicode collation instead of falling back in on LC_COLLATE=C 
> which
> means ascii only
> 
> For regrexp for example one should use the classes: :upper: or :lower:.
It is a good idea to keep LC_COLLATE and LC_NUMERIC (and LC_MONETARY?) 
at "C"
when LANG or LC_CTYPE is set to something else, otherwise unexpected
things may happen.

   Mark

> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>> I happened to run an old script today that uses sed(1) to extract the 
>> system
>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer 
>> works as
>> expected:
>> 
>> $ sysctl kern.boottime
>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 
>> 2016
>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>> v  5 16:18:34 2016
>> 
>> sed passes over 'S' and 'N' until it hits 'v',
which it considers
>> uppercase
>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works 
>> as
>> expected:
>> 
>> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
>> Nov  5 16:18:34 2016
>> 
>> Testing every lowercase character separately gives even more 
>> inconsistent
>> results:
>> 
>> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/'p
>> > a
>> > b
>> > c
>> > d
>> > e
>> > f
>> > g
>> > h
>> > i
>> > j
>> > k
>> > l
>> > m
>> > n
>> > o
>> > p
>> > q
>> > r
>> > s
>> > t
>> > u
>> > v
>> > w
>> > x
>> > y
>> > z
>> > !
>> b
>> c
>> d
>> e
>> f
>> g
>> h
>> i
>> j
>> k
>> l
>> m
>> n
>> o
>> p
>> q
>> r
>> s
>> t
>> u
>> v
>> w
>> x
>> y
>> z
>> 
>> Here sed thinks every lowercase character except for 'a' is
uppercase!
>> This
>> differs from the first test where sed did not think 'o' is
uppercase.
>> Again,
>> the above behaves as expected with LANG=C.
>> 
>> Does anyone have any insight into this? This is likely to break a lot 
>> of
>> existing code.

Stefan Bethke

2016-Nov-06 20:57 UTC

head link

Uppercase RE matching problems in FreeBSD 11

> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt at
FreeBSD.org>:
> 
> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>> I happened to run an old script today that uses sed(1) to extract the
system
>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer
works as
>> expected:
>> 
>> $ sysctl kern.boottime
>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34
2016
>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>> v  5 16:18:34 2016
>> 
>> sed passes over 'S' and 'N' until it hits 'v',
which it considers uppercase
>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
>> expected:
>> 
>> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
>> Nov  5 16:18:34 2016
>> 
>> Testing every lowercase character separately gives even more
inconsistent
>> results:
>> 
>> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/?p
>> Here sed thinks every lowercase character except for 'a' is
uppercase! This
>> differs from the first test where sed did not think 'o' is
uppercase. Again,
>> the above behaves as expected with LANG=C.
>> 
>> Does anyone have any insight into this? This is likely to break a lot
of
>> existing code.
>> 
> 
> Yes A-Z only means uppercase in an ASCII only world in a unicode world it
means
> AaBb... Z because there are way more characters that simple A-Z. In FreeBSD
11
> we have a unicode collation instead of falling back in on LC_COLLATE=C
which
> means ascii only
> 
> For regrexp for example one should use the classes: :upper: or :lower:.
That is rather surprising.  Is there a normative reference for the treatment of
bracket expressions and character classes when using locales other than C and/or
encodings like UTF-8?


Stefan

-- 
Stefan Bethke <stb at lassitu.de>   Fon +49 151 14070811

freebsd stable - Nov 2016 - Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11

Uppercase RE matching problems in FreeBSD 11