On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:> I happened to run an old script today that uses sed(1) to extract the system > boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works as > expected: > > $ sysctl kern.boottime > kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov 5 16:18:34 2016 > $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' > v 5 16:18:34 2016 > > sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase > apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as > expected: > > $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/' > Nov 5 16:18:34 2016 > > Testing every lowercase character separately gives even more inconsistent > results: > > $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/'p > > a > > b > > c > > d > > e > > f > > g > > h > > i > > j > > k > > l > > m > > n > > o > > p > > q > > r > > s > > t > > u > > v > > w > > x > > y > > z > > ! > b > c > d > e > f > g > h > i > j > k > l > m > n > o > p > q > r > s > t > u > v > w > x > y > z > > Here sed thinks every lowercase character except for 'a' is uppercase! This > differs from the first test where sed did not think 'o' is uppercase. Again, > the above behaves as expected with LANG=C. > > Does anyone have any insight into this? This is likely to break a lot of > existing code. >Yes A-Z only means uppercase in an ASCII only world in a unicode world it means AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11 we have a unicode collation instead of falling back in on LC_COLLATE=C which means ascii only For regrexp for example one should use the classes: :upper: or :lower:. Best regards, Bapt -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: not available URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20161106/6b3c254a/attachment.sig>
2016-11-06 12:07, Baptiste Daroussin wrote:> Yes A-Z only means uppercase in an ASCII only world in a unicode world > it means > AaBb... Z because there are way more characters that simple A-Z. In > FreeBSD 11 > we have a unicode collation instead of falling back in on LC_COLLATE=C > which > means ascii only > > For regrexp for example one should use the classes: :upper: or :lower:.It is a good idea to keep LC_COLLATE and LC_NUMERIC (and LC_MONETARY?) at "C" when LANG or LC_CTYPE is set to something else, otherwise unexpected things may happen. Mark> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: >> I happened to run an old script today that uses sed(1) to extract the >> system >> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer >> works as >> expected: >> >> $ sysctl kern.boottime >> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov 5 16:18:34 >> 2016 >> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' >> v 5 16:18:34 2016 >> >> sed passes over 'S' and 'N' until it hits 'v', which it considers >> uppercase >> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works >> as >> expected: >> >> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/' >> Nov 5 16:18:34 2016 >> >> Testing every lowercase character separately gives even more >> inconsistent >> results: >> >> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/'p >> > a >> > b >> > c >> > d >> > e >> > f >> > g >> > h >> > i >> > j >> > k >> > l >> > m >> > n >> > o >> > p >> > q >> > r >> > s >> > t >> > u >> > v >> > w >> > x >> > y >> > z >> > ! >> b >> c >> d >> e >> f >> g >> h >> i >> j >> k >> l >> m >> n >> o >> p >> q >> r >> s >> t >> u >> v >> w >> x >> y >> z >> >> Here sed thinks every lowercase character except for 'a' is uppercase! >> This >> differs from the first test where sed did not think 'o' is uppercase. >> Again, >> the above behaves as expected with LANG=C. >> >> Does anyone have any insight into this? This is likely to break a lot >> of >> existing code.
> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt at FreeBSD.org>: > > On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: >> I happened to run an old script today that uses sed(1) to extract the system >> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works as >> expected: >> >> $ sysctl kern.boottime >> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov 5 16:18:34 2016 >> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' >> v 5 16:18:34 2016 >> >> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase >> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as >> expected: >> >> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/' >> Nov 5 16:18:34 2016 >> >> Testing every lowercase character separately gives even more inconsistent >> results: >> >> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/?p>> Here sed thinks every lowercase character except for 'a' is uppercase! This >> differs from the first test where sed did not think 'o' is uppercase. Again, >> the above behaves as expected with LANG=C. >> >> Does anyone have any insight into this? This is likely to break a lot of >> existing code. >> > > Yes A-Z only means uppercase in an ASCII only world in a unicode world it means > AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11 > we have a unicode collation instead of falling back in on LC_COLLATE=C which > means ascii only > > For regrexp for example one should use the classes: :upper: or :lower:.That is rather surprising. Is there a normative reference for the treatment of bracket expressions and character classes when using locales other than C and/or encodings like UTF-8? Stefan -- Stefan Bethke <stb at lassitu.de> Fon +49 151 14070811