> Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin <bapt at
FreeBSD.org>:
>
> On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
>>
>>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt at
FreeBSD.org>:
>>>
>>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>>>> I happened to run an old script today that uses sed(1) to
extract the system
>>>> boot time from the kern.boottime sysctl MIB. On 11.0 this no
longer works as
>>>> expected:
>>>>
>>>> $ sysctl kern.boottime
>>>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov 5
16:18:34 2016
>>>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>>>> v 5 16:18:34 2016
>>>>
>>>> sed passes over 'S' and 'N' until it hits
'v', which it considers uppercase
>>>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it
works as
>>>> expected:
>>>>
>>>> $ sysctl kern.boottime | LANG=C sed -e
's/.*\([A-Z].*\)$/\1/'
>>>> Nov 5 16:18:34 2016
>>>>
>>>> Testing every lowercase character separately gives even more
inconsistent
>>>> results:
>>>>
>>>> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/?p
>>
>>>> Here sed thinks every lowercase character except for
'a' is uppercase! This
>>>> differs from the first test where sed did not think 'o'
is uppercase. Again,
>>>> the above behaves as expected with LANG=C.
>>>>
>>>> Does anyone have any insight into this? This is likely to break
a lot of
>>>> existing code.
>>>>
>>>
>>> Yes A-Z only means uppercase in an ASCII only world in a unicode
world it means
>>> AaBb... Z because there are way more characters that simple A-Z. In
FreeBSD 11
>>> we have a unicode collation instead of falling back in on
LC_COLLATE=C which
>>> means ascii only
>>>
>>> For regrexp for example one should use the classes: :upper: or
:lower:.
>>
>> That is rather surprising. Is there a normative reference for the
treatment of bracket expressions and character classes when using locales other
than C and/or encodings like UTF-8?
>
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
>
> For example:
>
> "Regular expressions are a context-independent syntax that can
represent a wide
> variety of character sets and character set orderings, where these
character
> sets are interpreted according to the current locale. While many regular
> expressions can be interpreted differently depending on the current locale,
many
> features, such as character class expressions, provide for contextual
invariance
> across locales.?
Sorry, maybe I wasn?t clear enough with my question. When a character class
fits the problem, it is clearly advantageous.
But under what circumstances would [A-Z] mean anything other than a character
whose Unicode codepoint is between U+0041 and U+005A, inclusive? Especially
given the locale in the example is en_US.UTF-8. Or, put another way, why would
an implementation interpret [A-Z] as anything other than [ABCDE?XYZ]?
From reading your reference, I can see in 9.3.5.7:> In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence, inclusive. In
other locales, a range expression has unspecified behavior[?]
So even if the observed behaviour is conforming, I?d think it?s still highly
undesirable.
Stefan
--
Stefan Bethke <stb at lassitu.de> Fon +49 151 14070811