thr3ads.net - R devel - [Rd] why does [A-Z] include 'T' in an Estonian locale? [May 2023]

If this information is useful, please help other people find it:
Share via:

Ben Bolker

2023-May-30 15:45 UTC

[Rd] why does [A-Z] include 'T' in an Estonian locale?

Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")

TRE's documentation at 
<https://laurikari.net/tre/documentation/regex-syntax/> says that a 
range "is shorthand for the full range of characters between those two 
[endpoints] (inclusive) in the collating sequence".

Yet, T is *not* between A and Z in the Estonian collating sequence:

  sort(LETTERS)
  [1] "A" "B" "C" "D" "E"
"F" "G" "H" "I" "J"
"K" "L" "M" "N" "O"
"P"
"Q" "R" "S"
[20] "Z" "T" "U" "V" "W"
"X" "Y"

   I realize that this may be a question about TRE rather than about R 
*per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so 
the question also applies to PCRE), but I'm wondering if anyone has any 
insights ...  (and yes, I know that the correct answer is "use [:alpha:] 
and don't worry about it")

(In contrast, the ICU engine underlying stringi/stringr says "[t]he 
characters to include are determined by Unicode code point ordering" - see

https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

for links)

Martin Maechler

2023-Jun-01 08:11 UTC

head link

[Rd] why does [A-Z] include 'T' in an Estonian locale?

>>>>> Ben Bolker 
>>>>>     on Tue, 30 May 2023 11:45:20 -0400 writes:
    > Inspired by this old Stack Overflow question

    >
https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

    > I was wondering why this is TRUE:

    > Sys.setlocale("LC_ALL", "et_EE")
    > grepl("[A-Z]", "T")

    > TRE's documentation at 
    > <https://laurikari.net/tre/documentation/regex-syntax/> says that
a
    > range "is shorthand for the full range of characters between those
two
    > [endpoints] (inclusive) in the collating sequence".

    > Yet, T is *not* between A and Z in the Estonian collating sequence:

    > sort(LETTERS)
    > [1] "A" "B" "C" "D"
"E" "F" "G" "H" "I"
"J" "K" "L" "M" "N"
"O" "P"
    > "Q" "R" "S"
    > [20] "Z" "T" "U" "V"
"W" "X" "Y"

    > I realize that this may be a question about TRE rather than about R 
    > *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so 
    > the question also applies to PCRE), but I'm wondering if anyone has
any
    > insights ...  (and yes, I know that the correct answer is "use
[:alpha:]
    > and don't worry about it")

    > (In contrast, the ICU engine underlying stringi/stringr says
"[t]he
    > characters to include are determined by Unicode code point
ordering" - see

    >
https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

    > for links)

Your last (<sentence>)  may point to the solution of the riddle:
Nowadays, typically in R
> capabilities()[["ICU"]][1] TRUE

but of course now one has to study if / why  ICU seems to take
precedence over the locale's internal "sort"ing ..


Best regards,
Martin

Tomas Kalibera

2023-Jun-01 09:53 UTC

head link

[Rd] why does [A-Z] include 'T' in an Estonian locale?

On 5/30/23 17:45, Ben Bolker wrote:> Inspired by this old Stack Overflow question
>
>
https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions
>
>
> I was wondering why this is TRUE:
>
> Sys.setlocale("LC_ALL", "et_EE")
> grepl("[A-Z]", "T")
>
> TRE's documentation at 
> <https://laurikari.net/tre/documentation/regex-syntax/> says that a 
> range "is shorthand for the full range of characters between those two
> [endpoints] (inclusive) in the collating sequence".
>
> Yet, T is *not* between A and Z in the Estonian collating sequence:
>
> ?sort(LETTERS)
> ?[1] "A" "B" "C" "D" "E"
"F" "G" "H" "I" "J"
"K" "L" "M" "N" "O"
"P"
> "Q" "R" "S"
> [20] "Z" "T" "U" "V" "W"
"X" "Y"
>
> ? I realize that this may be a question about TRE rather than about R 
> *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so 
> the question also applies to PCRE), but I'm wondering if anyone has 
> any insights ...? (and yes, I know that the correct answer is "use 
> [:alpha:] and don't worry about it")
The correct answer depends on what you want to do, but please see 
?regexp in R:

"Because their interpretation is locale- and implementation-dependent, 
character ranges are best avoided."

and

"The only portable way to specify all ASCII letters is to list them all 
as the character class
?[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]?."

This is from POSIX specification:

"In the POSIX locale, a range expression represents the set of collating 
elements that fall between two elements in the collation sequence, 
inclusive. In other locales, a range expression has unspecified 
behavior: strictly conforming applications shall not rely on whether the 
range expression is valid, or on the set of collating elements matched. 
A range expression shall be expressed as the starting point and the 
ending point separated by a <hyphen-minus> ( '-' )."

If you really want to know why the current implementation of R, TRE and 
PCRE2 works in a certain way, you can check the code, but I don't think 
it would be a good use of the time given what is written above.

It may be that TRE has a bug, maybe it doesn't do what was intended (see 
comment "XXX - Should use collation order instead of encoding values in 
character ranges." in the code), but I didn't check the code
thoroughly.

Best
Tomas
>
> (In contrast, the ICU engine underlying stringi/stringr says "[t]he 
> characters to include are determined by Unicode code point ordering" -
> see
>
>
https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163
>
>
> for links)
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

peter dalgaard

2023-Jun-16 09:16 UTC

head link

[Rd] why does [A-Z] include 'T' in an Estonian locale?

Just for amusement: Similar messups occur with Danish and its three extra
letters:
> Sys.setlocale("LC_ALL", "da_DK")
[1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"> sort(c(LETTERS,"?","?","?")) [1] "A" "B" "C" "D" "E"
"F" "G" "H" "I" "J"
"K" "L" "M" "N" "O"
"P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X"
"Y" "Z" "?" "?" "?"
> grepl("[A-?]", "?")
[1] FALSE> grepl("[A-?]", "?")
[1] FALSE> grepl("[A-?]", "?")
[1] TRUE> grepl("[A-?]", "?")
[1] FALSE> grepl("[A-?]", "?")
[1] TRUE> grepl("[A-?]", "?")[1] TRUE

So for character ranges, the order is ?,?,? (which is how they'd collate in
Swedish, except that Swedish uses diacriticals rather than ? and ?).
> Sys.setlocale("LC_ALL", "sv_SE")
[1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"> sort(c(LETTERS,"?","?","?")) [1] "A" "B" "C" "D" "E"
"F" "G" "H" "I" "J"
"K" "L" "M" "N" "O"
"P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X"
"Y" "Z" "?" "?"
"?"> sort(c(LETTERS,"?","?","?")) [1] "A" "B" "C" "D" "E"
"F" "G" "H" "I" "J"
"K" "L" "M" "N" "O"
"P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X"
"Y" "Z" "?" "?" "?"


> On 30 May 2023, at 17:45 , Ben Bolker <bbolker at gmail.com> wrote:
> 
>  Inspired by this old Stack Overflow question
> 
>
https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions
> 
> I was wondering why this is TRUE:
> 
> Sys.setlocale("LC_ALL", "et_EE")
> grepl("[A-Z]", "T")
> 
> TRE's documentation at
<https://laurikari.net/tre/documentation/regex-syntax/> says that a range
"is shorthand for the full range of characters between those two
[endpoints] (inclusive) in the collating sequence".
> 
> Yet, T is *not* between A and Z in the Estonian collating sequence:
> 
> sort(LETTERS)
> [1] "A" "B" "C" "D" "E"
"F" "G" "H" "I" "J"
"K" "L" "M" "N" "O"
"P" "Q" "R" "S"
> [20] "Z" "T" "U" "V" "W"
"X" "Y"
> 
>  I realize that this may be a question about TRE rather than about R *per
se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so the question
also applies to PCRE), but I'm wondering if anyone has any insights ... 
(and yes, I know that the correct answer is "use [:alpha:] and don't
worry about it")
> 
> (In contrast, the ICU engine underlying stringi/stringr says "[t]he
characters to include are determined by Unicode code point ordering" - see
> 
>
https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163
> 
> for links)
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Seemingly Similar Threads

Search for more maybe matching threads

R devel - May 2023 - why does [A-Z] include 'T' in an Estonian locale?

[Rd] why does [A-Z] include 'T' in an Estonian locale?

[Rd] why does [A-Z] include 'T' in an Estonian locale?

[Rd] why does [A-Z] include 'T' in an Estonian locale?

[Rd] why does [A-Z] include 'T' in an Estonian locale?

Seemingly Similar Threads