Inspired by this old Stack Overflow question https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions I was wondering why this is TRUE: Sys.setlocale("LC_ALL", "et_EE") grepl("[A-Z]", "T") TRE's documentation at <https://laurikari.net/tre/documentation/regex-syntax/> says that a range "is shorthand for the full range of characters between those two [endpoints] (inclusive) in the collating sequence". Yet, T is *not* between A and Z in the Estonian collating sequence: sort(LETTERS) [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "Z" "T" "U" "V" "W" "X" "Y" I realize that this may be a question about TRE rather than about R *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so the question also applies to PCRE), but I'm wondering if anyone has any insights ... (and yes, I know that the correct answer is "use [:alpha:] and don't worry about it") (In contrast, the ICU engine underlying stringi/stringr says "[t]he characters to include are determined by Unicode code point ordering" - see https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163 for links)
Martin Maechler
2023-Jun-01 08:11 UTC
[Rd] why does [A-Z] include 'T' in an Estonian locale?
>>>>> Ben Bolker >>>>> on Tue, 30 May 2023 11:45:20 -0400 writes:> Inspired by this old Stack Overflow question > https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions > I was wondering why this is TRUE: > Sys.setlocale("LC_ALL", "et_EE") > grepl("[A-Z]", "T") > TRE's documentation at > <https://laurikari.net/tre/documentation/regex-syntax/> says that a > range "is shorthand for the full range of characters between those two > [endpoints] (inclusive) in the collating sequence". > Yet, T is *not* between A and Z in the Estonian collating sequence: > sort(LETTERS) > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" > "Q" "R" "S" > [20] "Z" "T" "U" "V" "W" "X" "Y" > I realize that this may be a question about TRE rather than about R > *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so > the question also applies to PCRE), but I'm wondering if anyone has any > insights ... (and yes, I know that the correct answer is "use [:alpha:] > and don't worry about it") > (In contrast, the ICU engine underlying stringi/stringr says "[t]he > characters to include are determined by Unicode code point ordering" - see > https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163 > for links) Your last (<sentence>) may point to the solution of the riddle: Nowadays, typically in R> capabilities()[["ICU"]][1] TRUE but of course now one has to study if / why ICU seems to take precedence over the locale's internal "sort"ing .. Best regards, Martin
Tomas Kalibera
2023-Jun-01 09:53 UTC
[Rd] why does [A-Z] include 'T' in an Estonian locale?
On 5/30/23 17:45, Ben Bolker wrote:> Inspired by this old Stack Overflow question > > https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions > > > I was wondering why this is TRUE: > > Sys.setlocale("LC_ALL", "et_EE") > grepl("[A-Z]", "T") > > TRE's documentation at > <https://laurikari.net/tre/documentation/regex-syntax/> says that a > range "is shorthand for the full range of characters between those two > [endpoints] (inclusive) in the collating sequence". > > Yet, T is *not* between A and Z in the Estonian collating sequence: > > ?sort(LETTERS) > ?[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" > "Q" "R" "S" > [20] "Z" "T" "U" "V" "W" "X" "Y" > > ? I realize that this may be a question about TRE rather than about R > *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so > the question also applies to PCRE), but I'm wondering if anyone has > any insights ...? (and yes, I know that the correct answer is "use > [:alpha:] and don't worry about it")The correct answer depends on what you want to do, but please see ?regexp in R: "Because their interpretation is locale- and implementation-dependent, character ranges are best avoided." and "The only portable way to specify all ASCII letters is to list them all as the character class ?[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]?." This is from POSIX specification: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched. A range expression shall be expressed as the starting point and the ending point separated by a <hyphen-minus> ( '-' )." If you really want to know why the current implementation of R, TRE and PCRE2 works in a certain way, you can check the code, but I don't think it would be a good use of the time given what is written above. It may be that TRE has a bug, maybe it doesn't do what was intended (see comment "XXX - Should use collation order instead of encoding values in character ranges." in the code), but I didn't check the code thoroughly. Best Tomas> > (In contrast, the ICU engine underlying stringi/stringr says "[t]he > characters to include are determined by Unicode code point ordering" - > see > > https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163 > > > for links) > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
peter dalgaard
2023-Jun-16 09:16 UTC
[Rd] why does [A-Z] include 'T' in an Estonian locale?
Just for amusement: Similar messups occur with Danish and its three extra letters:> Sys.setlocale("LC_ALL", "da_DK")[1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"> sort(c(LETTERS,"?","?","?"))[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"> grepl("[A-?]", "?")[1] FALSE> grepl("[A-?]", "?")[1] FALSE> grepl("[A-?]", "?")[1] TRUE> grepl("[A-?]", "?")[1] FALSE> grepl("[A-?]", "?")[1] TRUE> grepl("[A-?]", "?")[1] TRUE So for character ranges, the order is ?,?,? (which is how they'd collate in Swedish, except that Swedish uses diacriticals rather than ? and ?).> Sys.setlocale("LC_ALL", "sv_SE")[1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"> sort(c(LETTERS,"?","?","?"))[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"> sort(c(LETTERS,"?","?","?"))[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"> On 30 May 2023, at 17:45 , Ben Bolker <bbolker at gmail.com> wrote: > > Inspired by this old Stack Overflow question > > https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions > > I was wondering why this is TRUE: > > Sys.setlocale("LC_ALL", "et_EE") > grepl("[A-Z]", "T") > > TRE's documentation at <https://laurikari.net/tre/documentation/regex-syntax/> says that a range "is shorthand for the full range of characters between those two [endpoints] (inclusive) in the collating sequence". > > Yet, T is *not* between A and Z in the Estonian collating sequence: > > sort(LETTERS) > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" > [20] "Z" "T" "U" "V" "W" "X" "Y" > > I realize that this may be a question about TRE rather than about R *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so the question also applies to PCRE), but I'm wondering if anyone has any insights ... (and yes, I know that the correct answer is "use [:alpha:] and don't worry about it") > > (In contrast, the ICU engine underlying stringi/stringr says "[t]he characters to include are determined by Unicode code point ordering" - see > > https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163 > > for links) > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com