Hugh Parsonage
2018-Feb-17 10:10 UTC
[Rd] readLines interaction with gsub different in R-dev
I was told to re-raise this issue with R-dev: In the documentation of R-dev and R-3.4.3, under ?gsub> replacement > ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.However, the following code runs differently: tempf <- tempfile() writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) entry <- readLines(tempf, encoding = "UTF-8") gsub("(\\w)", "\\U\\1", entry, perl = TRUE) "AUTHOR: AM?LIE" # R-3.4.3 "A" # R-dev Best, Hugh Parsonage.
Dirk Eddelbuettel
2018-Feb-17 15:15 UTC
[Rd] readLines interaction with gsub different in R-dev
On 17 February 2018 at 21:10, Hugh Parsonage wrote: | I was told to re-raise this issue with R-dev: | | In the documentation of R-dev and R-3.4.3, under ?gsub | | > replacement | > ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. | | However, the following code runs differently: | | tempf <- tempfile() | writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) | entry <- readLines(tempf, encoding = "UTF-8") | gsub("(\\w)", "\\U\\1", entry, perl = TRUE) | | | "AUTHOR: AM?LIE" # R-3.4.3 | | "A" # R-dev Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the regexp you use wrong, ie isn't R-devel giving the correct answer? R> tempf <- tempfile() R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) R> entry <- readLines(tempf, encoding = "UTF-8") R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) [1] "A" R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE) [1] "AUTHOR" R> gsub("(.*)", "\\U\\1", entry, perl = TRUE) [1] "AUTHOR: AM?LIE" R> Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
Hugh Parsonage
2018-Feb-17 15:35 UTC
[Rd] readLines interaction with gsub different in R-dev
| Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the regexp | you use wrong, ie isn't R-devel giving the correct answer? No, I don't think R-devel is correct (or at least consistent with the documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry, perl = TRUE) is "Take every word character and replace it with itself, converted to uppercase." Perhaps my example was too minimal. Consider the following: R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) [1] "A" R> gsub("(\\w)", "\\1", entry, perl = TRUE) [1] "author: Am?lie" # OK, but very different to 'A', despite only not specifying uppercase R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE) [1] "AUTHOR: AMELIE" # OK, but very different to 'A', R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE) "AUTHOR" # Where did everything after the first group go? I should note the following example too: R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE) [1] "AUTHOR: AM??LIE" # latin1 encoding A call to `readLines` (possibly `scan()` and `read.table` and friends) is essential. On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote:> > On 17 February 2018 at 21:10, Hugh Parsonage wrote: > | I was told to re-raise this issue with R-dev: > | > | In the documentation of R-dev and R-3.4.3, under ?gsub > | > | > replacement > | > ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. > | > | However, the following code runs differently: > | > | tempf <- tempfile() > | writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) > | entry <- readLines(tempf, encoding = "UTF-8") > | gsub("(\\w)", "\\U\\1", entry, perl = TRUE) > | > | > | "AUTHOR: AM?LIE" # R-3.4.3 > | > | "A" # R-dev > > Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the regexp > you use wrong, ie isn't R-devel giving the correct answer? > > R> tempf <- tempfile() > R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) > R> entry <- readLines(tempf, encoding = "UTF-8") > R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) > [1] "A" > R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE) > [1] "AUTHOR" > R> gsub("(.*)", "\\U\\1", entry, perl = TRUE) > [1] "AUTHOR: AM?LIE" > R> > > Dirk > > -- > http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org