Hugh Parsonage
2018-Feb-17 15:35 UTC
[Rd] readLines interaction with gsub different in R-dev
| Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the regexp | you use wrong, ie isn't R-devel giving the correct answer? No, I don't think R-devel is correct (or at least consistent with the documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry, perl = TRUE) is "Take every word character and replace it with itself, converted to uppercase." Perhaps my example was too minimal. Consider the following: R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) [1] "A" R> gsub("(\\w)", "\\1", entry, perl = TRUE) [1] "author: Am?lie" # OK, but very different to 'A', despite only not specifying uppercase R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE) [1] "AUTHOR: AMELIE" # OK, but very different to 'A', R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE) "AUTHOR" # Where did everything after the first group go? I should note the following example too: R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE) [1] "AUTHOR: AM??LIE" # latin1 encoding A call to `readLines` (possibly `scan()` and `read.table` and friends) is essential. On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote:> > On 17 February 2018 at 21:10, Hugh Parsonage wrote: > | I was told to re-raise this issue with R-dev: > | > | In the documentation of R-dev and R-3.4.3, under ?gsub > | > | > replacement > | > ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. > | > | However, the following code runs differently: > | > | tempf <- tempfile() > | writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) > | entry <- readLines(tempf, encoding = "UTF-8") > | gsub("(\\w)", "\\U\\1", entry, perl = TRUE) > | > | > | "AUTHOR: AM?LIE" # R-3.4.3 > | > | "A" # R-dev > > Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the regexp > you use wrong, ie isn't R-devel giving the correct answer? > > R> tempf <- tempfile() > R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) > R> entry <- readLines(tempf, encoding = "UTF-8") > R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) > [1] "A" > R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE) > [1] "AUTHOR" > R> gsub("(.*)", "\\U\\1", entry, perl = TRUE) > [1] "AUTHOR: AM?LIE" > R> > > Dirk > > -- > http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
William Dunlap
2018-Feb-17 19:24 UTC
[Rd] readLines interaction with gsub different in R-dev
I think the problem in R-devel happens when there are non-ASCII characters in any of the strings passed to gsub. txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)), as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "") txt #[1] "Am?lie" "Amelia" Encoding(txt) #[1] "unknown" "unknown" gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt) #[1] "<a" "<a" gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[1]) #[1] "<a" gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[2]) #[1] "<aM><eL><iA>" I can change the Encoding to "latin1" or "UTF-8" and get similar results from gsub. Bill Dunlap TIBCO Software wdunlap tibco.com On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage <hugh.parsonage at gmail.com> wrote:> | Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the > regexp > | you use wrong, ie isn't R-devel giving the correct answer? > > No, I don't think R-devel is correct (or at least consistent with the > documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry, > perl = TRUE) is "Take every word character and replace it with itself, > converted to uppercase." > > Perhaps my example was too minimal. Consider the following: > > R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) > [1] "A" > > R> gsub("(\\w)", "\\1", entry, perl = TRUE) > [1] "author: Am?lie" # OK, but very different to 'A', despite only > not specifying uppercase > > R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE) > [1] "AUTHOR: AMELIE" # OK, but very different to 'A', > > R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE) > "AUTHOR" # Where did everything after the first group go? > > I should note the following example too: > R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE) > [1] "AUTHOR: AM??LIE" # latin1 encoding > > > A call to `readLines` (possibly `scan()` and `read.table` and friends) > is essential. > > > > > On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote: > > > > On 17 February 2018 at 21:10, Hugh Parsonage wrote: > > | I was told to re-raise this issue with R-dev: > > | > > | In the documentation of R-dev and R-3.4.3, under ?gsub > > | > > | > replacement > > | > ... For perl = TRUE only, it can also contain "\U" or "\L" to > convert the rest of the replacement to upper or lower case and "\E" to end > case conversion. > > | > > | However, the following code runs differently: > > | > > | tempf <- tempfile() > > | writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) > > | entry <- readLines(tempf, encoding = "UTF-8") > > | gsub("(\\w)", "\\U\\1", entry, perl = TRUE) > > | > > | > > | "AUTHOR: AM?LIE" # R-3.4.3 > > | > > | "A" # R-dev > > > > Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the > regexp > > you use wrong, ie isn't R-devel giving the correct answer? > > > > R> tempf <- tempfile() > > R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) > > R> entry <- readLines(tempf, encoding = "UTF-8") > > R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) > > [1] "A" > > R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE) > > [1] "AUTHOR" > > R> gsub("(.*)", "\\U\\1", entry, perl = TRUE) > > [1] "AUTHOR: AM?LIE" > > R> > > > > Dirk > > > > -- > > http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Tomas Kalibera
2018-Feb-19 14:58 UTC
[Rd] readLines interaction with gsub different in R-dev
Thank you for the report and analysis. Now fixed in R-devel. Tomas On 02/17/2018 08:24 PM, William Dunlap via R-devel wrote:> I think the problem in R-devel happens when there are non-ASCII characters > in any > of the strings passed to gsub. > > txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)), > as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "") > txt > #[1] "Am?lie" "Amelia" > Encoding(txt) > #[1] "unknown" "unknown" > gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt) > #[1] "<a" "<a" > gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[1]) > #[1] "<a" > gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[2]) > #[1] "<aM><eL><iA>" > > I can change the Encoding to "latin1" or "UTF-8" and get similar results > from gsub. > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage <hugh.parsonage at gmail.com> > wrote: > >> | Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the >> regexp >> | you use wrong, ie isn't R-devel giving the correct answer? >> >> No, I don't think R-devel is correct (or at least consistent with the >> documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry, >> perl = TRUE) is "Take every word character and replace it with itself, >> converted to uppercase." >> >> Perhaps my example was too minimal. Consider the following: >> >> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) >> [1] "A" >> >> R> gsub("(\\w)", "\\1", entry, perl = TRUE) >> [1] "author: Am?lie" # OK, but very different to 'A', despite only >> not specifying uppercase >> >> R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE) >> [1] "AUTHOR: AMELIE" # OK, but very different to 'A', >> >> R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE) >> "AUTHOR" # Where did everything after the first group go? >> >> I should note the following example too: >> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE) >> [1] "AUTHOR: AM??LIE" # latin1 encoding >> >> >> A call to `readLines` (possibly `scan()` and `read.table` and friends) >> is essential. >> >> >> >> >> On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote: >>> On 17 February 2018 at 21:10, Hugh Parsonage wrote: >>> | I was told to re-raise this issue with R-dev: >>> | >>> | In the documentation of R-dev and R-3.4.3, under ?gsub >>> | >>> | > replacement >>> | > ... For perl = TRUE only, it can also contain "\U" or "\L" to >> convert the rest of the replacement to upper or lower case and "\E" to end >> case conversion. >>> | >>> | However, the following code runs differently: >>> | >>> | tempf <- tempfile() >>> | writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) >>> | entry <- readLines(tempf, encoding = "UTF-8") >>> | gsub("(\\w)", "\\U\\1", entry, perl = TRUE) >>> | >>> | >>> | "AUTHOR: AM?LIE" # R-3.4.3 >>> | >>> | "A" # R-dev >>> >>> Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the >> regexp >>> you use wrong, ie isn't R-devel giving the correct answer? >>> >>> R> tempf <- tempfile() >>> R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE) >>> R> entry <- readLines(tempf, encoding = "UTF-8") >>> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE) >>> [1] "A" >>> R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE) >>> [1] "AUTHOR" >>> R> gsub("(.*)", "\\U\\1", entry, perl = TRUE) >>> [1] "AUTHOR: AM?LIE" >>> R> >>> >>> Dirk >>> >>> -- >>> http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Maybe Matching Threads
- readLines interaction with gsub different in R-dev
- readLines interaction with gsub different in R-dev
- Writing UTF8 on Windows
- \U or \L perl regex in gsub removes text outside capturing group in UTF-8 contexts
- suggestion/request: install.packages and unnecessary file modifications