thr3ads.net - R devel - [Rd] readLines interaction with gsub different in R-dev [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Hugh Parsonage

2018-Feb-17 10:10 UTC

[Rd] readLines interaction with gsub different in R-dev

I was told to re-raise this issue with R-dev:

In the documentation of R-dev and R-3.4.3, under ?gsub
> replacement
>    ... For perl = TRUE only, it can also contain "\U" or
"\L" to convert the rest of the replacement to upper or lower case and
"\E" to end case conversion.
However, the following code runs differently:

tempf <- tempfile()
writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
entry <- readLines(tempf, encoding = "UTF-8")
gsub("(\\w)", "\\U\\1", entry, perl = TRUE)


"AUTHOR: AM?LIE"  # R-3.4.3

"A"                              # R-dev



Best,

Hugh Parsonage.

Dirk Eddelbuettel

2018-Feb-17 15:15 UTC

head link

[Rd] readLines interaction with gsub different in R-dev

On 17 February 2018 at 21:10, Hugh Parsonage wrote:
| I was told to re-raise this issue with R-dev:
| 
| In the documentation of R-dev and R-3.4.3, under ?gsub
| 
| > replacement
| >    ... For perl = TRUE only, it can also contain "\U" or
"\L" to convert the rest of the replacement to upper or lower case and
"\E" to end case conversion.
| 
| However, the following code runs differently:
| 
| tempf <- tempfile()
| writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
| 
| 
| "AUTHOR: AM?LIE"  # R-3.4.3
| 
| "A"                              # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes =
TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AM?LIE"
R> 

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org

Hugh Parsonage

2018-Feb-17 15:35 UTC

head link

[Rd] readLines interaction with gsub different in R-dev

| Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
| you use wrong, ie isn't R-devel giving the correct answer?

No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1",
entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."

Perhaps my example was too minimal. Consider the following:

R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"

R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Am?lie"   # OK, but very different to 'A',
despite only
not specifying uppercase

R> gsub("(\\w)", "\\U\\1", "author: Amelie",
perl = TRUE)
[1] "AUTHOR: AMELIE"  # OK, but very different to 'A',

R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl
= TRUE)
 "AUTHOR"  # Where did everything after the first group go?

I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes =
TRUE)
[1] "AUTHOR: AM??LIE"  # latin1 encoding


A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.




On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org>
wrote:>
> On 17 February 2018 at 21:10, Hugh Parsonage wrote:
> | I was told to re-raise this issue with R-dev:
> |
> | In the documentation of R-dev and R-3.4.3, under ?gsub
> |
> | > replacement
> | >    ... For perl = TRUE only, it can also contain "\U" or
"\L" to convert the rest of the replacement to upper or lower case and
"\E" to end case conversion.
> |
> | However, the following code runs differently:
> |
> | tempf <- tempfile()
> | writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes =
TRUE)
> | entry <- readLines(tempf, encoding = "UTF-8")
> | gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> |
> |
> | "AUTHOR: AM?LIE"  # R-3.4.3
> |
> | "A"                              # R-dev
>
> Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
regexp
> you use wrong, ie isn't R-devel giving the correct answer?
>
> R> tempf <- tempfile()
> R> writeLines(enc2utf8("author: Am?lie"), con = tempf,
useBytes = TRUE)
> R> entry <- readLines(tempf, encoding = "UTF-8")
> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> [1] "A"
> R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
> [1] "AUTHOR"
> R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
> [1] "AUTHOR: AM?LIE"
> R>
>
> Dirk
>
> --
> dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org

Apparently Analagous Threads

Search for more apparently analagous threads

R devel - Feb 2018 - readLines interaction with gsub different in R-dev

[Rd] readLines interaction with gsub different in R-dev

[Rd] readLines interaction with gsub different in R-dev

[Rd] readLines interaction with gsub different in R-dev

Apparently Analagous Threads