Hugh Parsonage
2017-Jun-19 11:50 UTC
[Rd] \U or \L perl regex in gsub removes text outside capturing group in UTF-8 contexts
I write to clarify the status of \U and \L when used in the replacement argument to gsub in R 3.5.0. The behaviour of gsub appears to have changed from R 3.4.0, but the documentation for the replacement argument has not. ## Reprex (A call to readLines is essential. A url is provided for convenience but the behaviour should reproduce for local files) bib <- readLines(" https://raw.githubusercontent.com/HughParsonage/TeXCheckR/master/tests/testthat/lint_bib_in.bib", encoding = "UTF-8", n = 10) bib8910 <- bib[8:10] gsub("(\\w+)", "\\U\\1", bib8910, perl = TRUE) #> [1] "@TECHREPORT" " AUTHOR" " TITLE" Expected result (in R 3.4.0): #> [1] "@TECHREPORT{WOODHUNTEROTOOLEETAL2012," #> [2] " AUTHOR = {TONY WOOD AND AM?LIE HUNTER AND MICHAEL O'TOOLE AND PRASANA VENKATARAMAN AND LUCY CARTER}," #> [3] " TITLE = {PUTTING THE CUSTOMER BACK IN FRONT: HOW TO MAKE ELECTRICITY CHEAPER}," ## Likely point of breaking change I was alerted on June 13 by Kurt Hornik that my package (TeXCheckR), which had previously been accepted on CRAN, was ERRORing, as a unit test relies on \L. ## sessionInfo() R Under development (unstable) (2017-06-19 r72808) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393) Matrix products: default locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.5.0 Many thanks, Hugh Parsonage Associate, Grattan Institute, Melbourne, AU [[alternative HTML version deleted]]