I noticed that sub() gives unexpected results for the following test case. In the test case, the (initial) input is ASCII but the replacements are UTF-8. The first sub() produces an UTF-8 result with an "unknown" Encoding. This makes the result garbled in Windows (no UTF-8 locale there). The second sub() produces a correct result, although for some reason it is converted to the native Encoding in Windows. I think the best result would be UTF-8 output marked as such. foo <- c("a", "b") foo <- sub("a", "\u00e4", foo) print(Encoding(foo)) ## [1] "unknown" "unknown" foo <- sub("b", "\u00f6", foo) print(Encoding(foo)) ## [1] "unknown" "unknown" # Windows ## [1] "unknown" "UTF-8" # Linux print(foo) ## [1] "??" "?" # Windows ## [1] "?" "?" # Linux The output of sessionInfo() for both test systems follows.> sessionInfo()R version 3.5.1 Patched (2018-11-28 r75713) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=Finnish_Finland.1252 LC_CTYPE=Finnish_Finland.1252 [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C [5] LC_TIME=Finnish_Finland.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.5.1> sessionInfo()R Under development (unstable) (2018-12-08 r75801) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3 LAPACK: /home/mikko/root_R-devel-r75801/lib/R/lib/libRlapack.so locale: [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C [3] LC_TIME=fi_FI.UTF-8 LC_COLLATE=fi_FI.UTF-8 [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8 [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.6.0
>>>>> Korpela Mikko (MML) >>>>> on Sat, 8 Dec 2018 18:42:30 +0000 writes:> I noticed that sub() gives unexpected results for the following test > case. In the test case, the (initial) input is ASCII but the > replacements are UTF-8. The first sub() produces an UTF-8 result with > an "unknown" Encoding. This makes the result garbled in Windows (no > UTF-8 locale there). The second sub() produces a correct result, > although for some reason it is converted to the native Encoding in > Windows. > I think the best result would be UTF-8 output marked as such. > foo <- c("a", "b") > foo <- sub("a", "\u00e4", foo) > print(Encoding(foo)) > ## [1] "unknown" "unknown" > foo <- sub("b", "\u00f6", foo) > print(Encoding(foo)) > ## [1] "unknown" "unknown" # Windows > ## [1] "unknown" "UTF-8" # Linux > print(foo) > ## [1] "??" "?" # Windows > ## [1] "?" "?" # Linux I can confirm the problem on Windows, also for a recent version of R-devel. Why not filing this as a proper bug report at R's bugzilla? There's still no certainty that it will be fixed quickly, but the bug PR's there are less easily forgotten. Martin > The output of sessionInfo() for both test systems follows. >> sessionInfo() > R version 3.5.1 Patched (2018-11-28 r75713) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 7 x64 (build 7601) Service Pack 1 > Matrix products: default > locale: > [1] LC_COLLATE=Finnish_Finland.1252 LC_CTYPE=Finnish_Finland.1252 > [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C > [5] LC_TIME=Finnish_Finland.1252 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > loaded via a namespace (and not attached): > [1] compiler_3.5.1 >> sessionInfo() > R Under development (unstable) (2018-12-08 r75801) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 18.04.1 LTS > Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3 > LAPACK: /home/mikko/root_R-devel-r75801/lib/R/lib/libRlapack.so > locale: > [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C > [3] LC_TIME=fi_FI.UTF-8 LC_COLLATE=fi_FI.UTF-8 > [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8 > [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > loaded via a namespace (and not attached): > [1] compiler_3.6.0 > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Thanks for the confirmation. The bug report is now online at https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17509 - Mikko -----Original Message----- From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] Sent: Monday, December 10, 2018 12:09 PM To: Korpela Mikko (MML) Cc: r-devel at r-project.org Subject: Re: [Rd] Possible encoding bug in sub()>>>>> Korpela Mikko (MML) >>>>> on Sat, 8 Dec 2018 18:42:30 +0000 writes:> I noticed that sub() gives unexpected results for the following test > case. In the test case, the (initial) input is ASCII but the > replacements are UTF-8. The first sub() produces an UTF-8 result with > an "unknown" Encoding. This makes the result garbled in Windows (no > UTF-8 locale there). The second sub() produces a correct result, > although for some reason it is converted to the native Encoding in > Windows. > I think the best result would be UTF-8 output marked as such. > foo <- c("a", "b") > foo <- sub("a", "\u00e4", foo) > print(Encoding(foo)) > ## [1] "unknown" "unknown" > foo <- sub("b", "\u00f6", foo) > print(Encoding(foo)) > ## [1] "unknown" "unknown" # Windows > ## [1] "unknown" "UTF-8" # Linux > print(foo) > ## [1] "??" "?" # Windows > ## [1] "?" "?" # Linux I can confirm the problem on Windows, also for a recent version of R-devel. Why not filing this as a proper bug report at R's bugzilla? There's still no certainty that it will be fixed quickly, but the bug PR's there are less easily forgotten. Martin > The output of sessionInfo() for both test systems follows. >> sessionInfo() > R version 3.5.1 Patched (2018-11-28 r75713) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 7 x64 (build 7601) Service Pack 1 > Matrix products: default > locale: > [1] LC_COLLATE=Finnish_Finland.1252 LC_CTYPE=Finnish_Finland.1252 > [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C > [5] LC_TIME=Finnish_Finland.1252 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > loaded via a namespace (and not attached): > [1] compiler_3.5.1 >> sessionInfo() > R Under development (unstable) (2018-12-08 r75801) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 18.04.1 LTS > Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3 > LAPACK: /home/mikko/root_R-devel-r75801/lib/R/lib/libRlapack.so > locale: > [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C > [3] LC_TIME=fi_FI.UTF-8 LC_COLLATE=fi_FI.UTF-8 > [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8 > [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > loaded via a namespace (and not attached): > [1] compiler_3.6.0 > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel