Lixin Gong
2016-Sep-05 04:40 UTC
[Rd] How to print UTF-8 encoded strings from a C routine to R's output?
Dear R experts, It seems that Rprintf has to be used to print from a C routine to guarantee to write to R?s output according to https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Printing. However if a string is UTF-8 encoded, non-ASCII characters (e.g., the infinity symbol http://www.fileformat.info/info/unicode/char/221e/index.htm) are misprinted. Is this an unsupported feature or is there a workaround for this limitation? Thanks! Michael [[alternative HTML version deleted]]
Duncan Murdoch
2016-Sep-05 10:31 UTC
[Rd] How to print UTF-8 encoded strings from a C routine to R's output?
On 05/09/2016 12:40 AM, Lixin Gong wrote:> Dear R experts, > > It seems that Rprintf has to be used to print from a C routine to guarantee > to write to R?s output according to > https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Printing. > > However if a string is UTF-8 encoded, non-ASCII characters (e.g., the > infinity symbol http://www.fileformat.info/info/unicode/char/221e/index.htm) > are misprinted. > Is this an unsupported feature or is there a workaround for this limitation?If you are working in a UTF-8 locale (as on most Unix-like systems), you should be fine. If not (as is normal on Windows), you'll need to translate the string to the local encoding. The Writing R Extensions manual section 6.11 tells you how to do the re-encoding. Duncan Murdoch
Lixin Gong
2016-Sep-06 01:05 UTC
[Rd] How to print UTF-8 encoded strings from a C routine to R's output?
Hi Duncan, Thanks a lot for your quick reply pointing out the Re-encoding section that I missed! Before trying out R's C-level interface to the iconv's encoding conversion capabilities, I did some quick tests with Encoding() and iconv() on Windows with Rgui and Rterm. After Encoding(), non-ASCII characters are fine with Rgui but still wrong with Rterm. After iconv(), non-ASCII characters are still misprinted no matter if it is Rgui or Rterm. Here is the code that I used: (neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e))) (neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex)) Encoding(neg_inf_utf8) Encoding(neg_inf_utf8) <- "UTF-8" Encoding(neg_inf_utf8) neg_inf_utf8 charToRaw(neg_inf_utf8) iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE) iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE) Here is what I got with Rgui:> (neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e)))[1] 2d e2 88 9e> (neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex))[1] "-???"> Encoding(neg_inf_utf8)[1] "unknown"> > Encoding(neg_inf_utf8) <- "UTF-8" > Encoding(neg_inf_utf8)[1] "UTF-8"> neg_inf_utf8[1] "-?"> > charToRaw(neg_inf_utf8)[1] 2d e2 88 9e> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE)[1] "-8"> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE)[[1]] [1] 2d 38>Here is what I got with Rterm:> (neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e)))[1] 2d e2 88 9e> (neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex))[1] "-?^z"> Encoding(neg_inf_utf8)[1] "unknown"> > Encoding(neg_inf_utf8) <- "UTF-8" > Encoding(neg_inf_utf8)[1] "UTF-8"> neg_inf_utf8[1] "-8"> > charToRaw(neg_inf_utf8)[1] 2d e2 88 9e> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE)[1] "-8"> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE)[[1]] [1] 2d 38>Here is the sessionInfo:> sessionInfo()R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base>Am I missing something obvious? Thanks a lot for your help and your time! Michael On Mon, Sep 5, 2016 at 3:31 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 05/09/2016 12:40 AM, Lixin Gong wrote: > >> Dear R experts, >> >> It seems that Rprintf has to be used to print from a C routine to >> guarantee >> to write to R?s output according to >> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Printing. >> >> However if a string is UTF-8 encoded, non-ASCII characters (e.g., the >> infinity symbol http://www.fileformat.info/inf >> o/unicode/char/221e/index.htm) >> are misprinted. >> Is this an unsupported feature or is there a workaround for this >> limitation? >> > > If you are working in a UTF-8 locale (as on most Unix-like systems), you > should be fine. If not (as is normal on Windows), you'll need to translate > the string to the local encoding. The Writing R Extensions manual section > 6.11 tells you how to do the re-encoding. > > Duncan Murdoch > >[[alternative HTML version deleted]]