thr3ads.net - R help - [R] sink() and UTF-8 on non-UTF-8 systems [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Milan Bouchet-Valat

2014-Apr-11 15:49 UTC

[R] sink() and UTF-8 on non-UTF-8 systems

Hi!

In the series "dealing with encoding madness on hostile systems",
I'm
looking for help as regards capturing R UTF-8 output on a system where
the locale is not using UTF-8, and where some characters cannot even be
represented using the locale encoding. The case I have in mind is
printing a character vector with Russian text to the R Commander output
window on an English/French (CP1252) Windows system.

Here's a code snippet illustrating the problem:> "\U41F"
[1] "?" # OK> con <- file(open="w+", encoding="UTF-8")
> capture.output(cat("\U41F"), file=con)
> readLines(con, encoding="UTF-8")[1] "<U+041F>" # Not OK

(same result without specifying 'encoding')


Now I have read ?sink and it is quite explicit about how this
works:> If file is a character string, the file will be opened using the
> current encoding. If you want a different encoding (e.g. to represent
> strings which have been stored in UTF-8), use a file connection ? but
> some ways to produce R output will already have converted such strings
> to the current encoding. 
The last words seem to apply to the case above, i.e. somewhere in the
process the UTF-8 string is converted to the locale encoding. Is there
any solution to get the correct output?


Thanks

> sessionInfo()R Under development (unstable) (2014-04-10 r65396)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Milan Bouchet-Valat

2014-Apr-14 16:27 UTC

head link

[R] sink() and UTF-8 on non-UTF-8 systems

Suggestions, anyone?

Le vendredi 11 avril 2014 ? 17:49 +0200, Milan Bouchet-Valat a ?crit
:> Hi!
> 
> In the series "dealing with encoding madness on hostile systems",
I'm
> looking for help as regards capturing R UTF-8 output on a system where
> the locale is not using UTF-8, and where some characters cannot even be
> represented using the locale encoding. The case I have in mind is
> printing a character vector with Russian text to the R Commander output
> window on an English/French (CP1252) Windows system.
> 
> Here's a code snippet illustrating the problem:
> > "\U41F"
> [1] "?" # OK
> > con <- file(open="w+", encoding="UTF-8")
> > capture.output(cat("\U41F"), file=con)
> > readLines(con, encoding="UTF-8")
> [1] "<U+041F>" # Not OK
> 
> (same result without specifying 'encoding')
> 
> 
> Now I have read ?sink and it is quite explicit about how this works:
> > If file is a character string, the file will be opened using the
> > current encoding. If you want a different encoding (e.g. to represent
> > strings which have been stored in UTF-8), use a file connection ? but
> > some ways to produce R output will already have converted such strings
> > to the current encoding. 
> 
> The last words seem to apply to the case above, i.e. somewhere in the
> process the UTF-8 string is converted to the locale encoding. Is there
> any solution to get the correct output?
> 
> 
> Thanks
> 
> 
> > sessionInfo()
> R Under development (unstable) (2014-04-10 r65396)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> 
> locale:
> [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
> [5] LC_TIME=French_France.1252    
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Apr 2014 - sink() and UTF-8 on non-UTF-8 systems

[R] sink() and UTF-8 on non-UTF-8 systems

[R] sink() and UTF-8 on non-UTF-8 systems