Majid Einian
2014-Feb-04 10:49 UTC
[R] Writing Unicode Text into Text File from R (in Windows)
Dear R Helpers, See the Code: a <- intToUtf8(1777) show(a) zz <- file(description="test.txt",open="w",encoding="UTF-8") cat(a, file = zz) close(zz) in a Unicode aware environment (such as RGui console or RStudio Console) you will see this as output: [1] "۱" but the character is not written correctly in the file test.txt (which is encoded in UTF-8 without BOM) : <U+06F1> The problem seems to be this: R changes text to the locale of system (for me this is Arabic Windows (Codepage 1256) that does not have a relevant code for U+06F1, then changes it back to UTF-8 and writes it into file. What do I miss here? How can I write a Unicode string into a text file correctly? Majid Einian, Economics Researcher, Monetary and Banking Research Institute, Central Bank of Islamic Republic of Iran, Tehran, IRAN and PhD Candidate in "Economics", Graduate School of Management and Economics, Sharif University of Technology, Tehran, IRAN [[alternative HTML version deleted]]
Duncan Murdoch
2014-Feb-04 12:48 UTC
[R] Writing Unicode Text into Text File from R (in Windows)
On 14-02-04 5:49 AM, Majid Einian wrote:> Dear R Helpers, > > See the Code: > > a <- intToUtf8(1777) > show(a) > zz <- file(description="test.txt",open="w",encoding="UTF-8") > cat(a, file = zz) > close(zz) > > in a Unicode aware environment (such as RGui console or RStudio Console) > you will see this as output: > > [1] "??" > > > but the character is not written correctly in the file test.txt (which is > encoded in UTF-8 without BOM) : > > <U+06F1> > > The problem seems to be this: R changes text to the locale of system (for > me this is Arabic Windows (Codepage 1256) that does not have a relevant > code for U+06F1, then changes it back to UTF-8 and writes it into file. > What do I miss here? > How can I write a Unicode string into a text file correctly?There are a lot of places in R where it converts strings to the local encoding, perhaps too many. On the other hand, maybe Windows should be offering UTF-8 locales by now. I haven't tested in your locale, but I believe writeLines() to a connection declared to be in a UTF-8 encoding will maintain the encoding. You can declare a file to be in encoding "UTF-8-BOM" if you want to ignore a BOM on input; I forget whether it will write one on output. If it doesn't, you can always write one explicitly. I was hoping to make some progress on this before R 3.1.0 so that more cases of writing strings to UTF-8 files would work, but time is running out. Duncan Murdoch> > > Majid Einian, > Economics Researcher, Monetary and Banking Research Institute, Central Bank > of Islamic Republic of Iran, Tehran, IRAN > and > PhD Candidate in "Economics", Graduate School of Management and > Economics, Sharif University of Technology, Tehran, IRAN > > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >