mark.redshaw at evonik.com
2010-May-19 16:35 UTC
[R] Multiple language output - Correct in RGui, wrong in .txt after sink()
I have the following problem with outputting multilingual data to a file. I get (except for Korean) what I expect as result in the RGui, but when I use sink() to output to a text file loose the characters in the foreign languages. I post a small example below. Since I am not sure how well my email system as the list copes with all the different characters I have additionally created a pdf version of this example. The first part of the example behaves as I expect for all languages except Korean. I believe that the Korean language may be a problem with the font, it would be great if someone could confirm this? In the second part with output to the txt file I get the <U+FF71> type unicode as output not the expected characters. My main problem is how can I output the characters as I expect?> RM_EN <- c("Alfalfa hay","Alfalfa meal","Alfalfa silage") > RM_DE <- c("Luzerneheu","Lurzernegr?nmehl","Luzernesilage") > RM_RU <- c("?????????? ????","?????????? ???????? ????","???????????????")> RM_CN <- c("????","????","????") > RM_JP <- c("?????????","??????? ???","?????????????")> RM_KR <- c("??? ??","??? ?","??? ????") > > RMLANG <- data.frame(RM_EN,RM_DE,RM_RU,RM_CN,RM_JP,RM_KR) > nrm <- NROW(RMLANG) > > for(i in 1:nrm)+ { + cat(format("English", width = 12, justify = c("left")), as.character(RMLANG$RM_EN[i]),"\n",sep="") + cat(format("Deutsch", width = 12, justify = c("left")), as.character(RMLANG$RM_DE[i]),"\n",sep="") + cat(format("Russian", width = 12, justify = c("left")), as.character(RMLANG$RM_RU[i]),"\n",sep="") + cat(format("Japanese", width = 12, justify = c("left")), as.character(RMLANG$RM_JP[i]),"\n",sep="") + cat(format("Chinese", width = 12, justify = c("left")), as.character(RMLANG$RM_CN[i]),"\n",sep="") + cat(format("Korean", width = 12, justify = c("left")), as.character(RMLANG$RM_KR[i]),"\n","\n","\n",sep="") + } English Alfalfa hay Deutsch Luzerneheu Russian ?????????? ???? Japanese ????????? Chinese ???? Korean ??? ?? English Alfalfa meal Deutsch Lurzernegr?nmehl Russian ?????????? ???????? ???? Japanese ??????? ??? Chinese ???? Korean ??? ? English Alfalfa silage Deutsch Luzernesilage Russian ?????????? ????? Japanese ??????? ?????? Chinese ???? Korean ??? ????> for(i in 1:nrm)+ { + sink("output.txt") + cat(format("English", width = 12, justify = c("left")), as.character(RMLANG$RM_EN[i]),"\n",sep="") + cat(format("Deutsch", width = 12, justify = c("left")), as.character(RMLANG$RM_DE[i]),"\n",sep="") + cat(format("Japanese", width = 12, justify = c("left")), as.character(RMLANG$RM_JP[i]),"\n",sep="") + cat(format("Chinese", width = 12, justify = c("left")), as.character(RMLANG$RM_CN[i]),"\n",sep="") + cat(format("Korean", width = 12, justify = c("left")), as.character(RMLANG$RM_KR[i]),"\n","\n","\n",sep="") + sink() + }>Output.txt contains: "" English Alfalfa hay Deutsch Luzerneheu Japanese <U+FF71><U+FF99><U+FF8C><U+FF67><U+FF99><U+FF8C><U+FF67><U+4E7 Chinese <U+82DC><U+84FF><U+5E72><U+8349> Korean <U+C54C><U+D314><U+D30C> <U+AC74><U+CD08> English Alfalfa meal Deutsch Lurzernegr?nmehl Japanese <U+FF71><U+FF99><U+FF8C><U+FF67><U+FF99><U+FF8C><U+FF67> <U+FF Chinese <U+82DC><U+84FF><U+8349><U+7C89> Korean <U+C54C><U+D314><U+D30C> <U+BC15> English Alfalfa silage Deutsch Luzernesilage Japanese <U+FF71><U+FF99><U+FF8C><U+FF67><U+FF99><U+FF8C><U+FF67> <U+FF Chinese <U+82DC><U+84FF><U+9752><U+8D2E> Korean <U+C54C><U+D314><U+D30C> <U+C0AC><U+C77C><U+B9AC><U+C9C0> "" many thanks Mark Redshaw Mark Redshaw Animal Nutrition Services Evonik Degussa GmbH, HN-M-AN, Rodenbacher Chaussee 4, 63457 Hanau, Germany Tel: +49 61 81 59 6788 www.aminoacidsandmore.com
Prof Brian Ripley
2010-May-19 21:35 UTC
[R] Multiple language output - Correct in RGui, wrong in .txt after sink()
You haven't given us the 'at a minimum' information asked for in the the posting guide (but we can guess you are using Windows), nor do we know the intended encoding of this email (I see no encoding in the header as it reached me, but it seems sensible viewed as UTF-8). And the absence of basic information does make it *really* hard to help here -- this reply is my third guess at what might be happening. We also do not know the font you are using in RGui, but I am not aware of any Windows font which covers correctly Russian and CJK. However, it is not just a question of knowing the font name: different versions of Windows, including different language-specific versions, have different fonts with the same name. RGui (since about R 2.7.0) works in UCS-2 encoding. Sink files work in the locale's encoding (another of the pieces of information you did not tell us, but on Windows it is 8-bit or specific to one of Simplified Chinese, Traditional Chinese, Japanese or Korean -- I'd guess from your address it was CP1252, but it *is* part of the 'at a minimum'). So whereas R can store non-native strings in UTF-8 (provided you get them in as such), it can only output them if told how to: the designer of RGui did so but you in using sink('output.txt') did not. cat+sink is an inefficient way to write to a file: try using the file= argument on an opened connection. And you can set the encoding on that connection. I really don't know what you meant by 'the characters as I expect': in a file they have to be in *some* encoding and you are not looking at bits but as a representation in some unspecified file viewer. One possibility is that you meant UCS-2 (what Windows tends incorrectly to call 'Unicode' files), in which case you can use something like con <- file("foo", encoding="UCS-2LE") cat(..., file=con) ... close(con) You can use a connection with sink() too. Think of it more as a miracle (and much unappreciated hard work and inspired design) that any of this works on Windows, and if you want it to work transparently, change to an OS with UTF-8 locales (these days, just about anything else). On Wed, 19 May 2010, mark.redshaw at evonik.com wrote:> I have the following problem with outputting multilingual data to a file. > I get (except for Korean) what I expect as result in the RGui, but when I > use sink() to output to a text file loose the characters in the foreign > languages. > I post a small example below. Since I am not sure how well my email system > as the list copes with all the different characters I have additionally > created a pdf version of this example. > The first part of the example behaves as I expect for all languages except > Korean. I believe that the Korean language may be a problem with the font, > it would be great if someone could confirm this? > In the second part with output to the txt file I get the <U+FF71> type > unicode as output not the expected characters. My main problem is how can > I output the characters as I expect? > >> RM_EN <- c("Alfalfa hay","Alfalfa meal","Alfalfa silage") >> RM_DE <- c("Luzerneheu","Lurzernegr?nmehl","Luzernesilage") >> RM_RU <- c("?????????? ????","?????????? ???????? ????","?????????? > ?????") >> RM_CN <- c("????","????","????") >> RM_JP <- c("?????????","??????? ???","??????? > ??????") >> RM_KR <- c("??? ??","??? ?","??? ????") >> >> RMLANG <- data.frame(RM_EN,RM_DE,RM_RU,RM_CN,RM_JP,RM_KR) >> nrm <- NROW(RMLANG) >> >> for(i in 1:nrm) > + { > + cat(format("English", width = 12, justify = c("left")), > as.character(RMLANG$RM_EN[i]),"\n",sep="") > + cat(format("Deutsch", width = 12, justify = c("left")), > as.character(RMLANG$RM_DE[i]),"\n",sep="") > + cat(format("Russian", width = 12, justify = c("left")), > as.character(RMLANG$RM_RU[i]),"\n",sep="") > + cat(format("Japanese", width = 12, justify = c("left")), > as.character(RMLANG$RM_JP[i]),"\n",sep="") > + cat(format("Chinese", width = 12, justify = c("left")), > as.character(RMLANG$RM_CN[i]),"\n",sep="") > + cat(format("Korean", width = 12, justify = c("left")), > as.character(RMLANG$RM_KR[i]),"\n","\n","\n",sep="") > + } > English Alfalfa hay > Deutsch Luzerneheu > Russian ?????????? ???? > Japanese ????????? > Chinese ???? > Korean ??? ?? > > English Alfalfa meal > Deutsch Lurzernegr?nmehl > Russian ?????????? ???????? ???? > Japanese ??????? ??? > Chinese ???? > Korean ??? ? > > English Alfalfa silage > Deutsch Luzernesilage > Russian ?????????? ????? > Japanese ??????? ?????? > Chinese ???? > Korean ??? ???? > >> for(i in 1:nrm) > + { > + sink("output.txt") > + cat(format("English", width = 12, justify = c("left")), > as.character(RMLANG$RM_EN[i]),"\n",sep="") > + cat(format("Deutsch", width = 12, justify = c("left")), > as.character(RMLANG$RM_DE[i]),"\n",sep="") > + cat(format("Japanese", width = 12, justify = c("left")), > as.character(RMLANG$RM_JP[i]),"\n",sep="") > + cat(format("Chinese", width = 12, justify = c("left")), > as.character(RMLANG$RM_CN[i]),"\n",sep="") > + cat(format("Korean", width = 12, justify = c("left")), > as.character(RMLANG$RM_KR[i]),"\n","\n","\n",sep="") > + sink() > + } >> > Output.txt contains: > "" > English Alfalfa hay > Deutsch Luzerneheu > Japanese <U+FF71><U+FF99><U+FF8C><U+FF67><U+FF99><U+FF8C><U+FF67><U+4E7 > Chinese <U+82DC><U+84FF><U+5E72><U+8349> > Korean <U+C54C><U+D314><U+D30C> <U+AC74><U+CD08> > > English Alfalfa meal > Deutsch Lurzernegr?nmehl > Japanese <U+FF71><U+FF99><U+FF8C><U+FF67><U+FF99><U+FF8C><U+FF67> <U+FF > Chinese <U+82DC><U+84FF><U+8349><U+7C89> > Korean <U+C54C><U+D314><U+D30C> <U+BC15> > > English Alfalfa silage > Deutsch Luzernesilage > Japanese <U+FF71><U+FF99><U+FF8C><U+FF67><U+FF99><U+FF8C><U+FF67> <U+FF > Chinese <U+82DC><U+84FF><U+9752><U+8D2E> > Korean <U+C54C><U+D314><U+D30C> <U+C0AC><U+C77C><U+B9AC><U+C9C0> > "" > > > > many thanks > Mark Redshaw > Mark Redshaw > Animal Nutrition Services > Evonik Degussa GmbH, HN-M-AN, Rodenbacher Chaussee 4, 63457 Hanau, Germany > > Tel: +49 61 81 59 6788 > www.aminoacidsandmore.com >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595