thr3ads.net - R help - [R] Problem with writing a file in UTF-8 [Feb 2011]

If this information is useful, please help other people find it:
Share via:

tpklein

2011-Feb-17 21:54 UTC

[R] Problem with writing a file in UTF-8

Hello,

I am working with a data frame containg character strings with many special
symbols from various European languages.  When writing such character
strings to a file using the UTF-8 encoding, some of them are converted in a
strange way.  See the following example, run in R 2.12.1 on Windows 7:

out <- file( description="out.txt", open="w",
encoding="UTF-8")
write( x="???????", file=out )
close( con=out )

The last two symbols in the character string are converted to "uL"
while all
other characters are not changed (which is what I want).  How to explain
this?  Does it have something to do with my locale?  And is there a way to
work around this problem? -- Any help would be greatly appreciated.

Thomas
-- 
View this message in context:
http://r.789695.n4.nabble.com/Problem-with-writing-a-file-in-UTF-8-tp3311721p3311721.html
Sent from the R help mailing list archive at Nabble.com.

Matt Shotwell

2011-Feb-21 16:47 UTC

head link

[R] Problem with writing a file in UTF-8

Thomas, 

I wasn't able to reproduce your finding. The last two characters in my
'out.txt' file were just as expected. But, I'm in an UTF-8 locale.
Your
locale affects the encoding of characters on your platform. If you're
not in a UTF-8 locale, then characters are converted from your native
encoding to UTF-8 (when you specify encoding="UTF-8"). In the process
of
conversion, it's possible to lose information. You can test whether
there is a loss (or a change rather) when R writes these characters like
so:

# what does ?? look like in binary (hex)?
raw_before <- charToRaw("??")

# write 'out.txt' as before
out <- file(description="out.txt", open="w",
encoding="UTF-8")
write(x="??", file=out)
close(con=out)

# read in the two characters
out <- file(description="out.txt", open="r",
encoding="UTF-8")
raw_after <- charToRaw(readChar(con=out, nchars=2))
close(con=out)

# compare the raw representations
identical(raw_before, raw_after)

This test passes on my machine. But, there's also the question of
whether these characters made it onto R-help list unaltered. Also,
please include the result of sessionInfo() in you subsequent messages.

Best,
Matt
> sessionInfo()R version 2.11.1 (2010-05-31) 
i686-pc-linux-gnu 

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base  

On Thu, 2011-02-17 at 13:54 -0800, tpklein wrote:
> Hello,
> 
> I am working with a data frame containg character strings with many special
> symbols from various European languages.  When writing such character
> strings to a file using the UTF-8 encoding, some of them are converted in a
> strange way.  See the following example, run in R 2.12.1 on Windows 7:
> 
> out <- file( description="out.txt", open="w",
encoding="UTF-8")
> write( x="???????", file=out )
> close( con=out )
> 
> The last two symbols in the character string are converted to
"uL" while all
> other characters are not changed (which is what I want).  How to explain
> this?  Does it have something to do with my locale?  And is there a way to
> work around this problem? -- Any help would be greatly appreciated.
> 
> Thomas

Reasonably Related Threads

Search for more apparently analagous threads

R help - Feb 2011 - Problem with writing a file in UTF-8

[R] Problem with writing a file in UTF-8

[R] Problem with writing a file in UTF-8

Reasonably Related Threads