Davor Josipovic
2018-Feb-15 07:36 UTC
[Rd] writeLines argument useBytes = TRUE still making conversions
I think this behavior is inconsistent with the documentation: tmp <- '?' tmp <- iconv(tmp, to = 'UTF-8') print(Encoding(tmp)) print(charToRaw(tmp)) tmpfilepath <- tempfile() writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE) [1] "UTF-8" [1] c3 a9 Raw text as hex: c3 83 c2 a9 If I switch to useBytes = FALSE, then the variable is written correctly as c3 a9. Any thoughts? This behavior is related to this issue: https://github.com/yihui/knitr/issues/1509 [[alternative HTML version deleted]]
Kevin Ushey
2018-Feb-15 16:19 UTC
[Rd] writeLines argument useBytes = TRUE still making conversions
I suspect your UTF-8 string is being stripped of its encoding before write, and so assumed to be in the system native encoding, and then re-encoded as UTF-8 when written to the file. You can see something similar with: > tmp <- '?' > tmp <- iconv(tmp, to = 'UTF-8') > Encoding(tmp) <- "unknown" > charToRaw(iconv(tmp, to = "UTF-8")) [1] c3 83 c2 a9 It's worth saying that: file(..., encoding = "UTF-8") means "attempt to re-encode strings as UTF-8 when writing to this file". However, if you already know your text is UTF-8, then you likely want to avoid opening a connection that might attempt to re-encode the input. Conversely (assuming I'm understanding the documentation correctly) file(..., encoding = "native.enc") means "assume that strings are in the native encoding, and hence translation is unnecessary". Note that it does not mean "attempt to translate strings to the native encoding". Also note that writeLines(..., useBytes = FALSE) will explicitly translate to the current encoding before sending bytes to the requested connection. In other words, there are two locations where translation might occur in your example: 1) In the call to writeLines(), 2) When characters are passed to the connection. In your case, it sounds like translation should be suppressed at both steps. I think this is documented correctly in ?writeLines (and also the Encoding section of ?file), but the behavior may feel unfamiliar at first glance. Kevin On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at live.com> wrote:> > I think this behavior is inconsistent with the documentation: > > tmp <- '?' > tmp <- iconv(tmp, to = 'UTF-8') > print(Encoding(tmp)) > print(charToRaw(tmp)) > tmpfilepath <- tempfile() > writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE) > > [1] "UTF-8" > [1] c3 a9 > > Raw text as hex: c3 83 c2 a9 > > If I switch to useBytes = FALSE, then the variable is written correctly as c3 a9. > > Any thoughts? This behavior is related to this issue: https://github.com/yihui/knitr/issues/1509 > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Ista Zahn
2018-Feb-15 17:16 UTC
[Rd] writeLines argument useBytes = TRUE still making conversions
On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinushey at gmail.com> wrote:> I suspect your UTF-8 string is being stripped of its encoding before > write, and so assumed to be in the system native encoding, and then > re-encoded as UTF-8 when written to the file. You can see something > similar with: > > > tmp <- '?' > > tmp <- iconv(tmp, to = 'UTF-8') > > Encoding(tmp) <- "unknown" > > charToRaw(iconv(tmp, to = "UTF-8")) > [1] c3 83 c2 a9 > > It's worth saying that: > > file(..., encoding = "UTF-8") > > means "attempt to re-encode strings as UTF-8 when writing to this > file". However, if you already know your text is UTF-8, then you > likely want to avoid opening a connection that might attempt to > re-encode the input. Conversely (assuming I'm understanding the > documentation correctly) > > file(..., encoding = "native.enc") > > means "assume that strings are in the native encoding, and hence > translation is unnecessary". Note that it does not mean "attempt to > translate strings to the native encoding".If all that is true I think ?file needs some attention. I've read it several times now and I just don't see how it can be interpreted as you've described it. Best, Ista> > Also note that writeLines(..., useBytes = FALSE) will explicitly > translate to the current encoding before sending bytes to the > requested connection. In other words, there are two locations where > translation might occur in your example: > > 1) In the call to writeLines(), > 2) When characters are passed to the connection. > > In your case, it sounds like translation should be suppressed at both steps. > > I think this is documented correctly in ?writeLines (and also the > Encoding section of ?file), but the behavior may feel unfamiliar at > first glance. > > Kevin > > On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at live.com> wrote: >> >> I think this behavior is inconsistent with the documentation: >> >> tmp <- '?' >> tmp <- iconv(tmp, to = 'UTF-8') >> print(Encoding(tmp)) >> print(charToRaw(tmp)) >> tmpfilepath <- tempfile() >> writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE) >> >> [1] "UTF-8" >> [1] c3 a9 >> >> Raw text as hex: c3 83 c2 a9 >> >> If I switch to useBytes = FALSE, then the variable is written correctly as c3 a9. >> >> Any thoughts? This behavior is related to this issue: https://github.com/yihui/knitr/issues/1509 >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Reasonably Related Threads
- writeLines argument useBytes = TRUE still making conversions
- writeLines argument useBytes = TRUE still making conversions
- writeLines argument useBytes = TRUE still making conversions
- writeLines argument useBytes = TRUE still making conversions
- Guidelines when to use LF vs CRLF ("\n" vs. "\r\n") on Windows for new lines (line endings)?