thr3ads.net - R devel - [Rd] writeLines argument useBytes = TRUE still making conversions [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Ista Zahn

2018-Feb-15 17:16 UTC

[Rd] writeLines argument useBytes = TRUE still making conversions

On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinushey at gmail.com>
wrote:> I suspect your UTF-8 string is being stripped of its encoding before
> write, and so assumed to be in the system native encoding, and then
> re-encoded as UTF-8 when written to the file. You can see something
> similar with:
>
>     > tmp <- '?'
>     > tmp <- iconv(tmp, to = 'UTF-8')
>     > Encoding(tmp) <- "unknown"
>     > charToRaw(iconv(tmp, to = "UTF-8"))
>     [1] c3 83 c2 a9
>
> It's worth saying that:
>
>     file(..., encoding = "UTF-8")
>
> means "attempt to re-encode strings as UTF-8 when writing to this
> file". However, if you already know your text is UTF-8, then you
> likely want to avoid opening a connection that might attempt to
> re-encode the input. Conversely (assuming I'm understanding the
> documentation correctly)
>
>     file(..., encoding = "native.enc")
>
> means "assume that strings are in the native encoding, and hence
> translation is unnecessary". Note that it does not mean "attempt
to
> translate strings to the native encoding".
If all that is true I think ?file needs some attention. I've read it
several times now and I just don't see how it can be interpreted as
you've described it.

Best,
Ista
>
> Also note that writeLines(..., useBytes = FALSE) will explicitly
> translate to the current encoding before sending bytes to the
> requested connection. In other words, there are two locations where
> translation might occur in your example:
>
>    1) In the call to writeLines(),
>    2) When characters are passed to the connection.
>
> In your case, it sounds like translation should be suppressed at both
steps.
>
> I think this is documented correctly in ?writeLines (and also the
> Encoding section of ?file), but the behavior may feel unfamiliar at
> first glance.
>
> Kevin
>
> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at
live.com> wrote:
>>
>> I think this behavior is inconsistent with the documentation:
>>
>>   tmp <- '?'
>>   tmp <- iconv(tmp, to = 'UTF-8')
>>   print(Encoding(tmp))
>>   print(charToRaw(tmp))
>>   tmpfilepath <- tempfile()
>>   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'),
useBytes = TRUE)
>>
>> [1] "UTF-8"
>> [1] c3 a9
>>
>> Raw text as hex: c3 83 c2 a9
>>
>> If I switch to useBytes = FALSE, then the variable is written correctly
as  c3 a9.
>>
>> Any thoughts? This behavior is related to this issue:
https://github.com/yihui/knitr/issues/1509
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Kevin Ushey

2018-Feb-17 22:19 UTC

head link

[Rd] writeLines argument useBytes = TRUE still making conversions

>From my understanding, translation is implied in this line of ?file (from
theEncoding section):

    The encoding of the input/output stream of a connection can be specified
    by name in the same way as it would be given to iconv: see that help page
    for how to find out what encoding names are recognized on your platform.
    Additionally, "" and "native.enc" both mean the ?native?
encoding, that is
    the internal encoding of the current locale and hence no translation is
    done.

This is also hinted at in the documentation in ?readLines for its
'encoding'
argument, which has a different semantic meaning from the 'encoding'
argument
as used with R connections:

    encoding to be assumed for input strings. It is used to mark character
    strings as known to be in Latin-1 or UTF-8: it is not used to re-encode
    the input. To do the latter, specify the encoding as part of the
    connection con or via options(encoding=): see the examples.

It might be useful to augment the documentation in ?file with something like:

    The 'encoding' argument is used to request the translation of
strings when
    writing to a connection.

and, perhaps to further drive home the point about not translating when
encoding = "native.enc":

    Note that R will not attempt translation of strings when encoding is
    either "" or "native.enc" (the default, as per
getOption("encoding")).
    This implies that attempting to write, for example, UTF-8 encoded content
    to a connection opened using "native.enc" will retain its original
UTF-8
    encoding -- it will not be translated.

It is a bit surprising that 'native.enc' means "do not
translate" rather than
"attempt translation to the encoding associated with the current
locale", but
those are the semantics and they are not bound to change.

This is the code I used to convince myself of that case:

    conn <- file(tempfile(), encoding = "native.enc", open =
"w+")

    before <- iconv('?', to = "UTF-8")
    cat(before, file = conn, sep = "\n")
    after <- readLines(conn)

    charToRaw(before)
    charToRaw(after)

with output:

    > charToRaw(before)
    [1] c3 a9
    > charToRaw(after)
    [1] c3 a9

Best,
Kevin


On Thu, Feb 15, 2018 at 9:16 AM, Ista Zahn <istazahn at gmail.com>
wrote:> On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinushey at
gmail.com> wrote:
>> I suspect your UTF-8 string is being stripped of its encoding before
>> write, and so assumed to be in the system native encoding, and then
>> re-encoded as UTF-8 when written to the file. You can see something
>> similar with:
>>
>>     > tmp <- '?'
>>     > tmp <- iconv(tmp, to = 'UTF-8')
>>     > Encoding(tmp) <- "unknown"
>>     > charToRaw(iconv(tmp, to = "UTF-8"))
>>     [1] c3 83 c2 a9
>>
>> It's worth saying that:
>>
>>     file(..., encoding = "UTF-8")
>>
>> means "attempt to re-encode strings as UTF-8 when writing to this
>> file". However, if you already know your text is UTF-8, then you
>> likely want to avoid opening a connection that might attempt to
>> re-encode the input. Conversely (assuming I'm understanding the
>> documentation correctly)
>>
>>     file(..., encoding = "native.enc")
>>
>> means "assume that strings are in the native encoding, and hence
>> translation is unnecessary". Note that it does not mean
"attempt to
>> translate strings to the native encoding".
>
> If all that is true I think ?file needs some attention. I've read it
> several times now and I just don't see how it can be interpreted as
> you've described it.
>
> Best,
> Ista
>
>>
>> Also note that writeLines(..., useBytes = FALSE) will explicitly
>> translate to the current encoding before sending bytes to the
>> requested connection. In other words, there are two locations where
>> translation might occur in your example:
>>
>>    1) In the call to writeLines(),
>>    2) When characters are passed to the connection.
>>
>> In your case, it sounds like translation should be suppressed at both
steps.
>>
>> I think this is documented correctly in ?writeLines (and also the
>> Encoding section of ?file), but the behavior may feel unfamiliar at
>> first glance.
>>
>> Kevin
>>
>> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at
live.com> wrote:
>>>
>>> I think this behavior is inconsistent with the documentation:
>>>
>>>   tmp <- '?'
>>>   tmp <- iconv(tmp, to = 'UTF-8')
>>>   print(Encoding(tmp))
>>>   print(charToRaw(tmp))
>>>   tmpfilepath <- tempfile()
>>>   writeLines(tmp, con = file(tmpfilepath, encoding =
'UTF-8'), useBytes = TRUE)
>>>
>>> [1] "UTF-8"
>>> [1] c3 a9
>>>
>>> Raw text as hex: c3 83 c2 a9
>>>
>>> If I switch to useBytes = FALSE, then the variable is written
correctly as  c3 a9.
>>>
>>> Any thoughts? This behavior is related to this issue:
https://github.com/yihui/knitr/issues/1509
>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

Kevin Ushey

2018-Feb-17 22:24 UTC

head link

[Rd] writeLines argument useBytes = TRUE still making conversions

Of course, right after writing this e-mail I tested on my Windows
machine and did not see what I expected:
> charToRaw(before)
[1] c3 a9> charToRaw(after)[1] e9

so obviously I'm misunderstanding something as well.

Best,
Kevin

On Sat, Feb 17, 2018 at 2:19 PM, Kevin Ushey <kevinushey at gmail.com>
wrote:> From my understanding, translation is implied in this line of ?file (from
the
> Encoding section):
>
>     The encoding of the input/output stream of a connection can be
specified
>     by name in the same way as it would be given to iconv: see that help
page
>     for how to find out what encoding names are recognized on your
platform.
>     Additionally, "" and "native.enc" both mean the
?native? encoding, that is
>     the internal encoding of the current locale and hence no translation is
>     done.
>
> This is also hinted at in the documentation in ?readLines for its
'encoding'
> argument, which has a different semantic meaning from the
'encoding' argument
> as used with R connections:
>
>     encoding to be assumed for input strings. It is used to mark character
>     strings as known to be in Latin-1 or UTF-8: it is not used to re-encode
>     the input. To do the latter, specify the encoding as part of the
>     connection con or via options(encoding=): see the examples.
>
> It might be useful to augment the documentation in ?file with something
like:
>
>     The 'encoding' argument is used to request the translation of
strings when
>     writing to a connection.
>
> and, perhaps to further drive home the point about not translating when
> encoding = "native.enc":
>
>     Note that R will not attempt translation of strings when encoding is
>     either "" or "native.enc" (the default, as per
getOption("encoding")).
>     This implies that attempting to write, for example, UTF-8 encoded
content
>     to a connection opened using "native.enc" will retain its
original UTF-8
>     encoding -- it will not be translated.
>
> It is a bit surprising that 'native.enc' means "do not
translate" rather than
> "attempt translation to the encoding associated with the current
locale", but
> those are the semantics and they are not bound to change.
>
> This is the code I used to convince myself of that case:
>
>     conn <- file(tempfile(), encoding = "native.enc", open =
"w+")
>
>     before <- iconv('?', to = "UTF-8")
>     cat(before, file = conn, sep = "\n")
>     after <- readLines(conn)
>
>     charToRaw(before)
>     charToRaw(after)
>
> with output:
>
>     > charToRaw(before)
>     [1] c3 a9
>     > charToRaw(after)
>     [1] c3 a9
>
> Best,
> Kevin
>
>
> On Thu, Feb 15, 2018 at 9:16 AM, Ista Zahn <istazahn at gmail.com>
wrote:
>> On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinushey at
gmail.com> wrote:
>>> I suspect your UTF-8 string is being stripped of its encoding
before
>>> write, and so assumed to be in the system native encoding, and then
>>> re-encoded as UTF-8 when written to the file. You can see something
>>> similar with:
>>>
>>>     > tmp <- '?'
>>>     > tmp <- iconv(tmp, to = 'UTF-8')
>>>     > Encoding(tmp) <- "unknown"
>>>     > charToRaw(iconv(tmp, to = "UTF-8"))
>>>     [1] c3 83 c2 a9
>>>
>>> It's worth saying that:
>>>
>>>     file(..., encoding = "UTF-8")
>>>
>>> means "attempt to re-encode strings as UTF-8 when writing to
this
>>> file". However, if you already know your text is UTF-8, then
you
>>> likely want to avoid opening a connection that might attempt to
>>> re-encode the input. Conversely (assuming I'm understanding the
>>> documentation correctly)
>>>
>>>     file(..., encoding = "native.enc")
>>>
>>> means "assume that strings are in the native encoding, and
hence
>>> translation is unnecessary". Note that it does not mean
"attempt to
>>> translate strings to the native encoding".
>>
>> If all that is true I think ?file needs some attention. I've read
it
>> several times now and I just don't see how it can be interpreted as
>> you've described it.
>>
>> Best,
>> Ista
>>
>>>
>>> Also note that writeLines(..., useBytes = FALSE) will explicitly
>>> translate to the current encoding before sending bytes to the
>>> requested connection. In other words, there are two locations where
>>> translation might occur in your example:
>>>
>>>    1) In the call to writeLines(),
>>>    2) When characters are passed to the connection.
>>>
>>> In your case, it sounds like translation should be suppressed at
both steps.
>>>
>>> I think this is documented correctly in ?writeLines (and also the
>>> Encoding section of ?file), but the behavior may feel unfamiliar at
>>> first glance.
>>>
>>> Kevin
>>>
>>> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at
live.com> wrote:
>>>>
>>>> I think this behavior is inconsistent with the documentation:
>>>>
>>>>   tmp <- '?'
>>>>   tmp <- iconv(tmp, to = 'UTF-8')
>>>>   print(Encoding(tmp))
>>>>   print(charToRaw(tmp))
>>>>   tmpfilepath <- tempfile()
>>>>   writeLines(tmp, con = file(tmpfilepath, encoding =
'UTF-8'), useBytes = TRUE)
>>>>
>>>> [1] "UTF-8"
>>>> [1] c3 a9
>>>>
>>>> Raw text as hex: c3 83 c2 a9
>>>>
>>>> If I switch to useBytes = FALSE, then the variable is written
correctly as  c3 a9.
>>>>
>>>> Any thoughts? This behavior is related to this issue:
https://github.com/yihui/knitr/issues/1509
>>>>
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel

Reasonably Related Threads

Search for more seemingly similar threads

R devel - Feb 2018 - writeLines argument useBytes = TRUE still making conversions

[Rd] writeLines argument useBytes = TRUE still making conversions

[Rd] writeLines argument useBytes = TRUE still making conversions

[Rd] writeLines argument useBytes = TRUE still making conversions

Reasonably Related Threads