thr3ads.net - R devel - [Rd] String encoding problem [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Hadley Wickham

2016-Jul-07 16:15 UTC

[Rd] String encoding problem

>>> I'm not sure what should happen here, but that's not a
legal string in a
>>> UTF-8 locale, so it's not too surprising that things go wonky.
>>
>> Here's bit more context on how I got that sequence of bytes:
>>
>> x <- "?????"
>> y <- iconv(x, to = "Shift-JIS")
>> Encoding(y)
>> y
>>
>> I did this to create an example to demonstrate how to handle encoding
>> problems, and it's bit frustrating that I have to manually mangle
the
>> string in order to be able to re-use it in another session.  Maybe
>> strings with unknown encoding shouldn't use unicode escapes?
>>
>
> The real issue is that the only supported encoding of strings in R are
native (=current locale), latin1, and UTF-8. So unless you're running in
Shift-JIS locale, that encoding is not supported in your R, so the result of the
iconv() above is not a valid R string, just a sequence of bytes that R
doesn't know how to deal with. It tries to interpret it in your locale
(UTF-8) just as a guess, but that doesn't quite work. To illustrate, doing
this in C locale yields a different result:
>
>> x
> [1]
"<U+3053><U+3093><U+306B><U+3061><U+306F>"
>> y <- iconv(x, from="UTF-8", to = "Shift-JIS")
>> y
> [1] "\202\261\202\361\202\311\202\277\202\315"
>
> If you want a result that does not depend on your locale and is none of the
supported encodings, you have to declare it as bytes (back in UTF-8):
>
>> Encoding(y)="bytes"
>> y
> [1] "\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd"
>> iconv(y, from="Shift-JIS", to="utf-8")
> [1] "?????"
>
> But that has its own perils such as the fact that you cannot dput()
byte-encoded strings.
Right - I'm aware of that.  But to me, it doesn't seem correct to
print a string that is not a valid R string. Why is an unknown
encoding printed like UTF-8?

Hadley

-- 
http://hadley.nz

peter dalgaard

2016-Jul-07 16:51 UTC

head link

[Rd] String encoding problem

> On 07 Jul 2016, at 18:15 , Hadley Wickham <h.wickham at gmail.com>
wrote:
> 
> Right - I'm aware of that.  But to me, it doesn't seem correct to
> print a string that is not a valid R string. Why is an unknown
> encoding printed like UTF-8?
> 
It isn't -- no UTF-8 would have the \xbf. I may be flogging a dead horse,
but it seems to me that there are three alternatives:

- refuse the input (x <- "\xc9\x82\xbf" gives "sorry, not a
UTF-8 string" or so)
- refuse to print it (print(x) gives "cannot print non-UTF-8 string")
- what happens now

and a fourth one might be to actually allow mixing of \u0007 and \x07 and \007,
but I suspect that there are demons down the line which is why it is not
happening now. (Does it ring a bell with anyone?)

-pd


-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Duncan Murdoch

2016-Jul-07 19:23 UTC

head link

[Rd] String encoding problem

On 07/07/2016 12:51 PM, peter dalgaard wrote:> > On 07 Jul 2016, at 18:15 , Hadley Wickham <h.wickham at
gmail.com> wrote:
> >
> > Right - I'm aware of that.  But to me, it doesn't seem correct
to
> > print a string that is not a valid R string. Why is an unknown
> > encoding printed like UTF-8?
> >
>
> It isn't -- no UTF-8 would have the \xbf. I may be flogging a dead
horse, but it seems to me that there are three alternatives:
>
> - refuse the input (x <- "\xc9\x82\xbf" gives "sorry, not
a UTF-8 string" or so)
> - refuse to print it (print(x) gives "cannot print non-UTF-8
string")
> - what happens now
>
> and a fourth one might be to actually allow mixing of \u0007 and \x07 and
\007, but I suspect that there are demons down the line which is why it is not
happening now. (Does it ring a bell with anyone?)
A fifth option would be to use only hex escapes when invalid UTF-8 was 
found.  That would echo back the input in this case.  No idea if it 
would cause other problems.

Duncan Murdoch

R devel - Jul 2016 - String encoding problem

[Rd] String encoding problem

[Rd] String encoding problem

[Rd] String encoding problem