>>> I'm not sure what should happen here, but that's not a legal string in a >>> UTF-8 locale, so it's not too surprising that things go wonky. >> >> Here's bit more context on how I got that sequence of bytes: >> >> x <- "?????" >> y <- iconv(x, to = "Shift-JIS") >> Encoding(y) >> y >> >> I did this to create an example to demonstrate how to handle encoding >> problems, and it's bit frustrating that I have to manually mangle the >> string in order to be able to re-use it in another session. Maybe >> strings with unknown encoding shouldn't use unicode escapes? >> > > The real issue is that the only supported encoding of strings in R are native (=current locale), latin1, and UTF-8. So unless you're running in Shift-JIS locale, that encoding is not supported in your R, so the result of the iconv() above is not a valid R string, just a sequence of bytes that R doesn't know how to deal with. It tries to interpret it in your locale (UTF-8) just as a guess, but that doesn't quite work. To illustrate, doing this in C locale yields a different result: > >> x > [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>" >> y <- iconv(x, from="UTF-8", to = "Shift-JIS") >> y > [1] "\202\261\202\361\202\311\202\277\202\315" > > If you want a result that does not depend on your locale and is none of the supported encodings, you have to declare it as bytes (back in UTF-8): > >> Encoding(y)="bytes" >> y > [1] "\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd" >> iconv(y, from="Shift-JIS", to="utf-8") > [1] "?????" > > But that has its own perils such as the fact that you cannot dput() byte-encoded strings.Right - I'm aware of that. But to me, it doesn't seem correct to print a string that is not a valid R string. Why is an unknown encoding printed like UTF-8? Hadley -- http://hadley.nz
> On 07 Jul 2016, at 18:15 , Hadley Wickham <h.wickham at gmail.com> wrote: > > Right - I'm aware of that. But to me, it doesn't seem correct to > print a string that is not a valid R string. Why is an unknown > encoding printed like UTF-8? >It isn't -- no UTF-8 would have the \xbf. I may be flogging a dead horse, but it seems to me that there are three alternatives: - refuse the input (x <- "\xc9\x82\xbf" gives "sorry, not a UTF-8 string" or so) - refuse to print it (print(x) gives "cannot print non-UTF-8 string") - what happens now and a fourth one might be to actually allow mixing of \u0007 and \x07 and \007, but I suspect that there are demons down the line which is why it is not happening now. (Does it ring a bell with anyone?) -pd -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On 07/07/2016 12:51 PM, peter dalgaard wrote:> > On 07 Jul 2016, at 18:15 , Hadley Wickham <h.wickham at gmail.com> wrote: > > > > Right - I'm aware of that. But to me, it doesn't seem correct to > > print a string that is not a valid R string. Why is an unknown > > encoding printed like UTF-8? > > > > It isn't -- no UTF-8 would have the \xbf. I may be flogging a dead horse, but it seems to me that there are three alternatives: > > - refuse the input (x <- "\xc9\x82\xbf" gives "sorry, not a UTF-8 string" or so) > - refuse to print it (print(x) gives "cannot print non-UTF-8 string") > - what happens now > > and a fourth one might be to actually allow mixing of \u0007 and \x07 and \007, but I suspect that there are demons down the line which is why it is not happening now. (Does it ring a bell with anyone?)A fifth option would be to use only hex escapes when invalid UTF-8 was found. That would echo back the input in this case. No idea if it would cause other problems. Duncan Murdoch