thr3ads.net - R devel - [Rd] String encoding problem [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Hadley Wickham

2016-Jul-07 15:40 UTC

[Rd] String encoding problem

On Thu, Jul 7, 2016 at 10:11 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:> On 07/07/2016 10:57 AM, Hadley Wickham wrote:
>>
>> If you print:
>>
>> "\xc9\x82\xbf"
>>
>> you get
>>
>>  "\u0242\xbf"
>>
>> But if you try and evaluate that string you get:
>>
>>>  "\u0242\xbf"
>>
>> Error: mixing Unicode and octal/hex escapes in a string is not allowed
>>
>> (Probably will only happen on mac/linux with default utf-8 encoding)
>
>
> I'm not sure what should happen here, but that's not a legal string
in a
> UTF-8 locale, so it's not too surprising that things go wonky.
Here's bit more context on how I got that sequence of bytes:

x <- "?????"
y <- iconv(x, to = "Shift-JIS")
Encoding(y)
y

I did this to create an example to demonstrate how to handle encoding
problems, and it's bit frustrating that I have to manually mangle the
string in order to be able to re-use it in another session.  Maybe
strings with unknown encoding shouldn't use unicode escapes?

Hadley

-- 
http://hadley.nz

Simon Urbanek

2016-Jul-07 16:00 UTC

head link

[Rd] String encoding problem

> On Jul 7, 2016, at 11:40 AM, Hadley Wickham <h.wickham at gmail.com>
wrote:
> 
> On Thu, Jul 7, 2016 at 10:11 AM, Duncan Murdoch
> <murdoch.duncan at gmail.com> wrote:
>> On 07/07/2016 10:57 AM, Hadley Wickham wrote:
>>> 
>>> If you print:
>>> 
>>> "\xc9\x82\xbf"
>>> 
>>> you get
>>> 
>>> "\u0242\xbf"
>>> 
>>> But if you try and evaluate that string you get:
>>> 
>>>> "\u0242\xbf"
>>> 
>>> Error: mixing Unicode and octal/hex escapes in a string is not
allowed
>>> 
>>> (Probably will only happen on mac/linux with default utf-8
encoding)
>> 
>> 
>> I'm not sure what should happen here, but that's not a legal
string in a
>> UTF-8 locale, so it's not too surprising that things go wonky.
> 
> Here's bit more context on how I got that sequence of bytes:
> 
> x <- "?????"
> y <- iconv(x, to = "Shift-JIS")
> Encoding(y)
> y
> 
> I did this to create an example to demonstrate how to handle encoding
> problems, and it's bit frustrating that I have to manually mangle the
> string in order to be able to re-use it in another session.  Maybe
> strings with unknown encoding shouldn't use unicode escapes?
> 
The real issue is that the only supported encoding of strings in R are native
(=current locale), latin1, and UTF-8. So unless you're running in Shift-JIS
locale, that encoding is not supported in your R, so the result of the iconv()
above is not a valid R string, just a sequence of bytes that R doesn't know
how to deal with. It tries to interpret it in your locale (UTF-8) just as a
guess, but that doesn't quite work. To illustrate, doing this in C locale
yields a different result:
> x[1]
"<U+3053><U+3093><U+306B><U+3061><U+306F>"> y <- iconv(x, from="UTF-8", to = "Shift-JIS")
> y[1] "\202\261\202\361\202\311\202\277\202\315"

If you want a result that does not depend on your locale and is none of the
supported encodings, you have to declare it as bytes (back in UTF-8):
> Encoding(y)="bytes"
> y[1]
"\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd"> iconv(y, from="Shift-JIS", to="utf-8")[1] "?????"

But that has its own perils such as the fact that you cannot dput() byte-encoded
strings.

Cheers,
Simon

Hadley Wickham

2016-Jul-07 16:15 UTC

head link

[Rd] String encoding problem

>>> I'm not sure what should happen here, but that's not a
legal string in a
>>> UTF-8 locale, so it's not too surprising that things go wonky.
>>
>> Here's bit more context on how I got that sequence of bytes:
>>
>> x <- "?????"
>> y <- iconv(x, to = "Shift-JIS")
>> Encoding(y)
>> y
>>
>> I did this to create an example to demonstrate how to handle encoding
>> problems, and it's bit frustrating that I have to manually mangle
the
>> string in order to be able to re-use it in another session.  Maybe
>> strings with unknown encoding shouldn't use unicode escapes?
>>
>
> The real issue is that the only supported encoding of strings in R are
native (=current locale), latin1, and UTF-8. So unless you're running in
Shift-JIS locale, that encoding is not supported in your R, so the result of the
iconv() above is not a valid R string, just a sequence of bytes that R
doesn't know how to deal with. It tries to interpret it in your locale
(UTF-8) just as a guess, but that doesn't quite work. To illustrate, doing
this in C locale yields a different result:
>
>> x
> [1]
"<U+3053><U+3093><U+306B><U+3061><U+306F>"
>> y <- iconv(x, from="UTF-8", to = "Shift-JIS")
>> y
> [1] "\202\261\202\361\202\311\202\277\202\315"
>
> If you want a result that does not depend on your locale and is none of the
supported encodings, you have to declare it as bytes (back in UTF-8):
>
>> Encoding(y)="bytes"
>> y
> [1] "\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd"
>> iconv(y, from="Shift-JIS", to="utf-8")
> [1] "?????"
>
> But that has its own perils such as the fact that you cannot dput()
byte-encoded strings.
Right - I'm aware of that.  But to me, it doesn't seem correct to
print a string that is not a valid R string. Why is an unknown
encoding printed like UTF-8?

Hadley

-- 
http://hadley.nz

Maybe Matching Threads

Search for more maybe matching threads

R devel - Jul 2016 - String encoding problem

[Rd] String encoding problem

[Rd] String encoding problem

[Rd] String encoding problem

Maybe Matching Threads