thr3ads.net - R help - [R] Converting two byte encoding to UTF-8 [Mar 2022]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2022-Mar-19 10:52 UTC

[R] Converting two byte encoding to UTF-8

I have a file that includes Japanese characters encoded using the 
"JIS_X0208-1997" encoding.  According to iconvlist(), an earlier 
revision "JIS_X0208-1990" is supported, so I'd like to try that to
decode them.

However, I can't seem to find how to provide input to iconv() to do it. 
This is a two-byte encoding, so one character has bytes

 > as.raw(result[[1]]$kanji)
[1] b0 a1

But this is being interpreted as two characters by iconv():

 > iconv(as.raw(result[[1]]$kanji), from = "JIS_X0208-1990", to =
"UTF-8")
[1] "?" "?"

I can't seem to find any input that iconv() will accept to treat this as 
a single character.  (I believe the answer should be ? , if that helps.) 
  How do I tell it to use 0xb0a1 (or 0xa1b0, if that's the right byte 
order)?  I just see NA:

 >  iconv(0xb0a1, from = "JIS_X0208-1990", to = "UTF-8")
[1] NA
 > iconv(0xa1b0, from = "JIS_X0208-1990", to = "UTF-8")
[1] NA

Duncan Murdoch

Duncan Murdoch

2022-Mar-19 12:35 UTC

head link

[R] Converting two byte encoding to UTF-8

I have solved it!

First, the bytes I have are offset by 0x80 from what they should 
contain.  The actual encoding of ? is 0x30 0x21.  But subtracting 0x80 
isn't enough; they are still treated as two characters:

 > iconv(as.raw(result[[1]]$kanji-0x80), from = "JIS_X0208-1990", 
to="UTF-8")
[1] "?" "?"

However, if I put those bytes in a list entry, it works:

 > iconv(list(as.raw(result[[1]]$kanji-0x80)), from =
"JIS_X0208-1990",
to="UTF-8")
[1] "?"

Duncan Murdoch


On 19/03/2022 6:52 a.m., Duncan Murdoch wrote:> I have a file that includes Japanese characters encoded using the
> "JIS_X0208-1997" encoding.  According to iconvlist(), an earlier
> revision "JIS_X0208-1990" is supported, so I'd like to try
that to
> decode them.
> 
> However, I can't seem to find how to provide input to iconv() to do it.
> This is a two-byte encoding, so one character has bytes
> 
>   > as.raw(result[[1]]$kanji)
> [1] b0 a1
> 
> But this is being interpreted as two characters by iconv():
> 
>   > iconv(as.raw(result[[1]]$kanji), from = "JIS_X0208-1990",
to = "UTF-8")
> [1] "?" "?"
> 
> I can't seem to find any input that iconv() will accept to treat this
as
> a single character.  (I believe the answer should be ? , if that helps.)
>    How do I tell it to use 0xb0a1 (or 0xa1b0, if that's the right byte
> order)?  I just see NA:
> 
>   >  iconv(0xb0a1, from = "JIS_X0208-1990", to =
"UTF-8")
> [1] NA
>   > iconv(0xa1b0, from = "JIS_X0208-1990", to =
"UTF-8")
> [1] NA
> 
> Duncan Murdoch
>

R help - Mar 2022 - Converting two byte encoding to UTF-8

[R] Converting two byte encoding to UTF-8

[R] Converting two byte encoding to UTF-8