Winston Chang
2016-Sep-29 16:38 UTC
[Rd] Problems with sub() due to inability to set encoding of ASCII strings
I'm encountering a problem using sub() on strings in R 3.3.1 in Windows, using both RGui and RStudio. The problem happens when the starting string is ASCII, but the replacement string is UTF-8. If we create an ASCII string x1, its encoding is marked as "unknown", and doing a sub() on that string with a UTF-8 replacement results in weird characters: x1 <- "a b c" Encoding(x1) # [1] "unknown" replacement <- "??" Encoding(replacement) # [1] "UTF-8" (y1 <- sub("a", replacement, x1)) #[1] "?????? b c" Encoding(y1) # [1] "unknown" If the starting string x2 has Chinese characters, it'll be marked as UTF-8, and replacement works fine: x2 <- "a b c ??" Encoding(x2) # [1] "UTF-8" (y2 <- sub("a", replacement, x2)) # [1] "?? b c ??" Encoding(y2) # [1] "UTF-8" It seems like the solution should be to mark the starting string as UTF-8, but apparently it doesn't work if the string is ASCII, and so the sub() still gives weird characters: # Not possible to mark x1 as UTF-8 Encoding(x1) <- "UTF-8" Encoding(x1) # [1] "unknown" (y3 <- sub("a", replacement, x1)) # [1] "?????? b c" Encoding(y3) # [1] "unknown" It is possible to tell R that the final string y3 is UTF-8, but it doesn't seem like this should be necessary: Encoding(y3) <- "UTF-8" y3 # [1] "?? b c" Is there some way to mark the starting string x1 as UTF-8 so that the result of sub() comes out marked as UTF-8? If the inputs are both UTF-8, it shouldn't be necessary to explicitly tell R that the output is also UTF-8. -Winston