Daniel Possenriede
2017-Aug-01 12:49 UTC
[Rd] special latin1 do not print as glyphs in current devel on windows
Thank you!. My apologies again for not including the console output in my message before. I sent another e-mail with the output in the meantime, so it should be a bit clearer now, what I am seeing. In case I missed something, please let me know. Yes, I am using latin1 and cp1252 interchangebly here, mostly because Encoding() is reporting the encoding as "latin1". You presumed correctly that my current/default locale's encoding is CP1252. (I also mentioned that my locale is LC_COLLATE=German_Germany.1252 before). As you are changing encodings, you do not want to preserve encoding!>I am not interested in preserving encodings. What I am worried about is that the encoding is not marked anymore, i.e. that Encoding() returns "unknown". In cp1252 encoding on Windows (note that I am using the cp1252 escape "\x80" and not the Unicode "\u20AC")> x_utf8 <- enc2utf8(c("?", "\x80")) > Encoding(x_utf8)[1] "UTF-8" "UTF-8"> x_nat <- enc2native(x_utf8) > Encoding(x_nat)[1] "unknown" "unknown" See also Kirill's message to this list: "ASCII strings are marked as ASCII internally, but this information doesn't seem to be available, e.g., Encoding() returns "unknown" for such strings " http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-tp4733523.html> > Again, this is not the case with iconv() >> >> x_iutf8 <- iconv(x, to = "UTF-8") >> Encoding(x_iutf8) >> x_inat <- iconv(x_iutf8, from = "UTF-8") >> Encoding(x_inat) >> > > iconv is converting from/to the current locale's encoding, presumably > CP1252, not from the marked encoding (as the help page states explicitly.) >I am aware that iconv is not using the marked encoding, but that you either have to set it explicitly or it uses the current locale's default encoding. As I said I am worried about the fact that the encoding markers get lost with the enc2* functions or rather they are not set correctly. I am just using the iconv example to show that iconv is able to set the encoding markers correctly. So it seems generally possible.> x_iutf8 <- iconv(c("?", "\x80"), to = "UTF-8") > Encoding(x_iutf8)[1] "UTF-8" "UTF-8"> x_iutf8[1] "?" "?"> x_inat <- iconv(x_iutf8, from = "UTF-8") > Encoding(x_inat)[1] "latin1" "latin1"> x_inat[1] "\u0080" "\u0080" [[alternative HTML version deleted]]
Patrick Perry
2017-Aug-27 15:40 UTC
[Rd] special latin1 do not print as glyphs in current devel on windows
Regarding the Windows character encoding issues Daniel Possenriede posted about earlier this month, where non-Latin-1 strings were getting marked as such (https://stat.ethz.ch/pipermail/r-devel/2017-August/074731.html ): The issue is that on Windows, when the character locale is Windows-1252, R marks some (possibly all) native non-ASCII strings as "latin1". I posted a related bug report: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 . The bug report also includes a link to a fix for a related issue: converting strings from Windows native to UTF-8. There is a work-around for this bug in the current development version of the 'corpus' package (not on CRAN yet). See https://github.com/patperry/r-corpus/issues/5 . I have tested this on a Windows-1252 install of R, but I have not tested it on a Windows install in another locale. It'd be great if someone with such an install would test the fix and report back, either here or on the github issue. Patrick
Daniel Possenriede
2017-Sep-14 07:40 UTC
[Rd] special latin1 do not print as glyphs in current devel on windows
This is a follow-up on my initial posts regarding character encodings on Windows (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and Patrick Perry's reply (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in particular (thank you for the links and the bug report!). My initial posts were quite chaotic (and partly wrong), so I am trying to clear things up a bit. Actually, the title of my original message "special latin1 [characters] do not print as glyphs in current devel on windows" is already wrong, because the problem exists with characters with CP1252 encoding in the 80-9F (hex) range. Like Brian Ripley rightfully pointed out, latin1 != CP1252. The characters in the 80-9F code point range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for example https://en.wikipedia.org/wiki/Windows-1252. R treats them as if they were, however, and that is exactly the problem, IMHO. Let me show you what I mean. (All output from R 3.5 r73238, see sessionInfo at the end) > Sys.getlocale("LC_CTYPE") [1] "German_Germany.1252" > x <- c("?", "?", "?", "?") > sapply(x, charToRaw) \u0080 \u009e \u009a? ? 80 9e 9a fc "?", "?", "?" serve as examples in the 80-9F range of CP1252. I also show the "?" just as an example of a non-ASCII character outside that range (and because Patrick Perry used it in his bug report which might be a (slightly) different problem, but I will get to that later.) > print(x) [1] "\u0080" "\u009e" "\u009a" "?" "?", "?", and "?" are printed as (incorrect) unicode escapes. "?" for example should be \u20ac not \u0080. (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in C (translateCharUTF8?))?) > print("\u20ac") [1] "?" The characters in x are marked as "latin1". > Encoding(x) [1] "latin1" "latin1" "latin1" "latin1" Looking at the CP1252 table (e.g. link above), we see that this is incorrect for "?", "?", and "?", which simply do not exist in latin1. As per the documentation, "enc2utf8 convert[s] elements of character vectors to [...] UTF-8 [...], taking any marked encoding into account." Since the marked encoding is wrong, so is the output of enc2utf8(). > enc2utf8(x) [1] "\u0080" "\u009e" "\u009a" "?" Now, when we set the encoding to "unknown" everything works fine. > x_un <- x > Encoding(x_un) <- "unknown" > print(x_un) [1] "?" "?" "?" "?" > (x_un2utf8 <- enc2utf8(x_un)) [1] "?" "?" "?" "?" Long story short: The characters in the 80 to 9F range should not be marked as "latin1" on CP1252 locales, IMHO. As a side-note: the output of localeToCharset() is also problematic, since ISO8859-1 != CP1252. > localeToCharset() [1] "ISO8859-1" Finally on to Patrick Perry's bug report (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On Windows, enc2utf8("?") yields "|".' Unfortunately, I cannot reproduce this with the CP1252 locale, as can be seen above. Probably, because the bug applies to the C locale (sorry if this is somewhere apparent in the bug report and I missed it). > Sys.setlocale("LC_CTYPE", "C") [1] "C" > enc2utf8("?") [1] "|" > charToRaw("?") [1] fc > Encoding("?") [1] "unknown" This does not seem to be related to the marked encoding of the string, so it seems to me that this is a different problem than the one above. Any advice on how to proceed further would be highly appreciated. Thanks! Daniel > sessionInfo() R Under development (unstable) (2017-09-11 r73238) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393) Matrix products: default locale: [1] LC_COLLATE=German_Germany.1252? LC_CTYPE=C [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] stats???? graphics? grDevices utils???? datasets? methods base loaded via a namespace (and not attached): [1] compiler_3.5.0