thr3ads.net - R devel - [Rd] special latin1 do not print as glyphs in current devel on windows [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Daniel Possenriede

2017-Sep-14 07:40 UTC

[Rd] special latin1 do not print as glyphs in current devel on windows

This is a follow-up on my initial posts regarding character encodings on 
Windows (stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) 
and Patrick Perry's reply 
(stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in 
particular (thank you for the links and the bug report!). My initial 
posts were quite chaotic (and partly wrong), so I am trying to clear 
things up a bit.

Actually, the title of my original message "special latin1 [characters] 
do not print as glyphs in current devel on windows" is already wrong, 
because the problem exists with characters with CP1252 encoding in the 
80-9F (hex) range. Like Brian Ripley rightfully pointed out, latin1 != 
CP1252. The characters in the 80-9F code point range are not even part 
of ISO/IEC 8859-1 a.k.a. latin1, see for example 
en.wikipedia.org/wiki/Windows-1252. R treats them as if they 
were, however, and that is exactly the problem, IMHO.

Let me show you what I mean. (All output from R 3.5 r73238, see 
sessionInfo at the end)

 > Sys.getlocale("LC_CTYPE")
[1] "German_Germany.1252"
 > x <- c("?", "?", "?", "?")
 > sapply(x, charToRaw)
\u0080 \u009e \u009a? ?
80 9e 9a fc

"?", "?", "?" serve as examples in the 80-9F range
of CP1252. I also
show the "?" just as an example of a non-ASCII character outside that 
range (and because Patrick Perry used it in his bug report which might 
be a (slightly) different problem, but I will get to that later.)

 > print(x)
[1] "\u0080" "\u009e" "\u009a" "?"

"?", "?", and "?" are printed as (incorrect)
unicode escapes. "?" for
example should be \u20ac not \u0080.
(In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. 
Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in C 
(translateCharUTF8?))?)

 > print("\u20ac")
[1] "?"

The characters in x are marked as "latin1".

 > Encoding(x)
[1] "latin1" "latin1" "latin1" "latin1"

Looking at the CP1252 table (e.g. link above), we see that this is 
incorrect for "?", "?", and "?", which simply do
not exist in latin1.

As per the documentation, "enc2utf8 convert[s] elements of character 
vectors to [...] UTF-8 [...], taking any marked encoding into account." 
Since the marked encoding is wrong, so is the output of enc2utf8().

 > enc2utf8(x)
[1] "\u0080" "\u009e" "\u009a" "?"

Now, when we set the encoding to "unknown" everything works fine.

 > x_un <- x
 > Encoding(x_un) <- "unknown"
 > print(x_un)
[1] "?" "?" "?" "?"
 > (x_un2utf8 <- enc2utf8(x_un))
[1] "?" "?" "?" "?"

Long story short: The characters in the 80 to 9F range should not be 
marked as "latin1" on CP1252 locales, IMHO.

As a side-note: the output of localeToCharset() is also problematic, 
since ISO8859-1 != CP1252.

 > localeToCharset()
[1] "ISO8859-1"

Finally on to Patrick Perry's bug report 
(bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On 
Windows, enc2utf8("?") yields "|".'

Unfortunately, I cannot reproduce this with the CP1252 locale, as can be 
seen above. Probably, because the bug applies to the C locale (sorry if 
this is somewhere apparent in the bug report and I missed it).

 > Sys.setlocale("LC_CTYPE", "C")
[1] "C"
 > enc2utf8("?")
[1] "|"
 > charToRaw("?")
[1] fc
 > Encoding("?")
[1] "unknown"

This does not seem to be related to the marked encoding of the string, 
so it seems to me that this is a different problem than the one above.

Any advice on how to proceed further would be highly appreciated.

Thanks!
Daniel

 > sessionInfo()
R Under development (unstable) (2017-09-11 r73238)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252? LC_CTYPE=C
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats???? graphics? grDevices utils???? datasets? methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0

Patrick Perry

2017-Sep-14 11:47 UTC

head link

[Rd] special latin1 do not print as glyphs in current devel on windows

This particular issue has a simple fix. Currently, the
"R_check_locale"
function includes the following code starting at line 244 in 
src/main/platform.c:

#ifdef Win32
     {
     char *ctype = setlocale(LC_CTYPE, NULL), *p;
     p = strrchr(ctype, '.');
     if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0;
     /* Not 100% correct, but CP1252 is a superset */
     known_to_be_latin1 = latin1locale = (localeCP == 1252);
     }
#endif

The "1252" should be "28591"; see 
msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx 
.
> Daniel Possenriede <mailto:possenriede at gmail.com>
> September 14, 2017 at 3:40 AM
> This is a follow-up on my initial posts regarding character encodings 
> on Windows 
> (stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and 
> Patrick Perry's reply 
> (stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in 
> particular (thank you for the links and the bug report!). My initial 
> posts were quite chaotic (and partly wrong), so I am trying to clear 
> things up a bit.
>
> Actually, the title of my original message "special latin1 
> [characters] do not print as glyphs in current devel on windows" is 
> already wrong, because the problem exists with characters with CP1252 
> encoding in the 80-9F (hex) range. Like Brian Ripley rightfully 
> pointed out, latin1 != CP1252. The characters in the 80-9F code point 
> range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for 
> example en.wikipedia.org/wiki/Windows-1252. R treats them as 
> if they were, however, and that is exactly the problem, IMHO.
>
> Let me show you what I mean. (All output from R 3.5 r73238, see 
> sessionInfo at the end)
>
> > Sys.getlocale("LC_CTYPE")
> [1] "German_Germany.1252"
> > x <- c("?", "?", "?", "?")
> > sapply(x, charToRaw)
> \u0080 \u009e \u009a  ?
> 80 9e 9a fc
>
> "?", "?", "?" serve as examples in the 80-9F
range of CP1252. I also
> show the "?" just as an example of a non-ASCII character outside
that
> range (and because Patrick Perry used it in his bug report which might 
> be a (slightly) different problem, but I will get to that later.)
>
> > print(x)
> [1] "\u0080" "\u009e" "\u009a" "?"
>
> "?", "?", and "?" are printed as (incorrect)
unicode escapes. "?" for
> example should be \u20ac not \u0080.
> (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. 
> Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in 
> C (translateCharUTF8?))?)
>
> > print("\u20ac")
> [1] "?"
>
> The characters in x are marked as "latin1".
>
> > Encoding(x)
> [1] "latin1" "latin1" "latin1"
"latin1"
>
> Looking at the CP1252 table (e.g. link above), we see that this is 
> incorrect for "?", "?", and "?", which simply
do not exist in latin1.
>
> As per the documentation, "enc2utf8 convert[s] elements of character 
> vectors to [...] UTF-8 [...], taking any marked encoding into 
> account." Since the marked encoding is wrong, so is the output of 
> enc2utf8().
>
> > enc2utf8(x)
> [1] "\u0080" "\u009e" "\u009a" "?"
>
> Now, when we set the encoding to "unknown" everything works fine.
>
> > x_un <- x
> > Encoding(x_un) <- "unknown"
> > print(x_un)
> [1] "?" "?" "?" "?"
> > (x_un2utf8 <- enc2utf8(x_un))
> [1] "?" "?" "?" "?"
>
> Long story short: The characters in the 80 to 9F range should not be 
> marked as "latin1" on CP1252 locales, IMHO.
>
> As a side-note: the output of localeToCharset() is also problematic, 
> since ISO8859-1 != CP1252.
>
> > localeToCharset()
> [1] "ISO8859-1"
>
> Finally on to Patrick Perry's bug report 
> (bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On 
> Windows, enc2utf8("?") yields "|".'
>
> Unfortunately, I cannot reproduce this with the CP1252 locale, as can 
> be seen above. Probably, because the bug applies to the C locale 
> (sorry if this is somewhere apparent in the bug report and I missed it).
>
> > Sys.setlocale("LC_CTYPE", "C")
> [1] "C"
> > enc2utf8("?")
> [1] "|"
> > charToRaw("?")
> [1] fc
> > Encoding("?")
> [1] "unknown"
>
> This does not seem to be related to the marked encoding of the string, 
> so it seems to me that this is a different problem than the one above.
>
> Any advice on how to proceed further would be highly appreciated.
>
> Thanks!
> Daniel
>
> > sessionInfo()
> R Under development (unstable) (2017-09-11 r73238)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>

	[[alternative HTML version deleted]]

Patrick Perry

2017-Nov-12 20:34 UTC

head link

[Rd] special latin1 do not print as glyphs in current devel on windows

Just following up on this since the associated bug report just got 
closed (bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 ) 
because my original bug report was incomplete, and did not include 
sessionInfo() or LC_CTYPE.

Admittedly, my original bug report was a little confused. I have since 
gained a better understanding of the issue. I want to confirm that this 
(a) is a real bug in base, R, not RStudio (b) provide more context. It 
looks like the real issue is that R marks native strings as "latin1" 
when the declared character locale is Windows-1252. This causes problems 
when converting to UTF-8. See Daniel Possenriede's email below for much 
more detail, including his sessionInfo() and a reproducible example .

The development version of the `stringi` package and the CRAN version of 
the `utf8` package both have workarounds for this bug. (See, e.g. 
github.com/gagolews/stringi/issues/287 and the links to the 
related issues).


Patrick
> Patrick Perry <mailto:pperry at stern.nyu.edu>
> September 14, 2017 at 7:47 AM
> This particular issue has a simple fix. Currently, the 
> "R_check_locale" function includes the following code starting at
line
> 244 in src/main/platform.c:
>
> #ifdef Win32
>     {
>     char *ctype = setlocale(LC_CTYPE, NULL), *p;
>     p = strrchr(ctype, '.');
>     if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP =
0;
>     /* Not 100% correct, but CP1252 is a superset */
>     known_to_be_latin1 = latin1locale = (localeCP == 1252);
>     }
> #endif
>
> The "1252" should be "28591"; see 
>
msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
> .
>
>
> Daniel Possenriede <mailto:possenriede at gmail.com>
> September 14, 2017 at 3:40 AM
> This is a follow-up on my initial posts regarding character encodings 
> on Windows 
> (stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and 
> Patrick Perry's reply 
> (stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in 
> particular (thank you for the links and the bug report!). My initial 
> posts were quite chaotic (and partly wrong), so I am trying to clear 
> things up a bit.
>
> Actually, the title of my original message "special latin1 
> [characters] do not print as glyphs in current devel on windows" is 
> already wrong, because the problem exists with characters with CP1252 
> encoding in the 80-9F (hex) range. Like Brian Ripley rightfully 
> pointed out, latin1 != CP1252. The characters in the 80-9F code point 
> range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for 
> example en.wikipedia.org/wiki/Windows-1252. R treats them as 
> if they were, however, and that is exactly the problem, IMHO.
>
> Let me show you what I mean. (All output from R 3.5 r73238, see 
> sessionInfo at the end)
>
> > Sys.getlocale("LC_CTYPE")
> [1] "German_Germany.1252"
> > x <- c("?", "?", "?", "?")
> > sapply(x, charToRaw)
> \u0080 \u009e \u009a  ?
> 80 9e 9a fc
>
> "?", "?", "?" serve as examples in the 80-9F
range of CP1252. I also
> show the "?" just as an example of a non-ASCII character outside
that
> range (and because Patrick Perry used it in his bug report which might 
> be a (slightly) different problem, but I will get to that later.)
>
> > print(x)
> [1] "\u0080" "\u009e" "\u009a" "?"
>
> "?", "?", and "?" are printed as (incorrect)
unicode escapes. "?" for
> example should be \u20ac not \u0080.
> (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. 
> Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in 
> C (translateCharUTF8?))?)
>
> > print("\u20ac")
> [1] "?"
>
> The characters in x are marked as "latin1".
>
> > Encoding(x)
> [1] "latin1" "latin1" "latin1"
"latin1"
>
> Looking at the CP1252 table (e.g. link above), we see that this is 
> incorrect for "?", "?", and "?", which simply
do not exist in latin1.
>
> As per the documentation, "enc2utf8 convert[s] elements of character 
> vectors to [...] UTF-8 [...], taking any marked encoding into 
> account." Since the marked encoding is wrong, so is the output of 
> enc2utf8().
>
> > enc2utf8(x)
> [1] "\u0080" "\u009e" "\u009a" "?"
>
> Now, when we set the encoding to "unknown" everything works fine.
>
> > x_un <- x
> > Encoding(x_un) <- "unknown"
> > print(x_un)
> [1] "?" "?" "?" "?"
> > (x_un2utf8 <- enc2utf8(x_un))
> [1] "?" "?" "?" "?"
>
> Long story short: The characters in the 80 to 9F range should not be 
> marked as "latin1" on CP1252 locales, IMHO.
>
> As a side-note: the output of localeToCharset() is also problematic, 
> since ISO8859-1 != CP1252.
>
> > localeToCharset()
> [1] "ISO8859-1"
>
> Finally on to Patrick Perry's bug report 
> (bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On 
> Windows, enc2utf8("?") yields "|".'
>
> Unfortunately, I cannot reproduce this with the CP1252 locale, as can 
> be seen above. Probably, because the bug applies to the C locale 
> (sorry if this is somewhere apparent in the bug report and I missed it).
>
> > Sys.setlocale("LC_CTYPE", "C")
> [1] "C"
> > enc2utf8("?")
> [1] "|"
> > charToRaw("?")
> [1] fc
> > Encoding("?")
> [1] "unknown"
>
> This does not seem to be related to the marked encoding of the string, 
> so it seems to me that this is a different problem than the one above.
>
> Any advice on how to proceed further would be highly appreciated.
>
> Thanks!
> Daniel
>
> > sessionInfo()
> R Under development (unstable) (2017-09-11 r73238)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>
> Patrick Perry <mailto:pperry at stern.nyu.edu>
> August 27, 2017 at 11:40 AM
> Regarding the Windows character encoding issues Daniel Possenriede 
> posted about earlier this month, where non-Latin-1 strings were 
> getting marked as such 
> (stat.ethz.ch/pipermail/r-devel/2017-August/074731.html ):
>
> The issue is that on Windows, when the character locale is 
> Windows-1252, R marks some (possibly all) native non-ASCII strings as 
> "latin1". I posted a related bug report: 
> bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 . The bug 
> report also includes a link to a fix for a related issue: converting 
> strings from Windows native to UTF-8.
>
> There is a work-around for this bug in the current development version 
> of the 'corpus' package (not on CRAN yet). See 
> github.com/patperry/r-corpus/issues/5 . I have tested this on 
> a Windows-1252 install of R, but I have not tested it on a Windows 
> install in another locale. It'd be great if someone with such an 
> install would test the fix and report back, either here or on the 
> github issue.
>
>
> Patrick

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more maybe matching threads

R devel - Sep 2017 - special latin1 do not print as glyphs in current devel on windows

[Rd] special latin1 do not print as glyphs in current devel on windows

[Rd] special latin1 do not print as glyphs in current devel on windows

[Rd] special latin1 do not print as glyphs in current devel on windows

Apparently Analagous Threads