thr3ads.net - R devel - [Rd] special latin1 do not print as glyphs in current devel on windows [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Daniel Possenriede

2017-Aug-01 09:19 UTC

[Rd] special latin1 do not print as glyphs in current devel on windows

Upon further inspection, I think these are at least two problems.
First the issue with printing latin1/cp1252 characters in the "80" to
"9F"
code range.

x <- c("?", "?", "?")
Encoding(x)
print(x)

I assume that these are Unicode escapes!? (Given that Encoding(x) shows
"latin1" I'd rather expect latin1/cp1252 escapes here, but these
would be
e.g. "\x80", right? My locale is LC_COLLATE=German_Germany.1252 btw.)
Now I don't know why print tries to convert to Unicode, but if these indeed
are Unicode escapes, then there is something wrong with the conversion from
cp1252 to Unicode.
In general, most cp1252 char codes translate to Unicode like CP1252:
"00"
-> Unicode "0000", "01" -> "0001",
"02" -> "0002", etc. see
http://www.cp1252.com/.
The exception is the cp1252 "80" to "9F" code range. E.g.
the Euro sign is
"80" in cp1252 but "20AC" in Unicode, endash "96"
in cp1252, "2013" in
Unicode.
The same error seems to happen with

enc2utf8(x)

Now with iconv() the result is as expected.

iconv(x, to = "UTF-8")


The second problem IMO is that encoding markers get lost with the enc2*
functions

x_utf8 <- enc2utf8(x)
Encoding(x_utf8)
x_nat <- enc2native(x_utf8)
Encoding(x_nat)

Again, this is not the case with iconv()

x_iutf8 <- iconv(x, to = "UTF-8")
Encoding(x_iutf8)
x_inat <- iconv(x_iutf8, from = "UTF-8")
Encoding(x_inat)

	[[alternative HTML version deleted]]

Daniel Possenriede

2017-Aug-01 10:33 UTC

head link

[Rd] special latin1 do not print as glyphs in current devel on windows

Sorry, I should have included my console output, obviously. So here we go:

Wrong UTF-8 escapes with using print in v3.5.0 devel:

# R Under development (unstable) (2017-07-30 r73000) -- "Unsuffered
Consequences"
# Platform: x86_64-w64-mingw32/x64 (64-bit)
> x <- c("?", "?", "?")
> Encoding(x)[1] "latin1" "latin1"
"latin1"> print(x)[1] "\u0080" "\u0096" "\u0089"

Same output with enc2utf8()
> enc2utf8(x)[1] "\u0080" "\u0096" "\u0089"

With iconv() the result is as expected.
> iconv(x, to = "UTF-8")[1] "?" "?" "?"

The second problem IMO is that encoding markers get lost with the enc2*
functions
> x_utf8 <- enc2utf8(x)
> Encoding(x_utf8)[1] "UTF-8" "UTF-8"
"UTF-8"> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)[1] "unknown" "unknown" "unknown"

This is not the case with iconv()
> x_iutf8 <- iconv(x, to = "UTF-8")
> Encoding(x_iutf8)[1] "UTF-8" "UTF-8"
"UTF-8"> x_inat <- iconv(x_iutf8, from = "UTF-8")
> Encoding(x_inat)[1] "latin1" "latin1" "latin1"

	[[alternative HTML version deleted]]

Prof Brian Ripley

2017-Aug-01 11:36 UTC

head link

[Rd] special latin1 do not print as glyphs in current devel on windows

You seem confused about Latin-1: those characters are not in Latin-1. 
(MicroSoft code pages are a proprietary encoding, some code pages such 
as CP1252 being extensions to Latin-1.)

You have not given the 'at a minimum information' asked for in the 
posting guide so we have no way to reproduce this, and without showing 
us the output on your system, we have no idea what you saw.

[As a convenience to Windows users, R does in some cases assume that 
they are using Latin-1 encodings. If they use extensions to Latin-1 then 
there are no guarantees that code written for strict Latin-1 will work.]

On 01/08/2017 10:19, Daniel Possenriede wrote:> Upon further inspection, I think these are at least two problems.
> First the issue with printing latin1/cp1252 characters in the
"80" to "9F"
> code range.
> 
> x <- c("?", "?", "?")
> Encoding(x)
> print(x)
> 
> I assume that these are Unicode escapes!? (Given that Encoding(x) shows
> "latin1" I'd rather expect latin1/cp1252 escapes here, but
these would be
> e.g. "\x80", right? My locale is LC_COLLATE=German_Germany.1252
btw.)
> Now I don't know why print tries to convert to Unicode, but if these
indeed
> are Unicode escapes, then there is something wrong with the conversion from
> cp1252 to Unicode.
> In general, most cp1252 char codes translate to Unicode like CP1252:
"00"
> -> Unicode "0000", "01" -> "0001",
"02" -> "0002", etc. see
> http://www.cp1252.com/.
> The exception is the cp1252 "80" to "9F" code range.
E.g. the Euro sign is
> "80" in cp1252 but "20AC" in Unicode, endash
"96" in cp1252, "2013" in
> Unicode.
> The same error seems to happen with
> 
> enc2utf8(x)
> 
> Now with iconv() the result is as expected.
> 
> iconv(x, to = "UTF-8")
> 
> 
> The second problem IMO is that encoding markers get lost with the enc2*
> functions
As you are changing encodings, you do not want to preserve encoding!
> x_utf8 <- enc2utf8(x)
> Encoding(x_utf8)
> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)
In an actual Latin-1 locale on Linux

 > x_utf8 <- c("??", "\u20ac", "\u2013")
 > Encoding(x_utf8)
[1] "latin1" "UTF-8"  "UTF-8"
 > enc2native(x_utf8)
[1] "??"     "<U+20AC>" "<U+2013>"
 > Encoding(.Last.value)
[1] "latin1"  "unknown" "unknown"

as expected.
> Again, this is not the case with iconv()
> 
> x_iutf8 <- iconv(x, to = "UTF-8")
> Encoding(x_iutf8)
> x_inat <- iconv(x_iutf8, from = "UTF-8")
> Encoding(x_inat)
iconv is converting from/to the current locale's encoding, presumably 
CP1252, not from the marked encoding (as the help page states explicitly.)

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford

Daniel Possenriede

2017-Aug-01 12:49 UTC

head link

[Rd] special latin1 do not print as glyphs in current devel on windows

Thank you!. My apologies again for not including the console output in my
message before. I sent another e-mail with the output in the meantime, so
it should be a bit clearer now, what I am seeing. In case I missed
something, please let me know.

Yes, I am using latin1 and cp1252 interchangebly here, mostly because
Encoding() is reporting the encoding as "latin1". You presumed
correctly
that my current/default locale's encoding is CP1252. (I also mentioned that
my locale is LC_COLLATE=German_Germany.1252 before).


As you are changing encodings, you do not want to preserve
encoding!>
I am not interested in preserving encodings. What I am worried about is
that the encoding is not marked anymore, i.e. that Encoding() returns
"unknown".
In cp1252 encoding on Windows (note that I am using the cp1252 escape
"\x80" and not the Unicode "\u20AC")
> x_utf8 <- enc2utf8(c("?", "\x80"))
> Encoding(x_utf8)
[1] "UTF-8" "UTF-8"> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)[1] "unknown" "unknown"

See also Kirill's message to this list: "ASCII strings are marked as
ASCII
internally, but this information doesn't seem to be available, e.g.,
Encoding() returns "unknown" for such strings "
http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-tp4733523.html
>
> Again, this is not the case with iconv()
>>
>> x_iutf8 <- iconv(x, to = "UTF-8")
>> Encoding(x_iutf8)
>> x_inat <- iconv(x_iutf8, from = "UTF-8")
>> Encoding(x_inat)
>>
>
> iconv is converting from/to the current locale's encoding, presumably
> CP1252, not from the marked encoding (as the help page states explicitly.)
>
I am aware that iconv is not using the marked encoding, but that you either
have to set it explicitly or it uses the current locale's default encoding.
As I said I am worried about the fact that the encoding markers get lost
with the enc2* functions or rather they are not set correctly. I am just
using the iconv example to show that iconv is able to set the encoding
markers correctly. So it seems generally possible.
> x_iutf8 <- iconv(c("?", "\x80"), to =
"UTF-8")
> Encoding(x_iutf8)
[1] "UTF-8" "UTF-8"> x_iutf8
[1] "?" "?"> x_inat <- iconv(x_iutf8, from = "UTF-8")
> Encoding(x_inat)
[1] "latin1" "latin1"> x_inat[1] "\u0080" "\u0080"

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more possibly parallel threads

R devel - Aug 2017 - special latin1 do not print as glyphs in current devel on windows

[Rd] special latin1 do not print as glyphs in current devel on windows

[Rd] special latin1 do not print as glyphs in current devel on windows

[Rd] special latin1 do not print as glyphs in current devel on windows

[Rd] special latin1 do not print as glyphs in current devel on windows

Possibly Parallel Threads