thr3ads.net - R help - [R] Plotting the ASCII character set. [Jul 2021]

If this information is useful, please help other people find it:
Share via:

Rolf Turner

2021-Jul-04 01:59 UTC

[R] Plotting the ASCII character set.

On Sat, 3 Jul 2021 09:40:28 +0200
Ivan Krylov <krylov.r00t at gmail.com> wrote:
> Hello Rolf Turner,
> 
> On Sat, 3 Jul 2021 14:02:59 +1200
> Rolf Turner <r.turner at auckland.ac.nz> wrote:
> 
> > Can anyone suggest how I might get my plot_ascii() function working
> > again?  Basically, it seems to me, the question is:  how do I
> > persuade R to read in "\260" as "\ub0" rather than
"\xb0"?
> 
> Part of the problem is that the "\xb0" byte is not in ASCII,
which
> covers only the lower half of possible 8-bit bytes. I guess that the
> strings containing bytes with highest bit set used to be interpreted
> as Latin-1 on your machine, but now get interpreted as UTF-8, which
> changes their meaning (in UTF-8, the highest bit being set indicates
> that there will be more bytes to follow, making the string invalid if
> there is none).
> 
> The good news is, since it's Latin-1, which is natively supported by
> R, there are even multiple options:
> 
> 1. Mark the string as Latin-1 by setting Encoding(a) <- 'latin1'
and
> let R do the re-encoding if and when Pango asks it for a UTF-8-encoded
> string.
> 
> 2. Decode Latin-1 into the locale encoding by using iconv(a,
'latin1',
> '') (or set the third parameter to 'UTF-8', which would
give almost
> the same result on a machine with a UTF-8 locale). The result is,
> again, a string where Encoding(a) matches the truth. Explicitly
> setting UTF-8 may be preferable on Windows machines running pre-UCRT
> builds of R where the locale encoding may not contain all Latin-1
> characters, but that's not a problem for you, as far as I know.
> 
> For any encoding other than Latin-1 or UTF-8, option (2) is still
> valid.
> 
> I have verified that your example works on my GNU/Linux system with a
> UTF-8 locale if I use either option.
Thanks Ivan. That solves most of the problem, but there are still
glitches. I get a plot OK, but a substantial number of the characters
are displayed as a wee rectangle containing a 2 x 2 array of digits
such as
>   0 0
>   8 0
Also note that there is a bit of difference between the results of using
Encoding() and the results of using iconv(). E.g. if I do

a <- "\x80"
b <- iconv(a,"latin1","UTF-8")
Encoding(a) <- "latin1"

then when I type "a" I get the Euro symbol "?", but when I
type "b"
I get the string "\u0080".

But that doesn't really matter.  More problematic is the fact that if I
do either

    plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE)
    text(0.5,0.5,labels=a,cex=6)
or

    plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE)
    text(0.5,0.5,labels=b,cex=6)

then I get wee rectangle with 0 0 8 0 arranged in a 2 x 2 array inside.
(Setting cex=6 makes it easier for my ageing eyes to see what the
digits are.)

Is there any way that I can get the Euro symbol to display correctly in
such a graphic?

Thanks.

cheers,

Rolf

-- 
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

David Winsemius

2021-Jul-04 05:37 UTC

head link

[R] Plotting the ASCII character set.

Sent from my iPhone
> On Jul 3, 2021, at 7:00 PM, Rolf Turner <r.turner at auckland.ac.nz>
wrote:
> 
> ?
>> On Sat, 3 Jul 2021 09:40:28 +0200
>> Ivan Krylov <krylov.r00t at gmail.com> wrote:
>> 
>> Hello Rolf Turner,
>> 
>> On Sat, 3 Jul 2021 14:02:59 +1200
>> Rolf Turner <r.turner at auckland.ac.nz> wrote:
>> 
>>> Can anyone suggest how I might get my plot_ascii() function working
>>> again?  Basically, it seems to me, the question is:  how do I
>>> persuade R to read in "\260" as "\ub0" rather
than "\xb0"?
>> 
>> Part of the problem is that the "\xb0" byte is not in ASCII,
which
>> covers only the lower half of possible 8-bit bytes. I guess that the
>> strings containing bytes with highest bit set used to be interpreted
>> as Latin-1 on your machine, but now get interpreted as UTF-8, which
>> changes their meaning (in UTF-8, the highest bit being set indicates
>> that there will be more bytes to follow, making the string invalid if
>> there is none).
>> 
>> The good news is, since it's Latin-1, which is natively supported
by
>> R, there are even multiple options:
>> 
>> 1. Mark the string as Latin-1 by setting Encoding(a) <-
'latin1' and
>> let R do the re-encoding if and when Pango asks it for a UTF-8-encoded
>> string.
>> 
>> 2. Decode Latin-1 into the locale encoding by using iconv(a,
'latin1',
>> '') (or set the third parameter to 'UTF-8', which would
give almost
>> the same result on a machine with a UTF-8 locale). The result is,
>> again, a string where Encoding(a) matches the truth. Explicitly
>> setting UTF-8 may be preferable on Windows machines running pre-UCRT
>> builds of R where the locale encoding may not contain all Latin-1
>> characters, but that's not a problem for you, as far as I know.
>> 
>> For any encoding other than Latin-1 or UTF-8, option (2) is still
>> valid.
>> 
>> I have verified that your example works on my GNU/Linux system with a
>> UTF-8 locale if I use either option.
> 
> Thanks Ivan. That solves most of the problem, but there are still
> glitches. I get a plot OK, but a substantial number of the characters
> are displayed as a wee rectangle containing a 2 x 2 array of digits
> such as
> 
>>  0 0
>>  8 0
> 
> Also note that there is a bit of difference between the results of using
> Encoding() and the results of using iconv(). E.g. if I do
> 
> a <- "\x80"
> b <- iconv(a,"latin1","UTF-8")
> Encoding(a) <- "latin1"
> 
> then when I type "a" I get the Euro symbol "?", but
when I type "b"
> I get the string "\u0080".
> 
> But that doesn't really matter.  More problematic is the fact that if I
> do either
> 
>   
plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE)
>    text(0.5,0.5,labels=a,cex=6)
> or
> 
>   
plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE)
>    text(0.5,0.5,labels=b,cex=6)
> 
> then I get wee rectangle with 0 0 8 0 arranged in a 2 x 2 array inside.
> (Setting cex=6 makes it easier for my ageing eyes to see what the
> mAxdigits are.)
> 
> E Is hethere any way that I can get the Euro symbol to display correctly in
> such a graphic?
> Pick a font that is supported on your OS that has the desired glyph. 
Also look at the examples in:

?points

? 
David > Thanks.
> 
> cheers,
> 
> Rolf
> 
> -- 
> Honorary Research Fellow
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ivan Krylov

2021-Jul-04 11:01 UTC

head link

[R] Plotting the ASCII character set.

On Sun, 4 Jul 2021 13:59:49 +1200
Rolf Turner <r.turner at auckland.ac.nz> wrote:
> a substantial number of the characters are displayed as a wee
> rectangle containing a 2 x 2 array of digits such as
> 
> >   0 0
> >   8 0  
Interesting. I didn't pay attention to it at first, but now I see that
a range of code points, U+0080 to U+009F, corresponds to control
characters (also, 0+00A0 is non-breakable space), not anything
printable. Also, Latin-1 doesn't define any meaning for bytes
0x80..0x9f, but here they are decoded to same-valued Unicode code
points. And the actual code point for ? is U+20AC, not even close to
what we're working with.
> Also note that there is a bit of difference between the results of
> using Encoding() and the results of using iconv()
You are right. I didn't know that, but my reading of the function
translateToNative in src/main/sysutils.c suggests that R decodes
strings marked as 'latin1' as Windows-1252 (if it's available for
the
system iconv()) and uses the actual Latin-1 as a fallback.

?Encoding does warn that 'latin1' is ambiguous and system-dependent
with regards to bytes 0x80..0x9f, so text() seems to be right to use
Latin-1 and not Windows-1252 when trying to plot byte 0x80 encoded as
CE_LATIN1 as U+0080. Although there's a /* FIXME: allow CP1252? */
comment in src/main/sysutils.c, function reEnc, which is used by text().
> Is there any way that I can get the Euro symbol to display correctly
> in such a graphic?
I think that iconv(a, 'CP1252', '', '\ufffd') should
work for you. At
least it seems to work for the ? sign. It does leave the following
bytes undefined, represented as ? U+FFFD REPLACEMENT CHARACTER:

as.raw(which(is.na(
 iconv(sapply(as.raw(1:255), rawToChar), 'CP1252', '')
)))
# [1] 81 8d 8f 90 9d

Not sure what can be done about those. With Latin-1, they would
correspond to unprintable control characters anyway.

-- 
Best regards,
Ivan

Duncan Murdoch

2021-Jul-04 11:15 UTC

head link

[R] Plotting the ASCII character set.

On 03/07/2021 9:59 p.m., Rolf Turner wrote:

  ... deletia ...
> Also note that there is a bit of difference between the results of using
> Encoding() and the results of using iconv(). E.g. if I do
> 
> a <- "\x80"
> b <- iconv(a,"latin1","UTF-8")
> Encoding(a) <- "latin1"
> 
> then when I type "a" I get the Euro symbol "?", but
when I type "b"
> I get the string "\u0080"
> 
> But that doesn't really matter.  More problematic is the fact that if I
> do either
> 
>     
plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE)
>      text(0.5,0.5,labels=a,cex=6)
> or
> 
>     
plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE)
>      text(0.5,0.5,labels=b,cex=6)
> 
> then I get wee rectangle with 0 0 8 0 arranged in a 2 x 2 array inside.
> (Setting cex=6 makes it easier for my ageing eyes to see what the
> digits are.)
> 
> Is there any way that I can get the Euro symbol to display correctly in
> such a graphic?

The problem with the Euro symbol is that it was invented after the first 
8 bit encodings, so it was stuck in later.  If you want it, this seems 
helpful:

 From https://web.stanford.edu/~laurik/fsmbook/faq/utf8.html:

"The proper Unicode code point for ? [this may or may not display 
correctly as the Euro sign in your browser] is decimal 8364 (0x20AC). In 
Windows CP1252 ? has the code 128 (0x80); in ISO-8859-15 (also known as 
Latin-9) the ? code is 164 (0xA4); in Macintosh Roman it is 219 (0xDB)."

So a fairly portable way to display it would be "\u20ac".  That works
in
a plot on my Mac; on other graphics devices it depends on whether the 
glyph is defined, but I'd expect it is fairly widespread.

The "\x80" character varies across 8 bit encodings.  In many of them 
it's a non-printable character, but not on Windows.

Duncan Murdoch

R help - Jul 2021 - Plotting the ASCII character set.

[R] Plotting the ASCII character set.

[R] Plotting the ASCII character set.

[R] Plotting the ASCII character set.

[R] Plotting the ASCII character set.