On Sat, 3 Jul 2021 09:40:28 +0200 Ivan Krylov <krylov.r00t at gmail.com> wrote:> Hello Rolf Turner, > > On Sat, 3 Jul 2021 14:02:59 +1200 > Rolf Turner <r.turner at auckland.ac.nz> wrote: > > > Can anyone suggest how I might get my plot_ascii() function working > > again? Basically, it seems to me, the question is: how do I > > persuade R to read in "\260" as "\ub0" rather than "\xb0"? > > Part of the problem is that the "\xb0" byte is not in ASCII, which > covers only the lower half of possible 8-bit bytes. I guess that the > strings containing bytes with highest bit set used to be interpreted > as Latin-1 on your machine, but now get interpreted as UTF-8, which > changes their meaning (in UTF-8, the highest bit being set indicates > that there will be more bytes to follow, making the string invalid if > there is none). > > The good news is, since it's Latin-1, which is natively supported by > R, there are even multiple options: > > 1. Mark the string as Latin-1 by setting Encoding(a) <- 'latin1' and > let R do the re-encoding if and when Pango asks it for a UTF-8-encoded > string. > > 2. Decode Latin-1 into the locale encoding by using iconv(a, 'latin1', > '') (or set the third parameter to 'UTF-8', which would give almost > the same result on a machine with a UTF-8 locale). The result is, > again, a string where Encoding(a) matches the truth. Explicitly > setting UTF-8 may be preferable on Windows machines running pre-UCRT > builds of R where the locale encoding may not contain all Latin-1 > characters, but that's not a problem for you, as far as I know. > > For any encoding other than Latin-1 or UTF-8, option (2) is still > valid. > > I have verified that your example works on my GNU/Linux system with a > UTF-8 locale if I use either option.Thanks Ivan. That solves most of the problem, but there are still glitches. I get a plot OK, but a substantial number of the characters are displayed as a wee rectangle containing a 2 x 2 array of digits such as> 0 0 > 8 0Also note that there is a bit of difference between the results of using Encoding() and the results of using iconv(). E.g. if I do a <- "\x80" b <- iconv(a,"latin1","UTF-8") Encoding(a) <- "latin1" then when I type "a" I get the Euro symbol "?", but when I type "b" I get the string "\u0080". But that doesn't really matter. More problematic is the fact that if I do either plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE) text(0.5,0.5,labels=a,cex=6) or plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE) text(0.5,0.5,labels=b,cex=6) then I get wee rectangle with 0 0 8 0 arranged in a 2 x 2 array inside. (Setting cex=6 makes it easier for my ageing eyes to see what the digits are.) Is there any way that I can get the Euro symbol to display correctly in such a graphic? Thanks. cheers, Rolf -- Honorary Research Fellow Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276
Sent from my iPhone> On Jul 3, 2021, at 7:00 PM, Rolf Turner <r.turner at auckland.ac.nz> wrote: > > ? >> On Sat, 3 Jul 2021 09:40:28 +0200 >> Ivan Krylov <krylov.r00t at gmail.com> wrote: >> >> Hello Rolf Turner, >> >> On Sat, 3 Jul 2021 14:02:59 +1200 >> Rolf Turner <r.turner at auckland.ac.nz> wrote: >> >>> Can anyone suggest how I might get my plot_ascii() function working >>> again? Basically, it seems to me, the question is: how do I >>> persuade R to read in "\260" as "\ub0" rather than "\xb0"? >> >> Part of the problem is that the "\xb0" byte is not in ASCII, which >> covers only the lower half of possible 8-bit bytes. I guess that the >> strings containing bytes with highest bit set used to be interpreted >> as Latin-1 on your machine, but now get interpreted as UTF-8, which >> changes their meaning (in UTF-8, the highest bit being set indicates >> that there will be more bytes to follow, making the string invalid if >> there is none). >> >> The good news is, since it's Latin-1, which is natively supported by >> R, there are even multiple options: >> >> 1. Mark the string as Latin-1 by setting Encoding(a) <- 'latin1' and >> let R do the re-encoding if and when Pango asks it for a UTF-8-encoded >> string. >> >> 2. Decode Latin-1 into the locale encoding by using iconv(a, 'latin1', >> '') (or set the third parameter to 'UTF-8', which would give almost >> the same result on a machine with a UTF-8 locale). The result is, >> again, a string where Encoding(a) matches the truth. Explicitly >> setting UTF-8 may be preferable on Windows machines running pre-UCRT >> builds of R where the locale encoding may not contain all Latin-1 >> characters, but that's not a problem for you, as far as I know. >> >> For any encoding other than Latin-1 or UTF-8, option (2) is still >> valid. >> >> I have verified that your example works on my GNU/Linux system with a >> UTF-8 locale if I use either option. > > Thanks Ivan. That solves most of the problem, but there are still > glitches. I get a plot OK, but a substantial number of the characters > are displayed as a wee rectangle containing a 2 x 2 array of digits > such as > >> 0 0 >> 8 0 > > Also note that there is a bit of difference between the results of using > Encoding() and the results of using iconv(). E.g. if I do > > a <- "\x80" > b <- iconv(a,"latin1","UTF-8") > Encoding(a) <- "latin1" > > then when I type "a" I get the Euro symbol "?", but when I type "b" > I get the string "\u0080". > > But that doesn't really matter. More problematic is the fact that if I > do either > > plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE) > text(0.5,0.5,labels=a,cex=6) > or > > plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE) > text(0.5,0.5,labels=b,cex=6) > > then I get wee rectangle with 0 0 8 0 arranged in a 2 x 2 array inside. > (Setting cex=6 makes it easier for my ageing eyes to see what the > mAxdigits are.) > > E Is hethere any way that I can get the Euro symbol to display correctly in > such a graphic? >Pick a font that is supported on your OS that has the desired glyph. Also look at the examples in: ?points ? David> Thanks. > > cheers, > > Rolf > > -- > Honorary Research Fellow > Department of Statistics > University of Auckland > Phone: +64-9-373-7599 ext. 88276 > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Sun, 4 Jul 2021 13:59:49 +1200 Rolf Turner <r.turner at auckland.ac.nz> wrote:> a substantial number of the characters are displayed as a wee > rectangle containing a 2 x 2 array of digits such as > > > 0 0 > > 8 0Interesting. I didn't pay attention to it at first, but now I see that a range of code points, U+0080 to U+009F, corresponds to control characters (also, 0+00A0 is non-breakable space), not anything printable. Also, Latin-1 doesn't define any meaning for bytes 0x80..0x9f, but here they are decoded to same-valued Unicode code points. And the actual code point for ? is U+20AC, not even close to what we're working with.> Also note that there is a bit of difference between the results of > using Encoding() and the results of using iconv()You are right. I didn't know that, but my reading of the function translateToNative in src/main/sysutils.c suggests that R decodes strings marked as 'latin1' as Windows-1252 (if it's available for the system iconv()) and uses the actual Latin-1 as a fallback. ?Encoding does warn that 'latin1' is ambiguous and system-dependent with regards to bytes 0x80..0x9f, so text() seems to be right to use Latin-1 and not Windows-1252 when trying to plot byte 0x80 encoded as CE_LATIN1 as U+0080. Although there's a /* FIXME: allow CP1252? */ comment in src/main/sysutils.c, function reEnc, which is used by text().> Is there any way that I can get the Euro symbol to display correctly > in such a graphic?I think that iconv(a, 'CP1252', '', '\ufffd') should work for you. At least it seems to work for the ? sign. It does leave the following bytes undefined, represented as ? U+FFFD REPLACEMENT CHARACTER: as.raw(which(is.na( iconv(sapply(as.raw(1:255), rawToChar), 'CP1252', '') ))) # [1] 81 8d 8f 90 9d Not sure what can be done about those. With Latin-1, they would correspond to unprintable control characters anyway. -- Best regards, Ivan
On 03/07/2021 9:59 p.m., Rolf Turner wrote: ... deletia ...> Also note that there is a bit of difference between the results of using > Encoding() and the results of using iconv(). E.g. if I do > > a <- "\x80" > b <- iconv(a,"latin1","UTF-8") > Encoding(a) <- "latin1" > > then when I type "a" I get the Euro symbol "?", but when I type "b" > I get the string "\u0080" > > But that doesn't really matter. More problematic is the fact that if I > do either > > plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE) > text(0.5,0.5,labels=a,cex=6) > or > > plot(0,0,type="n",xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE) > text(0.5,0.5,labels=b,cex=6) > > then I get wee rectangle with 0 0 8 0 arranged in a 2 x 2 array inside. > (Setting cex=6 makes it easier for my ageing eyes to see what the > digits are.) > > Is there any way that I can get the Euro symbol to display correctly in > such a graphic?The problem with the Euro symbol is that it was invented after the first 8 bit encodings, so it was stuck in later. If you want it, this seems helpful: From https://web.stanford.edu/~laurik/fsmbook/faq/utf8.html: "The proper Unicode code point for ? [this may or may not display correctly as the Euro sign in your browser] is decimal 8364 (0x20AC). In Windows CP1252 ? has the code 128 (0x80); in ISO-8859-15 (also known as Latin-9) the ? code is 164 (0xA4); in Macintosh Roman it is 219 (0xDB)." So a fairly portable way to display it would be "\u20ac". That works in a plot on my Mac; on other graphics devices it depends on whether the glyph is defined, but I'd expect it is fairly widespread. The "\x80" character varies across 8 bit encodings. In many of them it's a non-printable character, but not on Windows. Duncan Murdoch