Ivan Krylov
2023-Sep-23 20:43 UTC
[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device
On Wed, 20 Sep 2023 12:39:50 +0200 Martin Maechler <maechler at stat.math.ethz.ch> wrote:> The problem is that some pdf *viewers*, > notably `evince` on Fedora Linux, for several years now, > do *not* show *some* of the UTF-8 glyphs because they do not use > the correct fontsOne more problem that makes it nontrivial to use Unicode with pdf() is the graphics device not knowing some of the font metrics: x <- '\u410\u411\u412' pdf() plot(1:10, main = x) # Warning messages: # 1: In title(...) : font width unknown for character 0xb0 # 2: In title(...) : font width unknown for character 0xe4 # 3: In title(...) : font width unknown for character 0xfc # 4: In title(...) : font width unknown for character 0x7f dev.off() In the resulting PDF file, the three letters are visible, at least in Evince 3.38.2, but they are all positioned in the same space. I understand that this is strictly speaking not pdf()'s fault (grDevices contains the font metrics for all standard Adobe fonts and a few more), but I'm not sure what to do as a user. Should I call pdfFonts(...), declaring a font with all symbols I need? Where does one even get Type-1 Cyrillic Helvetica (or any other font) with separate font metrics files for use with pdf()? Actually, the wrong number of sometimes random character codes reminds me of stack garbage. In src/library/grDevices/src/devPS.c, function static double PostScriptStringWidth, there's this bit of code: if(!strIsASCII((char *) str) && /* * Every fifth font is a symbol font: * see postscriptFonts() */ (face % 5) != 0) { R_CheckStack2(strlen((char *)str)+1); char buff[strlen((char *)str)+1]; /* Output string cannot be longer */ mbcsToSbcs((char *)str, buff, encoding, enc); str1 = (unsigned char *)buff; } Later the characters in str1 are iterated over in order to calculate the total width of the string. I didn't notice this myself until I saw in the debugger that after a few iterations of the loop, the contents of str1 are completely different from the result of mbcsToSbcs((char *)str, buff, encoding, enc), and went to investigate. Only after the debugger told me that there's no variable called "buff" I realised that the VLA pointed to by str1 no longer exists. --- src/library/grDevices/src/devPS.c (revision 85214) +++ src/library/grDevices/src/devPS.c (working copy) @@ -721,6 +721,8 @@ unsigned char p1, p2; int status; + /* May be about to allocate */ + void *alloc = vmaxget(); if(!metrics && (face % 5) != 0) { /* This is the CID font case, and should only happen for non-symbol fonts. So we assume monospaced with multipliers. @@ -755,9 +757,8 @@ * Every fifth font is a symbol font: * see postscriptFonts() */ - (face % 5) != 0) { - R_CheckStack2(strlen((char *)str)+1); - char buff[strlen((char *)str)+1]; + (face % 5) != 0 && metrics) { + char *buff = R_alloc(strlen((char *)str)+1, 1); /* Output string cannot be longer */ mbcsToSbcs((char *)str, buff, encoding, enc); str1 = (unsigned char *)buff; @@ -792,6 +793,7 @@ } } } + vmaxset(alloc); return 0.001 * sum; } After this patch, I'm consistently getting the right character codes in the warnings, but I still don't know how to set up the font metrics. -- Best regards, Ivan
Paul Murrell
2023-Sep-25 23:39 UTC
[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device
Hi Yes, you can set up your own font and TeX installations are a good source of Type 1 fonts. Here is an example (paths obviously specific to my [Ubuntu 20.04] OS and TeX installation) ... cmlgc <- Type1Font("cmlgc", rep("/usr/share/texlive/texmf-dist/fonts/afm/public/cm-lgc/fcmr6z.afm", 4), encoding="Cyrillic") pdfFonts(cmlgc=cmlgc) x <- '\u410\u411\u412' pdf("cmlgc.pdf", family="cmlgc", encoding="Cyrillic") plot(1:10, main = x) dev.off() embedFonts("cmlgc.pdf", out="cmlgc-embed.pdf", fontpaths="/usr/share/texlive/texmf-dist/fonts/type1/public/cm-lgc/") Final result attached. Thanks for the patch for the unrelated memory problem; I will take a look at that. Paul On 24/09/23 09:43, Ivan Krylov wrote:> On Wed, 20 Sep 2023 12:39:50 +0200 > Martin Maechler <maechler at stat.math.ethz.ch> wrote: > > > The problem is that some pdf *viewers*, > > notably `evince` on Fedora Linux, for several years now, > > do *not* show *some* of the UTF-8 glyphs because they do not use > > the correct fonts > > One more problem that makes it nontrivial to use Unicode with pdf() is > the graphics device not knowing some of the font metrics: > > x <- '\u410\u411\u412' > pdf() > plot(1:10, main = x) > # Warning messages: > # 1: In title(...) : font width unknown for character 0xb0 > # 2: In title(...) : font width unknown for character 0xe4 > # 3: In title(...) : font width unknown for character 0xfc > # 4: In title(...) : font width unknown for character 0x7f > dev.off() > > In the resulting PDF file, the three letters are visible, at least in > Evince 3.38.2, but they are all positioned in the same space. > > I understand that this is strictly speaking not pdf()'s fault > (grDevices contains the font metrics for all standard Adobe fonts and a > few more), but I'm not sure what to do as a user. Should I call > pdfFonts(...), declaring a font with all symbols I need? Where does one > even get Type-1 Cyrillic Helvetica (or any other font) with separate > font metrics files for use with pdf()? > > Actually, the wrong number of sometimes random character codes reminds > me of stack garbage. In src/library/grDevices/src/devPS.c, function > static double PostScriptStringWidth, there's this bit of code: > > if(!strIsASCII((char *) str) && > /* > * Every fifth font is a symbol font: > * see postscriptFonts() > */ > (face % 5) != 0) { > R_CheckStack2(strlen((char *)str)+1); > char buff[strlen((char *)str)+1]; > /* Output string cannot be longer */ > mbcsToSbcs((char *)str, buff, encoding, enc); > str1 = (unsigned char *)buff; > } > > Later the characters in str1 are iterated over in order to calculate > the total width of the string. I didn't notice this myself until I saw > in the debugger that after a few iterations of the loop, the contents > of str1 are completely different from the result of mbcsToSbcs((char > *)str, buff, encoding, enc), and went to investigate. Only after the > debugger told me that there's no variable called "buff" I realised that > the VLA pointed to by str1 no longer exists. > > --- src/library/grDevices/src/devPS.c (revision 85214) > +++ src/library/grDevices/src/devPS.c (working copy) > @@ -721,6 +721,8 @@ > unsigned char p1, p2; > > int status; > + /* May be about to allocate */ > + void *alloc = vmaxget(); > if(!metrics && (face % 5) != 0) { > /* This is the CID font case, and should only happen for > non-symbol fonts. So we assume monospaced with multipliers. > @@ -755,9 +757,8 @@ > * Every fifth font is a symbol font: > * see postscriptFonts() > */ > - (face % 5) != 0) { > - R_CheckStack2(strlen((char *)str)+1); > - char buff[strlen((char *)str)+1]; > + (face % 5) != 0 && metrics) { > + char *buff = R_alloc(strlen((char *)str)+1, 1); > /* Output string cannot be longer */ > mbcsToSbcs((char *)str, buff, encoding, enc); > str1 = (unsigned char *)buff; > @@ -792,6 +793,7 @@ > } > } > } > + vmaxset(alloc); > return 0.001 * sum; > } > > > > After this patch, I'm consistently getting the right character codes in > the warnings, but I still don't know how to set up the font metrics. > > -- > Best regards, > Ivan > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > <https://stat.ethz.ch/mailman/listinfo/r-devel>-- Dr Paul Murrell Te Kura Tatauranga | Department of Statistics Waipapa Taumata Rau | The University of Auckland Private Bag 92019, Auckland 1142, New Zealand 64 9 3737599 x85392 paul at stat.auckland.ac.nz www.stat.auckland.ac.nz/~paul/ -------------- next part -------------- A non-text attachment was scrubbed... Name: cmlgc-embed.pdf Type: application/pdf Size: 10746 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230926/c4365ddc/attachment.pdf>