Martin Maechler
2023-Sep-20 10:39 UTC
[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device
>>>>> Trevor Davis >>>>> on Thu, 31 Aug 2023 13:49:03 -0700 writes:> Hi, > It would be nice if `grDevices::dev.capabilities()` could also be used to > query whether the current graphics device supports Unicode. In such a case > I'd expect it to return `FALSE` if `pdf()` is the current graphics device > and something else for the Cairo or Quartz devices. > Thanks, > Trevor I agree in principle that this would be useful new feature for dev.capabilities() However, pdf() *does* support Unicode. The problem is that some pdf *viewers*, notably `evince` on Fedora Linux, for several years now, do *not* show *some* of the UTF-8 glyphs because they do not use the correct fonts {which *are* on the machine; good old `xpdf` does in that case show the glyphs}. Martin
Trevor Davis
2023-Sep-20 16:12 UTC
[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device
> However, pdf() *does* support Unicode.When I run a simple Unicode example like: ``` f <- tempfile(fileext = ".pdf") pdf(f) # U+2655 ? is found in most (all?) "sans" fonts like Arial, Dejavu Sans, Arimo, etc. # However, it is not in the Latin-1 encoding grid::grid.text("\u2665") dev.off() ``` I observe the following output: ``` Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on '?' in 'mbcsToSbcs': dot substituted for <e2> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on '?' in 'mbcsToSbcs': dot substituted for <99> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on '?' in 'mbcsToSbcs': dot substituted for <a5> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on '?' in 'mbcsToSbcs': dot substituted for <e2> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on '?' in 'mbcsToSbcs': dot substituted for <99> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on '?' in 'mbcsToSbcs': dot substituted for <a5> ``` When I open up the pdf file I just see three dots and not a heart as I expected even if I open it up with `xpdf`. In contrast the pdf generated by `cairo_pdf()` has a heart without generating any warnings. Avoiding such WARNINGs on certain CRAN check machines when I have a Unicode graphics example that is worth including in a package's examples (if protected by an appropriate if statement) is my main use case for such a new feature. However, a new feature like `dev.capabilities()$unicode` could certainly return something more sophisticated than a crude `TRUE` and `FALSE` to distinguish between levels of Unicode support provided by different graphics devices. Thanks, Trevor On Wed, Sep 20, 2023 at 3:39?AM Martin Maechler <maechler at stat.math.ethz.ch> wrote:> >>>>> Trevor Davis > >>>>> on Thu, 31 Aug 2023 13:49:03 -0700 writes: > > > Hi, > > > It would be nice if `grDevices::dev.capabilities()` could also be > used to > > query whether the current graphics device supports Unicode. In such > a case > > I'd expect it to return `FALSE` if `pdf()` is the current graphics > device > > and something else for the Cairo or Quartz devices. > > > Thanks, > > Trevor > > I agree in principle that this would be useful new feature for > dev.capabilities() > > However, pdf() *does* support Unicode. > > The problem is that some pdf *viewers*, > notably `evince` on Fedora Linux, for several years now, > do *not* show *some* of the UTF-8 glyphs because they do not use > the correct fonts {which *are* on the machine; good old `xpdf` > does in that case show the glyphs}. > > Martin >[[alternative HTML version deleted]]
Ivan Krylov
2023-Sep-23 20:43 UTC
[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device
On Wed, 20 Sep 2023 12:39:50 +0200 Martin Maechler <maechler at stat.math.ethz.ch> wrote:> The problem is that some pdf *viewers*, > notably `evince` on Fedora Linux, for several years now, > do *not* show *some* of the UTF-8 glyphs because they do not use > the correct fontsOne more problem that makes it nontrivial to use Unicode with pdf() is the graphics device not knowing some of the font metrics: x <- '\u410\u411\u412' pdf() plot(1:10, main = x) # Warning messages: # 1: In title(...) : font width unknown for character 0xb0 # 2: In title(...) : font width unknown for character 0xe4 # 3: In title(...) : font width unknown for character 0xfc # 4: In title(...) : font width unknown for character 0x7f dev.off() In the resulting PDF file, the three letters are visible, at least in Evince 3.38.2, but they are all positioned in the same space. I understand that this is strictly speaking not pdf()'s fault (grDevices contains the font metrics for all standard Adobe fonts and a few more), but I'm not sure what to do as a user. Should I call pdfFonts(...), declaring a font with all symbols I need? Where does one even get Type-1 Cyrillic Helvetica (or any other font) with separate font metrics files for use with pdf()? Actually, the wrong number of sometimes random character codes reminds me of stack garbage. In src/library/grDevices/src/devPS.c, function static double PostScriptStringWidth, there's this bit of code: if(!strIsASCII((char *) str) && /* * Every fifth font is a symbol font: * see postscriptFonts() */ (face % 5) != 0) { R_CheckStack2(strlen((char *)str)+1); char buff[strlen((char *)str)+1]; /* Output string cannot be longer */ mbcsToSbcs((char *)str, buff, encoding, enc); str1 = (unsigned char *)buff; } Later the characters in str1 are iterated over in order to calculate the total width of the string. I didn't notice this myself until I saw in the debugger that after a few iterations of the loop, the contents of str1 are completely different from the result of mbcsToSbcs((char *)str, buff, encoding, enc), and went to investigate. Only after the debugger told me that there's no variable called "buff" I realised that the VLA pointed to by str1 no longer exists. --- src/library/grDevices/src/devPS.c (revision 85214) +++ src/library/grDevices/src/devPS.c (working copy) @@ -721,6 +721,8 @@ unsigned char p1, p2; int status; + /* May be about to allocate */ + void *alloc = vmaxget(); if(!metrics && (face % 5) != 0) { /* This is the CID font case, and should only happen for non-symbol fonts. So we assume monospaced with multipliers. @@ -755,9 +757,8 @@ * Every fifth font is a symbol font: * see postscriptFonts() */ - (face % 5) != 0) { - R_CheckStack2(strlen((char *)str)+1); - char buff[strlen((char *)str)+1]; + (face % 5) != 0 && metrics) { + char *buff = R_alloc(strlen((char *)str)+1, 1); /* Output string cannot be longer */ mbcsToSbcs((char *)str, buff, encoding, enc); str1 = (unsigned char *)buff; @@ -792,6 +793,7 @@ } } } + vmaxset(alloc); return 0.001 * sum; } After this patch, I'm consistently getting the right character codes in the warnings, but I still don't know how to set up the font metrics. -- Best regards, Ivan