thr3ads.net - R devel - [Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device [Sep 2023]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2023-Sep-20 10:39 UTC

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

>>>>> Trevor Davis 
>>>>>     on Thu, 31 Aug 2023 13:49:03 -0700 writes:
    > Hi,

    > It would be nice if `grDevices::dev.capabilities()` could also be used
to
    > query whether the current graphics device supports Unicode.  In such a
case
    > I'd expect it to return `FALSE` if `pdf()` is the current graphics
device
    > and something else for the Cairo or Quartz devices.

    > Thanks,
    > Trevor

I agree in principle that this would be useful new feature for
dev.capabilities()

However, pdf()   *does*  support  Unicode.

The problem is that some pdf *viewers*,
notably `evince` on Fedora Linux, for several years now,
do *not* show *some* of the UTF-8 glyphs because they do not use
the correct fonts {which *are* on the machine; good old `xpdf`
does in that case show the glyphs}.

Martin

Trevor Davis

2023-Sep-20 16:12 UTC

head link

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

> However, pdf()   *does*  support  Unicode.
When I run a simple Unicode example like:

```
f <- tempfile(fileext = ".pdf")
pdf(f)
# U+2655 ? is found in most (all?) "sans" fonts like Arial, Dejavu
Sans,
Arimo, etc.
# However, it is not in the Latin-1 encoding
grid::grid.text("\u2665")
dev.off()
```

I observe the following output:

```
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y,
 :
  conversion failure on '?' in 'mbcsToSbcs': dot substituted for
<e2>
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y,
 :
  conversion failure on '?' in 'mbcsToSbcs': dot substituted for
<99>
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y,
 :
  conversion failure on '?' in 'mbcsToSbcs': dot substituted for
<a5>
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y,
 :
  conversion failure on '?' in 'mbcsToSbcs': dot substituted for
<e2>
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y,
 :
  conversion failure on '?' in 'mbcsToSbcs': dot substituted for
<99>
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y,
 :
  conversion failure on '?' in 'mbcsToSbcs': dot substituted for
<a5>
```

When I open up the pdf file I just see three dots and not a heart as I
expected even if I open it up with `xpdf`.

In contrast the pdf generated by `cairo_pdf()` has a heart without
generating any warnings.

Avoiding such WARNINGs on certain CRAN check machines when I have a Unicode
graphics example that is worth including in a package's examples (if
protected by an appropriate if statement) is my main use case for such a
new feature.  However, a new feature like `dev.capabilities()$unicode`
could certainly return something more sophisticated than a crude `TRUE` and
`FALSE` to distinguish between levels of Unicode support provided by
different graphics devices.

Thanks,

Trevor

On Wed, Sep 20, 2023 at 3:39?AM Martin Maechler <maechler at
stat.math.ethz.ch>
wrote:
> >>>>> Trevor Davis
> >>>>>     on Thu, 31 Aug 2023 13:49:03 -0700 writes:
>
>     > Hi,
>
>     > It would be nice if `grDevices::dev.capabilities()` could also be
> used to
>     > query whether the current graphics device supports Unicode.  In
such
> a case
>     > I'd expect it to return `FALSE` if `pdf()` is the current
graphics
> device
>     > and something else for the Cairo or Quartz devices.
>
>     > Thanks,
>     > Trevor
>
> I agree in principle that this would be useful new feature for
> dev.capabilities()
>
> However, pdf()   *does*  support  Unicode.
>
> The problem is that some pdf *viewers*,
> notably `evince` on Fedora Linux, for several years now,
> do *not* show *some* of the UTF-8 glyphs because they do not use
> the correct fonts {which *are* on the machine; good old `xpdf`
> does in that case show the glyphs}.
>
> Martin
>
	[[alternative HTML version deleted]]

Ivan Krylov

2023-Sep-23 20:43 UTC

head link

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

On Wed, 20 Sep 2023 12:39:50 +0200
Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> The problem is that some pdf *viewers*,
> notably `evince` on Fedora Linux, for several years now,
> do *not* show *some* of the UTF-8 glyphs because they do not use
> the correct fonts 
One more problem that makes it nontrivial to use Unicode with pdf() is
the graphics device not knowing some of the font metrics:

x <- '\u410\u411\u412'
pdf()
plot(1:10, main = x)
# Warning messages:
# 1: In title(...) : font width unknown for character 0xb0
# 2: In title(...) : font width unknown for character 0xe4
# 3: In title(...) : font width unknown for character 0xfc
# 4: In title(...) : font width unknown for character 0x7f
dev.off()

In the resulting PDF file, the three letters are visible, at least in
Evince 3.38.2, but they are all positioned in the same space.

I understand that this is strictly speaking not pdf()'s fault
(grDevices contains the font metrics for all standard Adobe fonts and a
few more), but I'm not sure what to do as a user. Should I call
pdfFonts(...), declaring a font with all symbols I need? Where does one
even get Type-1 Cyrillic Helvetica (or any other font) with separate
font metrics files for use with pdf()?

Actually, the wrong number of sometimes random character codes reminds
me of stack garbage. In src/library/grDevices/src/devPS.c, function
static double PostScriptStringWidth, there's this bit of code:

	if(!strIsASCII((char *) str) &&
	   /*
	    * Every fifth font is a symbol font:
	    * see postscriptFonts()
	    */
	   (face % 5) != 0) {
	    R_CheckStack2(strlen((char *)str)+1);
	    char buff[strlen((char *)str)+1];
	    /* Output string cannot be longer */
	    mbcsToSbcs((char *)str, buff, encoding, enc);
	    str1 = (unsigned char *)buff;
	}

Later the characters in str1 are iterated over in order to calculate
the total width of the string. I didn't notice this myself until I saw
in the debugger that after a few iterations of the loop, the contents
of str1 are completely different from the result of mbcsToSbcs((char
*)str, buff, encoding, enc), and went to investigate. Only after the
debugger told me that there's no variable called "buff" I realised
that
the VLA pointed to by str1 no longer exists.

--- src/library/grDevices/src/devPS.c	(revision 85214)
+++ src/library/grDevices/src/devPS.c	(working copy)
@@ -721,6 +721,8 @@
     unsigned char p1, p2;

     int status;
+    /* May be about to allocate */
+    void *alloc = vmaxget();
     if(!metrics && (face % 5) != 0) {
 	/* This is the CID font case, and should only happen for
 	   non-symbol fonts.  So we assume monospaced with multipliers.
@@ -755,9 +757,8 @@
 	    * Every fifth font is a symbol font:
 	    * see postscriptFonts()
 	    */
-	   (face % 5) != 0) {
-	    R_CheckStack2(strlen((char *)str)+1);
-	    char buff[strlen((char *)str)+1];
+	   (face % 5) != 0 && metrics) {
+	    char *buff = R_alloc(strlen((char *)str)+1, 1);
 	    /* Output string cannot be longer */
 	    mbcsToSbcs((char *)str, buff, encoding, enc);
 	    str1 = (unsigned char *)buff;
@@ -792,6 +793,7 @@
 		}
 	}
     }
+    vmaxset(alloc);
     return 0.001 * sum;
 }

After this patch, I'm consistently getting the right character codes in
the warnings, but I still don't know how to set up the font metrics.

-- 
Best regards,
Ivan

R devel - Sep 2023 - proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device