thr3ads.net - R devel - [Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device [Sep 2023]

If this information is useful, please help other people find it:
Share via:

Ivan Krylov

2023-Sep-23 20:43 UTC

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

On Wed, 20 Sep 2023 12:39:50 +0200
Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> The problem is that some pdf *viewers*,
> notably `evince` on Fedora Linux, for several years now,
> do *not* show *some* of the UTF-8 glyphs because they do not use
> the correct fonts 
One more problem that makes it nontrivial to use Unicode with pdf() is
the graphics device not knowing some of the font metrics:

x <- '\u410\u411\u412'
pdf()
plot(1:10, main = x)
# Warning messages:
# 1: In title(...) : font width unknown for character 0xb0
# 2: In title(...) : font width unknown for character 0xe4
# 3: In title(...) : font width unknown for character 0xfc
# 4: In title(...) : font width unknown for character 0x7f
dev.off()

In the resulting PDF file, the three letters are visible, at least in
Evince 3.38.2, but they are all positioned in the same space.

I understand that this is strictly speaking not pdf()'s fault
(grDevices contains the font metrics for all standard Adobe fonts and a
few more), but I'm not sure what to do as a user. Should I call
pdfFonts(...), declaring a font with all symbols I need? Where does one
even get Type-1 Cyrillic Helvetica (or any other font) with separate
font metrics files for use with pdf()?

Actually, the wrong number of sometimes random character codes reminds
me of stack garbage. In src/library/grDevices/src/devPS.c, function
static double PostScriptStringWidth, there's this bit of code:

	if(!strIsASCII((char *) str) &&
	   /*
	    * Every fifth font is a symbol font:
	    * see postscriptFonts()
	    */
	   (face % 5) != 0) {
	    R_CheckStack2(strlen((char *)str)+1);
	    char buff[strlen((char *)str)+1];
	    /* Output string cannot be longer */
	    mbcsToSbcs((char *)str, buff, encoding, enc);
	    str1 = (unsigned char *)buff;
	}

Later the characters in str1 are iterated over in order to calculate
the total width of the string. I didn't notice this myself until I saw
in the debugger that after a few iterations of the loop, the contents
of str1 are completely different from the result of mbcsToSbcs((char
*)str, buff, encoding, enc), and went to investigate. Only after the
debugger told me that there's no variable called "buff" I realised
that
the VLA pointed to by str1 no longer exists.

--- src/library/grDevices/src/devPS.c	(revision 85214)
+++ src/library/grDevices/src/devPS.c	(working copy)
@@ -721,6 +721,8 @@
     unsigned char p1, p2;

     int status;
+    /* May be about to allocate */
+    void *alloc = vmaxget();
     if(!metrics && (face % 5) != 0) {
 	/* This is the CID font case, and should only happen for
 	   non-symbol fonts.  So we assume monospaced with multipliers.
@@ -755,9 +757,8 @@
 	    * Every fifth font is a symbol font:
 	    * see postscriptFonts()
 	    */
-	   (face % 5) != 0) {
-	    R_CheckStack2(strlen((char *)str)+1);
-	    char buff[strlen((char *)str)+1];
+	   (face % 5) != 0 && metrics) {
+	    char *buff = R_alloc(strlen((char *)str)+1, 1);
 	    /* Output string cannot be longer */
 	    mbcsToSbcs((char *)str, buff, encoding, enc);
 	    str1 = (unsigned char *)buff;
@@ -792,6 +793,7 @@
 		}
 	}
     }
+    vmaxset(alloc);
     return 0.001 * sum;
 }

After this patch, I'm consistently getting the right character codes in
the warnings, but I still don't know how to set up the font metrics.

-- 
Best regards,
Ivan

Paul Murrell

2023-Sep-25 23:39 UTC

head link

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

Hi

Yes, you can set up your own font and TeX installations are a good 
source of Type 1 fonts.  Here is an example (paths obviously specific to 
my [Ubuntu 20.04] OS and TeX installation) ...


cmlgc <- Type1Font("cmlgc",
 
rep("/usr/share/texlive/texmf-dist/fonts/afm/public/cm-lgc/fcmr6z.afm",
4),
                    encoding="Cyrillic")
pdfFonts(cmlgc=cmlgc)

x <- '\u410\u411\u412'
pdf("cmlgc.pdf", family="cmlgc",
encoding="Cyrillic")
plot(1:10, main = x)
dev.off()

embedFonts("cmlgc.pdf", out="cmlgc-embed.pdf",
 
fontpaths="/usr/share/texlive/texmf-dist/fonts/type1/public/cm-lgc/")


Final result attached.

Thanks for the patch for the unrelated memory problem;  I will take a 
look at that.

Paul

On 24/09/23 09:43, Ivan Krylov wrote:> On Wed, 20 Sep 2023 12:39:50 +0200
> Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> 
>  > The problem is that some pdf *viewers*,
>  > notably `evince` on Fedora Linux, for several years now,
>  > do *not* show *some* of the UTF-8 glyphs because they do not use
>  > the correct fonts
> 
> One more problem that makes it nontrivial to use Unicode with pdf() is
> the graphics device not knowing some of the font metrics:
> 
> x <- '\u410\u411\u412'
> pdf()
> plot(1:10, main = x)
> # Warning messages:
> # 1: In title(...) : font width unknown for character 0xb0
> # 2: In title(...) : font width unknown for character 0xe4
> # 3: In title(...) : font width unknown for character 0xfc
> # 4: In title(...) : font width unknown for character 0x7f
> dev.off()
> 
> In the resulting PDF file, the three letters are visible, at least in
> Evince 3.38.2, but they are all positioned in the same space.
> 
> I understand that this is strictly speaking not pdf()'s fault
> (grDevices contains the font metrics for all standard Adobe fonts and a
> few more), but I'm not sure what to do as a user. Should I call
> pdfFonts(...), declaring a font with all symbols I need? Where does one
> even get Type-1 Cyrillic Helvetica (or any other font) with separate
> font metrics files for use with pdf()?
> 
> Actually, the wrong number of sometimes random character codes reminds
> me of stack garbage. In src/library/grDevices/src/devPS.c, function
> static double PostScriptStringWidth, there's this bit of code:
> 
> if(!strIsASCII((char *) str) &&
> /*
> * Every fifth font is a symbol font:
> * see postscriptFonts()
> */
> (face % 5) != 0) {
> R_CheckStack2(strlen((char *)str)+1);
> char buff[strlen((char *)str)+1];
> /* Output string cannot be longer */
> mbcsToSbcs((char *)str, buff, encoding, enc);
> str1 = (unsigned char *)buff;
> }
> 
> Later the characters in str1 are iterated over in order to calculate
> the total width of the string. I didn't notice this myself until I saw
> in the debugger that after a few iterations of the loop, the contents
> of str1 are completely different from the result of mbcsToSbcs((char
> *)str, buff, encoding, enc), and went to investigate. Only after the
> debugger told me that there's no variable called "buff" I
realised that
> the VLA pointed to by str1 no longer exists.
> 
> --- src/library/grDevices/src/devPS.c (revision 85214)
> +++ src/library/grDevices/src/devPS.c (working copy)
> @@ -721,6 +721,8 @@
> unsigned char p1, p2;
> 
> int status;
> + /* May be about to allocate */
> + void *alloc = vmaxget();
> if(!metrics && (face % 5) != 0) {
> /* This is the CID font case, and should only happen for
> non-symbol fonts. So we assume monospaced with multipliers.
> @@ -755,9 +757,8 @@
> * Every fifth font is a symbol font:
> * see postscriptFonts()
> */
> - (face % 5) != 0) {
> - R_CheckStack2(strlen((char *)str)+1);
> - char buff[strlen((char *)str)+1];
> + (face % 5) != 0 && metrics) {
> + char *buff = R_alloc(strlen((char *)str)+1, 1);
> /* Output string cannot be longer */
> mbcsToSbcs((char *)str, buff, encoding, enc);
> str1 = (unsigned char *)buff;
> @@ -792,6 +793,7 @@
> }
> }
> }
> + vmaxset(alloc);
> return 0.001 * sum;
> }
> 
> 
> 
> After this patch, I'm consistently getting the right character codes in
> the warnings, but I still don't know how to set up the font metrics.
> 
> -- 
> Best regards,
> Ivan
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel 
> <https://stat.ethz.ch/mailman/listinfo/r-devel>
-- 
Dr Paul Murrell
Te Kura Tatauranga | Department of Statistics
Waipapa Taumata Rau | The University of Auckland
Private Bag 92019, Auckland 1142, New Zealand
64 9 3737599 x85392
paul at stat.auckland.ac.nz
www.stat.auckland.ac.nz/~paul/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cmlgc-embed.pdf
Type: application/pdf
Size: 10746 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20230926/c4365ddc/attachment.pdf>

R devel - Sep 2023 - proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device