thr3ads.net - R devel - [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF [Jan 2023]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2023-Jan-30 22:01 UTC

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

/Hello.

SUMMARY:
$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e
"Sys.getenv()"
Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'

$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e
"Sys.getenv('BOOM')"
[1] "\xff"

BACKGROUND:
I launch R through an Son of Grid Engine (SGE) scheduler, where the R
process is launched on a compute host via  'qrsh', which part of SGE.
Without going into details, 'mpirun' is also involved. Regardless, in
this process, an 'qrsh'-specific environment variable
'QRSH_COMMAND'
is set automatically.  The value of this variable comprise of a string
with \xff (ASCII 255) injected between the words.  This is by design
of SGE [1].  Here is an example of what this environment variable may
look like:

QRSH_COMMAND=
orted\xff--hnp-topo-sig\xff2N:2S:32L3:128L2:128L1:128C:256H:x86_64\xff-mca\xffess\xff\"env\"\xff-mca\xfforte_ess_jobid\xff\"3473342464\"\xff-mca\xfforte_ess_vpid\xff1\xff-mca\xfforte_ess_num_procs\xff\"3\"\xff-mca\xfforte_hnp_uri\xff\"3473342464.0;tcp://192.168.1.13:50847\"\xff-mca\xffplm\xff\"rsh\"\xff-mca\xfforte_tag_output\xff\"1\"\xff--tree-spawn"

where each \xff is a single byte 255=0xFF=\xFF.


ISSUE:
An environment variable with embedded 0xFF bytes in its value causes
calls to Sys.getenv() to produce an error when running R in a UTF-8
locale. Here is a minimal example on Linux:

$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e
"Sys.getenv()"
Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
Calls: Sys.getenv -> substring
In addition: Warning message:
In regexpr("=", x, fixed = TRUE) :
  input string 134 is invalid in this locale
Execution halted


WORKAROUND:
The workaround is to (1) identify any environment variables with
invalid UTF-8 symbols, and (2) prune or unset those variables before
launching R, e.g. in my SGE case, launching R using:

QRSH_COMMAND= Rscript --vanilla -e "Sys.getenv()"

avoid the problem.  Having to unset/modify environment variables
because R doesn't like them, see a bit of an ad-hoc hack to me.  Also,
if you are not aware of this problem, or not a savvy R user, it can be
quite tricky to track down the above error message, especially if
Sys.getenv() is called deep down in some package dependency.


DISCUSSION/SUGGESTION/ASK:
My suggestion would be to make Sys.getenv() robust against any type of
byte values in environment variable strings.

The error occurs in Sys.getenv() from:

        x <- .Internal(Sys.getenv(character(), ""))
        m <- regexpr("=", x, fixed = TRUE)  ## produces a warning
        n <- substring(x, 1L, m - 1L)
        v <- substring(x, m + 1L)  ## produces the error

I know too little about string encodings, so I'm not sure what the
best approach would be here, but maybe falling back to parsing strings
that are invalid in the current locale using the C locale would be
reasonable?  Maybe Sys.getenv() should always use the C locale for
this. It looks like Sys.getenv(name) does this, e.g.

$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e
"Sys.getenv('BOOM')"
[1] "\xff"

I'd appreciate any comments and suggestions. I'm happy to file a bug
report on BugZilla, if this is a bug.

Henrik

[1]
https://github.com/gridengine/gridengine/blob/master/source/clients/qrsh/qrsh_starter.c#L462-L466

Tomas Kalibera

2023-Jan-31 00:04 UTC

head link

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

On 1/30/23 23:01, Henrik Bengtsson wrote:> /Hello.
>
> SUMMARY:
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e
"Sys.getenv()"
> Error in substring(x, m + 1L) : invalid multibyte string at
'<ff>'
>
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e
"Sys.getenv('BOOM')"
> [1] "\xff"
>
> BACKGROUND:
> I launch R through an Son of Grid Engine (SGE) scheduler, where the R
> process is launched on a compute host via  'qrsh', which part of
SGE.
> Without going into details, 'mpirun' is also involved. Regardless,
in
> this process, an 'qrsh'-specific environment variable
'QRSH_COMMAND'
> is set automatically.  The value of this variable comprise of a string
> with \xff (ASCII 255) injected between the words.  This is by design
> of SGE [1].  Here is an example of what this environment variable may
> look like:
>
> QRSH_COMMAND=
orted\xff--hnp-topo-sig\xff2N:2S:32L3:128L2:128L1:128C:256H:x86_64\xff-mca\xffess\xff\"env\"\xff-mca\xfforte_ess_jobid\xff\"3473342464\"\xff-mca\xfforte_ess_vpid\xff1\xff-mca\xfforte_ess_num_procs\xff\"3\"\xff-mca\xfforte_hnp_uri\xff\"3473342464.0;tcp://192.168.1.13:50847\"\xff-mca\xffplm\xff\"rsh\"\xff-mca\xfforte_tag_output\xff\"1\"\xff--tree-spawn"
>
> where each \xff is a single byte 255=0xFF=\xFF.
>
>
> ISSUE:
> An environment variable with embedded 0xFF bytes in its value causes
> calls to Sys.getenv() to produce an error when running R in a UTF-8
> locale. Here is a minimal example on Linux:
>
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e
"Sys.getenv()"
> Error in substring(x, m + 1L) : invalid multibyte string at
'<ff>'
> Calls: Sys.getenv -> substring
> In addition: Warning message:
> In regexpr("=", x, fixed = TRUE) :
>    input string 134 is invalid in this locale
> Execution halted
>
>
> WORKAROUND:
> The workaround is to (1) identify any environment variables with
> invalid UTF-8 symbols, and (2) prune or unset those variables before
> launching R, e.g. in my SGE case, launching R using:
>
> QRSH_COMMAND= Rscript --vanilla -e "Sys.getenv()"
>
> avoid the problem.  Having to unset/modify environment variables
> because R doesn't like them, see a bit of an ad-hoc hack to me.  Also,
> if you are not aware of this problem, or not a savvy R user, it can be
> quite tricky to track down the above error message, especially if
> Sys.getenv() is called deep down in some package dependency.
>
>
> DISCUSSION/SUGGESTION/ASK:
> My suggestion would be to make Sys.getenv() robust against any type of
> byte values in environment variable strings.
>
> The error occurs in Sys.getenv() from:
>
>          x <- .Internal(Sys.getenv(character(), ""))
>          m <- regexpr("=", x, fixed = TRUE)  ## produces a
warning
>          n <- substring(x, 1L, m - 1L)
>          v <- substring(x, m + 1L)  ## produces the error
>
> I know too little about string encodings, so I'm not sure what the
> best approach would be here, but maybe falling back to parsing strings
> that are invalid in the current locale using the C locale would be
> reasonable?  Maybe Sys.getenv() should always use the C locale for
> this. It looks like Sys.getenv(name) does this, e.g.
>
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e
"Sys.getenv('BOOM')"
> [1] "\xff"
>
> I'd appreciate any comments and suggestions. I'm happy to file a
bug
> report on BugZilla, if this is a bug.This discussion comes from Python: https://bugs.python.org/issue4006
(it says Python skips such environment variables)

The problem of invalid strings in environment variables is a similar to 
the problem of invalid strings in file names. Both variables and file 
names are something people want to use as strings in programs, scripts, 
texts, but at the same time these may in theory not be valid strings. 
Working with (potentially) invalid strings (almost) transparently is 
much harder than with valid strings; even if R decided to do that, it 
would be hard to implement and take long and only work for some 
operations, most will still throw errors. In addition, in practice 
invalid strings are almost always due to an error, particularly so in 
file names or environment variables. Such errors are often worth 
catching (wrong encoding declaration, etc), even though perhaps not always.

In practice, this instance can only be properly fixed at the source, [1] 
should not do this. split_command() will run into problems with 
different software, not just R.

There should be a way to split the commands in ASCII (using some sort of 
quoting/escaping). Using \xFF is flawed also simply because it may be 
present in the commands, if we followed the same logic of that every 
byte is fine. So the code is buggy even regardless of multi-byte encodings.

Re difficulty to debug, I think the error message is clear and if 
packages catch and hide errors, that'd be bad design of such packages, R 
couldn't really do much about that. This needs to be fixed at [1].

Tomas
>
> Henrik
>
> [1]
https://github.com/gridengine/gridengine/blob/master/source/clients/qrsh/qrsh_starter.c#L462-L466
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Ivan Krylov

2023-Jan-31 08:48 UTC

head link

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Can we use the "bytes" encoding for such environment variables invalid
in the current locale? The following patch preserves CE_NATIVE for
strings valid in the current UTF-8 or multibyte locale (or
non-multibyte strings) but sets CE_BYTES for those that are invalid:

Index: src/main/sysutils.c
==================================================================---
src/main/sysutils.c	(revision 83731)
+++ src/main/sysutils.c	(working copy)
@@ -393,8 +393,16 @@
 	char **e;
 	for (i = 0, e = environ; *e != NULL; i++, e++);
 	PROTECT(ans = allocVector(STRSXP, i));
-	for (i = 0, e = environ; *e != NULL; i++, e++)
-	    SET_STRING_ELT(ans, i, mkChar(*e));
+	for (i = 0, e = environ; *e != NULL; i++, e++) {
+	    cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
+	        known_to_be_utf8 ? CE_UTF8 :
+	        CE_NATIVE;
+	    if (
+	        (utf8locale && !utf8Valid(*e))
+	        || (mbcslocale && !mbcsValid(*e))
+	    ) enc = CE_BYTES;
+	    SET_STRING_ELT(ans, i, mkCharCE(*e, enc));
+	}
 #endif
     } else {
 	PROTECT(ans = allocVector(STRSXP, i));
@@ -416,11 +424,14 @@
 	    if (s == NULL)
 		SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0));
 	    else {
-		SEXP tmp;
-		if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1);
-		else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8);
-		else tmp = mkChar(s);
-		SET_STRING_ELT(ans, j, tmp);
+		cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
+		    known_to_be_utf8 ? CE_UTF8 :
+		    CE_NATIVE;
+		if (
+		    (utf8locale && !utf8Valid(s))
+		    || (mbcslocale && !mbcsValid(s))
+		) enc = CE_BYTES;
+		SET_STRING_ELT(ans, j, mkCharCE(s, enc));
 	    }
 #endif
 	}

Here are the potential problems with this approach:

 * I don't know whether known_to_be_utf8 can disagree with utf8locale.
   known_to_be_utf8 was the original condition for setting CE_UTF8 on
   the string. I also need to detect non-UTF-8 multibyte locales, so
   I'm checking for utf8locale and mbcslocale. Perhaps I should be more
   careful and test for (enc == CE_UTF8) || (utf8locale && enc =  
CE_NATIVE) instead of just utf8locale.

 * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
   strings in the environment with this patch applied, but now
   print.Dlist does, because formatDL wants to compute the width of the
   string which has the 'bytes' encoding. If this is a good way to
   solve the problem, I can work on suggesting a fix for formatDL to
   avoid the error.

-- 
Best regards,
Ivan

R devel - Jan 2023 - Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF