thr3ads.net - R devel - [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF [Jan 2023]

If this information is useful, please help other people find it:
Share via:

Ivan Krylov

2023-Jan-31 08:48 UTC

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Can we use the "bytes" encoding for such environment variables invalid
in the current locale? The following patch preserves CE_NATIVE for
strings valid in the current UTF-8 or multibyte locale (or
non-multibyte strings) but sets CE_BYTES for those that are invalid:

Index: src/main/sysutils.c
==================================================================---
src/main/sysutils.c	(revision 83731)
+++ src/main/sysutils.c	(working copy)
@@ -393,8 +393,16 @@
 	char **e;
 	for (i = 0, e = environ; *e != NULL; i++, e++);
 	PROTECT(ans = allocVector(STRSXP, i));
-	for (i = 0, e = environ; *e != NULL; i++, e++)
-	    SET_STRING_ELT(ans, i, mkChar(*e));
+	for (i = 0, e = environ; *e != NULL; i++, e++) {
+	    cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
+	        known_to_be_utf8 ? CE_UTF8 :
+	        CE_NATIVE;
+	    if (
+	        (utf8locale && !utf8Valid(*e))
+	        || (mbcslocale && !mbcsValid(*e))
+	    ) enc = CE_BYTES;
+	    SET_STRING_ELT(ans, i, mkCharCE(*e, enc));
+	}
 #endif
     } else {
 	PROTECT(ans = allocVector(STRSXP, i));
@@ -416,11 +424,14 @@
 	    if (s == NULL)
 		SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0));
 	    else {
-		SEXP tmp;
-		if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1);
-		else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8);
-		else tmp = mkChar(s);
-		SET_STRING_ELT(ans, j, tmp);
+		cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
+		    known_to_be_utf8 ? CE_UTF8 :
+		    CE_NATIVE;
+		if (
+		    (utf8locale && !utf8Valid(s))
+		    || (mbcslocale && !mbcsValid(s))
+		) enc = CE_BYTES;
+		SET_STRING_ELT(ans, j, mkCharCE(s, enc));
 	    }
 #endif
 	}

Here are the potential problems with this approach:

 * I don't know whether known_to_be_utf8 can disagree with utf8locale.
   known_to_be_utf8 was the original condition for setting CE_UTF8 on
   the string. I also need to detect non-UTF-8 multibyte locales, so
   I'm checking for utf8locale and mbcslocale. Perhaps I should be more
   careful and test for (enc == CE_UTF8) || (utf8locale && enc =  
CE_NATIVE) instead of just utf8locale.

 * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
   strings in the environment with this patch applied, but now
   print.Dlist does, because formatDL wants to compute the width of the
   string which has the 'bytes' encoding. If this is a good way to
   solve the problem, I can work on suggesting a fix for formatDL to
   avoid the error.

-- 
Best regards,
Ivan

Tomas Kalibera

2023-Jan-31 09:53 UTC

head link

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

On 1/31/23 09:48, Ivan Krylov wrote:> Can we use the "bytes" encoding for such environment variables
invalid
> in the current locale? The following patch preserves CE_NATIVE for
> strings valid in the current UTF-8 or multibyte locale (or
> non-multibyte strings) but sets CE_BYTES for those that are invalid:
>
> Index: src/main/sysutils.c
> ==================================================================> ---
src/main/sysutils.c	(revision 83731)
> +++ src/main/sysutils.c	(working copy)
> @@ -393,8 +393,16 @@
>   	char **e;
>   	for (i = 0, e = environ; *e != NULL; i++, e++);
>   	PROTECT(ans = allocVector(STRSXP, i));
> -	for (i = 0, e = environ; *e != NULL; i++, e++)
> -	    SET_STRING_ELT(ans, i, mkChar(*e));
> +	for (i = 0, e = environ; *e != NULL; i++, e++) {
> +	    cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
> +	        known_to_be_utf8 ? CE_UTF8 :
> +	        CE_NATIVE;
> +	    if (
> +	        (utf8locale && !utf8Valid(*e))
> +	        || (mbcslocale && !mbcsValid(*e))
> +	    ) enc = CE_BYTES;
> +	    SET_STRING_ELT(ans, i, mkCharCE(*e, enc));
> +	}
>   #endif
>       } else {
>   	PROTECT(ans = allocVector(STRSXP, i));
> @@ -416,11 +424,14 @@
>   	    if (s == NULL)
>   		SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0));
>   	    else {
> -		SEXP tmp;
> -		if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1);
> -		else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8);
> -		else tmp = mkChar(s);
> -		SET_STRING_ELT(ans, j, tmp);
> +		cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
> +		    known_to_be_utf8 ? CE_UTF8 :
> +		    CE_NATIVE;
> +		if (
> +		    (utf8locale && !utf8Valid(s))
> +		    || (mbcslocale && !mbcsValid(s))
> +		) enc = CE_BYTES;
> +		SET_STRING_ELT(ans, j, mkCharCE(s, enc));
>   	    }
>   #endif
>   	}
>
> Here are the potential problems with this approach:
>
>   * I don't know whether known_to_be_utf8 can disagree with utf8locale.
>     known_to_be_utf8 was the original condition for setting CE_UTF8 on
>     the string. I also need to detect non-UTF-8 multibyte locales, so
>     I'm checking for utf8locale and mbcslocale. Perhaps I should be
more
>     careful and test for (enc == CE_UTF8) || (utf8locale && enc
=>     CE_NATIVE) instead of just utf8locale.
>
>   * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
>     strings in the environment with this patch applied, but now
>     print.Dlist does, because formatDL wants to compute the width of the
>     string which has the 'bytes' encoding. If this is a good way to
>     solve the problem, I can work on suggesting a fix for formatDL to
>     avoid the error.
Thanks, indeed, type instability is a big problem of the approach "turn 
invalid strings to bytes". It is something what is historically being 
done in regular expression operations, but it is brittle and not user 
friendly: writing code to be agnostic to whether we are dealing with 
"bytes" or a regular string is very tedious. Pieces of type
instability
come also from that ASCII strings are always flagged "native" (never 
"bytes"). Last year I had to revert a big change which broke existing 
code by introducing some more of this instability due to better dealing 
with invalid strings in regular expressions. I've made some additions to 
R-devel allowing to better deal with such instability but it is still a 
pain and existing code has to be changed (and made more complicated).

So, I don't think this is the way to go.

Tomas

>

Seemingly Similar Threads

Search for more maybe matching threads

R devel - Jan 2023 - Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Seemingly Similar Threads