Tomas Kalibera
2023-Jan-31 09:53 UTC
[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
On 1/31/23 09:48, Ivan Krylov wrote:> Can we use the "bytes" encoding for such environment variables invalid > in the current locale? The following patch preserves CE_NATIVE for > strings valid in the current UTF-8 or multibyte locale (or > non-multibyte strings) but sets CE_BYTES for those that are invalid: > > Index: src/main/sysutils.c > ==================================================================> --- src/main/sysutils.c (revision 83731) > +++ src/main/sysutils.c (working copy) > @@ -393,8 +393,16 @@ > char **e; > for (i = 0, e = environ; *e != NULL; i++, e++); > PROTECT(ans = allocVector(STRSXP, i)); > - for (i = 0, e = environ; *e != NULL; i++, e++) > - SET_STRING_ELT(ans, i, mkChar(*e)); > + for (i = 0, e = environ; *e != NULL; i++, e++) { > + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 : > + known_to_be_utf8 ? CE_UTF8 : > + CE_NATIVE; > + if ( > + (utf8locale && !utf8Valid(*e)) > + || (mbcslocale && !mbcsValid(*e)) > + ) enc = CE_BYTES; > + SET_STRING_ELT(ans, i, mkCharCE(*e, enc)); > + } > #endif > } else { > PROTECT(ans = allocVector(STRSXP, i)); > @@ -416,11 +424,14 @@ > if (s == NULL) > SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0)); > else { > - SEXP tmp; > - if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1); > - else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8); > - else tmp = mkChar(s); > - SET_STRING_ELT(ans, j, tmp); > + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 : > + known_to_be_utf8 ? CE_UTF8 : > + CE_NATIVE; > + if ( > + (utf8locale && !utf8Valid(s)) > + || (mbcslocale && !mbcsValid(s)) > + ) enc = CE_BYTES; > + SET_STRING_ELT(ans, j, mkCharCE(s, enc)); > } > #endif > } > > Here are the potential problems with this approach: > > * I don't know whether known_to_be_utf8 can disagree with utf8locale. > known_to_be_utf8 was the original condition for setting CE_UTF8 on > the string. I also need to detect non-UTF-8 multibyte locales, so > I'm checking for utf8locale and mbcslocale. Perhaps I should be more > careful and test for (enc == CE_UTF8) || (utf8locale && enc => CE_NATIVE) instead of just utf8locale. > > * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid > strings in the environment with this patch applied, but now > print.Dlist does, because formatDL wants to compute the width of the > string which has the 'bytes' encoding. If this is a good way to > solve the problem, I can work on suggesting a fix for formatDL to > avoid the error.Thanks, indeed, type instability is a big problem of the approach "turn invalid strings to bytes". It is something what is historically being done in regular expression operations, but it is brittle and not user friendly: writing code to be agnostic to whether we are dealing with "bytes" or a regular string is very tedious. Pieces of type instability come also from that ASCII strings are always flagged "native" (never "bytes"). Last year I had to revert a big change which broke existing code by introducing some more of this instability due to better dealing with invalid strings in regular expressions. I've made some additions to R-devel allowing to better deal with such instability but it is still a pain and existing code has to be changed (and made more complicated). So, I don't think this is the way to go. Tomas>
Martin Maechler
2023-Jan-31 10:50 UTC
[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
>>>>> Tomas Kalibera >>>>> on Tue, 31 Jan 2023 10:53:21 +0100 writes:> On 1/31/23 09:48, Ivan Krylov wrote: >> Can we use the "bytes" encoding for such environment variables invalid >> in the current locale? The following patch preserves CE_NATIVE for >> strings valid in the current UTF-8 or multibyte locale (or >> non-multibyte strings) but sets CE_BYTES for those that are invalid: >> >> Index: src/main/sysutils.c >> ================================================================== >> --- src/main/sysutils.c (revision 83731) ..... >> >> Here are the potential problems with this approach: >> >> * I don't know whether known_to_be_utf8 can disagree with utf8locale. >> known_to_be_utf8 was the original condition for setting CE_UTF8 on >> the string. I also need to detect non-UTF-8 multibyte locales, so >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more >> careful and test for (enc == CE_UTF8) || (utf8locale && enc = >> CE_NATIVE) instead of just utf8locale. >> >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid >> strings in the environment with this patch applied, but now >> print.Dlist does, because formatDL wants to compute the width of the >> string which has the 'bytes' encoding. If this is a good way to >> solve the problem, I can work on suggesting a fix for formatDL to >> avoid the error. > Thanks, indeed, type instability is a big problem of the approach "turn > invalid strings to bytes". It is something what is historically being > done in regular expression operations, but it is brittle and not user > friendly: writing code to be agnostic to whether we are dealing with > "bytes" or a regular string is very tedious. Pieces of type instability > come also from that ASCII strings are always flagged "native" (never > "bytes"). Last year I had to revert a big change which broke existing > code by introducing some more of this instability due to better dealing > with invalid strings in regular expressions. I've made some additions to > R-devel allowing to better deal with such instability but it is still a > pain and existing code has to be changed (and made more complicated). > So, I don't think this is the way to go. > Tomas hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, but of course you are right that type stability is really a valid goal.... In general, what about behaving close to "old R" and replace all such strings by NA_character_ (and typically raising one warning)? This would keep the result a valid character vector, just with some NA entries. Specifically for Sys.getenv(), I still think Simon has a very valid point of "requiring" (of our design) that Sys.getenv()[["BOOM"]] {double `[[`} should be the same as Sys.getenv("BOOM") Also, as typical R user, I'd definitely want to be able to get all the valid environment variables, even if there are one or more invalid ones. ... and similarly in other cases, it may be a cheap strategy to replace invalid strings ("string" in the sense of length 1 STRSXP, i.e., in R, a "character" of length 1) by NA_character_ and keep all valid parts of the character vector in a valid encoding. - - - - - - In addition to the above cheap "replace by NA"-strategy, and at least once R is "all UTF-8", we could also consider a more expensive strategy that would try to replace invalid characters/byte-sequences by one specific valid UTF-8 character, i.e., glyph (think of a special version of "?") that we would designate as replacement for "invalid-in-current-encoding". Probably such a glyph already exists and we have seen it used in some console's output when having to print such things.
Possibly Parallel Threads
- Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
- Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
- Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
- Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
- Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF