Martin Maechler
2023-Jan-31 10:50 UTC
[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
>>>>> Tomas Kalibera >>>>> on Tue, 31 Jan 2023 10:53:21 +0100 writes:> On 1/31/23 09:48, Ivan Krylov wrote: >> Can we use the "bytes" encoding for such environment variables invalid >> in the current locale? The following patch preserves CE_NATIVE for >> strings valid in the current UTF-8 or multibyte locale (or >> non-multibyte strings) but sets CE_BYTES for those that are invalid: >> >> Index: src/main/sysutils.c >> ================================================================== >> --- src/main/sysutils.c (revision 83731) ..... >> >> Here are the potential problems with this approach: >> >> * I don't know whether known_to_be_utf8 can disagree with utf8locale. >> known_to_be_utf8 was the original condition for setting CE_UTF8 on >> the string. I also need to detect non-UTF-8 multibyte locales, so >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more >> careful and test for (enc == CE_UTF8) || (utf8locale && enc = >> CE_NATIVE) instead of just utf8locale. >> >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid >> strings in the environment with this patch applied, but now >> print.Dlist does, because formatDL wants to compute the width of the >> string which has the 'bytes' encoding. If this is a good way to >> solve the problem, I can work on suggesting a fix for formatDL to >> avoid the error. > Thanks, indeed, type instability is a big problem of the approach "turn > invalid strings to bytes". It is something what is historically being > done in regular expression operations, but it is brittle and not user > friendly: writing code to be agnostic to whether we are dealing with > "bytes" or a regular string is very tedious. Pieces of type instability > come also from that ASCII strings are always flagged "native" (never > "bytes"). Last year I had to revert a big change which broke existing > code by introducing some more of this instability due to better dealing > with invalid strings in regular expressions. I've made some additions to > R-devel allowing to better deal with such instability but it is still a > pain and existing code has to be changed (and made more complicated). > So, I don't think this is the way to go. > Tomas hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, but of course you are right that type stability is really a valid goal.... In general, what about behaving close to "old R" and replace all such strings by NA_character_ (and typically raising one warning)? This would keep the result a valid character vector, just with some NA entries. Specifically for Sys.getenv(), I still think Simon has a very valid point of "requiring" (of our design) that Sys.getenv()[["BOOM"]] {double `[[`} should be the same as Sys.getenv("BOOM") Also, as typical R user, I'd definitely want to be able to get all the valid environment variables, even if there are one or more invalid ones. ... and similarly in other cases, it may be a cheap strategy to replace invalid strings ("string" in the sense of length 1 STRSXP, i.e., in R, a "character" of length 1) by NA_character_ and keep all valid parts of the character vector in a valid encoding. - - - - - - In addition to the above cheap "replace by NA"-strategy, and at least once R is "all UTF-8", we could also consider a more expensive strategy that would try to replace invalid characters/byte-sequences by one specific valid UTF-8 character, i.e., glyph (think of a special version of "?") that we would designate as replacement for "invalid-in-current-encoding". Probably such a glyph already exists and we have seen it used in some console's output when having to print such things.
Duncan Murdoch
2023-Jan-31 11:23 UTC
[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
On 31/01/2023 5:50 a.m., Martin Maechler wrote:>>>>>> Tomas Kalibera >>>>>> on Tue, 31 Jan 2023 10:53:21 +0100 writes: > > > On 1/31/23 09:48, Ivan Krylov wrote: > >> Can we use the "bytes" encoding for such environment variables invalid > >> in the current locale? The following patch preserves CE_NATIVE for > >> strings valid in the current UTF-8 or multibyte locale (or > >> non-multibyte strings) but sets CE_BYTES for those that are invalid: > >> > >> Index: src/main/sysutils.c > >> ==================================================================> >> --- src/main/sysutils.c (revision 83731) > ..... > >> > >> Here are the potential problems with this approach: > >> > >> * I don't know whether known_to_be_utf8 can disagree with utf8locale. > >> known_to_be_utf8 was the original condition for setting CE_UTF8 on > >> the string. I also need to detect non-UTF-8 multibyte locales, so > >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more > >> careful and test for (enc == CE_UTF8) || (utf8locale && enc => >> CE_NATIVE) instead of just utf8locale. > >> > >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid > >> strings in the environment with this patch applied, but now > >> print.Dlist does, because formatDL wants to compute the width of the > >> string which has the 'bytes' encoding. If this is a good way to > >> solve the problem, I can work on suggesting a fix for formatDL to > >> avoid the error. > > > Thanks, indeed, type instability is a big problem of the approach "turn > > invalid strings to bytes". It is something what is historically being > > done in regular expression operations, but it is brittle and not user > > friendly: writing code to be agnostic to whether we are dealing with > > "bytes" or a regular string is very tedious. Pieces of type instability > > come also from that ASCII strings are always flagged "native" (never > > "bytes"). Last year I had to revert a big change which broke existing > > code by introducing some more of this instability due to better dealing > > with invalid strings in regular expressions. I've made some additions to > > R-devel allowing to better deal with such instability but it is still a > > pain and existing code has to be changed (and made more complicated). > > > So, I don't think this is the way to go. > > > Tomas > > hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, > but of course you are right that type stability is really a > valid goal.... > > In general, what about behaving close to "old R" and replace all such > strings by NA_character_ (and typically raising one warning)? > This would keep the result a valid character vector, just with some NA entries. > > Specifically for Sys.getenv(), I still think Simon has a very > valid point of "requiring" (of our design) that > Sys.getenv()[["BOOM"]] {double `[[`} should be the same as > Sys.getenv("BOOM") > > Also, as typical R user, I'd definitely want to be able to get all the valid > environment variables, even if there are one or more invalid > ones. ... and similarly in other cases, it may be a cheap > strategy to replace invalid strings ("string" in the sense of > length 1 STRSXP, i.e., in R, a "character" of length 1) by > NA_character_ and keep all valid parts of the character vector > in a valid encoding. > > - - - - - - > > In addition to the above cheap "replace by NA"-strategy, > and at least once R is "all UTF-8", we could > also consider a more expensive strategy that would try to > replace invalid characters/byte-sequences by one specific valid UTF-8 > character, i.e., glyph (think of a special version of "?") that > we would designate as replacement for "invalid-in-current-encoding". > > Probably such a glyph already exists and we have seen it used in > some console's output when having to print such things.I think it is U+FFFD, described here: https://www.fileformat.info/info/unicode/char/fffd/index.htm Duncan Murdoch
Tomas Kalibera
2023-Jan-31 11:51 UTC
[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
On 1/31/23 11:50, Martin Maechler wrote:>>>>>> Tomas Kalibera >>>>>> on Tue, 31 Jan 2023 10:53:21 +0100 writes: > > On 1/31/23 09:48, Ivan Krylov wrote: > >> Can we use the "bytes" encoding for such environment variables invalid > >> in the current locale? The following patch preserves CE_NATIVE for > >> strings valid in the current UTF-8 or multibyte locale (or > >> non-multibyte strings) but sets CE_BYTES for those that are invalid: > >> > >> Index: src/main/sysutils.c > >> ==================================================================> >> --- src/main/sysutils.c (revision 83731) > ..... > >> > >> Here are the potential problems with this approach: > >> > >> * I don't know whether known_to_be_utf8 can disagree with utf8locale. > >> known_to_be_utf8 was the original condition for setting CE_UTF8 on > >> the string. I also need to detect non-UTF-8 multibyte locales, so > >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more > >> careful and test for (enc == CE_UTF8) || (utf8locale && enc => >> CE_NATIVE) instead of just utf8locale. > >> > >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid > >> strings in the environment with this patch applied, but now > >> print.Dlist does, because formatDL wants to compute the width of the > >> string which has the 'bytes' encoding. If this is a good way to > >> solve the problem, I can work on suggesting a fix for formatDL to > >> avoid the error. > > > Thanks, indeed, type instability is a big problem of the approach "turn > > invalid strings to bytes". It is something what is historically being > > done in regular expression operations, but it is brittle and not user > > friendly: writing code to be agnostic to whether we are dealing with > > "bytes" or a regular string is very tedious. Pieces of type instability > > come also from that ASCII strings are always flagged "native" (never > > "bytes"). Last year I had to revert a big change which broke existing > > code by introducing some more of this instability due to better dealing > > with invalid strings in regular expressions. I've made some additions to > > R-devel allowing to better deal with such instability but it is still a > > pain and existing code has to be changed (and made more complicated). > > > So, I don't think this is the way to go. > > > Tomas > > hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, > but of course you are right that type stability is really a > valid goal.... > > In general, what about behaving close to "old R" and replace all such > strings by NA_character_ (and typically raising one warning)? > This would keep the result a valid character vector, just with some NA entries. > > Specifically for Sys.getenv(), I still think Simon has a very > valid point of "requiring" (of our design) that > Sys.getenv()[["BOOM"]] {double `[[`} should be the same as > Sys.getenv("BOOM") > > Also, as typical R user, I'd definitely want to be able to get all the valid > environment variables, even if there are one or more invalid > ones. ... and similarly in other cases, it may be a cheap > strategy to replace invalid strings ("string" in the sense of > length 1 STRSXP, i.e., in R, a "character" of length 1) by > NA_character_ and keep all valid parts of the character vector > in a valid encoding.In case of specifically getenv(), yes, we could return NA for variables containing invalid strings, both when obtaining a value for a single variable and for multiple, partially matching undocumented and unintentional behavior R had before 4.1, and making getenv(var) and getenv()[[var]] consistent even with invalid strings.? Once we decide on how to deal with invalid strings in general, we can change this again accordingly, breaking code for people who depend on these things (but so far I only know about this one case). Perhaps this would be a logical alternative to the Python approach that would be more suitable for R (given we have NAs and given that we happened to have that somewhat similar alternative before). Conceptually it is about the same thing as omitting the variable in Python: R users would not be able to use such variables, but if they don't touch them, they could be inherited to child processes, etc.> - - - - - - > > In addition to the above cheap "replace by NA"-strategy, > and at least once R is "all UTF-8", we could > also consider a more expensive strategy that would try to > replace invalid characters/byte-sequences by one specific valid UTF-8 > character, i.e., glyph (think of a special version of "?") that > we would designate as replacement for "invalid-in-current-encoding". > > Probably such a glyph already exists and we have seen it used in > some console's output when having to print such things.This is part of the big discussion where we couldn't find a consensus, and I think this happened so recently that it is too early for opening it again. Also, we are not yet to where we could assume UTF-8 is always the native encoding. As you know my currently preferred solution is to allow invalid portions of UTF-8 and treat them in individual functions: often one can proceed with invalid UTF-8 bytes as usual (e.g. copying, concatenation, search), in some function one has to use a special algorithm (e.g. regular expressions, PCRE2 supports that, in some cases one would substitute characters, in some cases one would throw an error). As strings are immutable in R, validity checks could be done once for each string. This would in many cases allow for UTF-8 strings with bits of errors to "go through" R without triggering errors. We didn't have such options with other multi-byte encodings, but UTF-8 is designed so that this is possible. Substituting by various escapes "<U+..>", "??", etc, is something we already do in R in some cases, but it cannot be the default mechanism for dealing with invalid strings, e.g. file names are a prime example of where it doesn't work and indeed there are libraries for dealing with file names as special entities. As you probably know part of the problem is also that invalid strings that are "tolerated" by OSes, file names and environment variables, can "originally" be in different "encodings". Sorry for these many quotes, but simplifying a lot on Unix file systems often uses byte-sequences for file names (even though applications often assume it is UTF-8) and on Windows sequences of 16-bit words (sometimes understood as UCS-2, but not necessarily valid UTF-16). With environment variables, this came so far that you can have a narrow and wide version which may not directly correspond (and I've not investigated the details yet, see documentation for _wenviron etc). In R, particularly on Windows, we can never get completely rid of the UTF-16 and the wide character interface, so in some cases, we would want to "convert invalid strings" between encodings, which of course doesn't make sense and most likely will be the case where invalid strings would cause errors. These things are so complicated that, while it is not my currently preferred solution, I understand the position that we ban invalid strings fully and keep them banned in R even when UTF-8 is the only supported native encoding. Tomas