thr3ads.net - R devel - [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF [Jan 2023]

If this information is useful, please help other people find it:
Share via:

Tomas Kalibera

2023-Jan-31 11:51 UTC

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

On 1/31/23 11:50, Martin Maechler wrote:>>>>>> Tomas Kalibera
>>>>>>      on Tue, 31 Jan 2023 10:53:21 +0100 writes:
>      > On 1/31/23 09:48, Ivan Krylov wrote:
>      >> Can we use the "bytes" encoding for such
environment variables invalid
>      >> in the current locale? The following patch preserves
CE_NATIVE for
>      >> strings valid in the current UTF-8 or multibyte locale (or
>      >> non-multibyte strings) but sets CE_BYTES for those that are
invalid:
>      >>
>      >> Index: src/main/sysutils.c
>      >>
==================================================================>     
>> --- src/main/sysutils.c	(revision 83731)
> .....
>      >>
>      >> Here are the potential problems with this approach:
>      >>
>      >> * I don't know whether known_to_be_utf8 can disagree with
utf8locale.
>      >> known_to_be_utf8 was the original condition for setting
CE_UTF8 on
>      >> the string. I also need to detect non-UTF-8 multibyte
locales, so
>      >> I'm checking for utf8locale and mbcslocale. Perhaps I
should be more
>      >> careful and test for (enc == CE_UTF8) || (utf8locale
&& enc =>      >> CE_NATIVE) instead of just utf8locale.
>      >>
>      >> * I have verified that Sys.getenv() doesn't crash with
UTF-8-invalid
>      >> strings in the environment with this patch applied, but now
>      >> print.Dlist does, because formatDL wants to compute the width
of the
>      >> string which has the 'bytes' encoding. If this is a
good way to
>      >> solve the problem, I can work on suggesting a fix for
formatDL to
>      >> avoid the error.
>
>      > Thanks, indeed, type instability is a big problem of the approach
"turn
>      > invalid strings to bytes". It is something what is
historically being
>      > done in regular expression operations, but it is brittle and not
user
>      > friendly: writing code to be agnostic to whether we are dealing
with
>      > "bytes" or a regular string is very tedious. Pieces of
type instability
>      > come also from that ASCII strings are always flagged
"native" (never
>      > "bytes"). Last year I had to revert a big change which
broke existing
>      > code by introducing some more of this instability due to better
dealing
>      > with invalid strings in regular expressions. I've made some
additions to
>      > R-devel allowing to better deal with such instability but it is
still a
>      > pain and existing code has to be changed (and made more
complicated).
>
>      > So, I don't think this is the way to go.
>
>      > Tomas
>
> hmm.., that's a pity; I had hoped it was a pragmatic and valid
strategy,
> but of course you are right that type stability is really a
> valid goal....
>
> In general, what about behaving close to "old R" and replace all
such
> strings by  NA_character_  (and typically raising one warning)?
> This would keep the result a valid character vector, just with some NA
entries.
>
> Specifically for  Sys.getenv(),  I still think Simon has a very
> valid point of "requiring" (of our design) that
> Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
> Sys.getenv("BOOM")
>
> Also, as typical R user, I'd definitely want to be able to get all the
valid
> environment variables, even if there are one or more invalid
> ones. ... and similarly in other cases, it may be a cheap
> strategy to replace invalid strings ("string" in the sense of
> length 1 STRSXP, i.e., in R, a "character" of length 1) by
> NA_character_  and keep all valid parts of the character vector
> in a valid encoding.In case of specifically getenv(), yes, we could return NA for variables 
containing invalid strings, both when obtaining a value for a single 
variable and for multiple, partially matching undocumented and 
unintentional behavior R had before 4.1, and making getenv(var) and 
getenv()[[var]] consistent even with invalid strings.? Once we decide on 
how to deal with invalid strings in general, we can change this again 
accordingly, breaking code for people who depend on these things (but so 
far I only know about this one case). Perhaps this would be a logical 
alternative to the Python approach that would be more suitable for R 
(given we have NAs and given that we happened to have that somewhat 
similar alternative before). Conceptually it is about the same thing as 
omitting the variable in Python: R users would not be able to use such 
variables, but if they don't touch them, they could be inherited to 
child processes, etc.
> - - - - - -
>
> In addition to the above cheap "replace by NA"-strategy,
> and at least once R is "all UTF-8", we could
> also consider a more expensive strategy that would try to
> replace invalid characters/byte-sequences by one specific valid UTF-8
> character, i.e., glyph (think of a special version of "?") that
> we would designate as replacement for
"invalid-in-current-encoding".
>
> Probably such a glyph already exists and we have seen it used in
> some console's output when having to print such things.
This is part of the big discussion where we couldn't find a consensus, 
and I think this happened so recently that it is too early for opening 
it again. Also, we are not yet to where we could assume UTF-8 is always 
the native encoding. As you know my currently preferred solution is to 
allow invalid portions of UTF-8 and treat them in individual functions: 
often one can proceed with invalid UTF-8 bytes as usual (e.g. copying, 
concatenation, search), in some function one has to use a special 
algorithm (e.g. regular expressions, PCRE2 supports that, in some cases 
one would substitute characters, in some cases one would throw an 
error). As strings are immutable in R, validity checks could be done 
once for each string. This would in many cases allow for UTF-8 strings 
with bits of errors to "go through" R without triggering errors. We 
didn't have such options with other multi-byte encodings, but UTF-8 is 
designed so that this is possible.

Substituting by various escapes "<U+..>", "??", etc,
is something we
already do in R in some cases, but it cannot be the default mechanism 
for dealing with invalid strings, e.g. file names are a prime example of 
where it doesn't work and indeed there are libraries for dealing with 
file names as special entities.

As you probably know part of the problem is also that invalid strings 
that are "tolerated" by OSes, file names and environment variables,
can
"originally" be in different "encodings". Sorry for these
many quotes,
but simplifying a lot on Unix file systems often uses byte-sequences for 
file names (even though applications often assume it is UTF-8) and on 
Windows sequences of 16-bit words (sometimes understood as UCS-2, but 
not necessarily valid UTF-16). With environment variables, this came so 
far that you can have a narrow and wide version which may not directly 
correspond (and I've not investigated the details yet, see documentation 
for _wenviron etc). In R, particularly on Windows, we can never get 
completely rid of the UTF-16 and the wide character interface, so in 
some cases, we would want to "convert invalid strings" between 
encodings, which of course doesn't make sense and most likely will be 
the case where invalid strings would cause errors. These things are so 
complicated that, while it is not my currently preferred solution, I 
understand the position that we ban invalid strings fully and keep them 
banned in R even when UTF-8 is the only supported native encoding.

Tomas

peter dalgaard

2023-Jan-31 13:37 UTC

head link

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

> On 31 Jan 2023, at 12:51 , Tomas Kalibera <tomas.kalibera at
gmail.com> wrote:
> 
> 
> On 1/31/23 11:50, Martin Maechler wrote:
<snippage>>> hmm.., that's a pity; I had hoped it was a pragmatic and valid
strategy,
>> but of course you are right that type stability is really a
>> valid goal....
>> 
>> In general, what about behaving close to "old R" and replace
all such
>> strings by  NA_character_  (and typically raising one warning)?
>> This would keep the result a valid character vector, just with some NA
entries.
>> 
>> Specifically for  Sys.getenv(),  I still think Simon has a very
>> valid point of "requiring" (of our design) that
>> Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
>> Sys.getenv("BOOM")
>> 
>> Also, as typical R user, I'd definitely want to be able to get all
the valid
>> environment variables, even if there are one or more invalid
>> ones. ... and similarly in other cases, it may be a cheap
>> strategy to replace invalid strings ("string" in the sense of
>> length 1 STRSXP, i.e., in R, a "character" of length 1) by
>> NA_character_  and keep all valid parts of the character vector
>> in a valid encoding.
> In case of specifically getenv(), yes, we could return NA for variables
containing invalid strings, both when obtaining a value for a single variable
and for multiple, partially matching undocumented and unintentional behavior R
had before 4.1, and making getenv(var) and getenv()[[var]] consistent even with
invalid strings.  Once we decide on how to deal with invalid strings in general,
we can change this again accordingly, breaking code for people who depend on
these things (but so far I only know about this one case). Perhaps this would be
a logical alternative to the Python approach that would be more suitable for R
(given we have NAs and given that we happened to have that somewhat similar
alternative before). Conceptually it is about the same thing as omitting the
variable in Python: R users would not be able to use such variables, but if they
don't touch them, they could be inherited to child processes, etc.<more snippage>

Hum, I'm out of my waters here, but offhand I would be wary about approaches
that lead to loss of information. Presumably someone will sooner or later
actually want to deal with the content of an environment variable with invalid
bytes inside. I.e. it would be preferable to keep the content and mark the
encoding as something not-multibyte.

In fact this is almost what happens (for me...) if I just add Encoding(x) <-
"bytes" for the return value of .Internal(Sys.getenv(character(),
"")):
> Sys.getenv()[["BOOM"]]
[1] "\\xff"> Encoding(Sys.getenv()) [1] "unknown" "unknown" "bytes"  
"unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown"
"unknown" "unknown" "unknown" "unknown"
...

but I suppose that breaks if I have environment variables that actually _are_
utf8, because only plain-ASCII becomes "unknown"? And
nchar(Sys.getenv()) also does not work.

(And of course I agree that the QRSH thing is Just Wrong; people using 0xff as a
separator between utf8 strings deserve the same fate as those who used comma
separation between numbers with decimal commas.)

-pd

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Apparently Analagous Threads

Search for more possibly parallel threads

R devel - Jan 2023 - Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Apparently Analagous Threads