Displaying 4 results from an estimated 4 matches for "utf8valid".
2013 Sep 09
2
Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")
...3e3965"
gsub("a", "", "\U3e3965", perl=TRUE)
# Error in gsub("a", "", "\U3e3965", perl = TRUE) :
# input string 1 is invalid UTF-8
The error message in the second command seems to come from
src/main/grep.c:1640 (in do_gsub):
if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1);
utf8Valid() relies on valid_utf8() from PCRE, whose behavior is
described in src/extra/pcre/pcre_valid_utf8.c.
Even more problematic/interesting is the fact that iconv() does not
consider the above character as invalid, as it does...
2023 Jan 31
1
Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
...for (i = 0, e = environ; *e != NULL; i++, e++)
- SET_STRING_ELT(ans, i, mkChar(*e));
+ for (i = 0, e = environ; *e != NULL; i++, e++) {
+ cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
+ known_to_be_utf8 ? CE_UTF8 :
+ CE_NATIVE;
+ if (
+ (utf8locale && !utf8Valid(*e))
+ || (mbcslocale && !mbcsValid(*e))
+ ) enc = CE_BYTES;
+ SET_STRING_ELT(ans, i, mkCharCE(*e, enc));
+ }
#endif
} else {
PROTECT(ans = allocVector(STRSXP, i));
@@ -416,11 +424,14 @@
if (s == NULL)
SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0));...
2023 Jan 31
1
Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
...L; i++, e++)
> - SET_STRING_ELT(ans, i, mkChar(*e));
> + for (i = 0, e = environ; *e != NULL; i++, e++) {
> + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
> + known_to_be_utf8 ? CE_UTF8 :
> + CE_NATIVE;
> + if (
> + (utf8locale && !utf8Valid(*e))
> + || (mbcslocale && !mbcsValid(*e))
> + ) enc = CE_BYTES;
> + SET_STRING_ELT(ans, i, mkCharCE(*e, enc));
> + }
> #endif
> } else {
> PROTECT(ans = allocVector(STRSXP, i));
> @@ -416,11 +424,14 @@
> if (s == NULL)
> S...
2023 Jan 30
2
Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
/Hello.
SUMMARY:
$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
[1] "\xff"
BACKGROUND:
I launch R through an Son of Grid Engine (SGE) scheduler, where the R