Displaying 2 results from an estimated 2 matches for "valid_utf8".
2020 Apr 04
0
Possible Bug In Validation of UTF-8 Sequences
As per `?intToUtf8`, and in the comments to `valid_utf8`[1], R
intends to prevent illegal UTF-8 such as UTF-8 encoded
UTF-16 surrogate pairs.? `R_nchar`, invoked via `base::nchar`,
explicitly validates UTF-8 strings[2], but allows the surrogate:
??? > Encoding('\ud800')
??? [1] "UTF-8"
??? > nchar('\ud800')? // should b...
2013 Sep 09
2
Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")
...quot;a", "", "\U3e3965", perl = TRUE) :
# input string 1 is invalid UTF-8
The error message in the second command seems to come from
src/main/grep.c:1640 (in do_gsub):
if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1);
utf8Valid() relies on valid_utf8() from PCRE, whose behavior is
described in src/extra/pcre/pcre_valid_utf8.c.
Even more problematic/interesting is the fact that iconv() does not
consider the above character as invalid, as it does not replace it when
using the sub argument.
> iconv("a\U3e3965", sub="")
[...