Milan Bouchet-Valat
2013-Sep-09 08:49 UTC
[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")
Hi! I experience an error with an invalid UTF-8 character passed to gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the default) no error happens. (The character itself was read from an invalid HTML file.) Illustration of the error: gsub("a", "", "\U3e3965", perl=FALSE) # [1] "\U3e3965" gsub("a", "", "\U3e3965", perl=TRUE) # Error in gsub("a", "", "\U3e3965", perl = TRUE) : # input string 1 is invalid UTF-8 The error message in the second command seems to come from src/main/grep.c:1640 (in do_gsub): if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1); utf8Valid() relies on valid_utf8() from PCRE, whose behavior is described in src/extra/pcre/pcre_valid_utf8.c. Even more problematic/interesting is the fact that iconv() does not consider the above character as invalid, as it does not replace it when using the sub argument.> iconv("a\U3e3965", sub="")[1] "a\U003e3965" On the contrary, an invalid sequence such as \xff is substituted: iconv("a\xff", sub="") # [1] "a" This makes it difficult to sanitize the string before passing it to gsub(perl=TRUE). Thus, I'm wondering whether something could be done, and where. Should iconv() and PCRE be made to agree on the definition of an invalid UTF-8 sequence? Regards
Milan Bouchet-Valat
2013-Sep-09 09:00 UTC
[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")
...and of course I forgot to add relevant information. This is with Fedora 19, R 3.0.1 and a UTF-8 locale. On Windows 7 the problem does not appear, i.e. the gsub(perl=TRUE) call does not generate any error and \U3e3965 prints a Chinese character (AFAICT). R version 3.0.1 (2013-05-16) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=fr_FR.utf8 LC_NUMERIC=C [3] LC_TIME=fr_FR.utf8 LC_COLLATE=fr_FR.utf8 [5] LC_MONETARY=fr_FR.utf8 LC_MESSAGES=fr_FR.utf8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=fr_FR.utf8 LC_IDENTIFICATION=C Le lundi 09 septembre 2013 ? 10:49 +0200, Milan Bouchet-Valat a ?crit :> Hi! > > I experience an error with an invalid UTF-8 character passed to > gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the > default) no error happens. (The character itself was read from an > invalid HTML file.) Illustration of the error: > > gsub("a", "", "\U3e3965", perl=FALSE) > # [1] "\U3e3965" > gsub("a", "", "\U3e3965", perl=TRUE) > # Error in gsub("a", "", "\U3e3965", perl = TRUE) : > # input string 1 is invalid UTF-8 > > > The error message in the second command seems to come from > src/main/grep.c:1640 (in do_gsub): > if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1); > > utf8Valid() relies on valid_utf8() from PCRE, whose behavior is > described in src/extra/pcre/pcre_valid_utf8.c. > > > > Even more problematic/interesting is the fact that iconv() does not > consider the above character as invalid, as it does not replace it when > using the sub argument. > > iconv("a\U3e3965", sub="") > [1] "a\U003e3965" > > On the contrary, an invalid sequence such as \xff is substituted: > iconv("a\xff", sub="") > # [1] "a" > > This makes it difficult to sanitize the string before passing it to > gsub(perl=TRUE). Thus, I'm wondering whether something could be done, > and where. Should iconv() and PCRE be made to agree on the definition of > an invalid UTF-8 sequence? > > > Regards > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Prof Brian Ripley
2013-Sep-09 12:59 UTC
[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")
On 09/09/2013 09:49, Milan Bouchet-Valat wrote:> Hi! > > I experience an error with an invalid UTF-8 character passed to > gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the > default) no error happens. (The character itself was read from an > invalid HTML file.) Illustration of the error: > > gsub("a", "", "\U3e3965", perl=FALSE) > # [1] "\U3e3965" > gsub("a", "", "\U3e3965", perl=TRUE) > # Error in gsub("a", "", "\U3e3965", perl = TRUE) : > # input string 1 is invalid UTF-8 > > > The error message in the second command seems to come from > src/main/grep.c:1640 (in do_gsub): > if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1); > > utf8Valid() relies on valid_utf8() from PCRE, whose behavior is > described in src/extra/pcre/pcre_valid_utf8.c. > > > > Even more problematic/interesting is the fact that iconv() does not > consider the above character as invalid, as it does not replace it when > using the sub argument. >> iconv("a\U3e3965", sub="") > [1] "a\U003e3965" > > On the contrary, an invalid sequence such as \xff is substituted: > iconv("a\xff", sub="") > # [1] "a" > > This makes it difficult to sanitize the string before passing it to > gsub(perl=TRUE). Thus, I'm wondering whether something could be done, > and where. Should iconv() and PCRE be made to agree on the definition of > an invalid UTF-8 sequence?iconv() is using a system service: read its help page. So you know where to report this .... -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Possibly Parallel Threads
- Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
- Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
- Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv: embedded nulls when converting to UTF-16