thr3ads.net - R devel - [Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="") [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Milan Bouchet-Valat

2013-Sep-09 08:49 UTC

[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

Hi!

I experience an error with an invalid UTF-8 character passed to
gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the
default) no error happens. (The character itself was read from an
invalid HTML file.) Illustration of the error:

gsub("a", "", "\U3e3965", perl=FALSE)
# [1] "\U3e3965"
gsub("a", "", "\U3e3965", perl=TRUE)
# Error in gsub("a", "", "\U3e3965", perl = TRUE)
:
#   input string 1 is invalid UTF-8


The error message in the second command seems to come from
src/main/grep.c:1640 (in do_gsub):
if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1);

utf8Valid() relies on valid_utf8() from PCRE, whose behavior is
described in src/extra/pcre/pcre_valid_utf8.c.



Even more problematic/interesting is the fact that iconv() does not
consider the above character as invalid, as it does not replace it when
using the sub argument. > iconv("a\U3e3965", sub="")[1] "a\U003e3965"

On the contrary, an invalid sequence such as \xff is substituted:
iconv("a\xff", sub="")
# [1] "a"

This makes it difficult to sanitize the string before passing it to
gsub(perl=TRUE). Thus, I'm wondering whether something could be done,
and where. Should iconv() and PCRE be made to agree on the definition of
an invalid UTF-8 sequence?


Regards

Milan Bouchet-Valat

2013-Sep-09 09:00 UTC

head link

[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

...and of course I forgot to add relevant information. This is with
Fedora 19, R 3.0.1 and a UTF-8 locale.

On Windows 7 the problem does not appear, i.e. the gsub(perl=TRUE) call
does not generate any error and \U3e3965 prints a Chinese character
(AFAICT).


R version 3.0.1 (2013-05-16)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=fr_FR.utf8       LC_NUMERIC=C             
 [3] LC_TIME=fr_FR.utf8        LC_COLLATE=fr_FR.utf8    
 [5] LC_MONETARY=fr_FR.utf8    LC_MESSAGES=fr_FR.utf8   
 [7] LC_PAPER=C                LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=fr_FR.utf8 LC_IDENTIFICATION=C      



Le lundi 09 septembre 2013 ? 10:49 +0200, Milan Bouchet-Valat a ?crit
:> Hi!
> 
> I experience an error with an invalid UTF-8 character passed to
> gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the
> default) no error happens. (The character itself was read from an
> invalid HTML file.) Illustration of the error:
> 
> gsub("a", "", "\U3e3965", perl=FALSE)
> # [1] "\U3e3965"
> gsub("a", "", "\U3e3965", perl=TRUE)
> # Error in gsub("a", "", "\U3e3965", perl =
TRUE) :
> #   input string 1 is invalid UTF-8
> 
> 
> The error message in the second command seems to come from
> src/main/grep.c:1640 (in do_gsub):
> if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"),
i+1);
> 
> utf8Valid() relies on valid_utf8() from PCRE, whose behavior is
> described in src/extra/pcre/pcre_valid_utf8.c.
> 
> 
> 
> Even more problematic/interesting is the fact that iconv() does not
> consider the above character as invalid, as it does not replace it when
> using the sub argument. 
> > iconv("a\U3e3965", sub="")
> [1] "a\U003e3965"
> 
> On the contrary, an invalid sequence such as \xff is substituted:
> iconv("a\xff", sub="")
> # [1] "a"
> 
> This makes it difficult to sanitize the string before passing it to
> gsub(perl=TRUE). Thus, I'm wondering whether something could be done,
> and where. Should iconv() and PCRE be made to agree on the definition of
> an invalid UTF-8 sequence?
> 
> 
> Regards
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Prof Brian Ripley

2013-Sep-09 12:59 UTC

head link

[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

On 09/09/2013 09:49, Milan Bouchet-Valat wrote:> Hi!
>
> I experience an error with an invalid UTF-8 character passed to
> gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the
> default) no error happens. (The character itself was read from an
> invalid HTML file.) Illustration of the error:
>
> gsub("a", "", "\U3e3965", perl=FALSE)
> # [1] "\U3e3965"
> gsub("a", "", "\U3e3965", perl=TRUE)
> # Error in gsub("a", "", "\U3e3965", perl =
TRUE) :
> #   input string 1 is invalid UTF-8
>
>
> The error message in the second command seems to come from
> src/main/grep.c:1640 (in do_gsub):
> if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"),
i+1);
>
> utf8Valid() relies on valid_utf8() from PCRE, whose behavior is
> described in src/extra/pcre/pcre_valid_utf8.c.
>
>
>
> Even more problematic/interesting is the fact that iconv() does not
> consider the above character as invalid, as it does not replace it when
> using the sub argument.
>> iconv("a\U3e3965", sub="")
> [1] "a\U003e3965"
>
> On the contrary, an invalid sequence such as \xff is substituted:
> iconv("a\xff", sub="")
> # [1] "a"
>
> This makes it difficult to sanitize the string before passing it to
> gsub(perl=TRUE). Thus, I'm wondering whether something could be done,
> and where. Should iconv() and PCRE be made to agree on the definition of
> an invalid UTF-8 sequence?
iconv() is using a system service: read its help page.  So you know 
where to report this ....


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Apparently Analagous Threads

Search for more seemingly similar threads

R devel - Sep 2013 - Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

Apparently Analagous Threads