thr3ads.net - R devel - [Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Winston Chang

2015-Mar-02 19:14 UTC

[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

On Windows, grep(fixed=TRUE) throws errors with some UTF-8 strings.
Here's an example (must be run on Windows to reproduce the error):

Sys.setlocale("LC_CTYPE", "chinese")
y <- rawToChar(as.raw(c(0xe6, 0xb8, 0x97)))
Encoding(y) <- "UTF-8"
y
# [1] "?"
grep("\n", y, fixed = TRUE)
# Error in grep("\n", y, fixed = TRUE) : invalid multibyte string at
'<97>'


In my particular case, I'm using parse() on a string that contains
characters like this, and it triggers the same error, because parse()
calls srcfilecopy(), which calls grepl():

parse(text=y)
# Error in grepl("\n", lines, fixed = TRUE) :
#   invalid multibyte string at '<97>'


Am I right in assuming that this isn't the expected behavior?

-Winston

Winston Chang

2015-Mar-04 02:12 UTC

head link

[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

After a bit more investigation, I think I've found the cause of the bug,
and I have a patch.

This bug happens with grep(), when:
* Running on Windows.
* The search uses fixed=TRUE.
* The search pattern is a single byte.
* The current locale has a multibyte encoding.

======================Here's an example that demonstrates the bug:

# First, create a 3-byte UTF-8 character
y <- rawToChar(as.raw(c(0xe6, 0xb8, 0x97)))
Encoding(y) <- "UTF-8"
y
# [1] "?"

# In my default locale, grep with a single-char pattern and fixed=TRUE
# returns integer(0), as expected.
Sys.getlocale("LC_CTYPE")
# [1] "English_United States.1252"
grep("a", y, fixed = TRUE)
# integer(0)

# When the using a multibyte locale, grep with a single-char
# pattern and fixed=TRUE results in an error.
Sys.setlocale("LC_CTYPE", "chinese")
grep("a", y, fixed = TRUE)
# Error in grep("a", y, fixed = TRUE) : invalid multibyte string at
'<97>'


======================
I believe the problem is in the main/grep.c file, in the fgrep_one
function. It tests for a multi-byte character string locale
`mbcslocale`, and then for the `use_UTF8`, like so:

    if (!useBytes && mbcslocale) {
        ...
    } else if (!useBytes && use_UTF8) {
        ...
    } else ...

This can be seen at
https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L668-L692

A similar pattern occurs in the fgrep_one_bytes function, at
https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L718-L736


I believe that the test order should be reversed; it should test first
for `use_UTF8`, and then for `mbcslocale`. This pattern occurs in a few
places in grep.c. It looks like this:

    if (!useBytes && use_UTF8) {
        ...
    } else if (!useBytes && mbcslocale) {
        ...
    } else ...


======================This patch does what I described; it simply tests for
`use_UTF8` first,
and then `mbcslocale`, in both fgrep_one and fgrep_one_bytes. I made
this patch against the 3.1.2 sources, and tested the example code above.
In both cases, grep() returned integer(0), as expected.

(The reason I made this change against 3.1.2 is because I had problems
getting the current trunk to compile on both Linux or Windows.)


diff --git src/main/grep.c src/main/grep.c
index 6e6ec3e..348c63d 100644
--- src/main/grep.c
+++ src/main/grep.c
@@ -664,27 +664,27 @@ static int fgrep_one(const char *pat, const char *target,
 	    }
 	return -1;
     }
-    if (!useBytes && mbcslocale) { /* skip along by chars */
-	mbstate_t mb_st;
+    if (!useBytes && use_UTF8) {
 	int ib, used;
-	mbs_init(&mb_st);
 	for (ib = 0, i = 0; ib <= len-plen; i++) {
 	    if (strncmp(pat, target+ib, plen) == 0) {
 		if (next != NULL) *next = ib + plen;
 		return i;
 	    }
-	    used = (int) Mbrtowc(NULL,  target+ib, MB_CUR_MAX, &mb_st);
+	    used = utf8clen(target[ib]);
 	    if (used <= 0) break;
 	    ib += used;
 	}
-    } else if (!useBytes && use_UTF8) {
+    } else if (!useBytes && mbcslocale) { /* skip along by chars */
+	mbstate_t mb_st;
 	int ib, used;
+	mbs_init(&mb_st);
 	for (ib = 0, i = 0; ib <= len-plen; i++) {
 	    if (strncmp(pat, target+ib, plen) == 0) {
 		if (next != NULL) *next = ib + plen;
 		return i;
 	    }
-	    used = utf8clen(target[ib]);
+	    used = (int) Mbrtowc(NULL,  target+ib, MB_CUR_MAX, &mb_st);
 	    if (used <= 0) break;
 	    ib += used;
 	}
@@ -714,21 +714,21 @@ static int fgrep_one_bytes(const char *pat, const char
*target, int len,
 	    if (*p == pat[0]) return i;
 	return -1;
     }
-    if (!useBytes && mbcslocale) { /* skip along by chars */
-	mbstate_t mb_st;
+    if (!useBytes && use_UTF8) { /* not really needed */
 	int ib, used;
-	mbs_init(&mb_st);
 	for (ib = 0, i = 0; ib <= len-plen; i++) {
 	    if (strncmp(pat, target+ib, plen) == 0) return ib;
-	    used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
+	    used = utf8clen(target[ib]);
 	    if (used <= 0) break;
 	    ib += used;
 	}
-    } else if (!useBytes && use_UTF8) { /* not really needed */
+    } else if (!useBytes && mbcslocale) { /* skip along by chars */
+	mbstate_t mb_st;
 	int ib, used;
+	mbs_init(&mb_st);
 	for (ib = 0, i = 0; ib <= len-plen; i++) {
 	    if (strncmp(pat, target+ib, plen) == 0) return ib;
-	    used = utf8clen(target[ib]);
+	    used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
 	    if (used <= 0) break;
 	    ib += used;
 	}


-Winston

suimong

2015-Jul-20 07:13 UTC

head link

[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

Thank you Winston for the solution! The only workaround I come up with is to
set options(encoding = "UTF-8"), which is generally undesirable.

I'm wondering is there any chance this patch will be included in future R
version? I have been running into this problem from time to time and the
latest R 3.2.1 still hasn't handled this issue properly.


Winston Chang wrote> After a bit more investigation, I think I've found the cause of the
bug,
> and I have a patch.
> 
> This bug happens with grep(), when:
> * Running on Windows.
> * The search uses fixed=TRUE.
> * The search pattern is a single byte.
> * The current locale has a multibyte encoding.
> 
> ======================> Here's an example that demonstrates the bug:
> 
> # First, create a 3-byte UTF-8 character
> y <- rawToChar(as.raw(c(0xe6, 0xb8, 0x97)))
> Encoding(y) <- "UTF-8"
> y
> # [1] "?"
> 
> # In my default locale, grep with a single-char pattern and fixed=TRUE
> # returns integer(0), as expected.
> Sys.getlocale("LC_CTYPE")
> # [1] "English_United States.1252"
> grep("a", y, fixed = TRUE)
> # integer(0)
> 
> # When the using a multibyte locale, grep with a single-char
> # pattern and fixed=TRUE results in an error.
> Sys.setlocale("LC_CTYPE", "chinese")
> grep("a", y, fixed = TRUE)
> # Error in grep("a", y, fixed = TRUE) : invalid multibyte string
at '<97>'
> 
> 
> ======================> 
> I believe the problem is in the main/grep.c file, in the fgrep_one
> function. It tests for a multi-byte character string locale
> `mbcslocale`, and then for the `use_UTF8`, like so:
> 
>     if (!useBytes && mbcslocale) {
>         ...
>     } else if (!useBytes && use_UTF8) {
>         ...
>     } else ...
> 
> This can be seen at
>
https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L668-L692
> 
> A similar pattern occurs in the fgrep_one_bytes function, at
>
https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L718-L736
> 
> 
> I believe that the test order should be reversed; it should test first
> for `use_UTF8`, and then for `mbcslocale`. This pattern occurs in a few
> places in grep.c. It looks like this:
> 
>     if (!useBytes && use_UTF8) {
>         ...
>     } else if (!useBytes && mbcslocale) {
>         ...
>     } else ...
> 
> 
> ======================> This patch does what I described; it simply
tests for `use_UTF8` first,
> and then `mbcslocale`, in both fgrep_one and fgrep_one_bytes. I made
> this patch against the 3.1.2 sources, and tested the example code above.
> In both cases, grep() returned integer(0), as expected.
> 
> (The reason I made this change against 3.1.2 is because I had problems
> getting the current trunk to compile on both Linux or Windows.)
> 
> 
> diff --git src/main/grep.c src/main/grep.c
> index 6e6ec3e..348c63d 100644
> --- src/main/grep.c
> +++ src/main/grep.c
> @@ -664,27 +664,27 @@ static int fgrep_one(const char *pat, const char
> *target,
>  	    }
>  	return -1;
>      }
> -    if (!useBytes && mbcslocale) { /* skip along by chars */
> -	mbstate_t mb_st;
> +    if (!useBytes && use_UTF8) {
>  	int ib, used;
> -	mbs_init(&mb_st);
>  	for (ib = 0, i = 0; ib <= len-plen; i++) {
>  	    if (strncmp(pat, target+ib, plen) == 0) {
>  		if (next != NULL) *next = ib + plen;
>  		return i;
>  	    }
> -	    used = (int) Mbrtowc(NULL,  target+ib, MB_CUR_MAX, &mb_st);
> +	    used = utf8clen(target[ib]);
>  	    if (used <= 0) break;
>  	    ib += used;
>  	}
> -    } else if (!useBytes && use_UTF8) {
> +    } else if (!useBytes && mbcslocale) { /* skip along by chars
*/
> +	mbstate_t mb_st;
>  	int ib, used;
> +	mbs_init(&mb_st);
>  	for (ib = 0, i = 0; ib <= len-plen; i++) {
>  	    if (strncmp(pat, target+ib, plen) == 0) {
>  		if (next != NULL) *next = ib + plen;
>  		return i;
>  	    }
> -	    used = utf8clen(target[ib]);
> +	    used = (int) Mbrtowc(NULL,  target+ib, MB_CUR_MAX, &mb_st);
>  	    if (used <= 0) break;
>  	    ib += used;
>  	}
> @@ -714,21 +714,21 @@ static int fgrep_one_bytes(const char *pat, const
> char *target, int len,
>  	    if (*p == pat[0]) return i;
>  	return -1;
>      }
> -    if (!useBytes && mbcslocale) { /* skip along by chars */
> -	mbstate_t mb_st;
> +    if (!useBytes && use_UTF8) { /* not really needed */
>  	int ib, used;
> -	mbs_init(&mb_st);
>  	for (ib = 0, i = 0; ib <= len-plen; i++) {
>  	    if (strncmp(pat, target+ib, plen) == 0) return ib;
> -	    used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
> +	    used = utf8clen(target[ib]);
>  	    if (used <= 0) break;
>  	    ib += used;
>  	}
> -    } else if (!useBytes && use_UTF8) { /* not really needed */
> +    } else if (!useBytes && mbcslocale) { /* skip along by chars
*/
> +	mbstate_t mb_st;
>  	int ib, used;
> +	mbs_init(&mb_st);
>  	for (ib = 0, i = 0; ib <= len-plen; i++) {
>  	    if (strncmp(pat, target+ib, plen) == 0) return ib;
> -	    used = utf8clen(target[ib]);
> +	    used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
>  	    if (used <= 0) break;
>  	    ib += used;
>  	}
> 
> 
> -Winston
> 
> ______________________________________________
> R-devel@
>  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel




--
View this message in context:
http://r.789695.n4.nabble.com/Errors-on-Windows-with-grep-fixed-TRUE-on-UTF-8-strings-tp4704073p4710059.html
Sent from the R devel mailing list archive at Nabble.com.

Reasonably Related Threads

Search for more maybe matching threads

R devel - Mar 2015 - Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

Reasonably Related Threads