Winston Chang
2015-Mar-02 19:14 UTC
[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings
On Windows, grep(fixed=TRUE) throws errors with some UTF-8 strings. Here's an example (must be run on Windows to reproduce the error): Sys.setlocale("LC_CTYPE", "chinese") y <- rawToChar(as.raw(c(0xe6, 0xb8, 0x97))) Encoding(y) <- "UTF-8" y # [1] "?" grep("\n", y, fixed = TRUE) # Error in grep("\n", y, fixed = TRUE) : invalid multibyte string at '<97>' In my particular case, I'm using parse() on a string that contains characters like this, and it triggers the same error, because parse() calls srcfilecopy(), which calls grepl(): parse(text=y) # Error in grepl("\n", lines, fixed = TRUE) : # invalid multibyte string at '<97>' Am I right in assuming that this isn't the expected behavior? -Winston
Winston Chang
2015-Mar-04 02:12 UTC
[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings
After a bit more investigation, I think I've found the cause of the bug, and I have a patch. This bug happens with grep(), when: * Running on Windows. * The search uses fixed=TRUE. * The search pattern is a single byte. * The current locale has a multibyte encoding. ======================Here's an example that demonstrates the bug: # First, create a 3-byte UTF-8 character y <- rawToChar(as.raw(c(0xe6, 0xb8, 0x97))) Encoding(y) <- "UTF-8" y # [1] "?" # In my default locale, grep with a single-char pattern and fixed=TRUE # returns integer(0), as expected. Sys.getlocale("LC_CTYPE") # [1] "English_United States.1252" grep("a", y, fixed = TRUE) # integer(0) # When the using a multibyte locale, grep with a single-char # pattern and fixed=TRUE results in an error. Sys.setlocale("LC_CTYPE", "chinese") grep("a", y, fixed = TRUE) # Error in grep("a", y, fixed = TRUE) : invalid multibyte string at '<97>' ====================== I believe the problem is in the main/grep.c file, in the fgrep_one function. It tests for a multi-byte character string locale `mbcslocale`, and then for the `use_UTF8`, like so: if (!useBytes && mbcslocale) { ... } else if (!useBytes && use_UTF8) { ... } else ... This can be seen at https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L668-L692 A similar pattern occurs in the fgrep_one_bytes function, at https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L718-L736 I believe that the test order should be reversed; it should test first for `use_UTF8`, and then for `mbcslocale`. This pattern occurs in a few places in grep.c. It looks like this: if (!useBytes && use_UTF8) { ... } else if (!useBytes && mbcslocale) { ... } else ... ======================This patch does what I described; it simply tests for `use_UTF8` first, and then `mbcslocale`, in both fgrep_one and fgrep_one_bytes. I made this patch against the 3.1.2 sources, and tested the example code above. In both cases, grep() returned integer(0), as expected. (The reason I made this change against 3.1.2 is because I had problems getting the current trunk to compile on both Linux or Windows.) diff --git src/main/grep.c src/main/grep.c index 6e6ec3e..348c63d 100644 --- src/main/grep.c +++ src/main/grep.c @@ -664,27 +664,27 @@ static int fgrep_one(const char *pat, const char *target, } return -1; } - if (!useBytes && mbcslocale) { /* skip along by chars */ - mbstate_t mb_st; + if (!useBytes && use_UTF8) { int ib, used; - mbs_init(&mb_st); for (ib = 0, i = 0; ib <= len-plen; i++) { if (strncmp(pat, target+ib, plen) == 0) { if (next != NULL) *next = ib + plen; return i; } - used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st); + used = utf8clen(target[ib]); if (used <= 0) break; ib += used; } - } else if (!useBytes && use_UTF8) { + } else if (!useBytes && mbcslocale) { /* skip along by chars */ + mbstate_t mb_st; int ib, used; + mbs_init(&mb_st); for (ib = 0, i = 0; ib <= len-plen; i++) { if (strncmp(pat, target+ib, plen) == 0) { if (next != NULL) *next = ib + plen; return i; } - used = utf8clen(target[ib]); + used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st); if (used <= 0) break; ib += used; } @@ -714,21 +714,21 @@ static int fgrep_one_bytes(const char *pat, const char *target, int len, if (*p == pat[0]) return i; return -1; } - if (!useBytes && mbcslocale) { /* skip along by chars */ - mbstate_t mb_st; + if (!useBytes && use_UTF8) { /* not really needed */ int ib, used; - mbs_init(&mb_st); for (ib = 0, i = 0; ib <= len-plen; i++) { if (strncmp(pat, target+ib, plen) == 0) return ib; - used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st); + used = utf8clen(target[ib]); if (used <= 0) break; ib += used; } - } else if (!useBytes && use_UTF8) { /* not really needed */ + } else if (!useBytes && mbcslocale) { /* skip along by chars */ + mbstate_t mb_st; int ib, used; + mbs_init(&mb_st); for (ib = 0, i = 0; ib <= len-plen; i++) { if (strncmp(pat, target+ib, plen) == 0) return ib; - used = utf8clen(target[ib]); + used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st); if (used <= 0) break; ib += used; } -Winston
suimong
2015-Jul-20 07:13 UTC
[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings
Thank you Winston for the solution! The only workaround I come up with is to set options(encoding = "UTF-8"), which is generally undesirable. I'm wondering is there any chance this patch will be included in future R version? I have been running into this problem from time to time and the latest R 3.2.1 still hasn't handled this issue properly. Winston Chang wrote> After a bit more investigation, I think I've found the cause of the bug, > and I have a patch. > > This bug happens with grep(), when: > * Running on Windows. > * The search uses fixed=TRUE. > * The search pattern is a single byte. > * The current locale has a multibyte encoding. > > ======================> Here's an example that demonstrates the bug: > > # First, create a 3-byte UTF-8 character > y <- rawToChar(as.raw(c(0xe6, 0xb8, 0x97))) > Encoding(y) <- "UTF-8" > y > # [1] "?" > > # In my default locale, grep with a single-char pattern and fixed=TRUE > # returns integer(0), as expected. > Sys.getlocale("LC_CTYPE") > # [1] "English_United States.1252" > grep("a", y, fixed = TRUE) > # integer(0) > > # When the using a multibyte locale, grep with a single-char > # pattern and fixed=TRUE results in an error. > Sys.setlocale("LC_CTYPE", "chinese") > grep("a", y, fixed = TRUE) > # Error in grep("a", y, fixed = TRUE) : invalid multibyte string at '<97>' > > > ======================> > I believe the problem is in the main/grep.c file, in the fgrep_one > function. It tests for a multi-byte character string locale > `mbcslocale`, and then for the `use_UTF8`, like so: > > if (!useBytes && mbcslocale) { > ... > } else if (!useBytes && use_UTF8) { > ... > } else ... > > This can be seen at > https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L668-L692 > > A similar pattern occurs in the fgrep_one_bytes function, at > https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L718-L736 > > > I believe that the test order should be reversed; it should test first > for `use_UTF8`, and then for `mbcslocale`. This pattern occurs in a few > places in grep.c. It looks like this: > > if (!useBytes && use_UTF8) { > ... > } else if (!useBytes && mbcslocale) { > ... > } else ... > > > ======================> This patch does what I described; it simply tests for `use_UTF8` first, > and then `mbcslocale`, in both fgrep_one and fgrep_one_bytes. I made > this patch against the 3.1.2 sources, and tested the example code above. > In both cases, grep() returned integer(0), as expected. > > (The reason I made this change against 3.1.2 is because I had problems > getting the current trunk to compile on both Linux or Windows.) > > > diff --git src/main/grep.c src/main/grep.c > index 6e6ec3e..348c63d 100644 > --- src/main/grep.c > +++ src/main/grep.c > @@ -664,27 +664,27 @@ static int fgrep_one(const char *pat, const char > *target, > } > return -1; > } > - if (!useBytes && mbcslocale) { /* skip along by chars */ > - mbstate_t mb_st; > + if (!useBytes && use_UTF8) { > int ib, used; > - mbs_init(&mb_st); > for (ib = 0, i = 0; ib <= len-plen; i++) { > if (strncmp(pat, target+ib, plen) == 0) { > if (next != NULL) *next = ib + plen; > return i; > } > - used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st); > + used = utf8clen(target[ib]); > if (used <= 0) break; > ib += used; > } > - } else if (!useBytes && use_UTF8) { > + } else if (!useBytes && mbcslocale) { /* skip along by chars */ > + mbstate_t mb_st; > int ib, used; > + mbs_init(&mb_st); > for (ib = 0, i = 0; ib <= len-plen; i++) { > if (strncmp(pat, target+ib, plen) == 0) { > if (next != NULL) *next = ib + plen; > return i; > } > - used = utf8clen(target[ib]); > + used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st); > if (used <= 0) break; > ib += used; > } > @@ -714,21 +714,21 @@ static int fgrep_one_bytes(const char *pat, const > char *target, int len, > if (*p == pat[0]) return i; > return -1; > } > - if (!useBytes && mbcslocale) { /* skip along by chars */ > - mbstate_t mb_st; > + if (!useBytes && use_UTF8) { /* not really needed */ > int ib, used; > - mbs_init(&mb_st); > for (ib = 0, i = 0; ib <= len-plen; i++) { > if (strncmp(pat, target+ib, plen) == 0) return ib; > - used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st); > + used = utf8clen(target[ib]); > if (used <= 0) break; > ib += used; > } > - } else if (!useBytes && use_UTF8) { /* not really needed */ > + } else if (!useBytes && mbcslocale) { /* skip along by chars */ > + mbstate_t mb_st; > int ib, used; > + mbs_init(&mb_st); > for (ib = 0, i = 0; ib <= len-plen; i++) { > if (strncmp(pat, target+ib, plen) == 0) return ib; > - used = utf8clen(target[ib]); > + used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st); > if (used <= 0) break; > ib += used; > } > > > -Winston > > ______________________________________________> R-devel@> mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- View this message in context: http://r.789695.n4.nabble.com/Errors-on-Windows-with-grep-fixed-TRUE-on-UTF-8-strings-tp4704073p4710059.html Sent from the R devel mailing list archive at Nabble.com.