thr3ads.net - R devel - [Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above [Jun 2020]

If this information is useful, please help other people find it:
Share via:

Carson Sievert

2020-Jun-08 22:09 UTC

[Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above

Hi everyone,

I've noticed new behavior in `regexpr(..., perl = TRUE)` on Windows with
R4.0 and above with Unicode characters. Here's a minimal example where
I'd
expect to see a start value of `5` (as R 3.6.2 and below gives), but R
4.0.0 (and R 4.0.1) now returns:

```> regexpr("b", "foo\U0001F937bar", perl = TRUE)#> [1] 6
#> attr(,"match.length")
#> [1] 1
```

Perhaps this change in behavior could be explained by R4.0's migration to
PCRE2? Here is some relevant output from my R4.0 session:

```> pcre_config()#> UTF-8 Unicode properties     JIT    stack
#>  TRUE               TRUE    TRUE    FALSE
```

```> extSoftVersion()#>         zlib                        bzlib            xz
   PCRE
#> "1.2.11"   "1.0.8, 13-Jul-2019"    "5.2.4"  
"10.33 2019-04-16"
#> ICU                                       TRE            iconv
 readline   BLAS
#> "58.2" "TRE 0.8.0 R_fixes (BSD)" 
"win_iconv"               ""       ""
```

Let me know if there's any more information I can provide to help replicate
and isolate the issue. Also, if this happens to be the expected behavior,
I'd be keen to learn about why that's the case.

Thank you,

-Carson

-- 
Carson Sievert, PhD
Software Engineer at RStudio
Website <https://cpsievert.me> | Twitter
<https://twitter.com/cpsievert> |
GitHub <https://github.com/cpsievert>

	[[alternative HTML version deleted]]

Tomas Kalibera

2020-Jun-09 15:01 UTC

head link

[Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above

Hi Carson,

thanks for the report. This is a bug in R, specific to Windows and to 
characters that use surrogate pairs - other characters will work fine, 
other recent operating systems where R runs will work fine (all where a 
single wchar_t holds complete Unicode characters). Now fixed in R-devel.

If handling of surrogate pairs (e.g. Emoji characters) is important for 
you, it would help if you could systematically stress-test R for that. A 
number of related bugs have been fixed, but it is not impossible some 
are still present as these characters are rarely present in test data.

Also, sometimes fixing bugs ironically introduces new problems. This 
regression was caused by a correct fix of a bug related to surrogate 
pairs in R 4.0. That old bug was cancelling out this old bug in 
post-processing PCRE results.

Best
Tomas

On 6/9/20 12:09 AM, Carson Sievert wrote:> Hi everyone,
>
> I've noticed new behavior in `regexpr(..., perl = TRUE)` on Windows
with
> R4.0 and above with Unicode characters. Here's a minimal example where
I'd
> expect to see a start value of `5` (as R 3.6.2 and below gives), but R
> 4.0.0 (and R 4.0.1) now returns:
>
> ```
>> regexpr("b", "foo\U0001F937bar", perl = TRUE)
> #> [1] 6
> #> attr(,"match.length")
> #> [1] 1
> ```
>
> Perhaps this change in behavior could be explained by R4.0's migration
to
> PCRE2? Here is some relevant output from my R4.0 session:
>
> ```
>> pcre_config()
> #> UTF-8 Unicode properties     JIT    stack
> #>  TRUE               TRUE    TRUE    FALSE
> ```
>
> ```
>> extSoftVersion()
> #>         zlib                        bzlib            xz
>     PCRE
> #> "1.2.11"   "1.0.8, 13-Jul-2019"   
"5.2.4"   "10.33 2019-04-16"
> #> ICU                                       TRE            iconv
>   readline   BLAS
> #> "58.2" "TRE 0.8.0 R_fixes (BSD)" 
"win_iconv"               ""       ""
> ```
>
> Let me know if there's any more information I can provide to help
replicate
> and isolate the issue. Also, if this happens to be the expected behavior,
> I'd be keen to learn about why that's the case.
>
> Thank you,
>
> -Carson
>

Apparently Analagous Threads

Search for more seemingly similar threads

R devel - Jun 2020 - Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above

[Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above

[Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above

Apparently Analagous Threads