thr3ads.net - R devel - [Rd] Bug in perl=TRUE regexp matching? [Jul 2023]

If this information is useful, please help other people find it:
Share via:

Brodie Gaslam

2023-Jul-24 01:01 UTC

[Rd] Bug in perl=TRUE regexp matching?

On 7/23/23 4:29 PM, Duncan Murdoch wrote:> The help page for `?gsub` says (in the context of performance 
> considerations):
> 
> 
> "... just one UTF-8 string will force all the matching to be done in 
> Unicode"
It's been a little while since I looked at the code but IIRC this just 
means that strings are converted to UTF-8 before matching.  The problem 
here seems to be more about the interpretation of the "\\w+" token by 
PCRE.  I think this makes it a little clearer what's going on:

     gsub("\\w", "a", "?", perl=TRUE)
     [1] "?"

So no match.  The PCRE docs 
https://www.pcre.org/original/doc/html/pcrepattern.html (this might be 
the old docs, but it works for our purposes here) mention we can turn on 
unicode property matching with the "(*UCP)" token:

      gsub("(*UCP)\\w", "a", "?", perl=TRUE)
      [1] "a"

So there are two layers at play here.  The first one is whether R 
converts strings to UTF-8, which I think is what the documentation is 
about.  The other is whether the PCRE engine is configured to recognize 
Unicode properties, which at least in both of our configurations for 
this specific case it appears like it is not.

Best,

B.

> 
> 
> However, this thread on SO:? https://stackoverflow.com/q/76749529 gives 
> some indication that this is not true for `perl = TRUE`.? Specifically:
> 
>  > strings <- c("89 562", "John Smith",
"??????? ????????????",
> "Jean-Fran?ois Dupuis")
>  > Encoding(strings)
> [1] "unknown" "unknown" "UTF-8"??
"UTF-8"
>  > regex <- "\\B\\w+| +"
>  > gsub(regex, "", strings)
> [1] "85"?? "JS"?? "??"?? "J-FD"
> 
>  > gsub(regex, "", strings, perl = TRUE)
> [1] "85"????????????????? "JS"?????????????????
"???????????????????"
> "J-F?oD"
> 
> and the website https://regex101.com/r/QDFrOE/1 gives the first answer 
> when the regex option /u ("match with full Unicode) is specified, but 
> the second answer when it is not.
> 
> Now I'm not at all sure that that website is authoritative, but this 
> looks like a flag may have been missed in the `perl = TRUE` case.
> 
> Duncan Murdoch
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Duncan Murdoch

2023-Jul-24 08:10 UTC

head link

[Rd] Bug in perl=TRUE regexp matching?

On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:> 
> 
> On 7/23/23 4:29 PM, Duncan Murdoch wrote:
>> The help page for `?gsub` says (in the context of performance
>> considerations):
>>
>>
>> "... just one UTF-8 string will force all the matching to be done
in
>> Unicode"
> 
> It's been a little while since I looked at the code but IIRC this just
> means that strings are converted to UTF-8 before matching.  The problem
> here seems to be more about the interpretation of the "\\w+"
token by
> PCRE.  I think this makes it a little clearer what's going on:
> 
>       gsub("\\w", "a", "?", perl=TRUE)
>       [1] "?"
> 
> So no match.  The PCRE docs
> https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
> the old docs, but it works for our purposes here) mention we can turn on
> unicode property matching with the "(*UCP)" token:
> 
>        gsub("(*UCP)\\w", "a", "?", perl=TRUE)
>        [1] "a"
> 
> So there are two layers at play here.  The first one is whether R
> converts strings to UTF-8, which I think is what the documentation is
> about.  The other is whether the PCRE engine is configured to recognize
> Unicode properties, which at least in both of our configurations for
> this specific case it appears like it is not.
 From the surrounding context, I think the docs are talking about more 
than just conversion to UTF-8.  The full paragraph reads like this:

"If you are working in a single-byte locale (though not common since R 
4.2) and have marked UTF-8 strings that are representable in that 
locale, convert them first as just one UTF-8 string will force all the 
matching to be done in Unicode, which attracts a penalty of around
3? for the default POSIX 1003.2 mode."

i.e. it says the presence of UTF-8 strings slows things down by a factor 
of 3, so it's faster to convert everything to the local encoding.  If it 
was just conversion, I don't think that would be true.

But maybe "for the default POSIX 1003.2 mode" applies to the whole 
paragraph, not just to the penalty, so this is intentional.

Duncan Murdoch> 
> Best,
> 
> B.
> 
> 
>>
>>
>> However, this thread on SO:? https://stackoverflow.com/q/76749529 gives
>> some indication that this is not true for `perl = TRUE`.? Specifically:
>>
>>   > strings <- c("89 562", "John Smith",
"??????? ????????????",
>> "Jean-Fran?ois Dupuis")
>>   > Encoding(strings)
>> [1] "unknown" "unknown" "UTF-8"??
"UTF-8"
>>   > regex <- "\\B\\w+| +"
>>   > gsub(regex, "", strings)
>> [1] "85"?? "JS"?? "??"?? "J-FD"
>>
>>   > gsub(regex, "", strings, perl = TRUE)
>> [1] "85"????????????????? "JS"?????????????????
"???????????????????"
>> "J-F?oD"
>>
>> and the website https://regex101.com/r/QDFrOE/1 gives the first answer
>> when the regex option /u ("match with full Unicode) is specified,
but
>> the second answer when it is not.
>>
>> Now I'm not at all sure that that website is authoritative, but
this
>> looks like a flag may have been missed in the `perl = TRUE` case.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Jul 2023 - Bug in perl=TRUE regexp matching?

[Rd] Bug in perl=TRUE regexp matching?

[Rd] Bug in perl=TRUE regexp matching?