On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:>
>
> On 7/23/23 4:29 PM, Duncan Murdoch wrote:
>> The help page for `?gsub` says (in the context of performance
>> considerations):
>>
>>
>> "... just one UTF-8 string will force all the matching to be done
in
>> Unicode"
>
> It's been a little while since I looked at the code but IIRC this just
> means that strings are converted to UTF-8 before matching. The problem
> here seems to be more about the interpretation of the "\\w+"
token by
> PCRE. I think this makes it a little clearer what's going on:
>
> gsub("\\w", "a", "?", perl=TRUE)
> [1] "?"
>
> So no match. The PCRE docs
> https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
> the old docs, but it works for our purposes here) mention we can turn on
> unicode property matching with the "(*UCP)" token:
>
> gsub("(*UCP)\\w", "a", "?", perl=TRUE)
> [1] "a"
>
> So there are two layers at play here. The first one is whether R
> converts strings to UTF-8, which I think is what the documentation is
> about. The other is whether the PCRE engine is configured to recognize
> Unicode properties, which at least in both of our configurations for
> this specific case it appears like it is not.
From the surrounding context, I think the docs are talking about more
than just conversion to UTF-8. The full paragraph reads like this:
"If you are working in a single-byte locale (though not common since R
4.2) and have marked UTF-8 strings that are representable in that
locale, convert them first as just one UTF-8 string will force all the
matching to be done in Unicode, which attracts a penalty of around
3? for the default POSIX 1003.2 mode."
i.e. it says the presence of UTF-8 strings slows things down by a factor
of 3, so it's faster to convert everything to the local encoding. If it
was just conversion, I don't think that would be true.
But maybe "for the default POSIX 1003.2 mode" applies to the whole
paragraph, not just to the penalty, so this is intentional.
Duncan Murdoch>
> Best,
>
> B.
>
>
>>
>>
>> However, this thread on SO:? https://stackoverflow.com/q/76749529 gives
>> some indication that this is not true for `perl = TRUE`.? Specifically:
>>
>> > strings <- c("89 562", "John Smith",
"??????? ????????????",
>> "Jean-Fran?ois Dupuis")
>> > Encoding(strings)
>> [1] "unknown" "unknown" "UTF-8"??
"UTF-8"
>> > regex <- "\\B\\w+| +"
>> > gsub(regex, "", strings)
>> [1] "85"?? "JS"?? "??"?? "J-FD"
>>
>> > gsub(regex, "", strings, perl = TRUE)
>> [1] "85"????????????????? "JS"?????????????????
"???????????????????"
>> "J-F?oD"
>>
>> and the website https://regex101.com/r/QDFrOE/1 gives the first answer
>> when the regex option /u ("match with full Unicode) is specified,
but
>> the second answer when it is not.
>>
>> Now I'm not at all sure that that website is authoritative, but
this
>> looks like a flag may have been missed in the `perl = TRUE` case.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel