Thanks, in R-devel 76138, I confirm it speeds up gregexpr() with pcre in
Bill Dunlap's example from
https://stat.ethz.ch/pipermail/r-devel/2017-January/073577.html
(RegExprPCRE column)
The performance problem of StrSplitPCRE does not seem to be due to strlen().
Best
Tomas
On 2/19/19 9:37 PM, Toby Hocking wrote:> Hi all,
>
> Several people have noticed that gregexpr is very slow for large subject
> strings when perl=TRUE is specified.
> -
>
https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings
> -
>
http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html
> - https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html
>
> I figured out the issue, which is fixed by changing 1 line of code in
> src/main/grep.c -- there is a strlen function call which is currently
> inside of the while loop over matches, and the patch moves it before the
> loop.
>
https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch
>
> I made some figures that show the quadratic time complexity before applying
> the patch, and the linear time complexity after applying the patch
> https://github.com/tdhock/namedCapture-article#19-feb-2019
>
> I would have posted a bug report on bugs.r-project.org but I do not have an
> account. So can an R-devel person please either (1) accept this patch, or
> (2) give me an account so I can post the patch on the bug tracker?
>
> Finally I would like to mention that Bill Dunlap noticed a similar problem
> (time complexity which is quadratic in subject size) for strsplit with
> perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix
> could be accomplished (because I see that strlen is being called in a while
> loop in do_strsplit as well).
>
> Thanks
> Toby Dylan Hocking
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel