thr3ads.net - R devel - [Rd] patch for gregexpr(perl=TRUE) [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Toby Hocking

2019-Feb-19 20:37 UTC

[Rd] patch for gregexpr(perl=TRUE)

Hi all,

Several people have noticed that gregexpr is very slow for large subject
strings when perl=TRUE is specified.
-
https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings
-
http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html
- https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html

I figured out the issue, which is fixed by changing 1 line of code in
src/main/grep.c -- there is a strlen function call which is currently
inside of the while loop over matches, and the patch moves it before the
loop.
https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch

I made some figures that show the quadratic time complexity before applying
the patch, and the linear time complexity after applying the patch
https://github.com/tdhock/namedCapture-article#19-feb-2019

I would have posted a bug report on bugs.r-project.org but I do not have an
account. So can an R-devel person please either (1) accept this patch, or
(2) give me an account so I can post the patch on the bug tracker?

Finally I would like to mention that Bill Dunlap noticed a similar problem
(time complexity which is quadratic in subject size) for strsplit with
perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix
could be accomplished (because I see that strlen is being called in a while
loop in do_strsplit as well).

Thanks
Toby Dylan Hocking

	[[alternative HTML version deleted]]

Tomas Kalibera

2019-Feb-20 14:56 UTC

head link

[Rd] patch for gregexpr(perl=TRUE)

Thanks, in R-devel 76138, I confirm it speeds up gregexpr() with pcre in 
Bill Dunlap's example from
https://stat.ethz.ch/pipermail/r-devel/2017-January/073577.html
(RegExprPCRE column)

The performance problem of StrSplitPCRE does not seem to be due to strlen().

Best
Tomas

On 2/19/19 9:37 PM, Toby Hocking wrote:> Hi all,
>
> Several people have noticed that gregexpr is very slow for large subject
> strings when perl=TRUE is specified.
> -
>
https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings
> -
>
http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html
> - https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html
>
> I figured out the issue, which is fixed by changing 1 line of code in
> src/main/grep.c -- there is a strlen function call which is currently
> inside of the while loop over matches, and the patch moves it before the
> loop.
>
https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch
>
> I made some figures that show the quadratic time complexity before applying
> the patch, and the linear time complexity after applying the patch
> https://github.com/tdhock/namedCapture-article#19-feb-2019
>
> I would have posted a bug report on bugs.r-project.org but I do not have an
> account. So can an R-devel person please either (1) accept this patch, or
> (2) give me an account so I can post the patch on the bug tracker?
>
> Finally I would like to mention that Bill Dunlap noticed a similar problem
> (time complexity which is quadratic in subject size) for strsplit with
> perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix
> could be accomplished (because I see that strlen is being called in a while
> loop in do_strsplit as well).
>
> Thanks
> Toby Dylan Hocking
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Reasonably Related Threads

Search for more maybe matching threads

R devel - Feb 2019 - patch for gregexpr(perl=TRUE)

[Rd] patch for gregexpr(perl=TRUE)

[Rd] patch for gregexpr(perl=TRUE)

Reasonably Related Threads