thr3ads.net - R devel - [Rd] strcapture performance when perl = TRUE [Jan 2024]

If this information is useful, please help other people find it:
Share via:

Tim Taylor

2024-Jan-29 14:04 UTC

[Rd] strcapture performance when perl = TRUE

I wanted to raise the possibility of improving strcapture performance in
cases where perl = TRUE. I believe we can do this in a non-breaking way
by calling regexpr instead of regexec (conditionally when perl = TRUE).
To illustrate this I've put together a 'proof of concept' function
called
strcapture2 that utilises output from regexpr directly (following a very
nice substring approach that I've seen implemented by Toby Hocking
in the nc package - nc::capture_first_vec).

strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes = FALSE) {
    if (isTRUE(perl)) {
        m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes =
useBytes)
        nomatch <- is.na(m) | m == -1L
        ntokens <- length(proto)
        if (any(!nomatch)) {
            length <- attr(m, "match.length")
            start <- attr(m, "capture.start")
            length <- attr(m, "capture.length")
            end <- start + length - 1L
            end[nomatch, ] <- start[nomatch, ] <- NA
            res <- substring(x, start, end)
            out <- matrix(res, length(m))
            if (ncol(out) != ntokens) {
                stop("The number of captures in 'pattern' !=
'length(proto)'")
            }
        } else {
            out <- matrix(NA_character_, length(m), ntokens)
        }
        utils:::conformToProto(out,proto)
    } else {
        strcapture(pattern,x,proto,perl,useBytes)
    }
}

Now comparing with strcapture we can expand the named capture example
from the grep documentation:

notables <- c(
    "  Ben Franklin and Jefferson Davis",
    "\tMillard Fillmore",
    "Bob",
    NA_character_
)

regex <- "(?<first>[[:upper:]][[:lower:]]+)
(?<last>[[:upper:]][[:lower:]]+)"
proto = data.frame("", "")

(strcapture(regex, notables, proto, perl = TRUE))
      X..    X...1
1     Ben Franklin
2 Millard Fillmore
3    <NA>     <NA>
4    <NA>     <NA>

(strcapture2(regex, notables, proto, perl = TRUE))
      X..    X...1
1     Ben Franklin
2 Millard Fillmore
3    <NA>     <NA>
4    <NA>     <NA>

Now to compare timings over multiple reps:

lengths <- sort(outer(c(1, 2, 5), 10^(1:4)))
reps <- 20 

time_strcapture <- function(text, length, regex, proto, reps) {
    text <- rep_len(text, length)
    str <- system.time(for (i in seq_len(reps)) strcapture(regex, text,
proto, perl = TRUE))
    str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex, text,
proto, perl = TRUE))
    c(strcapture = str[["user.self"]], strcapture2 =
str2[["user.self"]])
}
timings <- sapply(
    lengths,
    time_strcapture,
    text = notables, regex = regex, reps = reps, proto = proto
)
cbind(lengths, t(timings))
      lengths strcapture strcapture2
 [1,]      10      0.005       0.003
 [2,]      20      0.005       0.002
 [3,]      50      0.008       0.003
 [4,]     100      0.012       0.002
 [5,]     200      0.021       0.003
 [6,]     500      0.051       0.003
 [7,]    1000      0.097       0.004
 [8,]    2000      0.171       0.005
 [9,]    5000      0.517       0.011
[10,]   10000      1.203       0.018
[11,]   20000      2.563       0.037
[12,]   50000      7.276       0.090

I've attached a plot of these timings in case helpful.

I appreciate that changing strcapture in this way does make it more
complicated but I think the performance improvements make it worth
considering. Note that I've not thoroughly tested the above implementation
as wanted to get feedback from the list before proceeding further.

Hope all this make sense. Cheers

Tim


-------------- next part --------------
A non-text attachment was scrubbed...
Name: strcapture.png
Type: image/png
Size: 28123 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20240129/261d61e7/attachment.png>

Toby Hocking

2024-Feb-14 03:51 UTC

head link

[Rd] strcapture performance when perl = TRUE

Thanks Tim.
I confirm the proposed solution is over 10x faster, see
https://github.com/tdhock/atime/issues/29#issuecomment-1943037753 for
figure and source code.

On Mon, Jan 29, 2024 at 7:05?AM Tim Taylor
<tim.taylor at hiddenelephants.co.uk> wrote:>
> I wanted to raise the possibility of improving strcapture performance in
> cases where perl = TRUE. I believe we can do this in a non-breaking way
> by calling regexpr instead of regexec (conditionally when perl = TRUE).
> To illustrate this I've put together a 'proof of concept'
function called
> strcapture2 that utilises output from regexpr directly (following a very
> nice substring approach that I've seen implemented by Toby Hocking
> in the nc package - nc::capture_first_vec).
>
> strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes =
FALSE) {
>     if (isTRUE(perl)) {
>         m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes
= useBytes)
>         nomatch <- is.na(m) | m == -1L
>         ntokens <- length(proto)
>         if (any(!nomatch)) {
>             length <- attr(m, "match.length")
>             start <- attr(m, "capture.start")
>             length <- attr(m, "capture.length")
>             end <- start + length - 1L
>             end[nomatch, ] <- start[nomatch, ] <- NA
>             res <- substring(x, start, end)
>             out <- matrix(res, length(m))
>             if (ncol(out) != ntokens) {
>                 stop("The number of captures in 'pattern' !=
'length(proto)'")
>             }
>         } else {
>             out <- matrix(NA_character_, length(m), ntokens)
>         }
>         utils:::conformToProto(out,proto)
>     } else {
>         strcapture(pattern,x,proto,perl,useBytes)
>     }
> }
>
> Now comparing with strcapture we can expand the named capture example
> from the grep documentation:
>
> notables <- c(
>     "  Ben Franklin and Jefferson Davis",
>     "\tMillard Fillmore",
>     "Bob",
>     NA_character_
> )
>
> regex <- "(?<first>[[:upper:]][[:lower:]]+)
(?<last>[[:upper:]][[:lower:]]+)"
> proto = data.frame("", "")
>
> (strcapture(regex, notables, proto, perl = TRUE))
>       X..    X...1
> 1     Ben Franklin
> 2 Millard Fillmore
> 3    <NA>     <NA>
> 4    <NA>     <NA>
>
> (strcapture2(regex, notables, proto, perl = TRUE))
>       X..    X...1
> 1     Ben Franklin
> 2 Millard Fillmore
> 3    <NA>     <NA>
> 4    <NA>     <NA>
>
> Now to compare timings over multiple reps:
>
> lengths <- sort(outer(c(1, 2, 5), 10^(1:4)))
> reps <- 20
>
> time_strcapture <- function(text, length, regex, proto, reps) {
>     text <- rep_len(text, length)
>     str <- system.time(for (i in seq_len(reps)) strcapture(regex, text,
proto, perl = TRUE))
>     str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex,
text, proto, perl = TRUE))
>     c(strcapture = str[["user.self"]], strcapture2 =
str2[["user.self"]])
> }
> timings <- sapply(
>     lengths,
>     time_strcapture,
>     text = notables, regex = regex, reps = reps, proto = proto
> )
> cbind(lengths, t(timings))
>       lengths strcapture strcapture2
>  [1,]      10      0.005       0.003
>  [2,]      20      0.005       0.002
>  [3,]      50      0.008       0.003
>  [4,]     100      0.012       0.002
>  [5,]     200      0.021       0.003
>  [6,]     500      0.051       0.003
>  [7,]    1000      0.097       0.004
>  [8,]    2000      0.171       0.005
>  [9,]    5000      0.517       0.011
> [10,]   10000      1.203       0.018
> [11,]   20000      2.563       0.037
> [12,]   50000      7.276       0.090
>
> I've attached a plot of these timings in case helpful.
>
> I appreciate that changing strcapture in this way does make it more
> complicated but I think the performance improvements make it worth
> considering. Note that I've not thoroughly tested the above
implementation
> as wanted to get feedback from the list before proceeding further.
>
> Hope all this make sense. Cheers
>
> Tim
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Jan 2024 - strcapture performance when perl = TRUE

[Rd] strcapture performance when perl = TRUE

[Rd] strcapture performance when perl = TRUE

Seemingly Similar Threads