Thanks Tim.
I confirm the proposed solution is over 10x faster, see
https://github.com/tdhock/atime/issues/29#issuecomment-1943037753 for
figure and source code.
On Mon, Jan 29, 2024 at 7:05?AM Tim Taylor
<tim.taylor at hiddenelephants.co.uk> wrote:>
> I wanted to raise the possibility of improving strcapture performance in
> cases where perl = TRUE. I believe we can do this in a non-breaking way
> by calling regexpr instead of regexec (conditionally when perl = TRUE).
> To illustrate this I've put together a 'proof of concept'
function called
> strcapture2 that utilises output from regexpr directly (following a very
> nice substring approach that I've seen implemented by Toby Hocking
> in the nc package - nc::capture_first_vec).
>
> strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes =
FALSE) {
> if (isTRUE(perl)) {
> m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes
= useBytes)
> nomatch <- is.na(m) | m == -1L
> ntokens <- length(proto)
> if (any(!nomatch)) {
> length <- attr(m, "match.length")
> start <- attr(m, "capture.start")
> length <- attr(m, "capture.length")
> end <- start + length - 1L
> end[nomatch, ] <- start[nomatch, ] <- NA
> res <- substring(x, start, end)
> out <- matrix(res, length(m))
> if (ncol(out) != ntokens) {
> stop("The number of captures in 'pattern' !=
'length(proto)'")
> }
> } else {
> out <- matrix(NA_character_, length(m), ntokens)
> }
> utils:::conformToProto(out,proto)
> } else {
> strcapture(pattern,x,proto,perl,useBytes)
> }
> }
>
> Now comparing with strcapture we can expand the named capture example
> from the grep documentation:
>
> notables <- c(
> " Ben Franklin and Jefferson Davis",
> "\tMillard Fillmore",
> "Bob",
> NA_character_
> )
>
> regex <- "(?<first>[[:upper:]][[:lower:]]+)
(?<last>[[:upper:]][[:lower:]]+)"
> proto = data.frame("", "")
>
> (strcapture(regex, notables, proto, perl = TRUE))
> X.. X...1
> 1 Ben Franklin
> 2 Millard Fillmore
> 3 <NA> <NA>
> 4 <NA> <NA>
>
> (strcapture2(regex, notables, proto, perl = TRUE))
> X.. X...1
> 1 Ben Franklin
> 2 Millard Fillmore
> 3 <NA> <NA>
> 4 <NA> <NA>
>
> Now to compare timings over multiple reps:
>
> lengths <- sort(outer(c(1, 2, 5), 10^(1:4)))
> reps <- 20
>
> time_strcapture <- function(text, length, regex, proto, reps) {
> text <- rep_len(text, length)
> str <- system.time(for (i in seq_len(reps)) strcapture(regex, text,
proto, perl = TRUE))
> str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex,
text, proto, perl = TRUE))
> c(strcapture = str[["user.self"]], strcapture2 =
str2[["user.self"]])
> }
> timings <- sapply(
> lengths,
> time_strcapture,
> text = notables, regex = regex, reps = reps, proto = proto
> )
> cbind(lengths, t(timings))
> lengths strcapture strcapture2
> [1,] 10 0.005 0.003
> [2,] 20 0.005 0.002
> [3,] 50 0.008 0.003
> [4,] 100 0.012 0.002
> [5,] 200 0.021 0.003
> [6,] 500 0.051 0.003
> [7,] 1000 0.097 0.004
> [8,] 2000 0.171 0.005
> [9,] 5000 0.517 0.011
> [10,] 10000 1.203 0.018
> [11,] 20000 2.563 0.037
> [12,] 50000 7.276 0.090
>
> I've attached a plot of these timings in case helpful.
>
> I appreciate that changing strcapture in this way does make it more
> complicated but I think the performance improvements make it worth
> considering. Note that I've not thoroughly tested the above
implementation
> as wanted to get feedback from the list before proceeding further.
>
> Hope all this make sense. Cheers
>
> Tim
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel