William Dunlap
2017-Jan-06 05:15 UTC
[Rd] strsplit(perl=TRUE), gregexpr(perl=TRUE) very slow for long strings
While doing some speed testing I noticed that in R-3.2.3 the perl=TRUE variants of strsplit() and gregexpr() took time proportional to the square of the number of pattern matches in their input strings. E.g., the attached test function times gsub, strsplit, and gregexpr, with perl TRUE (PCRE) and FALSE (TRE), when the input string contains 'n' matches to the given pattern. Notice the quadratic (in n) time growth for the StrSplitPCRE and RegExprPCRE columns.> regex.perf.test(N=2^(11:20))N SubTRE SubPCRE StrSplitTRE StrSplitPCRE RegExprTRE RegExprPCRE elapsed 2048 0.00 0.00 0.00 0.00 0.00 0.00 elapsed 4096 0.00 0.00 0.01 0.00 0.00 0.00 elapsed 8192 0.00 0.00 0.00 0.01 0.00 0.01 elapsed 16384 0.02 0.00 0.00 0.05 0.02 0.08 elapsed 32768 0.00 0.00 0.01 0.16 0.00 0.29 elapsed 65536 0.02 0.01 0.04 0.59 0.01 0.96 elapsed 131072 0.03 0.02 0.08 2.37 0.05 2.43 elapsed 262144 0.06 0.04 0.17 9.58 0.10 9.61 elapsed 524288 0.14 0.05 0.36 39.14 0.21 38.58 elapsed 1048576 0.30 0.08 0.52 155.50 0.40 155.43 I have not looked at R's code, but it is possible that the problem is caused by PCRE repeatedly scanning (once per match) the entire input string to make sure it is valid UTF-8. If so, adding PCRE_NO_UTF8_CHECK to the flags given to pcre_exec would solve the problem. Perhaps R is already doing that in gsub(perl=TRUE). Here is the test function: regex.perf.test <- function(N=c(1e4, 2e4, 4e4, 8e4)) { makeTestString <- function(n) paste(collapse="", rep("ab", n)) s <- lapply(N, makeTestString) fns <- list(SubTRE=function(si) gsub("a", "", si, perl=FALSE), SubPCRE=function(si) gsub("a", "", si, perl=TRUE), StrSplitTRE=function(si) strsplit(si, "a", perl=FALSE), StrSplitPCRE=function(si) strsplit(si, "a", perl=TRUE), RegExprTRE=function(si) gregexpr("a", si, perl=FALSE), RegExprPCRE=function(si) gregexpr("a", si, perl=TRUE)) times <- lapply(fns, function(fn) sapply(s, function(si) system.time(fn(si))["elapsed"])) do.call("cbind", c(list(N=N), times)) } Bill Dunlap TIBCO Software wdunlap tibco.com