Emmanuel Levy
2008-Oct-31 01:01 UTC
[R] gregexpr slow and increases exponentially with string length --> how to speed it up?
Dear All, I have a long string and need to search for regular expressions in there. However it becomes horribly slow as the string length increases. Below is an example: when "i" increases by 5, the time spent increases by more! (my string is 11,000,000 letters long!) I also noticed that - the search time increases dramatically with the number of matches found. - the perl=T option slows down the search Any idea to speed this up would be greatly appreciated! Best, Emmanuel> for (i in c(10000, 50000, 100000, 500000)){+ aa = as.character(sample(1:9, i, replace=T)) + aa = paste(aa, collapse='') + print(i) + print(system.time(gregexpr("[367]2[1-9][129]",aa))) + } [1] 10000 user system elapsed 0.004 0.000 0.003 [1] 50000 user system elapsed 0.060 0.000 0.061 [1] 1e+05 user system elapsed 0.240 0.000 0.238 [1] 5e+05 user system elapsed 5.733 0.000 5.732>
Charles C. Berry
2008-Oct-31 01:55 UTC
[R] gregexpr slow and increases exponentially with string length --> how to speed it up?
On Thu, 30 Oct 2008, Emmanuel Levy wrote:> Dear All, > > I have a long string and need to search for regular expressions in > there. However it becomes horribly slow as the string length > increases. > > Below is an example: when "i" increases by 5, the time spent increases > by more! (my string is 11,000,000 letters long!) > > I also noticed that > - the search time increases dramatically with the number of matches found. > - the perl=T option slows down the search > > Any idea to speed this up would be greatly appreciated!The pattern looks for 4 adjacent elements (x, say) satisfying x[1] %in% c(3,6,7) x[2] == 2 x[3] %in% 1-9 x[4] %in c(1,2,9) This is a much more specialized problem than gregexpr is tooled for. You can find all such matches (not just the disjoint ones that gregexpr finds) using something like this: twomatch <-function(x,y) intersect(x+1,y) match.list <- list( which( vec %in% c(3,6,7) ), which( vec == 2 ), which( vec %in% 1:9 ), which( vec %in% c(1,2,9) ) ) res <- Reduce( twomatch, match.list ) - length(match.list) + 1 If you want to precisely match the gregexpr results, you'll need to filter out the overlapping matches. HTH, Chuck> > Best, > > Emmanuel > > >> for (i in c(10000, 50000, 100000, 500000)){ > + aa = as.character(sample(1:9, i, replace=T)) > + aa = paste(aa, collapse='') > + print(i) > + print(system.time(gregexpr("[367]2[1-9][129]",aa))) > + } > [1] 10000 > user system elapsed > 0.004 0.000 0.003 > [1] 50000 > user system elapsed > 0.060 0.000 0.061 > [1] 1e+05 > user system elapsed > 0.240 0.000 0.238 > [1] 5e+05 > user system elapsed > 5.733 0.000 5.732 >> > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901