thr3ads.net - R help - [R] gregexpr slow and increases exponentially with string length --> how to speed it up? [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Emmanuel Levy

2008-Oct-31 01:01 UTC

[R] gregexpr slow and increases exponentially with string length --> how to speed it up?

Dear All,

I have a long string and need to search for regular expressions in
there. However it becomes horribly slow as the string length
increases.

Below is an example: when "i" increases by 5, the time spent increases
by more! (my string is 11,000,000 letters long!)

I also noticed that
- the search time increases dramatically with the number of matches found.
- the perl=T option slows down the search

Any idea to speed this up would be greatly appreciated!

Best,

Emmanuel

> for (i in c(10000, 50000, 100000, 500000)){+   aa = as.character(sample(1:9, i, replace=T))
+   aa = paste(aa, collapse='')
+   print(i)
+   print(system.time(gregexpr("[367]2[1-9][129]",aa)))
+ }
[1] 10000
   user  system elapsed
  0.004   0.000   0.003
[1] 50000
   user  system elapsed
  0.060   0.000   0.061
[1] 1e+05
   user  system elapsed
  0.240   0.000   0.238
[1] 5e+05
   user  system elapsed
  5.733   0.000   5.732>

Charles C. Berry

2008-Oct-31 01:55 UTC

head link

[R] gregexpr slow and increases exponentially with string length --> how to speed it up?

On Thu, 30 Oct 2008, Emmanuel Levy wrote:
> Dear All,
>
> I have a long string and need to search for regular expressions in
> there. However it becomes horribly slow as the string length
> increases.
>
> Below is an example: when "i" increases by 5, the time spent
increases
> by more! (my string is 11,000,000 letters long!)
>
> I also noticed that
> - the search time increases dramatically with the number of matches found.
> - the perl=T option slows down the search
>
> Any idea to speed this up would be greatly appreciated!
The pattern looks for 4 adjacent elements (x, say) satisfying

 	x[1] %in% c(3,6,7)
 	x[2] == 2
 	x[3] %in% 1-9
 	x[4] %in c(1,2,9)

This is a much more specialized problem than gregexpr is tooled for.

You can find all such matches (not just the disjoint ones that gregexpr 
finds) using something like this:

 	twomatch <-function(x,y) intersect(x+1,y)
 	match.list <-
 		list(
 			which( vec %in% c(3,6,7) ),
 			which( vec == 2 ),
 			which( vec %in% 1:9 ),
 			which( vec %in% c(1,2,9) ) )
 	res <- Reduce( twomatch, match.list ) - length(match.list) + 1

If you want to precisely match the gregexpr results, you'll need to filter 
out the overlapping matches.

HTH,

Chuck
>
> Best,
>
> Emmanuel
>
>
>> for (i in c(10000, 50000, 100000, 500000)){
> +   aa = as.character(sample(1:9, i, replace=T))
> +   aa = paste(aa, collapse='')
> +   print(i)
> +   print(system.time(gregexpr("[367]2[1-9][129]",aa)))
> + }
> [1] 10000
>   user  system elapsed
>  0.004   0.000   0.003
> [1] 50000
>   user  system elapsed
>  0.060   0.000   0.061
> [1] 1e+05
>   user  system elapsed
>  0.240   0.000   0.238
> [1] 5e+05
>   user  system elapsed
>  5.733   0.000   5.732
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

Maybe Matching Threads

Search for more apparently analagous threads

R help - Oct 2008 - gregexpr slow and increases exponentially with string length --> how to speed it up?

[R] gregexpr slow and increases exponentially with string length --> how to speed it up?

[R] gregexpr slow and increases exponentially with string length --> how to speed it up?

Maybe Matching Threads