Hi, I am using R's grep function to find patterns in vectors of strings. The number of patterns I would like to match is 7,700 (of different sizes). I noticed that I get an error message when I do the following: data <- array() for (j in 1:length(x)) { array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"), x[j], value = T)) } When I break this up into 4 chunks of patterns it works: data <- array() for (j in 1:length(x)) { array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"), x[j], value = T)) array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"), x[j], value = T)) array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"), x[j], value = T)) array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"), x[j], value = T)) } My questions: what's the maximum size of the patterns argument in grep? Is there a way to do this faster? It is very slow. Thanks. Math Sorry for not providing a reproducible example. It's a size issue which makes it difficult to provide an example. -- View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613.html Sent from the R help mailing list archive at Nabble.com.
Hi, Given that you can't provide a full example, please at least provide str() on your data, more complete information on the problem, and ideally a small toy example that demonstrates precisely what you are doing. For instance, you tell us that you "get an error message" but you never tell us what it is. Don't you think we might need to know what the error is to be able to diagnose and fix it? Also, note that your "working" example simply overwrites array$chunk1[j] four times. Sarah On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <mathijsdevaan at gmail.com> wrote:> Hi, > > I am using R's grep function to find patterns in vectors of strings. The > number of patterns I would like to match is 7,700 (of different sizes). I > noticed that I get an error message when I do the following: > > data <- array() > for (j in 1:length(x)) > { > array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"), x[j], > value = T)) > } > > When I break this up into 4 chunks of patterns it works: > > data <- array() > for (j in 1:length(x)) > { > array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"), > x[j], value = T)) > } > > My questions: what's the maximum size of the patterns argument in grep? Is > there a way to do this faster? It is very slow. > > Thanks. > > Math > > Sorry for not providing a reproducible example. It's a size issue which > makes it difficult to provide an example. >-- Sarah Goslee http://www.functionaldiversity.org
On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <mathijsdevaan at gmail.com> wrote:> Hi, > > I am using R's grep function to find patterns in vectors of strings. The > number of patterns I would like to match is 7,700 (of different sizes). I > noticed that I get an error message when I do the following: > > data <- array() > for (j in 1:length(x)) > { > array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"), x[j], > value = T)) > } > > When I break this up into 4 chunks of patterns it works: > > data <- array() > for (j in 1:length(x)) > { > array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"), > x[j], value = T)) > } > > My questions: what's the maximum size of the patterns argument in grep? Is > there a way to do this faster? It is very slow.Try strapplyc in gsubfn and see http://gsubfn.googlecode.com for more info. # test data x <- c("abcd", "z", "dbef") # re is regexp with 7700 alternatives # to test with g <- expand.grid(letters, letters, letters) gp <- do.call("paste0", g) gp7700 <- head(gp, 7700) re <- paste(gp7700, collapse = "|") # grep gives error message grep.out <- grep(re, x) # strapplyc works library(gsubfn) which(sapply(strapplyc(x, re), length) > 0) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com