I am working on a simulation where I need to count the number of matches for an arbitrary pattern in a large sequence of binomial factors. My current code is for(indx in 1:(length(bin.05)-3)) if ((bin.05[indx] == test.pattern[1]) && (bin.05[indx+1] == test.pattern[2]) && (bin.05[indx+2] == test.pattern[3])) return.values$count.match.pattern[1] = return.values$count.match.pattern[1] + 1 Since I am running the above code for each simulation multiple times on sequences of 10,000,000 factors the code is taking longer than I would like. Is there a better (more "R" way of achieving the same answer? Walter Anderson
There's almost certainly a better way, but I'd be more inclined to look for it if you'd provide a small reproducible example so I could actually try it. Without knowing the structure of your data, it's very hard to offer alternatives. Sarah On Fri, Mar 16, 2012 at 12:59 PM, Walter Anderson <wandrson01 at gmail.com> wrote:> I am working on a simulation where I need to count the number of matches for > an arbitrary pattern in a large sequence of binomial factors. ?My current > code is > > ? ?for(indx in 1:(length(bin.05)-3)) > ? ? ?if ((bin.05[indx] == test.pattern[1]) && (bin.05[indx+1] => test.pattern[2]) && (bin.05[indx+2] == test.pattern[3])) > ? ? ? ?return.values$count.match.pattern[1] > return.values$count.match.pattern[1] + 1 > > Since I am running the above code for each simulation multiple times on > sequences of 10,000,000 factors the code is taking longer than I would like. > ? Is there a better (more "R" way of achieving the same answer? > > Walter Anderson >-- Sarah Goslee http://www.functionaldiversity.org
You didn't show your complete code but the following may help you speed things up. Compare a function, f0, structured like your code and one, f1, that calls sum once instead of counting length(x)-3 times. f0 <- function(x, test.pattern) { count <- 0 for(indx in seq_len(length(x)-3)) { if ((x[indx] == test.pattern[1]) && (x[indx+1] == test.pattern[2]) && (x[indx+2] == test.pattern[3])) { count <- count + 1 } } count } f1 <- function(x, test.pattern) { indx <- seq_len(length(x)-3) sum((x[indx] == test.pattern[1]) & (x[indx+1] == test.pattern[2]) & (x[indx+2] == test.pattern[3])) }> bin.05 <- round((log10(1:10000000)%%1e-3 - log10(1:10000000)%%1e-4) * 1e4) # quasi-random sample of 10^7 from {0,...,9} > system.time(print(f0(bin.05, c(2,3,3))))[1] 3194 user system elapsed 14.35 0.00 14.35> system.time(print(f1(bin.05, c(2,3,3))))[1] 3194 user system elapsed 0.70 0.21 0.90 You are probably also slowing things down by doing yourList$yourCounts[1] <- yourList$yourCounts[1] + 1 many times instead of count <- yourList$yourCounts[1] once and count <- count + 1 many times. The former evaluates $, [, $<-, and [<- many times and the $<- and [<- in particular may use a fair bit of time. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of Walter Anderson > Sent: Friday, March 16, 2012 10:00 AM > To: R Help > Subject: [R] Faster way to implement this search? > > I am working on a simulation where I need to count the number of matches > for an arbitrary pattern in a large sequence of binomial factors. My > current code is > > for(indx in 1:(length(bin.05)-3)) > if ((bin.05[indx] == test.pattern[1]) && (bin.05[indx+1] => test.pattern[2]) && (bin.05[indx+2] == test.pattern[3])) > return.values$count.match.pattern[1] > return.values$count.match.pattern[1] + 1 > > Since I am running the above code for each simulation multiple times on > sequences of 10,000,000 factors the code is taking longer than I would > like. Is there a better (more "R" way of achieving the same answer? > > Walter Anderson > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
> My current question is there a way to perform the same count, but with > an arbitrary size pattern. In other words, instead of a fixed pattern > size of 3, could I have a pattern size of 4, 5, 6, ..., 30 any of which > that could be run without changing the script?Of course you cannot do this without changing your script. However, if you make a function out of it then you can change the function definition to be more flexible and not have to change any calls to it. Change your function from f <- function(x, test.pattern) { indx <- seq_len(length(x)-3) # 3 should be 2 sum((x[indx] == test.pattern[1]) & (x[indx+1] == test.pattern[2]) & (x[indx+2] == test.pattern[3])) } to f <- function (x, test.pattern) { if (length(x) < length(test.pattern)) { 0 # degenerate cases } else { indx <- seq_len(length(x) - length(test.pattern) + 1) match <- x[indx] == test.pattern[1] for (i in seq_len(length(test.pattern) - 1)) { match <- match & x[indx + i] == test.pattern[1 + i] } sum(match) } } Give the function a name that is meaningful and memorable to you and use it instead of copying the idiom in it when you need to do a search. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of Walter Anderson > Sent: Saturday, March 17, 2012 5:56 AM > To: Jeff Newmiller > Cc: R Help > Subject: Re: [R] Faster way to implement this search? > > On 03/17/2012 12:53 AM, Jeff Newmiller wrote: > > for(indx in 1:(length(bin.05)-3)) > > >>> if ((bin.05[indx] == test.pattern[1])&& (bin.05[indx+1] => > >>> test.pattern[2])&& (bin.05[indx+2] == test.pattern[3])) > > >>> return.values$count.match.pattern[1] > > >>> return.values$count.match.pattern[1] + 1 > Ok, sorry for not understanding the first time, here is my example with > the type of data I am working with in this simulation > > test.pattern <- c("T", "T", "O") > bin.05 cut(runif(10000000), breaks=c(-0.01,0.05,1), labels=c("T", > "O")) > for(indx in 1:(length(bin.05)-3)) > if ( > (bin.05[indx] == test.pattern[1]) && > (bin.05[indx+1] == test.pattern[2]) && > (bin.05[indx+2] == test.pattern[3])) > count <- count + 1 > > Now the approach provided by William Dunlop sped up my simulation > tremendously; > > indx <- seq_len(length(bin.05)-3) > count <- sum((bin.05[indx] == test.pattern[1]) & > (bin.05[indx+1] == test.pattern[2]) & > (bin.05[indx+2] == test.pattern[3])) > > My current question is there a way to perform the same count, but with > an arbitrary size pattern. In other words, instead of a fixed pattern > size of 3, could I have a pattern size of 4, 5, 6, ..., 30 any of which > that could be run without changing the script? > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.