Tim Churches
2007-Feb-15 01:24 UTC
[R] How to speed up or avoid the for-loops in this example?
Any advice, tips, clues or pointers to resources on how best to speed up or, better still, avoid the loops in the following example code much appreciated. My actual dataset has several tens of thousands of rows and lots of columns, and these loops take a rather long time to run. Everything else which I need to do is done using vectors and those parts all run very quickly indeed. I spent quite a while doing searches on r-help and re-reading the various manuals, but couldn't find any existing relevant advice. I am sure the solution is obvious, but it escapes me. Tim C # create an example data frame, multiple events per subject year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983) event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F) subject <- c(1,1,1,2,2,3,3,3,3,4,4) df <- data.frame(cbind(subject,year,event.of.interest)) # add a per-subject sequence number df$subject.seq <- 1 for (i in 2:nrow(df)) { if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <- df$subject.seq[i-1] + 1 } df # add an event sequence number which is zero until the first # event of interest for that subject happens, and then increments # thereafter df$event.seq <- 0 for (i in 1:nrow(df)) { if (df$subject.seq[i] == 1 ) { current.event.seq <- 0 } if (event.of.interest[i] == 1 | current.event.seq > 0) current.event.seq <- current.event.seq + 1 df$event.seq[i] <- current.event.seq } df
jim holtman
2007-Feb-15 02:25 UTC
[R] How to speed up or avoid the for-loops in this example?
On 2/14/07, Tim Churches <tchur@optushome.com.au> wrote:> > Any advice, tips, clues or pointers to resources on how best to speed up > or, better still, avoid the loops in the following example code much > appreciated. My actual dataset has several tens of thousands of rows and > lots of columns, and these loops take a rather long time to run. > Everything else which I need to do is done using vectors and those parts > all run very quickly indeed. I spent quite a while doing searches on > r-help and re-reading the various manuals, but couldn't find any > existing relevant advice. I am sure the solution is obvious, but it > escapes me. > > Tim C > > # create an example data frame, multiple events per subject > > year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983) > event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F) > subject <- c(1,1,1,2,2,3,3,3,3,4,4) > df <- data.frame(cbind(subject,year,event.of.interest)) > > # add a per-subject sequence number > > df$subject.seq <- 1 > for (i in 2:nrow(df)) { > if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <- > df$subject.seq[i-1] + 1 > } > df# add an event sequence number which is zero until the first> # event of interest for that subject happens, and then increments > # thereafter > > df$event.seq <- 0 > for (i in 1:nrow(df)) { > if (df$subject.seq[i] == 1 ) { > current.event.seq <- 0 > } > if (event.of.interest[i] == 1 | current.event.seq > 0) > current.event.seq <- current.event.seq + 1 > df$event.seq[i] <- current.event.seq > } > dftry:> df <- data.frame(cbind(subject,year,event.of.interest)) > df <- do.call(rbind,by(df, df$subject, function(z){z$subject.seq <-seq(nrow(z)); z}))> dfsubject year event.of.interest subject.seq 1.1 1 1980 0 1 1.2 1 1982 1 2 1.3 1 1996 1 3 2.4 2 1985 0 1 2.5 2 1987 0 2 3.6 3 1990 0 1 3.7 3 1991 1 2 3.8 3 1992 0 3 3.9 3 1999 1 4 4.10 4 1972 1 1 4.11 4 1983 0 2> > # determine first event > df <- do.call(rbind, by(df, df$subject, function(x){+ # determine first event + .first <- cumsum(x$event.of.interest) + # create sequence after first non-zero + .first <- cumsum(.first > 0) + x$event.seq <- .first + x + }))> dfsubject year event.of.interest subject.seq event.seq 1.1.1 1 1980 0 1 0 1.1.2 1 1982 1 2 1 1.1.3 1 1996 1 3 2 2.2.4 2 1985 0 1 0 2.2.5 2 1987 0 2 0 3.3.6 3 1990 0 1 0 3.3.7 3 1991 1 2 1 3.3.8 3 1992 0 3 2 3.3.9 3 1999 1 4 3 4.4.10 4 1972 1 1 1 4.4.11 4 1983 0 2 2> >______________________________________________> R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]]
Marc Schwartz
2007-Feb-15 02:48 UTC
[R] How to speed up or avoid the for-loops in this example?
On Thu, 2007-02-15 at 12:24 +1100, Tim Churches wrote:> Any advice, tips, clues or pointers to resources on how best to speed up > or, better still, avoid the loops in the following example code much > appreciated. My actual dataset has several tens of thousands of rows and > lots of columns, and these loops take a rather long time to run. > Everything else which I need to do is done using vectors and those parts > all run very quickly indeed. I spent quite a while doing searches on > r-help and re-reading the various manuals, but couldn't find any > existing relevant advice. I am sure the solution is obvious, but it > escapes me. > > Tim C > > # create an example data frame, multiple events per subject > > year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983) > event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F) > subject <- c(1,1,1,2,2,3,3,3,3,4,4) > df <- data.frame(cbind(subject,year,event.of.interest)) > > # add a per-subject sequence number > > df$subject.seq <- 1 > for (i in 2:nrow(df)) { > if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <- > df$subject.seq[i-1] + 1 > } > df > > # add an event sequence number which is zero until the first > # event of interest for that subject happens, and then increments > # thereafter > > df$event.seq <- 0 > for (i in 1:nrow(df)) { > if (df$subject.seq[i] == 1 ) { > current.event.seq <- 0 > } > if (event.of.interest[i] == 1 | current.event.seq > 0) > current.event.seq <- current.event.seq + 1 > df$event.seq[i] <- current.event.seq > } > dfOK, here is one possible solution, though perhaps with a bit more time, there may be more optimal approaches. Using your example data above, but first noting that you do not want to use: df <- data.frame(cbind(subject,year,event.of.interest)) Using cbind() first, creates a matrix and causes all columns to be coerced to a common data type, obviating the benefit of data frames to be able to handle multiple data types. For example:> str(df)'data.frame': 11 obs. of 3 variables: $ subject : num 1 1 1 2 2 3 3 3 3 4 ... $ year : num 1980 1982 1996 1985 1987 ... $ event.of.interest: num 0 1 1 0 0 0 1 0 1 1 ... Note that your column "event.of.interest" is coerced to a numeric, rather than staying as a logical. Thus, use: df <- data.frame(subject, year, event.of.interest)> str(df)'data.frame': 11 obs. of 3 variables: $ subject : num 1 1 1 2 2 3 3 3 3 4 ... $ year : num 1980 1982 1996 1985 1987 ... $ event.of.interest: logi FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE So, now on to the solution: # First, order the data frame by increasing order of # subject number and decreasing order for event.of.interest # This ensures that these columns are properly sorted # to facilitate the subsequent code. df <- df[order(df$subject, -df$event.of.interest), ] So, 'df' will look like:> dfsubject year event.of.interest 2 1 1982 TRUE 3 1 1996 TRUE 1 1 1980 FALSE 4 2 1985 FALSE 5 2 1987 FALSE 7 3 1991 TRUE 9 3 1999 TRUE 6 3 1990 FALSE 8 3 1992 FALSE 10 4 1972 TRUE 11 4 1983 FALSE # Now use the combinations of sapply(), rle(), seq() and unlist() to # generate per subject sequences. Note that rle() returns: # # > rle(df$subject) # Run Length Encoding # lengths: int [1:4] 3 2 4 2 # values : num [1:4] 1 2 3 4 # # See ?rle, ?seq, ?sapply and ?unlist df$subject.seq <- unlist(sapply(rle(df$subject)$lengths, function(x) seq(x))) So, 'df' now looks like:> dfsubject year event.of.interest subject.seq 2 1 1982 TRUE 1 3 1 1996 TRUE 2 1 1 1980 FALSE 3 4 2 1985 FALSE 1 5 2 1987 FALSE 2 7 3 1991 TRUE 1 9 3 1999 TRUE 2 6 3 1990 FALSE 3 8 3 1992 FALSE 4 10 4 1972 TRUE 1 11 4 1983 FALSE 2 # Now set event.seq to all 0's df$event.seq <- 0 So, 'df' now looks like:> dfsubject year event.of.interest subject.seq event.seq 2 1 1982 TRUE 1 0 3 1 1996 TRUE 2 0 1 1 1980 FALSE 3 0 4 2 1985 FALSE 1 0 5 2 1987 FALSE 2 0 7 3 1991 TRUE 1 0 9 3 1999 TRUE 2 0 6 3 1990 FALSE 3 0 8 3 1992 FALSE 4 0 10 4 1972 TRUE 1 0 11 4 1983 FALSE 2 0 # Get the unique subject id's # See ?unique subj.id <- unique(df$subject) # Now get the indices for each subject where event.of.interest # is TRUE. See ?which events <- sapply(subj.id, function(x) which(df$subject == x & df$event.of.interest)) So, 'events' looks like:> events[[1]] [1] 1 2 [[2]] integer(0) [[3]] [1] 6 7 [[4]] [1] 10 # Now use sapply() on the above list to create # individual sequences per list element: seq <- sapply(events, function(x) seq(along = x)) So 'seq' looks like:> seq[[1]] [1] 1 2 [[2]] integer(0) [[3]] [1] 1 2 [[4]] [1] 1 # So, for the final step, assign the event sequence values in 'seq' to # the row indices in 'events': df$event.seq[unlist(events)] <- unlist(seq) So, 'df' now looks like this:> dfsubject year event.of.interest subject.seq event.seq 2 1 1982 TRUE 1 1 3 1 1996 TRUE 2 2 1 1 1980 FALSE 3 0 4 2 1985 FALSE 1 0 5 2 1987 FALSE 2 0 7 3 1991 TRUE 1 1 9 3 1999 TRUE 2 2 6 3 1990 FALSE 3 0 8 3 1992 FALSE 4 0 10 4 1972 TRUE 1 1 11 4 1983 FALSE 2 0 HTH, Marc SChwartz