I have two large data frames with the following structure:> df1id date test1.result 1 a 2009-08-28 1 2 a 2009-09-16 1 3 b 2008-08-06 0 4 c 2012-02-02 1 5 c 2010-08-03 1 6 c 2012-08-02 0> df2id date test2.result 1 a 2011-02-03 1 2 b 2011-09-27 0 3 b 2011-09-01 1 4 c 2009-07-16 0 5 c 2009-04-15 0 6 c 2010-08-10 1 I need to match items in df2 to those in df1 with specific matching criteria. I have written a looped matching algorithm that works, but it is very slow with my large datasets. I am requesting help on making a version of this code that is faster and ?vectorized" so to speak. My algorithm is currently something like this code. It works but is damn slow. findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30, lagdays=30){ # Function to find, within subjects, two tests that occur with a timeframe # # test1 = the reference test result for which matching second tests are sought # test2 = the second test result # date1 = the date of test1 # date2 = the date of test2 # id1 = unique identifier for subject undergoing test 1 # id2 = unique identifier for subject undergoing test 2 # predays = maximum number of days prior to test1 date that test2 date might occur # lagdays = maximum number of days after test1 date that test2 date might occur result <- data.frame(matrix(ncol=5, nrow=length(test1))) colnames(result) <- c('id','test1','date','test2count',?test2lag.result') result$id <- id1 result$test1 <- test1 result$date <- date1 for(i in 1:length(test1)){ l <- 0 # Counter of test2 results that matches test1 within lag interval m <- NA # Indicator of positive test2 within lag interval for(j in 1:length(test2)){ if(id1[i] == id2[j]){ # STEP1: Match IDs interval <- date2[j] - date1[i] intmatch <- ifelse(interval >= predays && interval <= lagdays, 1, 0) if(intmatch == 1){ # STEP2: Does test2 fall within lag interval? l <- l+1 # If test2 within lag interval, count it if(test2[j] == 1) { # STEP3: Is test 2 positive? m <- 1 # If test2 is positive, set indicator to 1 } else { m <- 0 } } } } result$test2count[i] <- l result$test2lag.result[i] <- m } return(result) } I would appreciate help on building a faster matching algorithm. I am pretty certain that R functions can be used to do this but I do not have a good grasp of how to make it work. Brant Inman [[alternative HTML version deleted]]
Hi Brant, I'm a bit confused about which data frame is the one to match to, but the following, while still including loops, should run much faster than the above as it only matches dates within id matches. df1<-read.table(text="id date test1.result a 2009-08-28 1 a 2009-09-16 1 b 2008-08-06 0 c 2012-02-02 1 c 2010-08-03 1 c 2012-08-02 0",header=TRUE) df2<-read.table(text="id date test2.result a 2011-02-03 1 b 2011-09-27 0 b 2011-09-01 1 c 2009-07-16 0 c 2009-04-15 0 c 2010-08-10 1",header=TRUE) bi.match<-function(x1,x2,maxdaydiff=30) { # convert the character strings to dates (may not be necessary) x1$dates<-as.Date(x1$date,"%Y-%m-%d") x2$dates<-as.Date(x2$date,"%Y-%m-%d") # initialize the l and m variables x1$l<-x1$m<-0 # get all the id codes ids<-unique(x2$id) # step through the id codes for(id1 in ids) { x1ind<-which(x1$id == id1) x2ind<-which(x2$id == id1) for(id2 in 1:length(x1ind)) { # get the indices of the x2 dates that are within maxdaydiff days of this x1 date diffok<-which(abs(x1$dates[x1ind[id2]]-x2$dates[x2ind])<=30) # set the date diff match indicator to 1 x1$l[x1ind[id2]]<-length(diffok) > 0 # set the positive test indicator to 1 x1$m[x1ind[id2]]<-any(x2$test2.result[x2ind[diffok]] > 0) } } return(x1) } bi.match(df1,df2) Jim On Sat, Apr 18, 2015 at 2:14 PM, Brant Inman <brant.inman at me.com> wrote:> I have two large data frames with the following structure: > >> df1 > id date test1.result > 1 a 2009-08-28 1 > 2 a 2009-09-16 1 > 3 b 2008-08-06 0 > 4 c 2012-02-02 1 > 5 c 2010-08-03 1 > 6 c 2012-08-02 0 > >> df2 > id date test2.result > 1 a 2011-02-03 1 > 2 b 2011-09-27 0 > 3 b 2011-09-01 1 > 4 c 2009-07-16 0 > 5 c 2009-04-15 0 > 6 c 2010-08-10 1 > > I need to match items in df2 to those in df1 with specific matching criteria. I have written a looped matching algorithm that works, but it is very slow with my large datasets. I am requesting help on making a version of this code that is faster and ?vectorized" so to speak. > > My algorithm is currently something like this code. It works but is damn slow. > > findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30, > lagdays=30){ > # Function to find, within subjects, two tests that occur with a timeframe > # > # test1 = the reference test result for which matching second tests are sought > # test2 = the second test result > # date1 = the date of test1 > # date2 = the date of test2 > # id1 = unique identifier for subject undergoing test 1 > # id2 = unique identifier for subject undergoing test 2 > # predays = maximum number of days prior to test1 date that test2 date might occur > # lagdays = maximum number of days after test1 date that test2 date might occur > > result <- data.frame(matrix(ncol=5, nrow=length(test1))) > colnames(result) <- c('id','test1','date','test2count',?test2lag.result') > result$id <- id1 > result$test1 <- test1 > result$date <- date1 > > for(i in 1:length(test1)){ > l <- 0 # Counter of test2 results that matches test1 within lag interval > m <- NA # Indicator of positive test2 within lag interval > > for(j in 1:length(test2)){ > if(id1[i] == id2[j]){ # STEP1: Match IDs > interval <- date2[j] - date1[i] > intmatch <- ifelse(interval >= predays && interval <= lagdays, 1, 0) > > if(intmatch == 1){ # STEP2: Does test2 fall within lag interval? > l <- l+1 # If test2 within lag interval, count it > > if(test2[j] == 1) { # STEP3: Is test 2 positive? > m <- 1 # If test2 is positive, set indicator to 1 > } else { > m <- 0 > } > } > } > } > result$test2count[i] <- l > result$test2lag.result[i] <- m > } > return(result) > } > > I would appreciate help on building a faster matching algorithm. I am pretty certain that R functions can be used to do this but I do not have a good grasp of how to make it work. > > Brant Inman > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Sat, 18 Apr 2015, Brant Inman wrote:> I have two large data frames with the following structure: > >> df1 > id date test1.result > 1 a 2009-08-28 1 > 2 a 2009-09-16 1 > 3 b 2008-08-06 0 > 4 c 2012-02-02 1 > 5 c 2010-08-03 1 > 6 c 2012-08-02 0 > >> df2 > id date test2.result > 1 a 2011-02-03 1 > 2 b 2011-09-27 0 > 3 b 2011-09-01 1 > 4 c 2009-07-16 0 > 5 c 2009-04-15 0 > 6 c 2010-08-10 1 >> I need to match items in df2 to those in df1 with specific matching > criteria. I have written a looped matching algorithm that works, but it > is very slow with my large datasets. I am requesting help on making a > version of this code that is faster and ?vectorized" so to speak.As I see in your posted code, you match id's exactly, dates according to a range, and count the number of positive test result in the second data.frame. For this, the countOverlaps() function of the GenomicRanges package will do the trick with suitably defined GRanges objects. Something like: require(GenomicRanges) date1 <- as.integer( as.Date( df1$date, "%Y-%m-%d" )) date2 <- as.integer( as.Date( df2$date, "%Y-%m-%d" )) lagdays <- 30L predays <- -30L gr1 <- GRanges(seqnames=df1$id, IRanges(start=date1,width=1),strand="*") gr2 <- GRanges(seqnames=df2$id, IRanges(start=date2+predays,end=date2+lagdays), strand="*")[ df2$test2.result==1,] df1$test2.count <- countOverlaps(gr1,gr2) For the example data.frames (as rendered by Jim Lemon's code), this yields> df1id date test1.result test2.count 1 a 2009-08-28 1 0 2 a 2009-09-16 1 0 3 b 2008-08-06 0 0 4 c 2012-02-02 1 0 5 c 2010-08-03 1 1 6 c 2012-08-02 0 0 The GenomicRanges package is at http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html where you will find installation instructions and links to vignettes. HTH, Chuck
Jim Mankin liked your message with Boxer. On April 18, 2015 at 10:48:17 AM MST, Charles C. Berry <ccberry at ucsd.edu> wrote:On Sat, 18 Apr 2015, Brant Inman wrote:> I have two large data frames with the following structure:>>> df1> id date test1.result> 1 a 2009-08-28 1> 2 a 2009-09-16 1> 3 b 2008-08-06 0> 4 c 2012-02-02 1> 5 c 2010-08-03 1> 6 c 2012-08-02 0>>> df2> id date test2.result> 1 a 2011-02-03 1> 2 b 2011-09-27 0> 3 b 2011-09-01 1> 4 c 2009-07-16 0> 5 c 2009-04-15 0> 6 c 2010-08-10 1>> I need to match items in df2 to those in df1 with specific matching > criteria. I have written a looped matching algorithm that works, but it > is very slow with my large datasets. I am requesting help on making a > version of this code that is faster and ?vectorized" so to speak.As I see in your posted code, you match id's exactly, dates according to a range, and count the number of positive test result in the second data.frame.For this, the countOverlaps() function of the GenomicRanges package will do the trick with suitably defined GRanges objects. Something like:require(GenomicRanges)date1 date2 lagdays predays gr1 gr2 IRanges(start=date2+predays,end=date2+lagdays), strand="*")[ df2$test2.result==1,]df1$test2.count For the example data.frames (as rendered by Jim Lemon's code), this yields> df1 id date test1.result test2.count1 a 2009-08-28 1 02 a 2009-09-16 1 03 b 2008-08-06 0 04 c 2012-02-02 1 05 c 2010-08-03 1 16 c 2012-08-02 0 0The GenomicRanges package is athttp://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.htmlwhere you will find installation instructions and links to vignettes.HTH,Chuck______________________________________________R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, seehttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]