Hi, I have encountered an issue about finding a date closest to another date So this is how the data frame looks like: PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY 13 4549 2002-08-21 2002-08-20 -1 183 2 14 4549 2002-08-21 2002-11-14 85 91 1 15 4549 2002-08-21 2003-02-18 181 89 1 16 4549 2002-08-21 2003-05-15 267 109 2 17 4549 2002-08-21 2003-12-16 482 96 1 128 4839 2006-11-28 2006-11-28 0 179 2 I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'. For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT. I was thinking about using abs(), and I got this: baseline<- function(x){ + + #remove all uncessary variables + baseline<- x[,c("PT_ID","DAYS_DIFF")] + + #get a list of every unique ID + uniqueID <- unique(baseline$PT_ID) + + #make a vector that will contain the smallest DAYS_DIFF + first <- rep(-99,length(uniqueID)) + + i = 1 + #loop through each unique ID + for (PT_ID in uniqueID){ + + #for each iteration get the smallest DAYS_DIFF for that ID + first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) + + #up the iteration counter + i = i + 1 + + } + #make a data frame with the lowest DAYS_DIFF and ID + newdata <- data.frame(uniqueID,first) + names(newdata) <- c("PT_ID","DAYS_DIFF") + + #return the data frame containing the lowest GPI for each ID + return(newdata) + }> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique#Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) : undefined columns selected Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID? Thanks a lot [[alternative HTML version deleted]]
Hello, Try the following. dat <- read.table(text=" PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY 13 4549 2002-08-21 2002-08-20 -1 183 2 14 4549 2002-08-21 2002-11-14 85 91 1 15 4549 2002-08-21 2003-02-18 181 89 1 16 4549 2002-08-21 2003-05-15 267 109 2 17 4549 2002-08-21 2003-12-16 482 96 1 128 4839 2006-11-28 2006-11-28 0 179 2 ", header=TRUE) spl <- split(dat, dat$PT_ID) idx <- sapply(spl, function(x) which.min(x$DAYS_DIFF)) res <- lapply(names(idx), function(nm) spl[[ nm ]][ idx[nm], ]) do.call(rbind, res) And assign the return value of do.call to your result (reuse 'res'). Hope this helps, Rui Barradas Em 01-09-2012 18:10, WANG WEIJIA escreveu:> Hi, > > I have encountered an issue about finding a date closest to another date > > So this is how the data frame looks like: > > PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY > 13 4549 2002-08-21 2002-08-20 -1 183 2 > 14 4549 2002-08-21 2002-11-14 85 91 1 > 15 4549 2002-08-21 2003-02-18 181 89 1 > 16 4549 2002-08-21 2003-05-15 267 109 2 > 17 4549 2002-08-21 2003-12-16 482 96 1 > 128 4839 2006-11-28 2006-11-28 0 179 2 > > I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'. > > For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT. > > I was thinking about using abs(), and I got this: > > baseline<- function(x){ > + > + #remove all uncessary variables > + baseline<- x[,c("PT_ID","DAYS_DIFF")] > + > + #get a list of every unique ID > + uniqueID <- unique(baseline$PT_ID) > + > + #make a vector that will contain the smallest DAYS_DIFF > + first <- rep(-99,length(uniqueID)) > + > + i = 1 > + #loop through each unique ID > + for (PT_ID in uniqueID){ > + > + #for each iteration get the smallest DAYS_DIFF for that ID > + first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) > + > + #up the iteration counter > + i = i + 1 > + > + } > + #make a data frame with the lowest DAYS_DIFF and ID > + newdata <- data.frame(uniqueID,first) > + names(newdata) <- c("PT_ID","DAYS_DIFF") > + > + #return the data frame containing the lowest GPI for each ID > + return(newdata) > + } >> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique# > Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) : > undefined columns selected > > Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID? > > Thanks a lot > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi, Try this: dat1 <- read.table(text=" ? PT_ID??? IDX_DT? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY 13? 4549 2002-08-21 2002-08-20??????? -1????? 183??????? 2 14? 4549 2002-08-21 2002-11-14??????? 85??????? 91??????? 1 15? 4549 2002-08-21 2003-02-18????? 181??????? 89??????? 1 16? 4549 2002-08-21 2003-05-15????? 267????? 109??????? 2 17? 4549 2002-08-21 2003-12-16????? 482??????? 96??????? 1 128? 4839 2006-11-28 2006-11-28??????? 0????? 179??????? 2 ", header=TRUE) dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) merge(dat1,dat3) #? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY #1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2 #2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2 #or, dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min) dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2) ?row.names(dat4)<-1:nrow(dat4) merge(dat1,dat4) #? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY #1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2 #2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2 A.K. ----- Original Message ----- From: WANG WEIJIA <wwang.nyu at gmail.com> To: "r-help at R-project.org" <r-help at r-project.org> Cc: Sent: Saturday, September 1, 2012 1:10 PM Subject: [R] R_closest date Hi, I have encountered an issue about finding a date closest to another date So this is how the data frame looks like: ? ? PT_ID? ? IDX_DT? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY 13? 4549 2002-08-21 2002-08-20? ? ? ? -1? ? ? 183? ? ? ? 2 14? 4549 2002-08-21 2002-11-14? ? ? ? 85? ? ? ? 91? ? ? ? 1 15? 4549 2002-08-21 2003-02-18? ? ? 181? ? ? ? 89? ? ? ? 1 16? 4549 2002-08-21 2003-05-15? ? ? 267? ? ? 109? ? ? ? 2 17? 4549 2002-08-21 2003-12-16? ? ? 482? ? ? ? 96? ? ? ? 1 128? 4839 2006-11-28 2006-11-28? ? ? ? 0? ? ? 179? ? ? ? 2 I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'. For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT. I was thinking about using abs(), and I got this: baseline<- function(x){ +? +? #remove all uncessary variables +? baseline<- x[,c("PT_ID","DAYS_DIFF")] +? +? #get a list of every unique ID +? uniqueID <- unique(baseline$PT_ID) +? +? #make a vector that will contain the smallest DAYS_DIFF +? first <- rep(-99,length(uniqueID)) +? +? i = 1 +? #loop through each unique ID +? for (PT_ID in uniqueID){ +? +? #for each iteration get the smallest DAYS_DIFF for that ID +? first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) +? +? #up the iteration counter +? i = i + 1 +? +? } +? #make a data frame with the lowest DAYS_DIFF and ID +? newdata <- data.frame(uniqueID,first) +? names(newdata) <- c("PT_ID","DAYS_DIFF") +? +? #return the data frame containing the lowest GPI for each ID +? return(newdata) +? }> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique#Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) : ? undefined columns selected Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID? Thanks a lot ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi, No problem. If you use join() instead of merge(), the original order of columns may not get altered. dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) library(plyr) ?join(dat1,dat3,type="inner") #Joining by: PT_ID, DAYS_DIFF ?# PT_ID???? IDX_DT?? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY #1? 4549 2002-08-21 2002-08-20??????? -1?????? 183??????? 2 #2? 4839 2006-11-28 2006-11-28???????? 0?????? 179??????? 2 A.K. ________________________________ From: Weijia Wang <wwang.nyu at gmail.com> To: arun <smartpink111 at yahoo.com> Sent: Saturday, September 1, 2012 5:11 PM Subject: Re: [R] R_closest date Thank you Arun, for your help again. Best ______________________________ WANG WEIJIA Graudate Research and Teaching?Assistant Department of Environmental?Medicine New York University, School of?Medicine wwang.nyu at gmail.com On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote: Hi,>Try this: >dat1 <- read.table(text=" >? PT_ID??? IDX_DT? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >13? 4549 2002-08-21 2002-08-20??????? -1????? 183??????? 2 >14? 4549 2002-08-21 2002-11-14??????? 85??????? 91??????? 1 >15? 4549 2002-08-21 2003-02-18????? 181??????? 89??????? 1 >16? 4549 2002-08-21 2003-05-15????? 267????? 109??????? 2 >17? 4549 2002-08-21 2003-12-16????? 482??????? 96??????? 1 >128? 4839 2006-11-28 2006-11-28??????? 0????? 179??????? 2 >", header=TRUE) >dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) >merge(dat1,dat3) >#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY >#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2 >#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2 > >#or, >dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min) >dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2) >?row.names(dat4)<-1:nrow(dat4) >merge(dat1,dat4) >#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY >#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2 >#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2 >A.K. > > > > > >----- Original Message ----- >From: WANG WEIJIA <wwang.nyu at gmail.com> >To: "r-help at R-project.org" <r-help at r-project.org> >Cc: >Sent: Saturday, September 1, 2012 1:10 PM >Subject: [R] R_closest date > >Hi, > >I have encountered an issue about finding a date closest to another date > >So this is how the data frame looks like: > >? ? PT_ID? ? ?IDX_DT? ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >13? ?4549 2002-08-21 2002-08-20? ? ? ? -1? ? ? ?183? ? ? ? 2 >14? ?4549 2002-08-21 2002-11-14? ? ? ? 85? ? ? ? 91? ? ? ? 1 >15? ?4549 2002-08-21 2003-02-18? ? ? ?181? ? ? ? 89? ? ? ? 1 >16? ?4549 2002-08-21 2003-05-15? ? ? ?267? ? ? ?109? ? ? ? 2 >17? ?4549 2002-08-21 2003-12-16? ? ? ?482? ? ? ? 96? ? ? ? 1 >128? 4839 2006-11-28 2006-11-28? ? ? ? ?0? ? ? ?179? ? ? ? 2 > >I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'. > >For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT. > >I was thinking about using abs(), and I got this: > >baseline<- function(x){ >+? >+? #remove all uncessary variables >+? baseline<- x[,c("PT_ID","DAYS_DIFF")] >+? >+? #get a list of every unique ID >+? uniqueID <- unique(baseline$PT_ID) >+? >+? #make a vector that will contain the smallest DAYS_DIFF >+? first <- rep(-99,length(uniqueID)) >+? >+? i = 1 >+? #loop through each unique ID >+? for (PT_ID in uniqueID){ >+? >+? #for each iteration get the smallest DAYS_DIFF for that ID >+? first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) >+? >+? #up the iteration counter >+? i = i + 1 >+? >+? } >+? #make a data frame with the lowest DAYS_DIFF and ID >+? newdata <- data.frame(uniqueID,first) >+? names(newdata) <- c("PT_ID","DAYS_DIFF") >+? >+? #return the data frame containing the lowest GPI for each ID >+? return(newdata) >+? } > >ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique# >>Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) : >? undefined columns selected > >Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID? > >Thanks a lot >??? [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > >
HI, For your question to split by two variables, try this: dat1<-read.table(text=" ID???????? PT_ID? BASE???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF?? LDL_BASE? rf 118485 9624295 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-20 2.5 118486 9624295 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-20 2.5 118487 9624295 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-20 2.5 118488 9624295 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-20 2.5 118489 9624295 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-20 2.5 118490 9624295 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-25 2.5 118491 9624295 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-25 2.5 118492 9624295 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-25 2.5 118485 9624296 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-26 2.5 118486 9624296 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-26 2.5 118487 9624296 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-26 2.5 118488 9624296 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-27 2.5 118489 9624296 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-27 2.5 118490 9624296 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-27 2.5 118491 9624296 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-27 2.5 118492 9624296 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-27 2.5 ?",sep="",header=TRUE) dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID)) dat3<-list() ?for(i in seq_along(dat2)){ ?dat3[[i]]<-list() ?dat3[[i]]<-ddply(dat2[[i]],.(DAYS_DIFF),summarize,Mean=mean(OBS_VALUE)) ?} dat3 do.call(rbind,dat3) #?? DAYS_DIFF? Mean #1??????? -52 208.0 #2??????? 462? 78.2 #3??????? 505 123.8 #4??????? 819 157.6 #5???????? 98? 86.0 #6??????? 223 107.2 #7??????? 658 131.2 #8??????? -52 208.0 #9??????? 505 123.8 #10?????? 819 157.6 #11??????? 98? 86.0 #12?????? 223 107.2 #13?????? 462? 78.2 #14?????? 658 131.2 #Not sure whether this will work or not in your huge dataset.? May be you can try lapply() also.? A.K. ________________________________ From: Weijia Wang <wwang.nyu at gmail.com> To: arun <smartpink111 at yahoo.com> Sent: Thursday, September 6, 2012 12:21 PM Subject: Re: [R] R_closest date Hi, Arun Do you have idea about good package that split HUGE dataframe by two variables? I was trying to use 'ddply' to calculate a mean of LDL-C values which had same date for every patient.? Therefore, I need to break down my dataframe, first by patient ID, then by the date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a same day. The example is: ? ? ? ? ?PT_ID ?BASE ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF ? LDL_BASE ?rf 118485 9624295 164.2 2006-11-21 2009-02-17 ? ? 157.6 ? ? ? ?2 ?819 days 2006-11-20 2.5 118486 9624295 164.2 2006-11-21 2006-09-30 ? ? 208.0 ? ? ? ?2 ?-52 days 2006-11-20 2.5 118487 9624295 164.2 2006-11-21 2008-04-09 ? ? 123.8 ? ? ? ?2 ?505 days 2006-11-20 2.5 118488 9624295 164.2 2006-11-21 2008-02-26 ? ? ?17.4 ? ? ? ?1 ?462 days 2006-11-20 2.5 118489 9624295 164.2 2006-11-21 2008-02-26 ? ? 139.0 ? ? ? ?2 ?462 days 2006-11-20 2.5 118490 9624295 164.2 2006-11-21 2007-07-02 ? ? 107.2 ? ? ? ?2 ?223 days 2006-11-20 2.5 118491 9624295 164.2 2006-11-21 2007-02-27 ? ? ?86.0 ? ? ? ?1 ? 98 days 2006-11-20 2.5 118492 9624295 164.2 2006-11-21 2008-09-09 ? ? 131.2 ? ? ? ?2 ?658 days 2006-11-20 2.5 ? ? ? ? ?REDUCTION ? ? ? ? ? ?GOAL ?FAILURE 118485 ?0.04019488 ? ? NOT AT GOAL ? ? FAIL 118486 -0.26674787 PRE NOT AT GOAL NOT FAIL 118487 ?0.24604141 ? ? ? ? AT GOAL NOT FAIL 118488 ?0.89403167 ? ? ? ? AT GOAL NOT FAIL 118489 ?0.15347138 ? ? NOT AT GOAL ? ? FAIL 118490 ?0.34713764 ? ? ? ? AT GOAL NOT FAIL 118491 ?0.47624848 ? ? ? ? AT GOAL NOT FAIL 118492 ?0.20097442 ? ? NOT AT GOAL NOT FAIL So, this patient has two LDL-C readings on '462 days', therefore, I want to get a mean of these 17.4 and 139.0. 'ddply' did give me a mean when running on a test dataframe, but when I used it on my dataframe with 200,000ish observations, the computer run for like 5 hours and return error. Do you have idea about other good function, that focuses on split and apply function, and rbind? Best Weijia On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote: That is actually a great idea, thanks again!> >Weijia Wang > > >On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote: > >> Hi, >> No problem. >> >> If you use join() instead of merge(), the original order of columns may not get altered. >> >> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) >> library(plyr) >> ?join(dat1,dat3,type="inner") >> #Joining by: PT_ID, DAYS_DIFF >> ?# PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >> #1 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2 >> #2 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2 >> A.K. >> >> >> >> >> >> >> ________________________________ >> From: Weijia Wang <wwang.nyu at gmail.com> >> To: arun <smartpink111 at yahoo.com> >> Sent: Saturday, September 1, 2012 5:11 PM >> Subject: Re: [R] R_closest date >> >> >> Thank you Arun, for your help again. >> >> Best >> ______________________________ >> WANG WEIJIA >> Graudate Research and Teaching Assistant >> Department of Environmental Medicine >> New York University, School of Medicine >> wwang.nyu at gmail.com >> >> >> >> >> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote: >> >> Hi, >>> Try this: >>> dat1 <- read.table(text=" >>> ? PT_ID ? ?IDX_DT ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >>> 13 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ?183 ? ? ? ?2 >>> 14 ?4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1 >>> 15 ?4549 2002-08-21 2003-02-18 ? ? ?181 ? ? ? ?89 ? ? ? ?1 >>> 16 ?4549 2002-08-21 2003-05-15 ? ? ?267 ? ? ?109 ? ? ? ?2 >>> 17 ?4549 2002-08-21 2003-12-16 ? ? ?482 ? ? ? ?96 ? ? ? ?1 >>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ?0 ? ? ?179 ? ? ? ?2 >>> ", header=TRUE) >>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) >>> merge(dat1,dat3) >>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY >>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2 >>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2 >>> >>> #or, >>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min) >>> dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2) >>> ?row.names(dat4)<-1:nrow(dat4) >>> merge(dat1,dat4) >>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY >>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2 >>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2 >>> A.K. >>> >>> >>> >>> >>> >>> ----- Original Message ----- >>> From: WANG WEIJIA <wwang.nyu at gmail.com> >>> To: "r-help at R-project.org" <r-help at r-project.org> >>> Cc: >>> Sent: Saturday, September 1, 2012 1:10 PM >>> Subject: [R] R_closest date >>> >>> Hi, >>> >>> I have encountered an issue about finding a date closest to another date >>> >>> So this is how the data frame looks like: >>> >>> ? ? PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >>> 13 ? 4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2 >>> 14 ? 4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1 >>> 15 ? 4549 2002-08-21 2003-02-18 ? ? ? 181 ? ? ? ?89 ? ? ? ?1 >>> 16 ? 4549 2002-08-21 2003-05-15 ? ? ? 267 ? ? ? 109 ? ? ? ?2 >>> 17 ? 4549 2002-08-21 2003-12-16 ? ? ? 482 ? ? ? ?96 ? ? ? ?1 >>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2 >>> >>> I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'. >>> >>> For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT. >>> >>> I was thinking about using abs(), and I got this: >>> >>> baseline<- function(x){ >>> + >>> + ?#remove all uncessary variables >>> + ?baseline<- x[,c("PT_ID","DAYS_DIFF")] >>> + >>> + ?#get a list of every unique ID >>> + ?uniqueID <- unique(baseline$PT_ID) >>> + >>> + ?#make a vector that will contain the smallest DAYS_DIFF >>> + ?first <- rep(-99,length(uniqueID)) >>> + >>> + ?i = 1 >>> + ?#loop through each unique ID >>> + ?for (PT_ID in uniqueID){ >>> + >>> + ?#for each iteration get the smallest DAYS_DIFF for that ID >>> + ?first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) >>> + >>> + ?#up the iteration counter >>> + ?i = i + 1 >>> + >>> + ?} >>> + ?#make a data frame with the lowest DAYS_DIFF and ID >>> + ?newdata <- data.frame(uniqueID,first) >>> + ?names(newdata) <- c("PT_ID","DAYS_DIFF") >>> + >>> + ?#return the data frame containing the lowest GPI for each ID >>> + ?return(newdata) >>> + ?} >>> >>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique# >>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) : >>> ? undefined columns selected >>> >>> Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID? >>> >>> Thanks a lot >>> ? ? [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >
Hi Weija, You can also try this instead of the loop solution in my previous reply. dat1<-read.table(text=" ID???????? PT_ID? BASE???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF?? LDL_BASE? rf 118485 9624295 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-20 2.5 118486 9624295 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-20 2.5 118487 9624295 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-20 2.5 118488 9624295 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-20 2.5 118489 9624295 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-20 2.5 118490 9624295 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-25 2.5 118491 9624295 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-25 2.5 118492 9624295 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-25 2.5 118485 9624296 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-26 2.5 118486 9624296 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-26 2.5 118487 9624296 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-26 2.5 118488 9624296 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-27 2.5 118489 9624296 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-27 2.5 118490 9624296 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-27 2.5 118491 9624296 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-27 2.5 118492 9624296 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-27 2.5 ?",sep="",header=TRUE) dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID)) dat3<-data.frame(do.call(rbind,lapply(dat2,`[`,c(6,8)))) dat4<-data.frame(LDL_Date.PTID=row.names(dat3),dat3) row.names(dat4)<-1:nrow(dat4) dat4$LDL_Date.PTID<-gsub("(.*..*)\\.\\d+","\\1",dat4$LDL_Date.PTID) aggregate(dat4$OBS_VALUE,list(dat4$DAYS_DIFF,dat4$LDL_Date.PTID),mean) Group.1??????????? Group.2???? x #1????? -52 2006-11-20.9624295 208.0 #2????? 462 2006-11-20.9624295? 78.2 #3????? 505 2006-11-20.9624295 123.8 #4????? 819 2006-11-20.9624295 157.6 #5?????? 98 2006-11-25.9624295? 86.0 #6????? 223 2006-11-25.9624295 107.2 #7????? 658 2006-11-25.9624295 131.2 #8????? -52 2006-11-26.9624296 208.0 #9????? 505 2006-11-26.9624296 123.8 #10???? 819 2006-11-26.9624296 157.6 #11????? 98 2006-11-27.9624296? 86.0 #12???? 223 2006-11-27.9624296 107.2 #13???? 462 2006-11-27.9624296? 78.2 #14???? 658 2006-11-27.9624296 131.2 A.K. ________________________________ From: Weijia Wang <wwang.nyu at gmail.com> To: arun <smartpink111 at yahoo.com> Sent: Thursday, September 6, 2012 12:21 PM Subject: Re: [R] R_closest date Hi, Arun Do you have idea about good package that split HUGE dataframe by two variables? I was trying to use 'ddply' to calculate a mean of LDL-C values which had same date for every patient.? Therefore, I need to break down my dataframe, first by patient ID, then by the date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a same day. The example is: ? ? ? ? ?PT_ID ?BASE ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF ? LDL_BASE ?rf 118485 9624295 164.2 2006-11-21 2009-02-17 ? ? 157.6 ? ? ? ?2 ?819 days 2006-11-20 2.5 118486 9624295 164.2 2006-11-21 2006-09-30 ? ? 208.0 ? ? ? ?2 ?-52 days 2006-11-20 2.5 118487 9624295 164.2 2006-11-21 2008-04-09 ? ? 123.8 ? ? ? ?2 ?505 days 2006-11-20 2.5 118488 9624295 164.2 2006-11-21 2008-02-26 ? ? ?17.4 ? ? ? ?1 ?462 days 2006-11-20 2.5 118489 9624295 164.2 2006-11-21 2008-02-26 ? ? 139.0 ? ? ? ?2 ?462 days 2006-11-20 2.5 118490 9624295 164.2 2006-11-21 2007-07-02 ? ? 107.2 ? ? ? ?2 ?223 days 2006-11-20 2.5 118491 9624295 164.2 2006-11-21 2007-02-27 ? ? ?86.0 ? ? ? ?1 ? 98 days 2006-11-20 2.5 118492 9624295 164.2 2006-11-21 2008-09-09 ? ? 131.2 ? ? ? ?2 ?658 days 2006-11-20 2.5 ? ? ? ? ?REDUCTION ? ? ? ? ? ?GOAL ?FAILURE 118485 ?0.04019488 ? ? NOT AT GOAL ? ? FAIL 118486 -0.26674787 PRE NOT AT GOAL NOT FAIL 118487 ?0.24604141 ? ? ? ? AT GOAL NOT FAIL 118488 ?0.89403167 ? ? ? ? AT GOAL NOT FAIL 118489 ?0.15347138 ? ? NOT AT GOAL ? ? FAIL 118490 ?0.34713764 ? ? ? ? AT GOAL NOT FAIL 118491 ?0.47624848 ? ? ? ? AT GOAL NOT FAIL 118492 ?0.20097442 ? ? NOT AT GOAL NOT FAIL So, this patient has two LDL-C readings on '462 days', therefore, I want to get a mean of these 17.4 and 139.0. 'ddply' did give me a mean when running on a test dataframe, but when I used it on my dataframe with 200,000ish observations, the computer run for like 5 hours and return error. Do you have idea about other good function, that focuses on split and apply function, and rbind? Best Weijia On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote: That is actually a great idea, thanks again!> >Weijia Wang > > >On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote: > >> Hi, >> No problem. >> >> If you use join() instead of merge(), the original order of columns may not get altered. >> >> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) >> library(plyr) >> ?join(dat1,dat3,type="inner") >> #Joining by: PT_ID, DAYS_DIFF >> ?# PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >> #1 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2 >> #2 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2 >> A.K. >> >> >> >> >> >> >> ________________________________ >> From: Weijia Wang <wwang.nyu at gmail.com> >> To: arun <smartpink111 at yahoo.com> >> Sent: Saturday, September 1, 2012 5:11 PM >> Subject: Re: [R] R_closest date >> >> >> Thank you Arun, for your help again. >> >> Best >> ______________________________ >> WANG WEIJIA >> Graudate Research and Teaching Assistant >> Department of Environmental Medicine >> New York University, School of Medicine >> wwang.nyu at gmail.com >> >> >> >> >> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote: >> >> Hi, >>> Try this: >>> dat1 <- read.table(text=" >>> ? PT_ID ? ?IDX_DT ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >>> 13 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ?183 ? ? ? ?2 >>> 14 ?4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1 >>> 15 ?4549 2002-08-21 2003-02-18 ? ? ?181 ? ? ? ?89 ? ? ? ?1 >>> 16 ?4549 2002-08-21 2003-05-15 ? ? ?267 ? ? ?109 ? ? ? ?2 >>> 17 ?4549 2002-08-21 2003-12-16 ? ? ?482 ? ? ? ?96 ? ? ? ?1 >>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ?0 ? ? ?179 ? ? ? ?2 >>> ", header=TRUE) >>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) >>> merge(dat1,dat3) >>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY >>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2 >>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2 >>> >>> #or, >>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min) >>> dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2) >>> ?row.names(dat4)<-1:nrow(dat4) >>> merge(dat1,dat4) >>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY >>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2 >>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2 >>> A.K. >>> >>> >>> >>> >>> >>> ----- Original Message ----- >>> From: WANG WEIJIA <wwang.nyu at gmail.com> >>> To: "r-help at R-project.org" <r-help at r-project.org> >>> Cc: >>> Sent: Saturday, September 1, 2012 1:10 PM >>> Subject: [R] R_closest date >>> >>> Hi, >>> >>> I have encountered an issue about finding a date closest to another date >>> >>> So this is how the data frame looks like: >>> >>> ? ? PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >>> 13 ? 4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2 >>> 14 ? 4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1 >>> 15 ? 4549 2002-08-21 2003-02-18 ? ? ? 181 ? ? ? ?89 ? ? ? ?1 >>> 16 ? 4549 2002-08-21 2003-05-15 ? ? ? 267 ? ? ? 109 ? ? ? ?2 >>> 17 ? 4549 2002-08-21 2003-12-16 ? ? ? 482 ? ? ? ?96 ? ? ? ?1 >>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2 >>> >>> I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'. >>> >>> For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT. >>> >>> I was thinking about using abs(), and I got this: >>> >>> baseline<- function(x){ >>> + >>> + ?#remove all uncessary variables >>> + ?baseline<- x[,c("PT_ID","DAYS_DIFF")] >>> + >>> + ?#get a list of every unique ID >>> + ?uniqueID <- unique(baseline$PT_ID) >>> + >>> + ?#make a vector that will contain the smallest DAYS_DIFF >>> + ?first <- rep(-99,length(uniqueID)) >>> + >>> + ?i = 1 >>> + ?#loop through each unique ID >>> + ?for (PT_ID in uniqueID){ >>> + >>> + ?#for each iteration get the smallest DAYS_DIFF for that ID >>> + ?first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) >>> + >>> + ?#up the iteration counter >>> + ?i = i + 1 >>> + >>> + ?} >>> + ?#make a data frame with the lowest DAYS_DIFF and ID >>> + ?newdata <- data.frame(uniqueID,first) >>> + ?names(newdata) <- c("PT_ID","DAYS_DIFF") >>> + >>> + ?#return the data frame containing the lowest GPI for each ID >>> + ?return(newdata) >>> + ?} >>> >>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique# >>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) : >>> ? undefined columns selected >>> >>> Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID? >>> >>> Thanks a lot >>> ? ? [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >