Hi,
I have encountered an issue about finding a date closest to another date
So this is how the data frame looks like:
PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13 4549 2002-08-21 2002-08-20 -1 183 2
14 4549 2002-08-21 2002-11-14 85 91 1
15 4549 2002-08-21 2003-02-18 181 89 1
16 4549 2002-08-21 2003-05-15 267 109 2
17 4549 2002-08-21 2003-12-16 482 96 1
128 4839 2006-11-28 2006-11-28 0 179 2
I need to find, the single observation, which has the closest date of
'OBS_DATE' to 'IDX_DT'.
For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE
is just one day away from IDX_DT.
I was thinking about using abs(), and I got this:
baseline<- function(x){
+
+ #remove all uncessary variables
+ baseline<- x[,c("PT_ID","DAYS_DIFF")]
+
+ #get a list of every unique ID
+ uniqueID <- unique(baseline$PT_ID)
+
+ #make a vector that will contain the smallest DAYS_DIFF
+ first <- rep(-99,length(uniqueID))
+
+ i = 1
+ #loop through each unique ID
+ for (PT_ID in uniqueID){
+
+ #for each iteration get the smallest DAYS_DIFF for that ID
+ first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
+
+ #up the iteration counter
+ i = i + 1
+
+ }
+ #make a data frame with the lowest DAYS_DIFF and ID
+ newdata <- data.frame(uniqueID,first)
+ names(newdata) <- c("PT_ID","DAYS_DIFF")
+
+ #return the data frame containing the lowest GPI for each ID
+ return(newdata)
+ }> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs,
all unique#
Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
abs(baseline$DAYS_DIFF)) :
undefined columns selected
Can anyone help me in figuring out how to get the minimum value of the absolute
value of DAYS_DIFF for unique ID?
Thanks a lot
[[alternative HTML version deleted]]
Hello, Try the following. dat <- read.table(text=" PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY 13 4549 2002-08-21 2002-08-20 -1 183 2 14 4549 2002-08-21 2002-11-14 85 91 1 15 4549 2002-08-21 2003-02-18 181 89 1 16 4549 2002-08-21 2003-05-15 267 109 2 17 4549 2002-08-21 2003-12-16 482 96 1 128 4839 2006-11-28 2006-11-28 0 179 2 ", header=TRUE) spl <- split(dat, dat$PT_ID) idx <- sapply(spl, function(x) which.min(x$DAYS_DIFF)) res <- lapply(names(idx), function(nm) spl[[ nm ]][ idx[nm], ]) do.call(rbind, res) And assign the return value of do.call to your result (reuse 'res'). Hope this helps, Rui Barradas Em 01-09-2012 18:10, WANG WEIJIA escreveu:> Hi, > > I have encountered an issue about finding a date closest to another date > > So this is how the data frame looks like: > > PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY > 13 4549 2002-08-21 2002-08-20 -1 183 2 > 14 4549 2002-08-21 2002-11-14 85 91 1 > 15 4549 2002-08-21 2003-02-18 181 89 1 > 16 4549 2002-08-21 2003-05-15 267 109 2 > 17 4549 2002-08-21 2003-12-16 482 96 1 > 128 4839 2006-11-28 2006-11-28 0 179 2 > > I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'. > > For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT. > > I was thinking about using abs(), and I got this: > > baseline<- function(x){ > + > + #remove all uncessary variables > + baseline<- x[,c("PT_ID","DAYS_DIFF")] > + > + #get a list of every unique ID > + uniqueID <- unique(baseline$PT_ID) > + > + #make a vector that will contain the smallest DAYS_DIFF > + first <- rep(-99,length(uniqueID)) > + > + i = 1 > + #loop through each unique ID > + for (PT_ID in uniqueID){ > + > + #for each iteration get the smallest DAYS_DIFF for that ID > + first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) > + > + #up the iteration counter > + i = i + 1 > + > + } > + #make a data frame with the lowest DAYS_DIFF and ID > + newdata <- data.frame(uniqueID,first) > + names(newdata) <- c("PT_ID","DAYS_DIFF") > + > + #return the data frame containing the lowest GPI for each ID > + return(newdata) > + } >> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique# > Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) : > undefined columns selected > > Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID? > > Thanks a lot > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi,
Try this:
dat1 <- read.table(text="
? PT_ID??? IDX_DT? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13? 4549 2002-08-21 2002-08-20??????? -1????? 183??????? 2
14? 4549 2002-08-21 2002-11-14??????? 85??????? 91??????? 1
15? 4549 2002-08-21 2003-02-18????? 181??????? 89??????? 1
16? 4549 2002-08-21 2003-05-15????? 267????? 109??????? 2
17? 4549 2002-08-21 2003-12-16????? 482??????? 96??????? 1
128? 4839 2006-11-28 2006-11-28??????? 0????? 179??????? 2
", header=TRUE)
dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
merge(dat1,dat3)
#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY
#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2
#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2
#or,
dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
?row.names(dat4)<-1:nrow(dat4)
merge(dat1,dat4)
#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY
#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2
#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2
A.K.
----- Original Message -----
From: WANG WEIJIA <wwang.nyu at gmail.com>
To: "r-help at R-project.org" <r-help at r-project.org>
Cc:
Sent: Saturday, September 1, 2012 1:10 PM
Subject: [R] R_closest date
Hi,
I have encountered an issue about finding a date closest to another date
So this is how the data frame looks like:
? ? PT_ID? ? IDX_DT? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13? 4549 2002-08-21 2002-08-20? ? ? ? -1? ? ? 183? ? ? ? 2
14? 4549 2002-08-21 2002-11-14? ? ? ? 85? ? ? ? 91? ? ? ? 1
15? 4549 2002-08-21 2003-02-18? ? ? 181? ? ? ? 89? ? ? ? 1
16? 4549 2002-08-21 2003-05-15? ? ? 267? ? ? 109? ? ? ? 2
17? 4549 2002-08-21 2003-12-16? ? ? 482? ? ? ? 96? ? ? ? 1
128? 4839 2006-11-28 2006-11-28? ? ? ? 0? ? ? 179? ? ? ? 2
I need to find, the single observation, which has the closest date of
'OBS_DATE' to 'IDX_DT'.
For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE
is just one day away from IDX_DT.
I was thinking about using abs(), and I got this:
baseline<- function(x){
+?
+? #remove all uncessary variables
+? baseline<- x[,c("PT_ID","DAYS_DIFF")]
+?
+? #get a list of every unique ID
+? uniqueID <- unique(baseline$PT_ID)
+?
+? #make a vector that will contain the smallest DAYS_DIFF
+? first <- rep(-99,length(uniqueID))
+?
+? i = 1
+? #loop through each unique ID
+? for (PT_ID in uniqueID){
+?
+? #for each iteration get the smallest DAYS_DIFF for that ID
+? first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
+?
+? #up the iteration counter
+? i = i + 1
+?
+? }
+? #make a data frame with the lowest DAYS_DIFF and ID
+? newdata <- data.frame(uniqueID,first)
+? names(newdata) <- c("PT_ID","DAYS_DIFF")
+?
+? #return the data frame containing the lowest GPI for each ID
+? return(newdata)
+? }> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs,
all unique#
Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
abs(baseline$DAYS_DIFF)) :
? undefined columns selected
Can anyone help me in figuring out how to get the minimum value of the absolute
value of DAYS_DIFF for unique ID?
Thanks a lot
??? [[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi, No problem. If you use join() instead of merge(), the original order of columns may not get altered. dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) library(plyr) ?join(dat1,dat3,type="inner") #Joining by: PT_ID, DAYS_DIFF ?# PT_ID???? IDX_DT?? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY #1? 4549 2002-08-21 2002-08-20??????? -1?????? 183??????? 2 #2? 4839 2006-11-28 2006-11-28???????? 0?????? 179??????? 2 A.K. ________________________________ From: Weijia Wang <wwang.nyu at gmail.com> To: arun <smartpink111 at yahoo.com> Sent: Saturday, September 1, 2012 5:11 PM Subject: Re: [R] R_closest date Thank you Arun, for your help again. Best ______________________________ WANG WEIJIA Graudate Research and Teaching?Assistant Department of Environmental?Medicine New York University, School of?Medicine wwang.nyu at gmail.com On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote: Hi,>Try this: >dat1 <- read.table(text=" >? PT_ID??? IDX_DT? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >13? 4549 2002-08-21 2002-08-20??????? -1????? 183??????? 2 >14? 4549 2002-08-21 2002-11-14??????? 85??????? 91??????? 1 >15? 4549 2002-08-21 2003-02-18????? 181??????? 89??????? 1 >16? 4549 2002-08-21 2003-05-15????? 267????? 109??????? 2 >17? 4549 2002-08-21 2003-12-16????? 482??????? 96??????? 1 >128? 4839 2006-11-28 2006-11-28??????? 0????? 179??????? 2 >", header=TRUE) >dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) >merge(dat1,dat3) >#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY >#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2 >#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2 > >#or, >dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min) >dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2) >?row.names(dat4)<-1:nrow(dat4) >merge(dat1,dat4) >#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY >#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2 >#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2 >A.K. > > > > > >----- Original Message ----- >From: WANG WEIJIA <wwang.nyu at gmail.com> >To: "r-help at R-project.org" <r-help at r-project.org> >Cc: >Sent: Saturday, September 1, 2012 1:10 PM >Subject: [R] R_closest date > >Hi, > >I have encountered an issue about finding a date closest to another date > >So this is how the data frame looks like: > >? ? PT_ID? ? ?IDX_DT? ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY >13? ?4549 2002-08-21 2002-08-20? ? ? ? -1? ? ? ?183? ? ? ? 2 >14? ?4549 2002-08-21 2002-11-14? ? ? ? 85? ? ? ? 91? ? ? ? 1 >15? ?4549 2002-08-21 2003-02-18? ? ? ?181? ? ? ? 89? ? ? ? 1 >16? ?4549 2002-08-21 2003-05-15? ? ? ?267? ? ? ?109? ? ? ? 2 >17? ?4549 2002-08-21 2003-12-16? ? ? ?482? ? ? ? 96? ? ? ? 1 >128? 4839 2006-11-28 2006-11-28? ? ? ? ?0? ? ? ?179? ? ? ? 2 > >I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'. > >For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT. > >I was thinking about using abs(), and I got this: > >baseline<- function(x){ >+? >+? #remove all uncessary variables >+? baseline<- x[,c("PT_ID","DAYS_DIFF")] >+? >+? #get a list of every unique ID >+? uniqueID <- unique(baseline$PT_ID) >+? >+? #make a vector that will contain the smallest DAYS_DIFF >+? first <- rep(-99,length(uniqueID)) >+? >+? i = 1 >+? #loop through each unique ID >+? for (PT_ID in uniqueID){ >+? >+? #for each iteration get the smallest DAYS_DIFF for that ID >+? first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) >+? >+? #up the iteration counter >+? i = i + 1 >+? >+? } >+? #make a data frame with the lowest DAYS_DIFF and ID >+? newdata <- data.frame(uniqueID,first) >+? names(newdata) <- c("PT_ID","DAYS_DIFF") >+? >+? #return the data frame containing the lowest GPI for each ID >+? return(newdata) >+? } > >ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique# >>Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) : >? undefined columns selected > >Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID? > >Thanks a lot >??? [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > >
HI,
For your question to split by two variables, try this:
dat1<-read.table(text="
ID???????? PT_ID? BASE???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF??
LDL_BASE? rf
118485 9624295 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-20
2.5
118486 9624295 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-20
2.5
118487 9624295 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-20
2.5
118488 9624295 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-20
2.5
118489 9624295 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-20
2.5
118490 9624295 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-25
2.5
118491 9624295 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-25
2.5
118492 9624295 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-25
2.5
118485 9624296 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-26
2.5
118486 9624296 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-26
2.5
118487 9624296 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-26
2.5
118488 9624296 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-27
2.5
118489 9624296 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-27
2.5
118490 9624296 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-27
2.5
118491 9624296 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-27
2.5
118492 9624296 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-27
2.5
?",sep="",header=TRUE)
dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
dat3<-list()
?for(i in seq_along(dat2)){
?dat3[[i]]<-list()
?dat3[[i]]<-ddply(dat2[[i]],.(DAYS_DIFF),summarize,Mean=mean(OBS_VALUE))
?}
dat3
do.call(rbind,dat3)
#?? DAYS_DIFF? Mean
#1??????? -52 208.0
#2??????? 462? 78.2
#3??????? 505 123.8
#4??????? 819 157.6
#5???????? 98? 86.0
#6??????? 223 107.2
#7??????? 658 131.2
#8??????? -52 208.0
#9??????? 505 123.8
#10?????? 819 157.6
#11??????? 98? 86.0
#12?????? 223 107.2
#13?????? 462? 78.2
#14?????? 658 131.2
#Not sure whether this will work or not in your huge dataset.? May be you can
try lapply() also.?
A.K.
________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com>
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date
Hi, Arun
Do you have idea about good package that split HUGE dataframe by two variables?
I was trying to use 'ddply' to calculate a mean of LDL-C values which
had same date for every patient.?
Therefore, I need to break down my dataframe, first by patient ID, then by the
date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C
on a same day.
The example is:
? ? ? ? ?PT_ID ?BASE ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF ?
LDL_BASE ?rf
118485 9624295 164.2 2006-11-21 2009-02-17 ? ? 157.6 ? ? ? ?2 ?819 days
2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30 ? ? 208.0 ? ? ? ?2 ?-52 days
2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09 ? ? 123.8 ? ? ? ?2 ?505 days
2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26 ? ? ?17.4 ? ? ? ?1 ?462 days
2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26 ? ? 139.0 ? ? ? ?2 ?462 days
2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02 ? ? 107.2 ? ? ? ?2 ?223 days
2006-11-20 2.5
118491 9624295 164.2 2006-11-21 2007-02-27 ? ? ?86.0 ? ? ? ?1 ? 98 days
2006-11-20 2.5
118492 9624295 164.2 2006-11-21 2008-09-09 ? ? 131.2 ? ? ? ?2 ?658 days
2006-11-20 2.5
? ? ? ? ?REDUCTION ? ? ? ? ? ?GOAL ?FAILURE
118485 ?0.04019488 ? ? NOT AT GOAL ? ? FAIL
118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
118487 ?0.24604141 ? ? ? ? AT GOAL NOT FAIL
118488 ?0.89403167 ? ? ? ? AT GOAL NOT FAIL
118489 ?0.15347138 ? ? NOT AT GOAL ? ? FAIL
118490 ?0.34713764 ? ? ? ? AT GOAL NOT FAIL
118491 ?0.47624848 ? ? ? ? AT GOAL NOT FAIL
118492 ?0.20097442 ? ? NOT AT GOAL NOT FAIL
So, this patient has two LDL-C readings on '462 days', therefore, I want
to get a mean of these 17.4 and 139.0.
'ddply' did give me a mean when running on a test dataframe, but when I
used it on my dataframe with 200,000ish observations, the computer run for like
5 hours and return error. Do you have idea about other good function, that
focuses on split and apply function, and rbind?
Best
Weijia
On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:
That is actually a great idea, thanks again!>
>Weijia Wang
>
>
>On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
>
>> Hi,
>> No problem.
>>
>> If you use join() instead of merge(), the original order of columns may
not get altered.
>>
>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>> library(plyr)
>> ?join(dat1,dat3,type="inner")
>> #Joining by: PT_ID, DAYS_DIFF
>> ?# PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>> #1 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>> #2 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>> A.K.
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Weijia Wang <wwang.nyu at gmail.com>
>> To: arun <smartpink111 at yahoo.com>
>> Sent: Saturday, September 1, 2012 5:11 PM
>> Subject: Re: [R] R_closest date
>>
>>
>> Thank you Arun, for your help again.
>>
>> Best
>> ______________________________
>> WANG WEIJIA
>> Graudate Research and Teaching Assistant
>> Department of Environmental Medicine
>> New York University, School of Medicine
>> wwang.nyu at gmail.com
>>
>>
>>
>>
>> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com>
wrote:
>>
>> Hi,
>>> Try this:
>>> dat1 <- read.table(text="
>>> ? PT_ID ? ?IDX_DT ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ?183 ? ? ? ?2
>>> 14 ?4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ?4549 2002-08-21 2003-02-18 ? ? ?181 ? ? ? ?89 ? ? ? ?1
>>> 16 ?4549 2002-08-21 2003-05-15 ? ? ?267 ? ? ?109 ? ? ? ?2
>>> 17 ?4549 2002-08-21 2003-12-16 ? ? ?482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ?0 ? ? ?179 ? ? ? ?2
>>> ", header=TRUE)
>>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>>> merge(dat1,dat3)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>>
>>> #or,
>>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
>>>
dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
>>> ?row.names(dat4)<-1:nrow(dat4)
>>> merge(dat1,dat4)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>> A.K.
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: WANG WEIJIA <wwang.nyu at gmail.com>
>>> To: "r-help at R-project.org" <r-help at
r-project.org>
>>> Cc:
>>> Sent: Saturday, September 1, 2012 1:10 PM
>>> Subject: [R] R_closest date
>>>
>>> Hi,
>>>
>>> I have encountered an issue about finding a date closest to another
date
>>>
>>> So this is how the data frame looks like:
>>>
>>> ? ? PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ? 4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>>> 14 ? 4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ? 4549 2002-08-21 2003-02-18 ? ? ? 181 ? ? ? ?89 ? ? ? ?1
>>> 16 ? 4549 2002-08-21 2003-05-15 ? ? ? 267 ? ? ? 109 ? ? ? ?2
>>> 17 ? 4549 2002-08-21 2003-12-16 ? ? ? 482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>>>
>>> I need to find, the single observation, which has the closest date
of 'OBS_DATE' to 'IDX_DT'.
>>>
>>> For example, for 'PT_ID' of 4549, I need row 13, of which
the OBS_DATE is just one day away from IDX_DT.
>>>
>>> I was thinking about using abs(), and I got this:
>>>
>>> baseline<- function(x){
>>> +
>>> + ?#remove all uncessary variables
>>> + ?baseline<- x[,c("PT_ID","DAYS_DIFF")]
>>> +
>>> + ?#get a list of every unique ID
>>> + ?uniqueID <- unique(baseline$PT_ID)
>>> +
>>> + ?#make a vector that will contain the smallest DAYS_DIFF
>>> + ?first <- rep(-99,length(uniqueID))
>>> +
>>> + ?i = 1
>>> + ?#loop through each unique ID
>>> + ?for (PT_ID in uniqueID){
>>> +
>>> + ?#for each iteration get the smallest DAYS_DIFF for that ID
>>> + ?first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
>>> +
>>> + ?#up the iteration counter
>>> + ?i = i + 1
>>> +
>>> + ?}
>>> + ?#make a data frame with the lowest DAYS_DIFF and ID
>>> + ?newdata <- data.frame(uniqueID,first)
>>> + ?names(newdata) <- c("PT_ID","DAYS_DIFF")
>>> +
>>> + ?#return the data frame containing the lowest GPI for each ID
>>> + ?return(newdata)
>>> + ?}
>>>
>>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total
11368 obs, all unique#
>>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID ==
PT_ID), abs(baseline$DAYS_DIFF)) :
>>> ? undefined columns selected
>>>
>>> Can anyone help me in figuring out how to get the minimum value of
the absolute value of DAYS_DIFF for unique ID?
>>>
>>> Thanks a lot
>>> ? ? [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>
Hi Weija,
You can also try this instead of the loop solution in my previous reply.
dat1<-read.table(text="
ID???????? PT_ID? BASE???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF??
LDL_BASE? rf
118485 9624295 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-20
2.5
118486 9624295 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-20
2.5
118487 9624295 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-20
2.5
118488 9624295 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-20
2.5
118489 9624295 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-20
2.5
118490 9624295 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-25
2.5
118491 9624295 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-25
2.5
118492 9624295 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-25
2.5
118485 9624296 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-26
2.5
118486 9624296 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-26
2.5
118487 9624296 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-26
2.5
118488 9624296 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-27
2.5
118489 9624296 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-27
2.5
118490 9624296 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-27
2.5
118491 9624296 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-27
2.5
118492 9624296 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-27
2.5
?",sep="",header=TRUE)
dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
dat3<-data.frame(do.call(rbind,lapply(dat2,`[`,c(6,8))))
dat4<-data.frame(LDL_Date.PTID=row.names(dat3),dat3)
row.names(dat4)<-1:nrow(dat4)
dat4$LDL_Date.PTID<-gsub("(.*..*)\\.\\d+","\\1",dat4$LDL_Date.PTID)
aggregate(dat4$OBS_VALUE,list(dat4$DAYS_DIFF,dat4$LDL_Date.PTID),mean)
Group.1??????????? Group.2???? x
#1????? -52 2006-11-20.9624295 208.0
#2????? 462 2006-11-20.9624295? 78.2
#3????? 505 2006-11-20.9624295 123.8
#4????? 819 2006-11-20.9624295 157.6
#5?????? 98 2006-11-25.9624295? 86.0
#6????? 223 2006-11-25.9624295 107.2
#7????? 658 2006-11-25.9624295 131.2
#8????? -52 2006-11-26.9624296 208.0
#9????? 505 2006-11-26.9624296 123.8
#10???? 819 2006-11-26.9624296 157.6
#11????? 98 2006-11-27.9624296? 86.0
#12???? 223 2006-11-27.9624296 107.2
#13???? 462 2006-11-27.9624296? 78.2
#14???? 658 2006-11-27.9624296 131.2
A.K.
________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com>
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date
Hi, Arun
Do you have idea about good package that split HUGE dataframe by two variables?
I was trying to use 'ddply' to calculate a mean of LDL-C values which
had same date for every patient.?
Therefore, I need to break down my dataframe, first by patient ID, then by the
date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C
on a same day.
The example is:
? ? ? ? ?PT_ID ?BASE ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF ?
LDL_BASE ?rf
118485 9624295 164.2 2006-11-21 2009-02-17 ? ? 157.6 ? ? ? ?2 ?819 days
2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30 ? ? 208.0 ? ? ? ?2 ?-52 days
2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09 ? ? 123.8 ? ? ? ?2 ?505 days
2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26 ? ? ?17.4 ? ? ? ?1 ?462 days
2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26 ? ? 139.0 ? ? ? ?2 ?462 days
2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02 ? ? 107.2 ? ? ? ?2 ?223 days
2006-11-20 2.5
118491 9624295 164.2 2006-11-21 2007-02-27 ? ? ?86.0 ? ? ? ?1 ? 98 days
2006-11-20 2.5
118492 9624295 164.2 2006-11-21 2008-09-09 ? ? 131.2 ? ? ? ?2 ?658 days
2006-11-20 2.5
? ? ? ? ?REDUCTION ? ? ? ? ? ?GOAL ?FAILURE
118485 ?0.04019488 ? ? NOT AT GOAL ? ? FAIL
118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
118487 ?0.24604141 ? ? ? ? AT GOAL NOT FAIL
118488 ?0.89403167 ? ? ? ? AT GOAL NOT FAIL
118489 ?0.15347138 ? ? NOT AT GOAL ? ? FAIL
118490 ?0.34713764 ? ? ? ? AT GOAL NOT FAIL
118491 ?0.47624848 ? ? ? ? AT GOAL NOT FAIL
118492 ?0.20097442 ? ? NOT AT GOAL NOT FAIL
So, this patient has two LDL-C readings on '462 days', therefore, I want
to get a mean of these 17.4 and 139.0.
'ddply' did give me a mean when running on a test dataframe, but when I
used it on my dataframe with 200,000ish observations, the computer run for like
5 hours and return error. Do you have idea about other good function, that
focuses on split and apply function, and rbind?
Best
Weijia
On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:
That is actually a great idea, thanks again!>
>Weijia Wang
>
>
>On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
>
>> Hi,
>> No problem.
>>
>> If you use join() instead of merge(), the original order of columns may
not get altered.
>>
>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>> library(plyr)
>> ?join(dat1,dat3,type="inner")
>> #Joining by: PT_ID, DAYS_DIFF
>> ?# PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>> #1 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>> #2 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>> A.K.
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Weijia Wang <wwang.nyu at gmail.com>
>> To: arun <smartpink111 at yahoo.com>
>> Sent: Saturday, September 1, 2012 5:11 PM
>> Subject: Re: [R] R_closest date
>>
>>
>> Thank you Arun, for your help again.
>>
>> Best
>> ______________________________
>> WANG WEIJIA
>> Graudate Research and Teaching Assistant
>> Department of Environmental Medicine
>> New York University, School of Medicine
>> wwang.nyu at gmail.com
>>
>>
>>
>>
>> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com>
wrote:
>>
>> Hi,
>>> Try this:
>>> dat1 <- read.table(text="
>>> ? PT_ID ? ?IDX_DT ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ?183 ? ? ? ?2
>>> 14 ?4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ?4549 2002-08-21 2003-02-18 ? ? ?181 ? ? ? ?89 ? ? ? ?1
>>> 16 ?4549 2002-08-21 2003-05-15 ? ? ?267 ? ? ?109 ? ? ? ?2
>>> 17 ?4549 2002-08-21 2003-12-16 ? ? ?482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ?0 ? ? ?179 ? ? ? ?2
>>> ", header=TRUE)
>>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>>> merge(dat1,dat3)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>>
>>> #or,
>>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
>>>
dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
>>> ?row.names(dat4)<-1:nrow(dat4)
>>> merge(dat1,dat4)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>> A.K.
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: WANG WEIJIA <wwang.nyu at gmail.com>
>>> To: "r-help at R-project.org" <r-help at
r-project.org>
>>> Cc:
>>> Sent: Saturday, September 1, 2012 1:10 PM
>>> Subject: [R] R_closest date
>>>
>>> Hi,
>>>
>>> I have encountered an issue about finding a date closest to another
date
>>>
>>> So this is how the data frame looks like:
>>>
>>> ? ? PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ? 4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>>> 14 ? 4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ? 4549 2002-08-21 2003-02-18 ? ? ? 181 ? ? ? ?89 ? ? ? ?1
>>> 16 ? 4549 2002-08-21 2003-05-15 ? ? ? 267 ? ? ? 109 ? ? ? ?2
>>> 17 ? 4549 2002-08-21 2003-12-16 ? ? ? 482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>>>
>>> I need to find, the single observation, which has the closest date
of 'OBS_DATE' to 'IDX_DT'.
>>>
>>> For example, for 'PT_ID' of 4549, I need row 13, of which
the OBS_DATE is just one day away from IDX_DT.
>>>
>>> I was thinking about using abs(), and I got this:
>>>
>>> baseline<- function(x){
>>> +
>>> + ?#remove all uncessary variables
>>> + ?baseline<- x[,c("PT_ID","DAYS_DIFF")]
>>> +
>>> + ?#get a list of every unique ID
>>> + ?uniqueID <- unique(baseline$PT_ID)
>>> +
>>> + ?#make a vector that will contain the smallest DAYS_DIFF
>>> + ?first <- rep(-99,length(uniqueID))
>>> +
>>> + ?i = 1
>>> + ?#loop through each unique ID
>>> + ?for (PT_ID in uniqueID){
>>> +
>>> + ?#for each iteration get the smallest DAYS_DIFF for that ID
>>> + ?first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
>>> +
>>> + ?#up the iteration counter
>>> + ?i = i + 1
>>> +
>>> + ?}
>>> + ?#make a data frame with the lowest DAYS_DIFF and ID
>>> + ?newdata <- data.frame(uniqueID,first)
>>> + ?names(newdata) <- c("PT_ID","DAYS_DIFF")
>>> +
>>> + ?#return the data frame containing the lowest GPI for each ID
>>> + ?return(newdata)
>>> + ?}
>>>
>>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total
11368 obs, all unique#
>>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID ==
PT_ID), abs(baseline$DAYS_DIFF)) :
>>> ? undefined columns selected
>>>
>>> Can anyone help me in figuring out how to get the minimum value of
the absolute value of DAYS_DIFF for unique ID?
>>>
>>> Thanks a lot
>>> ? ? [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>