thr3ads.net - R help - [R] R_closest date [Sep 2012]

If this information is useful, please help other people find it:
Share via:

WANG WEIJIA

2012-Sep-01 17:10 UTC

[R] R_closest date

Hi, 

I have encountered an issue about finding a date closest to another date

So this is how the data frame looks like:

    PT_ID     IDX_DT   OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13   4549 2002-08-21 2002-08-20        -1       183        2
14   4549 2002-08-21 2002-11-14        85        91        1
15   4549 2002-08-21 2003-02-18       181        89        1
16   4549 2002-08-21 2003-05-15       267       109        2
17   4549 2002-08-21 2003-12-16       482        96        1
128  4839 2006-11-28 2006-11-28         0       179        2

I need to find, the single observation, which has the closest date of
'OBS_DATE' to 'IDX_DT'.

For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE
is just one day away from IDX_DT.

I was thinking about using abs(), and I got this:

baseline<- function(x){
+  
+  #remove all uncessary variables
+  baseline<- x[,c("PT_ID","DAYS_DIFF")]
+  
+  #get a list of every unique ID
+  uniqueID <- unique(baseline$PT_ID)
+  
+  #make a vector that will contain the smallest DAYS_DIFF
+  first <- rep(-99,length(uniqueID))
+  
+  i = 1
+  #loop through each unique ID
+  for (PT_ID in uniqueID){
+  
+  #for each iteration get the smallest DAYS_DIFF for that ID
+  first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
+  
+  #up the iteration counter
+  i = i + 1
+  
+  }
+  #make a data frame with the lowest DAYS_DIFF and ID
+  newdata <- data.frame(uniqueID,first)
+  names(newdata) <- c("PT_ID","DAYS_DIFF")
+  
+  #return the data frame containing the lowest GPI for each ID
+  return(newdata)
+  }> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs,
all unique#Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
abs(baseline$DAYS_DIFF)) :
  undefined columns selected
 
Can anyone help me in figuring out how to get the minimum value of the absolute
value of DAYS_DIFF for unique ID?

Thanks a lot
	[[alternative HTML version deleted]]

Rui Barradas

2012-Sep-01 18:17 UTC

head link

[R] R_closest date

Hello,

Try the following.

dat <- read.table(text="
  PT_ID     IDX_DT   OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13   4549 2002-08-21 2002-08-20        -1       183        2
14   4549 2002-08-21 2002-11-14        85        91        1
15   4549 2002-08-21 2003-02-18       181        89        1
16   4549 2002-08-21 2003-05-15       267       109        2
17   4549 2002-08-21 2003-12-16       482        96        1
128  4839 2006-11-28 2006-11-28         0       179        2
", header=TRUE)

spl <- split(dat, dat$PT_ID)
idx <- sapply(spl, function(x) which.min(x$DAYS_DIFF))
res <- lapply(names(idx), function(nm) spl[[ nm ]][ idx[nm], ])
do.call(rbind, res)

And assign the return value of do.call to your result (reuse 'res').

Hope this helps,

Rui Barradas
Em 01-09-2012 18:10, WANG WEIJIA escreveu:> Hi,
>
> I have encountered an issue about finding a date closest to another date
>
> So this is how the data frame looks like:
>
>      PT_ID     IDX_DT   OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
> 13   4549 2002-08-21 2002-08-20        -1       183        2
> 14   4549 2002-08-21 2002-11-14        85        91        1
> 15   4549 2002-08-21 2003-02-18       181        89        1
> 16   4549 2002-08-21 2003-05-15       267       109        2
> 17   4549 2002-08-21 2003-12-16       482        96        1
> 128  4839 2006-11-28 2006-11-28         0       179        2
>
> I need to find, the single observation, which has the closest date of
'OBS_DATE' to 'IDX_DT'.
>
> For example, for 'PT_ID' of 4549, I need row 13, of which the
OBS_DATE is just one day away from IDX_DT.
>
> I was thinking about using abs(), and I got this:
>
> baseline<- function(x){
> +
> +  #remove all uncessary variables
> +  baseline<- x[,c("PT_ID","DAYS_DIFF")]
> +
> +  #get a list of every unique ID
> +  uniqueID <- unique(baseline$PT_ID)
> +
> +  #make a vector that will contain the smallest DAYS_DIFF
> +  first <- rep(-99,length(uniqueID))
> +
> +  i = 1
> +  #loop through each unique ID
> +  for (PT_ID in uniqueID){
> +
> +  #for each iteration get the smallest DAYS_DIFF for that ID
> +  first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
> +
> +  #up the iteration counter
> +  i = i + 1
> +
> +  }
> +  #make a data frame with the lowest DAYS_DIFF and ID
> +  newdata <- data.frame(uniqueID,first)
> +  names(newdata) <- c("PT_ID","DAYS_DIFF")
> +
> +  #return the data frame containing the lowest GPI for each ID
> +  return(newdata)
> +  }
>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368
obs, all unique#
> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
abs(baseline$DAYS_DIFF)) :
>    undefined columns selected
>   
> Can anyone help me in figuring out how to get the minimum value of the
absolute value of DAYS_DIFF for unique ID?
>
> Thanks a lot
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

arun

2012-Sep-01 21:04 UTC

head link

[R] R_closest date

Hi,
Try this:
dat1 <- read.table(text="
? PT_ID??? IDX_DT? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13? 4549 2002-08-21 2002-08-20??????? -1????? 183??????? 2
14? 4549 2002-08-21 2002-11-14??????? 85??????? 91??????? 1
15? 4549 2002-08-21 2003-02-18????? 181??????? 89??????? 1
16? 4549 2002-08-21 2003-05-15????? 267????? 109??????? 2
17? 4549 2002-08-21 2003-12-16????? 482??????? 96??????? 1
128? 4839 2006-11-28 2006-11-28??????? 0????? 179??????? 2
", header=TRUE)
dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
merge(dat1,dat3)
#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY
#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2
#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2

#or,
dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
?row.names(dat4)<-1:nrow(dat4)
merge(dat1,dat4)
#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY
#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2
#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2
A.K.





----- Original Message -----
From: WANG WEIJIA <wwang.nyu at gmail.com>
To: "r-help at R-project.org" <r-help at r-project.org>
Cc: 
Sent: Saturday, September 1, 2012 1:10 PM
Subject: [R] R_closest date

Hi, 

I have encountered an issue about finding a date closest to another date

So this is how the data frame looks like:

? ? PT_ID? ?  IDX_DT?  OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13?  4549 2002-08-21 2002-08-20? ? ? ? -1? ? ?  183? ? ? ? 2
14?  4549 2002-08-21 2002-11-14? ? ? ? 85? ? ? ? 91? ? ? ? 1
15?  4549 2002-08-21 2003-02-18? ? ?  181? ? ? ? 89? ? ? ? 1
16?  4549 2002-08-21 2003-05-15? ? ?  267? ? ?  109? ? ? ? 2
17?  4549 2002-08-21 2003-12-16? ? ?  482? ? ? ? 96? ? ? ? 1
128? 4839 2006-11-28 2006-11-28? ? ? ?  0? ? ?  179? ? ? ? 2

I need to find, the single observation, which has the closest date of
'OBS_DATE' to 'IDX_DT'.

For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE
is just one day away from IDX_DT.

I was thinking about using abs(), and I got this:

baseline<- function(x){
+? 
+? #remove all uncessary variables
+? baseline<- x[,c("PT_ID","DAYS_DIFF")]
+? 
+? #get a list of every unique ID
+? uniqueID <- unique(baseline$PT_ID)
+? 
+? #make a vector that will contain the smallest DAYS_DIFF
+? first <- rep(-99,length(uniqueID))
+? 
+? i = 1
+? #loop through each unique ID
+? for (PT_ID in uniqueID){
+? 
+? #for each iteration get the smallest DAYS_DIFF for that ID
+? first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
+? 
+? #up the iteration counter
+? i = i + 1
+? 
+? }
+? #make a data frame with the lowest DAYS_DIFF and ID
+? newdata <- data.frame(uniqueID,first)
+? names(newdata) <- c("PT_ID","DAYS_DIFF")
+? 
+? #return the data frame containing the lowest GPI for each ID
+? return(newdata)
+? }> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs,
all unique#Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
abs(baseline$DAYS_DIFF)) :
? undefined columns selected

Can anyone help me in figuring out how to get the minimum value of the absolute
value of DAYS_DIFF for unique ID?

Thanks a lot
??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

arun

2012-Sep-02 16:12 UTC

head link

[R] R_closest date

Hi,
No problem.

If you use join() instead of merge(), the original order of columns may not get
altered.

dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
library(plyr)
?join(dat1,dat3,type="inner")
#Joining by: PT_ID, DAYS_DIFF
?# PT_ID???? IDX_DT?? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
#1? 4549 2002-08-21 2002-08-20??????? -1?????? 183??????? 2
#2? 4839 2006-11-28 2006-11-28???????? 0?????? 179??????? 2
A.K.






________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com> 
Sent: Saturday, September 1, 2012 5:11 PM
Subject: Re: [R] R_closest date


Thank you Arun, for your help again.

Best
______________________________
WANG WEIJIA
Graudate Research and Teaching?Assistant
Department of Environmental?Medicine
New York University, School of?Medicine
wwang.nyu at gmail.com




On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote:

Hi,>Try this:
>dat1 <- read.table(text="
>? PT_ID??? IDX_DT? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>13? 4549 2002-08-21 2002-08-20??????? -1????? 183??????? 2
>14? 4549 2002-08-21 2002-11-14??????? 85??????? 91??????? 1
>15? 4549 2002-08-21 2003-02-18????? 181??????? 89??????? 1
>16? 4549 2002-08-21 2003-05-15????? 267????? 109??????? 2
>17? 4549 2002-08-21 2003-12-16????? 482??????? 96??????? 1
>128? 4839 2006-11-28 2006-11-28??????? 0????? 179??????? 2
>", header=TRUE)
>dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>merge(dat1,dat3)
>#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY
>#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2
>#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2
>
>#or,
>dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
>dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
>?row.names(dat4)<-1:nrow(dat4)
>merge(dat1,dat4)
>#? PT_ID DAYS_DIFF???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY
>#1? 4549??????? -1 2002-08-21 2002-08-20?????? 183??????? 2
>#2? 4839???????? 0 2006-11-28 2006-11-28?????? 179??????? 2
>A.K.
>
>
>
>
>
>----- Original Message -----
>From: WANG WEIJIA <wwang.nyu at gmail.com>
>To: "r-help at R-project.org" <r-help at r-project.org>
>Cc: 
>Sent: Saturday, September 1, 2012 1:10 PM
>Subject: [R] R_closest date
>
>Hi, 
>
>I have encountered an issue about finding a date closest to another date
>
>So this is how the data frame looks like:
>
>? ? PT_ID? ? ?IDX_DT? ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>13? ?4549 2002-08-21 2002-08-20? ? ? ? -1? ? ? ?183? ? ? ? 2
>14? ?4549 2002-08-21 2002-11-14? ? ? ? 85? ? ? ? 91? ? ? ? 1
>15? ?4549 2002-08-21 2003-02-18? ? ? ?181? ? ? ? 89? ? ? ? 1
>16? ?4549 2002-08-21 2003-05-15? ? ? ?267? ? ? ?109? ? ? ? 2
>17? ?4549 2002-08-21 2003-12-16? ? ? ?482? ? ? ? 96? ? ? ? 1
>128? 4839 2006-11-28 2006-11-28? ? ? ? ?0? ? ? ?179? ? ? ? 2
>
>I need to find, the single observation, which has the closest date of
'OBS_DATE' to 'IDX_DT'.
>
>For example, for 'PT_ID' of 4549, I need row 13, of which the
OBS_DATE is just one day away from IDX_DT.
>
>I was thinking about using abs(), and I got this:
>
>baseline<- function(x){
>+? 
>+? #remove all uncessary variables
>+? baseline<- x[,c("PT_ID","DAYS_DIFF")]
>+? 
>+? #get a list of every unique ID
>+? uniqueID <- unique(baseline$PT_ID)
>+? 
>+? #make a vector that will contain the smallest DAYS_DIFF
>+? first <- rep(-99,length(uniqueID))
>+? 
>+? i = 1
>+? #loop through each unique ID
>+? for (PT_ID in uniqueID){
>+? 
>+? #for each iteration get the smallest DAYS_DIFF for that ID
>+? first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
>+? 
>+? #up the iteration counter
>+? i = i + 1
>+? 
>+? }
>+? #make a data frame with the lowest DAYS_DIFF and ID
>+? newdata <- data.frame(uniqueID,first)
>+? names(newdata) <- c("PT_ID","DAYS_DIFF")
>+? 
>+? #return the data frame containing the lowest GPI for each ID
>+? return(newdata)
>+? }
>
>ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs,
all unique#
>>Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
abs(baseline$DAYS_DIFF)) :
>? undefined columns selected
>
>Can anyone help me in figuring out how to get the minimum value of the
absolute value of DAYS_DIFF for unique ID?
>
>Thanks a lot
>??? [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>

arun

2012-Sep-06 19:20 UTC

head link

[R] R_closest date

HI,

For your question to split by two variables, try this:
dat1<-read.table(text="
ID???????? PT_ID? BASE???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF??
LDL_BASE? rf
118485 9624295 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-20
2.5
118486 9624295 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-20
2.5
118487 9624295 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-20
2.5
118488 9624295 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-20
2.5
118489 9624295 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-20
2.5
118490 9624295 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-25
2.5
118491 9624295 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-25
2.5
118492 9624295 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-25
2.5
118485 9624296 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-26
2.5
118486 9624296 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-26
2.5
118487 9624296 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-26
2.5
118488 9624296 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-27
2.5
118489 9624296 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-27
2.5
118490 9624296 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-27
2.5
118491 9624296 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-27
2.5
118492 9624296 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-27
2.5
?",sep="",header=TRUE)

dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
dat3<-list()
?for(i in seq_along(dat2)){
?dat3[[i]]<-list()
?dat3[[i]]<-ddply(dat2[[i]],.(DAYS_DIFF),summarize,Mean=mean(OBS_VALUE))
?}
dat3
do.call(rbind,dat3)
#?? DAYS_DIFF? Mean
#1??????? -52 208.0
#2??????? 462? 78.2
#3??????? 505 123.8
#4??????? 819 157.6
#5???????? 98? 86.0
#6??????? 223 107.2
#7??????? 658 131.2
#8??????? -52 208.0
#9??????? 505 123.8
#10?????? 819 157.6
#11??????? 98? 86.0
#12?????? 223 107.2
#13?????? 462? 78.2
#14?????? 658 131.2


#Not sure whether this will work or not in your huge dataset.? May be you can
try lapply() also.?

A.K.
________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com> 
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date


Hi, Arun

Do you have idea about good package that split HUGE dataframe by two variables?

I was trying to use 'ddply' to calculate a mean of LDL-C values which
had same date for every patient.?

Therefore, I need to break down my dataframe, first by patient ID, then by the
date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C
on a same day.

The example is:

? ? ? ? ?PT_ID ?BASE ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF ?
LDL_BASE ?rf
118485 9624295 164.2 2006-11-21 2009-02-17 ? ? 157.6 ? ? ? ?2 ?819 days
2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30 ? ? 208.0 ? ? ? ?2 ?-52 days
2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09 ? ? 123.8 ? ? ? ?2 ?505 days
2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26 ? ? ?17.4 ? ? ? ?1 ?462 days
2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26 ? ? 139.0 ? ? ? ?2 ?462 days
2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02 ? ? 107.2 ? ? ? ?2 ?223 days
2006-11-20 2.5
118491 9624295 164.2 2006-11-21 2007-02-27 ? ? ?86.0 ? ? ? ?1 ? 98 days
2006-11-20 2.5
118492 9624295 164.2 2006-11-21 2008-09-09 ? ? 131.2 ? ? ? ?2 ?658 days
2006-11-20 2.5
? ? ? ? ?REDUCTION ? ? ? ? ? ?GOAL ?FAILURE
118485 ?0.04019488 ? ? NOT AT GOAL ? ? FAIL
118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
118487 ?0.24604141 ? ? ? ? AT GOAL NOT FAIL
118488 ?0.89403167 ? ? ? ? AT GOAL NOT FAIL
118489 ?0.15347138 ? ? NOT AT GOAL ? ? FAIL
118490 ?0.34713764 ? ? ? ? AT GOAL NOT FAIL
118491 ?0.47624848 ? ? ? ? AT GOAL NOT FAIL
118492 ?0.20097442 ? ? NOT AT GOAL NOT FAIL

So, this patient has two LDL-C readings on '462 days', therefore, I want
to get a mean of these 17.4 and 139.0.

'ddply' did give me a mean when running on a test dataframe, but when I
used it on my dataframe with 200,000ish observations, the computer run for like
5 hours and return error. Do you have idea about other good function, that
focuses on split and apply function, and rbind?

Best
Weijia


On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:

That is actually a great idea, thanks again!>
>Weijia Wang
>
>
>On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
>
>> Hi,
>> No problem.
>>
>> If you use join() instead of merge(), the original order of columns may
not get altered.
>>
>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>> library(plyr)
>> ?join(dat1,dat3,type="inner")
>> #Joining by: PT_ID, DAYS_DIFF
>> ?# PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>> #1 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>> #2 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>> A.K.
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Weijia Wang <wwang.nyu at gmail.com>
>> To: arun <smartpink111 at yahoo.com>
>> Sent: Saturday, September 1, 2012 5:11 PM
>> Subject: Re: [R] R_closest date
>>
>>
>> Thank you Arun, for your help again.
>>
>> Best
>> ______________________________
>> WANG WEIJIA
>> Graudate Research and Teaching Assistant
>> Department of Environmental Medicine
>> New York University, School of Medicine
>> wwang.nyu at gmail.com
>>
>>
>>
>>
>> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com>
wrote:
>>
>> Hi,
>>> Try this:
>>> dat1 <- read.table(text="
>>> ? PT_ID ? ?IDX_DT ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ?183 ? ? ? ?2
>>> 14 ?4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ?4549 2002-08-21 2003-02-18 ? ? ?181 ? ? ? ?89 ? ? ? ?1
>>> 16 ?4549 2002-08-21 2003-05-15 ? ? ?267 ? ? ?109 ? ? ? ?2
>>> 17 ?4549 2002-08-21 2003-12-16 ? ? ?482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ?0 ? ? ?179 ? ? ? ?2
>>> ", header=TRUE)
>>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>>> merge(dat1,dat3)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>>
>>> #or,
>>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
>>>
dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
>>> ?row.names(dat4)<-1:nrow(dat4)
>>> merge(dat1,dat4)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>> A.K.
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: WANG WEIJIA <wwang.nyu at gmail.com>
>>> To: "r-help at R-project.org" <r-help at
r-project.org>
>>> Cc:
>>> Sent: Saturday, September 1, 2012 1:10 PM
>>> Subject: [R] R_closest date
>>>
>>> Hi,
>>>
>>> I have encountered an issue about finding a date closest to another
date
>>>
>>> So this is how the data frame looks like:
>>>
>>> ? ? PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ? 4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>>> 14 ? 4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ? 4549 2002-08-21 2003-02-18 ? ? ? 181 ? ? ? ?89 ? ? ? ?1
>>> 16 ? 4549 2002-08-21 2003-05-15 ? ? ? 267 ? ? ? 109 ? ? ? ?2
>>> 17 ? 4549 2002-08-21 2003-12-16 ? ? ? 482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>>>
>>> I need to find, the single observation, which has the closest date
of 'OBS_DATE' to 'IDX_DT'.
>>>
>>> For example, for 'PT_ID' of 4549, I need row 13, of which
the OBS_DATE is just one day away from IDX_DT.
>>>
>>> I was thinking about using abs(), and I got this:
>>>
>>> baseline<- function(x){
>>> +
>>> + ?#remove all uncessary variables
>>> + ?baseline<- x[,c("PT_ID","DAYS_DIFF")]
>>> +
>>> + ?#get a list of every unique ID
>>> + ?uniqueID <- unique(baseline$PT_ID)
>>> +
>>> + ?#make a vector that will contain the smallest DAYS_DIFF
>>> + ?first <- rep(-99,length(uniqueID))
>>> +
>>> + ?i = 1
>>> + ?#loop through each unique ID
>>> + ?for (PT_ID in uniqueID){
>>> +
>>> + ?#for each iteration get the smallest DAYS_DIFF for that ID
>>> + ?first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
>>> +
>>> + ?#up the iteration counter
>>> + ?i = i + 1
>>> +
>>> + ?}
>>> + ?#make a data frame with the lowest DAYS_DIFF and ID
>>> + ?newdata <- data.frame(uniqueID,first)
>>> + ?names(newdata) <- c("PT_ID","DAYS_DIFF")
>>> +
>>> + ?#return the data frame containing the lowest GPI for each ID
>>> + ?return(newdata)
>>> + ?}
>>>
>>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total
11368 obs, all unique#
>>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID ==
PT_ID), abs(baseline$DAYS_DIFF)) :
>>> ? undefined columns selected
>>>
>>> Can anyone help me in figuring out how to get the minimum value of
the absolute value of DAYS_DIFF for unique ID?
>>>
>>> Thanks a lot
>>> ? ? [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>

arun

2012-Sep-06 21:32 UTC

head link

[R] R_closest date

Hi Weija,

You can also try this instead of the loop solution in my previous reply.

dat1<-read.table(text="
ID???????? PT_ID? BASE???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF??
LDL_BASE? rf
118485 9624295 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-20
2.5
118486 9624295 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-20
2.5
118487 9624295 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-20
2.5
118488 9624295 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-20
2.5
118489 9624295 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-20
2.5
118490 9624295 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-25
2.5
118491 9624295 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-25
2.5
118492 9624295 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-25
2.5
118485 9624296 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-26
2.5
118486 9624296 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-26
2.5
118487 9624296 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-26
2.5
118488 9624296 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-27
2.5
118489 9624296 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-27
2.5
118490 9624296 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-27
2.5
118491 9624296 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-27
2.5
118492 9624296 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-27
2.5
?",sep="",header=TRUE)

dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
dat3<-data.frame(do.call(rbind,lapply(dat2,`[`,c(6,8))))
dat4<-data.frame(LDL_Date.PTID=row.names(dat3),dat3)
row.names(dat4)<-1:nrow(dat4)
dat4$LDL_Date.PTID<-gsub("(.*..*)\\.\\d+","\\1",dat4$LDL_Date.PTID)
aggregate(dat4$OBS_VALUE,list(dat4$DAYS_DIFF,dat4$LDL_Date.PTID),mean)
Group.1??????????? Group.2???? x
#1????? -52 2006-11-20.9624295 208.0
#2????? 462 2006-11-20.9624295? 78.2
#3????? 505 2006-11-20.9624295 123.8
#4????? 819 2006-11-20.9624295 157.6
#5?????? 98 2006-11-25.9624295? 86.0
#6????? 223 2006-11-25.9624295 107.2
#7????? 658 2006-11-25.9624295 131.2
#8????? -52 2006-11-26.9624296 208.0
#9????? 505 2006-11-26.9624296 123.8
#10???? 819 2006-11-26.9624296 157.6
#11????? 98 2006-11-27.9624296? 86.0
#12???? 223 2006-11-27.9624296 107.2
#13???? 462 2006-11-27.9624296? 78.2
#14???? 658 2006-11-27.9624296 131.2
A.K.






________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com> 
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date


Hi, Arun

Do you have idea about good package that split HUGE dataframe by two variables?

I was trying to use 'ddply' to calculate a mean of LDL-C values which
had same date for every patient.?

Therefore, I need to break down my dataframe, first by patient ID, then by the
date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C
on a same day.

The example is:

? ? ? ? ?PT_ID ?BASE ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF ?
LDL_BASE ?rf
118485 9624295 164.2 2006-11-21 2009-02-17 ? ? 157.6 ? ? ? ?2 ?819 days
2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30 ? ? 208.0 ? ? ? ?2 ?-52 days
2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09 ? ? 123.8 ? ? ? ?2 ?505 days
2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26 ? ? ?17.4 ? ? ? ?1 ?462 days
2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26 ? ? 139.0 ? ? ? ?2 ?462 days
2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02 ? ? 107.2 ? ? ? ?2 ?223 days
2006-11-20 2.5
118491 9624295 164.2 2006-11-21 2007-02-27 ? ? ?86.0 ? ? ? ?1 ? 98 days
2006-11-20 2.5
118492 9624295 164.2 2006-11-21 2008-09-09 ? ? 131.2 ? ? ? ?2 ?658 days
2006-11-20 2.5
? ? ? ? ?REDUCTION ? ? ? ? ? ?GOAL ?FAILURE
118485 ?0.04019488 ? ? NOT AT GOAL ? ? FAIL
118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
118487 ?0.24604141 ? ? ? ? AT GOAL NOT FAIL
118488 ?0.89403167 ? ? ? ? AT GOAL NOT FAIL
118489 ?0.15347138 ? ? NOT AT GOAL ? ? FAIL
118490 ?0.34713764 ? ? ? ? AT GOAL NOT FAIL
118491 ?0.47624848 ? ? ? ? AT GOAL NOT FAIL
118492 ?0.20097442 ? ? NOT AT GOAL NOT FAIL

So, this patient has two LDL-C readings on '462 days', therefore, I want
to get a mean of these 17.4 and 139.0.

'ddply' did give me a mean when running on a test dataframe, but when I
used it on my dataframe with 200,000ish observations, the computer run for like
5 hours and return error. Do you have idea about other good function, that
focuses on split and apply function, and rbind?

Best
Weijia


On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:

That is actually a great idea, thanks again!>
>Weijia Wang
>
>
>On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
>
>> Hi,
>> No problem.
>>
>> If you use join() instead of merge(), the original order of columns may
not get altered.
>>
>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>> library(plyr)
>> ?join(dat1,dat3,type="inner")
>> #Joining by: PT_ID, DAYS_DIFF
>> ?# PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>> #1 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>> #2 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>> A.K.
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Weijia Wang <wwang.nyu at gmail.com>
>> To: arun <smartpink111 at yahoo.com>
>> Sent: Saturday, September 1, 2012 5:11 PM
>> Subject: Re: [R] R_closest date
>>
>>
>> Thank you Arun, for your help again.
>>
>> Best
>> ______________________________
>> WANG WEIJIA
>> Graudate Research and Teaching Assistant
>> Department of Environmental Medicine
>> New York University, School of Medicine
>> wwang.nyu at gmail.com
>>
>>
>>
>>
>> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com>
wrote:
>>
>> Hi,
>>> Try this:
>>> dat1 <- read.table(text="
>>> ? PT_ID ? ?IDX_DT ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ?183 ? ? ? ?2
>>> 14 ?4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ?4549 2002-08-21 2003-02-18 ? ? ?181 ? ? ? ?89 ? ? ? ?1
>>> 16 ?4549 2002-08-21 2003-05-15 ? ? ?267 ? ? ?109 ? ? ? ?2
>>> 17 ?4549 2002-08-21 2003-12-16 ? ? ?482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ?0 ? ? ?179 ? ? ? ?2
>>> ", header=TRUE)
>>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>>> merge(dat1,dat3)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>>
>>> #or,
>>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
>>>
dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
>>> ?row.names(dat4)<-1:nrow(dat4)
>>> merge(dat1,dat4)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>> A.K.
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: WANG WEIJIA <wwang.nyu at gmail.com>
>>> To: "r-help at R-project.org" <r-help at
r-project.org>
>>> Cc:
>>> Sent: Saturday, September 1, 2012 1:10 PM
>>> Subject: [R] R_closest date
>>>
>>> Hi,
>>>
>>> I have encountered an issue about finding a date closest to another
date
>>>
>>> So this is how the data frame looks like:
>>>
>>> ? ? PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ? 4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>>> 14 ? 4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ? 4549 2002-08-21 2003-02-18 ? ? ? 181 ? ? ? ?89 ? ? ? ?1
>>> 16 ? 4549 2002-08-21 2003-05-15 ? ? ? 267 ? ? ? 109 ? ? ? ?2
>>> 17 ? 4549 2002-08-21 2003-12-16 ? ? ? 482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>>>
>>> I need to find, the single observation, which has the closest date
of 'OBS_DATE' to 'IDX_DT'.
>>>
>>> For example, for 'PT_ID' of 4549, I need row 13, of which
the OBS_DATE is just one day away from IDX_DT.
>>>
>>> I was thinking about using abs(), and I got this:
>>>
>>> baseline<- function(x){
>>> +
>>> + ?#remove all uncessary variables
>>> + ?baseline<- x[,c("PT_ID","DAYS_DIFF")]
>>> +
>>> + ?#get a list of every unique ID
>>> + ?uniqueID <- unique(baseline$PT_ID)
>>> +
>>> + ?#make a vector that will contain the smallest DAYS_DIFF
>>> + ?first <- rep(-99,length(uniqueID))
>>> +
>>> + ?i = 1
>>> + ?#loop through each unique ID
>>> + ?for (PT_ID in uniqueID){
>>> +
>>> + ?#for each iteration get the smallest DAYS_DIFF for that ID
>>> + ?first[i] <-
min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
>>> +
>>> + ?#up the iteration counter
>>> + ?i = i + 1
>>> +
>>> + ?}
>>> + ?#make a data frame with the lowest DAYS_DIFF and ID
>>> + ?newdata <- data.frame(uniqueID,first)
>>> + ?names(newdata) <- c("PT_ID","DAYS_DIFF")
>>> +
>>> + ?#return the data frame containing the lowest GPI for each ID
>>> + ?return(newdata)
>>> + ?}
>>>
>>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total
11368 obs, all unique#
>>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID ==
PT_ID), abs(baseline$DAYS_DIFF)) :
>>> ? undefined columns selected
>>>
>>> Can anyone help me in figuring out how to get the minimum value of
the absolute value of DAYS_DIFF for unique ID?
>>>
>>> Thanks a lot
>>> ? ? [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Sep 2012 - R_closest date

[R] R_closest date

[R] R_closest date

[R] R_closest date

[R] R_closest date

[R] R_closest date

[R] R_closest date

Apparently Analagous Threads