Dear R-listers,
I am a relatively inexperienced R-user currently migrating from Stata. I
am deeply frustrated by this data manipulation question: I know how I
could do it in Stata, but I cannot make it work in R.
I have a data frame of hospitalization data where each row represents an
admission. I need to know when patients were first discharged, but the
problem is that patients were sometimes transferred between hospital
departments. In my data a transfer looks like a new admission, except
that it has a 'start' date equal to the previous admission's
'stop'
date.
Here is an example:
id <- c(rep("a",4),rep("b",2), rep("c",5),
rep("d",1))
start <- c(c(0,6,17,20),c(0,1),c(0,5,10,11,50),c(0))
stop <- c(c(6,12,20,30),c(1,10),c(3,10,11,30,55),c(6))
data <- as.data.frame(cbind(id,start,stop))
data
# id start stop
# 1 a 0 6
# 2 a 6 12
# 3 a 17 20
# 4 a 20 30
# 5 b 0 1
# 6 b 1 10
# 7 c 0 3
# 8 c 5 10
# 9 c 10 11
# 10 c 11 30
# 11 c 50 55
# 12 d 0 6
So, what I want to end up with is this:
id start stop
a 0 12 # This patient was transferred at time 6 and discharged at
time 12. The admission starting at time 17 is therefore irrelevant.
b 0 10
c 0 3
d 0 6
I have tried tons of variations over lapply, sapply, split, for etc.,
all to no avail.
Thank you in advance for any assistance.
Best regards,
Peter Jepsen, MD.
On Thu, Nov 6, 2008 at 4:23 PM, Peter Jepsen <PJ at dce.au.dk> wrote:> > Here is an example: > > id <- c(rep("a",4),rep("b",2), rep("c",5), rep("d",1)) > start <- c(c(0,6,17,20),c(0,1),c(0,5,10,11,50),c(0)) > stop <- c(c(6,12,20,30),c(1,10),c(3,10,11,30,55),c(6)) > data <- as.data.frame(cbind(id,start,stop)) > data > # id start stop > # 1 a 0 6 > # 2 a 6 12 > # 3 a 17 20 > # 4 a 20 30 > # 5 b 0 1 > # 6 b 1 10 > # 7 c 0 3 > # 8 c 5 10 > # 9 c 10 11 > # 10 c 11 30 > # 11 c 50 55 > # 12 d 0 6 > > So, what I want to end up with is this: > > id start stop > a 0 12 # This patient was transferred at time 6 and discharged at > time 12. The admission starting at time 17 is therefore irrelevant. > b 0 10 > c 0 3 > d 0 6 >Try this: result <- list() num <- length(levels(factor(data$id))) length(result) <- 3*num dim(result) <- c(3,num) result <- data[data$start == 0,] Y <- as.integer(row.names(result)) for (i in 1:num) { if (Y[i] == dim(data)[1]) (result[i,3] <- data[dim(data)[1],3]) else (result[i,3] <- data[Y[i]+1,3]) } result Sorry it is ugly cuz i am new too but hopefully it gives you some ideas.
How about:
id <- c(rep("a",4),rep("b",2), rep("c",5),
rep("d",1))
start <- c(c(0,6,17,20),c(0,1),c(0,5,10,11,50),c(0))
stop <- c(c(6,12,20,30),c(1,10),c(3,10,11,30,55),c(6))
data <- data.frame(id,start,stop)
f <- function(data){
m <- match(data$start,data$stop) + 1
if (length(m)==1 && is.na(m)) m <- 1
if (length(m) > 1 && is.na(m[2])) m <- 1
data$stop[min(m,na.rm=T)]
}
by(data,data$id,f)
The if statements in the function are for some special cases, in all the
other cases the firs line will do the trick.
I would like to add that using data is a somewhat bad behavior, as this
overwrites the build in data function of R.
And I changed the way you made up the data.frame, as your method would
convert everything to factors.
Good luck
Bart
Peter Jepsen wrote:>
> Dear R-listers,
>
> I am a relatively inexperienced R-user currently migrating from Stata. I
> am deeply frustrated by this data manipulation question: I know how I
> could do it in Stata, but I cannot make it work in R.
>
> I have a data frame of hospitalization data where each row represents an
> admission. I need to know when patients were first discharged, but the
> problem is that patients were sometimes transferred between hospital
> departments. In my data a transfer looks like a new admission, except
> that it has a 'start' date equal to the previous admission's
'stop'
> date.
>
> Here is an example:
>
> id <- c(rep("a",4),rep("b",2), rep("c",5),
rep("d",1))
> start <- c(c(0,6,17,20),c(0,1),c(0,5,10,11,50),c(0))
> stop <- c(c(6,12,20,30),c(1,10),c(3,10,11,30,55),c(6))
> data <- as.data.frame(cbind(id,start,stop))
> data
> # id start stop
> # 1 a 0 6
> # 2 a 6 12
> # 3 a 17 20
> # 4 a 20 30
> # 5 b 0 1
> # 6 b 1 10
> # 7 c 0 3
> # 8 c 5 10
> # 9 c 10 11
> # 10 c 11 30
> # 11 c 50 55
> # 12 d 0 6
>
> So, what I want to end up with is this:
>
> id start stop
> a 0 12 # This patient was transferred at time 6 and discharged at
> time 12. The admission starting at time 17 is therefore irrelevant.
> b 0 10
> c 0 3
> d 0 6
>
> I have tried tons of variations over lapply, sapply, split, for etc.,
> all to no avail.
>
> Thank you in advance for any assistance.
>
> Best regards,
> Peter Jepsen, MD.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context:
http://www.nabble.com/Data-manipulation-question-tp20356835p20358624.html
Sent from the R help mailing list archive at Nabble.com.