Hi, I am new to R, coming from a few years using Stata. I've been twisting my brain and checking several R and S references over the last few days to try to solve this data management problem: I have a data set with a unique patient identifier that is repeated along multiple rows, a variable with month of patient encounter, and a continous variable for cost of individual encounters. The data looks like this: ID date cost 1 "2001-01" 200.00 1 "2001-01" 123.94 1 "2001-03" 100.23 1 "2001-04" 150.34 2 "2001-03" 296.34 2 "2002-05" 156.36 I would like to obtain the median costs and boxplots for the sum of encounters happening in the first six months after the index encounter (first patient encounter) for each patient, then the mean and median costs for the costs happening from 6 to 12 months after the index encounter, and so on. Notice that the first ID has two encounters during the index date, making it more difficult to define a single row with the index encounter. Any help would be appreciated, Ricardo Ricardo Pietrobon, MD Assistant Professor of Surgery Duke University Medical Center Durham, NC 27710 US
Ricardo Pietrobon <rpietro at duke.edu> writes:> ID date cost > 1 "2001-01" 200.00 > 1 "2001-01" 123.94 > 1 "2001-03" 100.23 > 1 "2001-04" 150.34 > 2 "2001-03" 296.34 > 2 "2002-05" 156.36 > > > I would like to obtain the median costs and boxplots for the sum of > encounters happening in the first six months after the index encounter > (first patient encounter) for each patient, then the mean and median costs > for the costs happening from 6 to 12 months after the index encounter, and > so on. Notice that the first ID has two encounters during the index date, > making it more difficult to define a single row with the index encounter. > > Any help would be appreciated,Let's see... You're going to need a bit of slight ugliness to convert the date to a numeric month number. Something like (NB: That's a code that means "I didn't actually try this"...) attach(yourdata) monthnum <- sapply(strsplit(date,"-"),function(x)sum(as.numeric(x)*c(12,1))) Then we need a table of the index dates for each person tbl <- tapply(monthnum, ID, min) Now subtract the index date from monthnum months.post.index <- monthnum - tbl[ID] then you probably want to look at the subset of your original data frame and do the sums total.cost.6mo <- with(subset(yourdata,months.post.index < 6), tapply(cost,ID,sum)) and finally boxplot(total.cost.6mo) median(total.cost.6mo) (You could elaborate by converting months.post.index with cut() and use lapply(names(period),.....) to give you a list of tables, which boxplot() might actually know how to plot directly.) -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Try this. The function takes a vector of dates of the form yyyy-mm and produces a new character vector of dates of the same form except the output date is the beginning of the 6 month period in which the input date lies. The 6 month intervals are measured from the minimum date. date.grouping <- function(d) { # for ea date in d calculate date beginning 6 month period which contains it mat <- matrix(as.numeric(unlist(strsplit(as.character(d),"-"))),nr=2) f <- function(x) do.call( "ISOdate", as.list(x) ) POSIXct.dates <- apply(rbind(mat,1),2,f) + ISOdate(1970,1,1) breaks <- c(seq(from=min(POSIXct.dates), along=POSIXct.dates, by="6 mo"), Inf) format( as.POSIXct( cut( POSIXct.dates, breaks, include.lowest=T )), "%Y-%m" ) } patients2 <- with( patients, tapply( cost, list(ID,date.grouping(date)), sum ) ) patients2 <- as.data.frame( patients2 ) summary(patients2) boxplot(patients2) --- Ricardo Pietrobon <rpietro@duke.edu> wrote:>Hi, > > >I am new to R, coming from a few years using Stata. I've been twisting my >brain and checking several R and S references over the last few days to >try to solve this data management problem: I have a data set with a unique >patient identifier that is repeated along multiple rows, a variable with >month of patient encounter, and a continous variable for cost of >individual encounters. The data looks like this: > >ID date cost >1 "2001-01" 200.00 >1 "2001-01" 123.94 >1 "2001-03" 100.23 >1 "2001-04" 150.34 >2 "2001-03" 296.34 >2 "2002-05" 156.36 > > >I would like to obtain the median costs and boxplots for the sum of >encounters happening in the first six months after the index encounter >(first patient encounter) for each patient, then the mean and median costs >for the costs happening from 6 to 12 months after the index encounter, and >so on. Notice that the first ID has two encounters during the index date, >making it more difficult to define a single row with the index encounter. > >Any help would be appreciated, > > >Ricardo > > >Ricardo Pietrobon, MD >Assistant Professor of Surgery >Duke University Medical Center >Durham, NC 27710 US > >______________________________________________ >R-help@stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Sorry but there was an error in the seq statement. Here it is again. date.grouping <- function(d) { # for ea date in d calculate date beginning 6 month period which contains it mat <- matrix(as.numeric(unlist(strsplit(as.character(d),"-"))),nr=2) f <- function(x) do.call( "ISOdate", as.list(x) ) POSIXct.dates <- apply(rbind(mat,1),2,f) + ISOdate(1970,1,1) breaks <- c(seq(from=min(POSIXct.dates), to=max(POSIXct.dates), by="6 mo"), Inf) format( as.POSIXct( cut( POSIXct.dates, breaks, include.lowest=T )), "%Y-%m" ) } patients2 <- with( patients, tapply( cost, list(ID,date.grouping(date)), sum ) ) patients2 <- as.data.frame( patients2 ) summary(patients2) boxplot(patients2) --- Gabor Grothendieck <ggrothendieck@volcanomail.com> wrote:> >Try this. The function takes a vector of dates of the form yyyy-mm and produces a new character vector of dates of the same form except the >output date is the beginning of the 6 month period in which the input date lies. The 6 month intervals are measured from the minimum date. > >date.grouping <- function(d) { > # for ea date in d calculate date beginning 6 month period which contains it > mat <- matrix(as.numeric(unlist(strsplit(as.character(d),"-"))),nr=2) > f <- function(x) do.call( "ISOdate", as.list(x) ) > POSIXct.dates <- apply(rbind(mat,1),2,f) + ISOdate(1970,1,1) > breaks <- c(seq(from=min(POSIXct.dates), along=POSIXct.dates, by="6 mo"), Inf) > format( as.POSIXct( cut( POSIXct.dates, breaks, include.lowest=T )), "%Y-%m" ) >} > >patients2 <- with( patients, tapply( cost, list(ID,date.grouping(date)), sum ) ) >patients2 <- as.data.frame( patients2 ) > >summary(patients2) > >boxplot(patients2) > > > >--- Ricardo Pietrobon <rpietro@duke.edu> wrote: >>Hi, >> >> >>I am new to R, coming from a few years using Stata. I've been twisting my >>brain and checking several R and S references over the last few days to >>try to solve this data management problem: I have a data set with a unique >>patient identifier that is repeated along multiple rows, a variable with >>month of patient encounter, and a continous variable for cost of >>individual encounters. The data looks like this: >> >>ID date cost >>1 "2001-01" 200.00 >>1 "2001-01" 123.94 >>1 "2001-03" 100.23 >>1 "2001-04" 150.34 >>2 "2001-03" 296.34 >>2 "2002-05" 156.36 >> >> >>I would like to obtain the median costs and boxplots for the sum of >>encounters happening in the first six months after the index encounter >>(first patient encounter) for each patient, then the mean and median costs >>for the costs happening from 6 to 12 months after the index encounter, and >>so on. Notice that the first ID has two encounters during the index date, >>making it more difficult to define a single row with the index encounter. >> >>Any help would be appreciated, >> >> >>Ricardo >> >> >>Ricardo Pietrobon, MD >>Assistant Professor of Surgery >>Duke University Medical Center >>Durham, NC 27710 US >> >>______________________________________________ >>R-help@stat.math.ethz.ch mailing list >>https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >______________________________________________ >R-help@stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-help
And here is a simplification I just noticed: date.grouping <- function(d) { # for ea date in d calculate date beginning 6 month period which contains it POSIXct.dates <- as.POSIXct(paste(as.character(d),"01",sep="-")) breaks <- c(seq(from=min(POSIXct.dates), to=max(POSIXct.dates), by="6 mo"), Inf) format( as.POSIXct( cut( POSIXct.dates, breaks, include.lowest=T )), "%Y-%m" ) } patients <- read.table("clipboard",header=T) patients2 <- with( patients, tapply( cost, list(ID,date.grouping(date)), sum ) ) patients2 <- as.data.frame( patients2 ) summary(patients2) boxplot(patients2) --- Gabor Grothendieck <ggrothendieck@volcanomail.com> wrote:>Sorry but there was an error in the seq statement. Here it is again. > > >date.grouping <- function(d) { > # for ea date in d calculate date beginning 6 month period which contains it > mat <- matrix(as.numeric(unlist(strsplit(as.character(d),"-"))),nr=2) > f <- function(x) do.call( "ISOdate", as.list(x) ) > POSIXct.dates <- apply(rbind(mat,1),2,f) + ISOdate(1970,1,1) > breaks <- c(seq(from=min(POSIXct.dates), to=max(POSIXct.dates), by="6 mo"), Inf) > format( as.POSIXct( cut( POSIXct.dates, breaks, include.lowest=T )), "%Y-%m" ) >} > >patients2 <- with( patients, tapply( cost, list(ID,date.grouping(date)), sum ) ) >patients2 <- as.data.frame( patients2 ) > >summary(patients2) > >boxplot(patients2) > > > >--- Gabor Grothendieck <ggrothendieck@volcanomail.com> wrote: >> >>Try this. The function takes a vector of dates of the form yyyy-mm and produces a new character vector of dates of the same form except the >>output date is the beginning of the 6 month period in which the input date lies. The 6 month intervals are measured from the minimum date. >> >>date.grouping <- function(d) { >> # for ea date in d calculate date beginning 6 month period which contains it >> mat <- matrix(as.numeric(unlist(strsplit(as.character(d),"-"))),nr=2) >> f <- function(x) do.call( "ISOdate", as.list(x) ) >> POSIXct.dates <- apply(rbind(mat,1),2,f) + ISOdate(1970,1,1) >> breaks <- c(seq(from=min(POSIXct.dates), along=POSIXct.dates, by="6 mo"), Inf) >> format( as.POSIXct( cut( POSIXct.dates, breaks, include.lowest=T )), "%Y-%m" ) >>} >> >>patients2 <- with( patients, tapply( cost, list(ID,date.grouping(date)), sum ) ) >>patients2 <- as.data.frame( patients2 ) >> >>summary(patients2) >> >>boxplot(patients2) >> >> >> >>--- Ricardo Pietrobon <rpietro@duke.edu> wrote: >>>Hi, >>> >>> >>>I am new to R, coming from a few years using Stata. I've been twisting my >>>brain and checking several R and S references over the last few days to >>>try to solve this data management problem: I have a data set with a unique >>>patient identifier that is repeated along multiple rows, a variable with >>>month of patient encounter, and a continous variable for cost of >>>individual encounters. The data looks like this: >>> >>>ID date cost >>>1 "2001-01" 200.00 >>>1 "2001-01" 123.94 >>>1 "2001-03" 100.23 >>>1 "2001-04" 150.34 >>>2 "2001-03" 296.34 >>>2 "2002-05" 156.36 >>> >>> >>>I would like to obtain the median costs and boxplots for the sum of >>>encounters happening in the first six months after the index encounter >>>(first patient encounter) for each patient, then the mean and median costs >>>for the costs happening from 6 to 12 months after the index encounter, and >>>so on. Notice that the first ID has two encounters during the index date, >>>making it more difficult to define a single row with the index encounter. >>> >>>Any help would be appreciated, >>> >>> >>>Ricardo >>> >>> >>>Ricardo Pietrobon, MD >>>Assistant Professor of Surgery >>>Duke University Medical Center >>>Durham, NC 27710 US >>> >>>______________________________________________ >>>R-help@stat.math.ethz.ch mailing list >>>https://www.stat.math.ethz.ch/mailman/listinfo/r-help >> >>______________________________________________ >>R-help@stat.math.ethz.ch mailing list >>https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >_____________________________________________________________