I am fitting a coxph model on a large dataset (approx 100,000 patients), and then trying to estimate the survival curves for several new patients based on the coxph object using survfit. When I run coxph I get the coxph object back fairly quickly however when I try to run survfit it does not come back. I am wondering if their is a more efficient way to get predicted survival curves from a coxph object.predict.coxph does not seem to generate survival curves. here is some sample code that mirrors what I am trying to do with my dataset, I get results using this code but it still takes a long time, my dataset includes quite a few more covariates, so any suggestions on speeding this up would be greatly appreciated. library(survival) ### generate sample data time <- rexp(100000,(1/180)) ag <- rnorm(100000,38,12) sx <- sample(x=c(0,1),100000,replace=TRUE) ac <- factor(sample(x=c(1,2,3,4,5),100000,replace=TRUE),levels=c(1:5)) ev <- sample(x=c(0,1),100000,replace=TRUE) c1 <- as.data.frame(cbind(ag,sx,ac)) #generate newdata ts <- as.data.frame (cbind(ag[23:24],sx[1000:1001],factor(ac[9000:9001],levels=c(1:5)))) colnames(ts) <- c("ag","sx","ac") cph <- coxph(Surv(time,ev)~ ag+sx+ac,data=c1) survfit(cph,newdata=ts,individual=F) thanks, Spencer [[alternative HTML version deleted]]
sj wrote:> I am fitting a coxph model on a large dataset (approx 100,000 patients), and > then trying to estimate the survival curves for several new patients based > on the coxph object using survfit. When I run coxph I get the coxph object > back fairly quickly however when I try to run survfit it does not come > back. I am wondering if their is a more efficient way to get predicted > survival curves from a coxph object.predict.coxph does not seem to generate > survival curves. > > here is some sample code that mirrors what I am trying to do with my > dataset, I get results using this code but it still takes a long time, my > dataset includes quite a few more covariates, so any suggestions on speeding > this up would be greatly appreciated. > > > library(survival) > ### generate sample data > time <- rexp(100000,(1/180)) > ag <- rnorm(100000,38,12) > sx <- sample(x=c(0,1),100000,replace=TRUE) > ac <- factor(sample(x=c(1,2,3,4,5),100000,replace=TRUE),levels=c(1:5)) > ev <- sample(x=c(0,1),100000,replace=TRUE) > c1 <- as.data.frame(cbind(ag,sx,ac))cl <- data.frame(ag, sex, ac)> > #generate newdata > ts <- as.data.frame > (cbind(ag[23:24],sx[1000:1001],factor(ac[9000:9001],levels=c(1:5)))) > colnames(ts) <- c("ag","sx","ac") > > > cph <- coxph(Surv(time,ev)~ ag+sx+ac,data=c1)Don't need data= since everything is already available.> > survfit(cph,newdata=ts,individual=F)The delay is probably due to computations of confidence limits. If you don't need them or don't mind using approximate confidence limits you can get very quick estimates using library(Design) f <- cph(Surv(time,ev) ~ ..., surv=TRUE) # do ?cph survplot(f, ...) # do ?survplot survest(f, ...) # do ?survest survfit(f, ...) # do ?survfit Or create a nomogram that gives users almost instant computations of survival probabilities at selected times without using a computer. There are examples of this in my book Regression Modeling Strategies. Do ?nomogram which has an example of drawing a nomogram that shows median, 3-month, and 6-month survival estimates from a parametric survival model fit, which you can also do with the Cox model using cph (a wrapper for coxph). Frank Harrell> > > > thanks, > > Spencer >-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
> When I run coxph I get the coxph object back fairly quickly, > however when I try to run survfit it does not come back.If you are very, very patient the routine will come back eventually. Unfortunately, for some very large data sets this could be months... The reason is that the algorithms for coxph have been carefully optimized over the years, but survfit is used so much less frequently that I have not propogated many of these improvements forward to that routine. In particular, there is a computation which is O(d*n) if done in the obvious way, but O(2n) when approached more cleverly; where d=number of events and n= number of observations in the data set. Your example has d ~ 50,000 and n~ 100,000, so I would expect survfit.coxph to be roughly 20000 times slower than coxph. The long term solution is for me to fix this. It's a couple of week's work, if I can only find the weeks to do it. The mid term one is to take Frank Harrell's suggestion. If standard errors are not needed, there is an O(n) algorithm, which he has implemented as part of his additions to the coxph suite. Terry Therneau