Ravi Varadhan
2010-Aug-31 03:40 UTC
[R] Speeding up prediction of survival estimates when using `survifit'
Hi, I fit a Cox PH model to estimate the cause-specific hazards (in a competing risks setting). Then , I compute the survival estimates for all the individuals in my data set using the `survfit' function. I am currently playing with a data set that has about 6000 observations and 12 covariates. I am finding that the survfit function is very slow. Here is a simple simulation example (modified from Frank Harrell's example for `cph') that illustrates the problem: #n <- 500 set.seed(4321) age <- 50 + 12*rnorm(n) sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 5 * runif(n) h <- 0.02 * exp(0.04 * (age-50) + 0.8 * (sex=='Female')) dt <- -log(runif(n))/h e <- ifelse(dt <= cens, 1, 0) dt <- pmin(dt, cens) Srv <- Surv(dt, e) f <- coxph(Srv ~ age + sex, x=TRUE, y=TRUE) system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x)) When I run the above code with sample sizes, n, taking on values of 500, 1000, 2000, and 4000, the time it takes for survfit to run are as follows: # n <- 500> system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x))user system elapsed 0.16 0.00 0.15 # n <- 1000> system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x))user system elapsed 1.45 0.00 1.48 # n <- 2000> system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x))user system elapsed 10.19 0.00 10.25 # n <- 4000> system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x))user system elapsed 72.87 0.05 74.87 I eventually want to use `survfit' on a data set with roughly 50K observations, which I am afraid is going to be painfully slow. I would much appreciate hints/suggestions on how to make `survfit' faster or any other faster alternatives. Thanks. Best, Ravi. ____________________________________________________________________ Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvaradhan at jhmi.edu
Frank Harrell
2010-Aug-31 12:51 UTC
[R] Speeding up prediction of survival estimates when using `survifit'
Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University On Mon, 30 Aug 2010, Ravi Varadhan wrote:> Hi, > > I fit a Cox PH model to estimate the cause-specific hazards (in a competing risks setting). Then , I compute the survival estimates for all the individuals in my data set using the `survfit' function. I am currently playing with a data set that has about 6000 observations and 12 covariates. I am finding that the survfit function is very slow. > > Here is a simple simulation example (modified from Frank Harrell's example for `cph') that illustrates the problem: > > #n <- 500 > set.seed(4321) > > age <- 50 + 12*rnorm(n) > > sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) > > cens <- 5 * runif(n) > > h <- 0.02 * exp(0.04 * (age-50) + 0.8 * (sex=='Female')) > > dt <- -log(runif(n))/h > > e <- ifelse(dt <= cens, 1, 0) > > dt <- pmin(dt, cens) > > Srv <- Surv(dt, e) > > f <- coxph(Srv ~ age + sex, x=TRUE, y=TRUE) > > system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x)) > > > When I run the above code with sample sizes, n, taking on values of 500, 1000, 2000, and 4000, the time it takes for survfit to run are as follows: > > # n <- 500 >> system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x)) > user system elapsed > 0.16 0.00 0.15 > > > # n <- 1000 >> system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x)) > user system elapsed > 1.45 0.00 1.48 > > > # n <- 2000 >> system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x)) > user system elapsed > 10.19 0.00 10.25 > > > # n <- 4000 >> system.time(ans <- survfit(f, type="aalen", se.fit=FALSE, newdata=f$x)) > user system elapsed > 72.87 0.05 74.87 > > > I eventually want to use `survfit' on a data set with roughly 50K observations, which I am afraid is going to be painfully slow. I would much appreciate hints/suggestions on how to make `survfit' faster or any other faster alternatives.Ravi, If you don't need standard errors/confidence limits, the rms package's survest and related functions can speed things up greatly if you fit the model using cph(...., surv=TRUE). [cph calls coxph, and calls survfit once to estimate the underlying survival curve]. Frank> > Thanks. > > Best, > Ravi. > ____________________________________________________________________ > > Ravi Varadhan, Ph.D. > Assistant Professor, > Division of Geriatric Medicine and Gerontology > School of Medicine > Johns Hopkins University > > Ph. (410) 502-2619 > email: rvaradhan at jhmi.edu