Since I read the list in digest form (and was out ill yesterday) I'm late to the discussion. There are 3 steps for predicting survival, using a Cox model: 1. Fit the data fit <- coxph(Surv(time, status) ~ age + ph.ecog, data=lung) The biggest question to answer here is what covariates you wish to base the prediction on. There is the usual tradeoff between too few (leave out something important) or too many (including unimportant things). 2. Get survival curves curves <- survfit(fit, newdata= _____) The newdata needs to include all the covariates in your model. 3. Summarize Note that you don't get a single number prediction for each subject, you get a set of survival curves. plot(curves[1]) for instance shows you the first one, plot(curves[2]) the second. print(curves) will give a 1 line per curve summary including the median, and optionally one of several versions of the mean. See the discussion in help(print.survfit). The mean is rarely used as a summary due to the fact that we don't see the whole distribution. (Use temp<- summary(curves); temp$table to use the printout values in further calculations.) ------------------- The same process applies for parametric survival using survreg. In return for specifying a distributional form, the predicted survival curve for a particular subject is completely defined. This includes the mean and all quantiles. Reliablity analysis (survival analysis in industry) uses parametric almost exclusively, since the tail of the distribution is of greatest interest. Your use of predict(,type='response') is almost correct, there is just the math detail that the Weibull fits on a log scale, so the returned value is a geometric mean time to death rather than an arithmetic mean. The suggestion to use ordinary regression on the observed times is wrong. Censored data is more complex than that. Terry Therneau
Terry, My point was that if you are asking the question: What is the average time to death based on a set of variables? The only logical approach for calculating actual time to death is to use uncensored cases, because we do not know the time to death for the censored cases and can only estimate them. While actual time to death for uncensored cases may not be a very useful piece of information, it can indeed be calculated. However, as you point out predicted values for time to death can be estimated using the survival function which incorporates both censored and uncensored data. However, the assumption of proportional hazards is rarely defensible. Best, Jim On Fri, Nov 12, 2010 at 12:09 PM, Terry Therneau <therneau@mayo.edu> wrote:> Since I read the list in digest form (and was out ill yesterday) I'm > late to the discussion. > > There are 3 steps for predicting survival, using a Cox model: > > 1. Fit the data > fit <- coxph(Surv(time, status) ~ age + ph.ecog, data=lung) > > The biggest question to answer here is what covariates you wish to base > the prediction on. There is the usual tradeoff between too few (leave > out something important) or too many (including unimportant things). > > 2. Get survival curves > curves <- survfit(fit, newdata= _____) > The newdata needs to include all the covariates in your model. > > 3. Summarize > Note that you don't get a single number prediction for each subject, > you get a set of survival curves. plot(curves[1]) for instance shows > you the first one, plot(curves[2]) the second. > print(curves) will give a 1 line per curve summary including the > median, and optionally one of several versions of the mean. See the > discussion in help(print.survfit). The mean is rarely used as a summary > due to the fact that we don't see the whole distribution. (Use temp<- > summary(curves); temp$table to use the printout values in further > calculations.) > > ------------------- > > The same process applies for parametric survival using survreg. In > return for specifying a distributional form, the predicted survival > curve for a particular subject is completely defined. This includes the > mean and all quantiles. Reliablity analysis (survival analysis in > industry) uses parametric almost exclusively, since the tail of the > distribution is of greatest interest. Your use of > predict(,type='response') is almost correct, there is just the math > detail that the Weibull fits on a log scale, so the returned value is a > geometric mean time to death rather than an arithmetic mean. > > The suggestion to use ordinary regression on the observed times is > wrong. Censored data is more complex than that. > > Terry Therneau > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- *James C. Whanger Research Consultant 2 Wolf Ridge Gap Ledyard, CT 06339 Phone: 860.389.0414* [[alternative HTML version deleted]]
Jim, I respectfully disagree, and there is 5 decades of literature to back me up. Berkson and Gage (1950) is in response to medical papers that summarized surgical outcomes using only the observed deaths, and shows important failings of the method. Ignoring the censored cases usually gives biased answers, often so badly so that they are misleading and worse than no answer at all. The PH model is surprisingly accurate in acute disease (I work in areas like multiple myeloma and liver transplant so see a lot of this) and is also used in economics (duration of unemployment for instance), the accelerated failure time models have proven very reliable predictors in industry work. Censored linear regression (e.g. "Tobit" model) is not uncommon. I am not aware of any cases where ignoring the censored cases gives a competitive answer. Blindly using a coxph model without checking into or at least thinking about the proportional hazards assumption is dangerous, but so is blind use of any other model. Terry T. ------- Begin included message ------------- Terry, My point was that if you are asking the question: What is the average time to death based on a set of variables? The only logical approach for calculating actual time to death is to use uncensored cases, because we do not know the time to death for the censored cases and can only estimate them. While actual time to death for uncensored cases may not be a very useful piece of information, it can indeed be calculated. However, as you point out predicted values for time to death can be estimated using the survival function which incorporates both censored and uncensored data. However, the assumption of proportional hazards is rarely defensible. Best, Jim
----------------------------------------> Date: Fri, 12 Nov 2010 16:08:57 -0600 > From: therneau at mayo.edu > To: james.whanger at gmail.com > CC: r-help at r-project.org; haenlein at escpeurope.eu > Subject: Re: [R] predict.coxph > > Jim, > I respectfully disagree, and there is 5 decades of literature to back > me up. Berkson and Gage (1950) is in response to medical papers that > summarized surgical outcomes using only the observed deaths, and shows > important failings of the method. Ignoring the censored cases usually > gives biased answers, often so badly so that they are misleading and > worse than no answer at all. The PH model is surprisingly accurate in( yes I read all the way through and noted your caveats below but curious about the reality of what you encounter and what would make sense to consider in the future as better understanding of causality can remove random events.? ) If you are looking at radioactive decay maybe but how often do you actually see exponential KM curves in real life? Certainly depending on MOA of drug or disease/enrollment critera, you could expect qualitative changes in disease trajectory and consequently in survival curves. A trial design could in fact try to get all the control sample to "event" at the same time if enough was known about prognostic factors and natural trajectory as this should make drug effects quite clear- a step function of course is not a constant hazard.( now writing a label based on this trial may annoy the FDA [ " indicated for patients with exactly 6 months of life expectancy based on XYZ paper " LOL ] but from a statistical standpoint would seem like a good idea to consider to get power with few patients). At minimum, there could be some inital plateau as almost-dead patients may be excluded etc.> acute disease (I work in areas like multiple myeloma and liverOn the R-related topic, do you know anything about results with VLA-4 inhibitors in MM?> transplant so see a lot of this) and is also used in economics (duration > of unemployment for instance), the accelerated failure time models have > proven very reliable predictors in industry work. Censored linear > regression (e.g. "Tobit" model) is not uncommon. I am not aware of any > cases where ignoring the censored cases gives a competitive answer.Are you talking about right censored? These points would seem to be informative as they have survived this long nand simply ignoring them would create bias. Ceratinly lost to follow up should be unbiased if just ignored no? Personally I think I finally decided that comparing integral measures may be more helpful- patient-months of excess survival for example- rather than asking about things like means or medians. So basically your conversation is about calculating things like average survival time with many data points yet to event?> Blindly using a coxph model without checking into or at least thinking > about the proportional hazards assumption is dangerous, but so is blind > use of any other model.As noted above, I wasn't trying to take your earlier statement out of context...> > Terry T. > > ------- Begin included message ------------- > Terry, > > My point was that if you are asking the question: What is the average > time to death based on a set of variables? The only logical approach for > calculating actual time to death is to use uncensored cases, because we > do not know the time to death for the censored cases and can only > estimate them. While actual time to death for uncensored cases may not > be a very useful piece of information, it can indeed be calculated. > However, as you point out predicted values for time to death can be > estimated using the survival function which incorporates both censored > and uncensored data. However, the assumption of proportional hazards is > rarely defensible. > > Best, > > Jim > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.