thr3ads.net - R help - [R] predict.coxph [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Terry Therneau

2010-Nov-12 17:09 UTC

[R] predict.coxph

Since I read the list in digest form (and was out ill yesterday) I'm
late to the discussion.

There are 3 steps for predicting survival, using a Cox model:

1. Fit the data
 fit <- coxph(Surv(time, status) ~ age + ph.ecog, data=lung)

The biggest question to answer here is what covariates you wish to base
the prediction on.  There is the usual tradeoff between too few (leave
out something important) or too many (including unimportant things).

2. Get survival curves
  curves <- survfit(fit, newdata= _____)
The newdata needs to include all the covariates in your model.  

3. Summarize
 Note that you don't get a single number prediction for each subject,
you get a set of survival curves.  plot(curves[1]) for instance shows
you the first one, plot(curves[2]) the second. 
  print(curves) will give a 1 line per curve summary including the
median, and optionally one of several versions of the mean. See the
discussion in help(print.survfit).  The mean is rarely used as a summary
due to the fact that we don't see the whole distribution.  (Use temp<-
summary(curves); temp$table to use the printout values in further
calculations.)

-------------------

  The same process applies for parametric survival using survreg.  In
return for specifying a distributional form, the predicted survival
curve for a particular subject is completely defined.  This includes the
mean and all quantiles.  Reliablity analysis (survival analysis in
industry) uses parametric almost exclusively, since the tail of the
distribution is of greatest interest.  Your use of
predict(,type='response') is almost correct, there is just the math
detail that the Weibull fits on a log scale, so the returned value is a
geometric mean time to death rather than an arithmetic mean. 

 The suggestion to use ordinary regression on the observed times is
wrong.  Censored data is more complex than that.

Terry Therneau

James C. Whanger

2010-Nov-12 19:44 UTC

head link

[R] predict.coxph

Terry,

My point was that if you are asking the question:  What is the average time
to death based on a set of variables? The only logical approach for
calculating actual time to death is to use uncensored cases, because we do
not know the time to death for the censored cases and can only estimate
them.  While actual time to death for uncensored cases may not be a very
useful piece of information, it can indeed be calculated.  However, as you
point out predicted values for time to death can be estimated using the
survival function which incorporates both censored and uncensored data.
However, the assumption of proportional hazards is rarely defensible.

Best,

Jim

On Fri, Nov 12, 2010 at 12:09 PM, Terry Therneau <therneau@mayo.edu>
wrote:
> Since I read the list in digest form (and was out ill yesterday) I'm
> late to the discussion.
>
> There are 3 steps for predicting survival, using a Cox model:
>
> 1. Fit the data
>  fit <- coxph(Surv(time, status) ~ age + ph.ecog, data=lung)
>
> The biggest question to answer here is what covariates you wish to base
> the prediction on.  There is the usual tradeoff between too few (leave
> out something important) or too many (including unimportant things).
>
> 2. Get survival curves
>  curves <- survfit(fit, newdata= _____)
> The newdata needs to include all the covariates in your model.
>
> 3. Summarize
>  Note that you don't get a single number prediction for each subject,
> you get a set of survival curves.  plot(curves[1]) for instance shows
> you the first one, plot(curves[2]) the second.
>  print(curves) will give a 1 line per curve summary including the
> median, and optionally one of several versions of the mean. See the
> discussion in help(print.survfit).  The mean is rarely used as a summary
> due to the fact that we don't see the whole distribution.  (Use
temp<-
> summary(curves); temp$table to use the printout values in further
> calculations.)
>
> -------------------
>
>  The same process applies for parametric survival using survreg.  In
> return for specifying a distributional form, the predicted survival
> curve for a particular subject is completely defined.  This includes the
> mean and all quantiles.  Reliablity analysis (survival analysis in
> industry) uses parametric almost exclusively, since the tail of the
> distribution is of greatest interest.  Your use of
> predict(,type='response') is almost correct, there is just the math
> detail that the Weibull fits on a log scale, so the returned value is a
> geometric mean time to death rather than an arithmetic mean.
>
>  The suggestion to use ordinary regression on the observed times is
> wrong.  Censored data is more complex than that.
>
> Terry Therneau
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
*James C. Whanger
Research Consultant
2 Wolf Ridge Gap
Ledyard, CT  06339

Phone: 860.389.0414*

	[[alternative HTML version deleted]]

Therneau, Terry M., Ph.D.

2010-Nov-12 22:08 UTC

head link

[R] predict.coxph

Jim,
  I respectfully disagree, and there is 5 decades of literature to back
me up.  Berkson and Gage (1950) is in response to medical papers that
summarized surgical outcomes using only the observed deaths, and shows
important failings of the method.  Ignoring the censored cases usually
gives biased answers, often so badly so that they are misleading and
worse than no answer at all.  The PH model is surprisingly accurate in
acute disease (I work in areas like multiple myeloma and liver
transplant so see a lot of this) and is also used in economics (duration
of unemployment for instance), the accelerated failure time models have
proven very reliable predictors in industry work.  Censored linear
regression (e.g. "Tobit" model) is not uncommon.  I am not aware of
any
cases where ignoring the censored cases gives a competitive answer.
Blindly using a coxph model without checking into or at least thinking
about the proportional hazards assumption is dangerous, but so is blind
use of any other model.

Terry T.

------- Begin included message -------------
Terry,

My point was that if you are asking the question:  What is the average
time to death based on a set of variables? The only logical approach for
calculating actual time to death is to use uncensored cases, because we
do not know the time to death for the censored cases and can only
estimate them.  While actual time to death for uncensored cases may not
be a very useful piece of information, it can indeed be calculated.
However, as you point out predicted values for time to death can be
estimated using the survival function which incorporates both censored
and uncensored data.  However, the assumption of proportional hazards is
rarely defensible.

Best,

Jim

Mike Marchywka

2010-Nov-13 12:39 UTC

head link

[R] predict.coxph

----------------------------------------> Date: Fri, 12 Nov 2010 16:08:57 -0600
> From: therneau at mayo.edu
> To: james.whanger at gmail.com
> CC: r-help at r-project.org; haenlein at escpeurope.eu
> Subject: Re: [R] predict.coxph
>
> Jim,
> I respectfully disagree, and there is 5 decades of literature to back
> me up. Berkson and Gage (1950) is in response to medical papers that
> summarized surgical outcomes using only the observed deaths, and shows
> important failings of the method. Ignoring the censored cases usually
> gives biased answers, often so badly so that they are misleading and
> worse than no answer at all. The PH model is surprisingly accurate in
( yes I read all the way through and noted your caveats below
but curious about the reality of what you encounter and what would
make sense to consider in the future as better understanding of causality
can remove random events.? )

If you are looking at radioactive decay maybe but how often do
you actually see exponential KM curves in real life? Certainly
depending on MOA of drug or disease/enrollment critera, you could expect
qualitative
changes in disease trajectory and consequently in survival
curves.  A  trial design
could in fact try to get all the control sample to "event"  at the
same
time if enough was known about prognostic factors and natural trajectory
as this should make drug effects quite clear- a step function of course
is not a constant hazard.( now writing a label based on this trial
may annoy the FDA [ " indicated for patients with exactly 6 months of life
expectancy
based on XYZ paper " LOL ] but from a statistical standpoint would seem
like
a good idea to consider to get power with few patients). At minimum, there could
be
some inital plateau as almost-dead patients may be excluded etc.



> acute disease (I work in areas like multiple myeloma and liver
On the R-related topic, do you know anything about results
with VLA-4 inhibitors in MM?
> transplant so see a lot of this) and is also used in economics (duration
> of unemployment for instance), the accelerated failure time models have
> proven very reliable predictors in industry work. Censored linear
> regression (e.g. "Tobit" model) is not uncommon. I am not aware
of any
> cases where ignoring the censored cases gives a competitive answer.
Are you talking about right censored? These points would seem to be
informative as they have survived this long nand simply ignoring them would
create bias. Ceratinly lost to follow
up should be unbiased if just ignored no? Personally I think I finally decided
that
comparing integral measures may be more helpful- patient-months of excess
survival for example- rather than asking about things like means or
medians.

So basically your conversation is about calculating things like average 
survival time with many data points yet to event? 
> Blindly using a coxph model without checking into or at least thinking
> about the proportional hazards assumption is dangerous, but so is blind
> use of any other model.
As noted above, I wasn't trying to take your earlier statement out of
context...
>
> Terry T.
>
> ------- Begin included message -------------
> Terry,
>
> My point was that if you are asking the question: What is the average
> time to death based on a set of variables? The only logical approach for
> calculating actual time to death is to use uncensored cases, because we
> do not know the time to death for the censored cases and can only
> estimate them. While actual time to death for uncensored cases may not
> be a very useful piece of information, it can indeed be calculated.
> However, as you point out predicted values for time to death can be
> estimated using the survival function which incorporates both censored
> and uncensored data. However, the assumption of proportional hazards is
> rarely defensible.
>
> Best,
>
> Jim
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Nov 2010 - predict.coxph

[R] predict.coxph

[R] predict.coxph

[R] predict.coxph

[R] predict.coxph

Apparently Analagous Threads