Mayeul KAUFFMANN
2004-Aug-13 01:03 UTC
[R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?
Hello, coxph does not use any information that are in the dataset between event times (or "death times") , since computation only occurs at event times. For instance, removing observations when there is no event at that time in the whole dataset does not change the results:> set.seed(1) > data <-as.data.frame(cbind(start=c(1:5,1:5,1:4),stop=c(2:6,2:6,2:5),status=c(rep( 0,7),1,rep(0,5),1),id=c(rep(1,5),rep(2,5),rep(3,4)),x1=rnorm(14)))> datastart stop status id x1 1 1 2 0 1 -0.6264538 2 2 3 0 1 0.1836433 3 3 4 0 1 -0.8356286 4 4 5 0 1 1.5952808 5 5 6 0 1 0.3295078 6 1 2 0 2 -0.8204684 7 2 3 0 2 0.4874291 8 3 4 1 2 0.7383247 9 4 5 0 2 0.5757814 10 5 6 0 2 -0.3053884 11 1 2 0 3 1.5117812 12 2 3 0 3 0.3898432 13 3 4 0 3 -0.6212406 14 4 5 1 3 -2.2146999 coxph(Surv(start, stop,status)~ cluster(id)+x1,data=data ,robust=T) coxph(Surv(start, stop,status)~ cluster(id)+x1,data=subset(data,stop %in% 4:5) ,robust=T) # the same !!! (except n) First, some data is lost. Second, this loss could be an important problem when there is a time-varying covariate that changes quicker than the frequency of events. Specifically, I have a covariate which has low values most of the time. It sometimes jumps to high values and that is hypothesized as greatly increasing the risk of an event. With rare events, the effect of this covariate will only be measured at event times. Chances are that the only time such a covariate is recorded at high level, the individual for which it is measured as being high is having an event. This may bias the estimated coefficient. Here is my question: How to fully use the dataset? (that is: how to have really _time-varying_ covariates (even if they change step by step, not continuously), not covariates whose changes are measured only at event time ) Ideally, the full dataset would be use to estimate the parameters, or at least to estimate the standard error of the estimated parameters. Any ideas ??? . . . A second best (which might require less work) would be to use all the dataset to assess the predictive power of the model. Maybe by using the expected number of events for an individual over the time interval that they were observed to be at risk> predict(coxfit,type="expected")and compare it with observed number of events (does it use all data and takes into account all the baseline hazard, even between events?) Or, if not, following Brian D. Ripley suggestion about the baseline hazard: "As an approximation you can smooth the fitted baseline cumulative hazard (e.g. by package pspline) and ask for its derivative" (https://stat.ethz.ch/pipermail/r-help/2004-July/052376.html) the following code could be use to estimate (and plot) a smooth baseline hazard:> t<-seq(min(data$start),max(data$stop),length=100) > lines(t,predict(sm.spline(x=basehaz(coxfit)[,2],y=basehaz(coxfit)[,1],norder=2), t,1)) #there is a problem with this code. One should add the contraint that the baseline hazard cannot be negative. The following computes the parametric part of the cox model.> risk <- predict(coxfit, type='risk') # gives exp(X'b)something like> baseline.hazard*riskwould give the true risk at any time (but it would be probably much harder to compute) which could help assess the predictive power of the model. (still a lot of work) Thanks in advance for any help or comment. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France
Prof Brian Ripley
2004-Aug-13 06:40 UTC
[R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?
This is the consequence of the use of partial likelihood in the Cox model. You should read the literature on this point (for example, have you read Cox, 1972 and all its discussion, or Anderson, Borgan, Gill & Keiding?). It is not an R question. You need to make more assumptions, such as a smooth baseline hazard, and you can always use parametric models and a full likelihood (but you may have to program them yourself). On Fri, 13 Aug 2004, Mayeul KAUFFMANN wrote:> Hello, > > coxph does not use any information that are in the dataset between event > times (or "death times") , since computation only occurs at event times. > > For instance, removing observations when there is no event at that time in > the whole dataset does not change the results: > > set.seed(1) > > data <- > as.data.frame(cbind(start=c(1:5,1:5,1:4),stop=c(2:6,2:6,2:5),status=c(rep( > 0,7),1,rep(0,5),1),id=c(rep(1,5),rep(2,5),rep(3,4)),x1=rnorm(14))) > > data > start stop status id x1 > 1 1 2 0 1 -0.6264538 > 2 2 3 0 1 0.1836433 > 3 3 4 0 1 -0.8356286 > 4 4 5 0 1 1.5952808 > 5 5 6 0 1 0.3295078 > 6 1 2 0 2 -0.8204684 > 7 2 3 0 2 0.4874291 > 8 3 4 1 2 0.7383247 > 9 4 5 0 2 0.5757814 > 10 5 6 0 2 -0.3053884 > 11 1 2 0 3 1.5117812 > 12 2 3 0 3 0.3898432 > 13 3 4 0 3 -0.6212406 > 14 4 5 1 3 -2.2146999 > coxph(Surv(start, stop,status)~ cluster(id)+x1,data=data ,robust=T) > coxph(Surv(start, stop,status)~ cluster(id)+x1,data=subset(data,stop %in% > 4:5) ,robust=T) # the same !!! (except n) > > First, some data is lost. > Second, this loss could be an important problem when there is a > time-varying covariate that changes quicker than the frequency of events. > Specifically, I have a covariate which has low values most of the time. It > sometimes jumps to high values and that is hypothesized as greatly > increasing the risk of an event. > With rare events, the effect of this covariate will only be measured at > event times. Chances are that the only time such a covariate is recorded > at high level, the individual for which it is measured as being high is > having an event. > This may bias the estimated coefficient. > > Here is my question: > How to fully use the dataset? > > (that is: how to have really _time-varying_ covariates (even if they > change step by step, not continuously), not covariates whose changes are > measured only at event time ) > > Ideally, the full dataset would be use to estimate the parameters, or at > least to estimate the standard error of the estimated parameters. > Any ideas ??? > . > . > . > > A second best (which might require less work) would be to use all the > dataset to assess the predictive power of the model. > > Maybe by using the expected number of events for an individual over the > time interval that they were observed to be at risk > > predict(coxfit,type="expected") > and compare it with observed number of events > (does it use all data and takes into account all the baseline hazard, even > between events?) > > > Or, if not, following Brian D. Ripley suggestion about the baseline > hazard: "As an approximation you can smooth the fitted > baseline cumulative hazard (e.g. by package pspline) and ask for its > derivative" (https://stat.ethz.ch/pipermail/r-help/2004-July/052376.html) > > the following code could be use to estimate (and plot) a smooth baseline > hazard: > > t<-seq(min(data$start),max(data$stop),length=100) > > lines(t, > predict(sm.spline(x=basehaz(coxfit)[,2],y=basehaz(coxfit)[,1],norder=2), > t,1)) > #there is a problem with this code. One should add the contraint that the > baseline hazard cannot be negative. > > The following computes the parametric part of the cox model. > > risk <- predict(coxfit, type='risk') # gives exp(X'b) > > something like > > baseline.hazard*risk > would give the true risk at any time (but it would be probably much harder > to compute) > > which could help assess the predictive power of the model. > (still a lot of work) > > Thanks in advance for any help or comment.-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595