Mayeul KAUFFMANN
2004-Jul-26 11:24 UTC
[R] covariate selection in cox model (counting process)
Hello everyone, I am searching for a covariate selection procedure in a cox model formulated as a counting process. I use intervals, my formula looks like coxph(Surv(start,stop,status)~ x1+x2+...+cluster(id),robust=T) where id is a country code (I study occurence of civil wars from 1962 to 1997). I'd like something not based on p-values, since they have several flaws for this purpose. I turned to other criteria but all the articles I read seems to apply to the classical formulation of the cox model, not the counting process one (or they apply to both but I am not aware of this) I've tried AIC with> step(cox.fit)or> stepAIC(cox.fit)and BIC using>step(cox.fit,k = log(n))but there seems to be 2 theoretical problems to address: (1) These values are based on partial loglikelihood ("loglik") I wonder if this is correct with the cox model formulated as a *counting process*, with many (consecutive) observations for a given individual, and then some observation not being independent Since "the likelihood ratio and score tests assume independence of observations within a cluster, the Wald and robust score tests do not" (R warning), and the likelihood ratio being based on loglik, can I use loglik in BIC with some dependent observations? [I have 170 individuals (namely, countries) for 36 year, some single countries having up to 140 very short observation intervals, other having (on the other extreme) only 1 long interval per year. That's because I created artificial covariates measuring the proximity since some events: exp(-days.since.event/a.chosen.parameter). I splitted every yearly interval for which these covariates change rapidly (i.e. when the events are recent) yielding up to 11 intervals a year] (2) What penalized term to used? It seems natural to include the number of covariates, k. What about the number of observations? I found several definitions: AIC= -2 loglik(b) + 2.k Schwartz Bayesian information criteria: SBIC= -2 loglik(b) + k ln(n) Frédérique Letué (author of PhD thesis "COX MODEL: ESTIMATION VIA MODEL SELECTION AND BIVARIATE SHOCK MODEL", http://www-lmc.imag.fr/lmc-sms/Frederique.Letue/These3.ps) suggested me AIC= - loglik(b) + 2.k/n BIC= - loglik(b) + 2.ln(n).k/n, with other possible values for parameter "2" in this case (see her thesis p.100, but this section is in French) All these do not tell *what to take for n*. There are 3 main possibilities: a) Taking the number of observations (including censored one) will give a huge n (around 6000 to 8000), which may seem meaningless since some observations are only a few days long. With n at the denominator (Letué's criteria), the penalized term would be so low that it's like not having it:>log(7000)/7000[1] 0.001264809 (where loglik from summary(cox.fit) range from -155 to -175, dependig on the model) b) Volinsky & Raftery "propose a revision of the penalty term in BIC so that the penalty is defined in terms of the number of uncensored events instead of the number of observations." (Volinsky & Raftery , Bayesian Information Criterion for Censored Survival Models, June 16, 1999, http://www.research.att.com/~volinsky/papers/biocs.ps) This could be computed with>sum(coxph.detail(cox.interac.dsi6mois)$nevent)Letué's BIC penalized trerm with 50 events will then be> 2*log(50)/50[1] 0.1564809 which will have more effects. However, adding or removing a country which has data for the 36 years but no event (then, it is censored) will not change this BIC. Thus, it is not suitable to account for missing data that do not reduce the number of event. I'd like the criteria to take this into account, because all covariates do not have the same missing data. The question is: When I have the choice with adding a covariate, x10 or x11, which have different (not nested) set of missing values, which one is best? Estimating all subsets of the full model (full model = all covariates) with a dataset containing no missing data for the full model would be a solution but would more than halve the dataset for many subsets of the covariates. I should mention that step(cox.fit) gives a warning and stops: "Error in step(cox.fit) : number of rows in use has changed: remove missing values?" which makes me ask whether the whole procedure is OK with model of different sample size. c) "For discrete time event history analysis, the same choice has been made, while the total number of exposure time units has also been used, for consistency with logistic regresion (Raftery,Lewis,Aghajanian and Kahn,1993;Raftery Lewisand Aghajanian,1994)" (Raftery, Bayesian Model Selection in Social Research, 1994, http://www.stat.washington.edu/tech.reports/bic.ps) I am not sure what "exposure time units" mean. But since I could have used a logit model with yearly observations [but with many flaws...], I suggest I could use the number of years (sum of length of intervals, in year)> sum((fit$y)[,2]-(fit$y)[,1])/365.25[1] 3759.537 This may still be too high. Since I have datas from 1962 to 1997 (36 years), I have the folowing numbers of "complete cases"-equivalent:>sum((fit$y)[,2]-(fit$y)[,1])/ (365.25*36)[1] 104.4316 This seems more resonable and would account for NAs in different models. However, it might be to high, because some countries are not at risk during the all period: some did not existed because they gain independence near the end of the period (E.G. ex-USRR countries in arly 1990's) or because they were experiencing an event (new wars in countries already experiencing a war are not taken into account). I may take the *proportion* of available data to time at risk to adjust for this: a country at risk during 1 year and for which data are available for this entire year will increase n by 1, not by 1/36. If the data frame "dataset" contains all countries at risk (including some with NA), assuming id == id[complete.cases(dataset)] (all countries have at least one complete observation) this will be computed by>attach(dataset) >sum(tapply(stop[complete.cases(dataset)]-start[complete.cases(dataset)],CCODE,sum) /tapply(stop-start,CCODE,sum)) But this would be a rather empirical adjustment, maybe with no theoretical basis. And I don't think I can enter this as argument k to step().... Thank you for having read this. Hope I was not too long. And thank you a lot for any help, comment, etc. (sorry for mistakes in English as I'm a non native English speaker) Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France [[alternative HTML version deleted]]
Mayeul KAUFFMANN
2004-Jul-26 12:15 UTC
[R] covariate selection in cox model (counting process)
Hello everyone, I am searching for a covariate selection procedure in a cox model formulated as a counting process. I use intervals, my formula looks like coxph(Surv(start,stop,status)~ x1+x2+...+cluster(id),robust=T) where id is a country code (I study occurence of civil wars from 1962 to 1997). I'd like something not based on p-values, since they have several flaws for this purpose. I turned to other criteria but all the articles I read seems to apply to the classical formulation of the cox model, not the counting process one (or they apply to both but I am not aware of this) I've tried AIC with> step(cox.fit)or> stepAIC(cox.fit)and BIC using>step(cox.fit,k = log(n))but there seems to be 2 theoretical problems to address: (1) These values are based on partial loglikelihood ("loglik") I wonder if this is correct with the cox model formulated as a *counting process*, with many (consecutive) observations for a given individual, and then some observation not being independent Since "the likelihood ratio and score tests assume independence of observations within a cluster, the Wald and robust score tests do not" (R warning), and the likelihood ratio being based on loglik, can I use loglik in BIC with some dependent observations? [I have 170 individuals (namely, countries) for 36 year, some single countries having up to 140 very short observation intervals, other having (on the other extreme) only 1 long interval per year. That's because I created artificial covariates measuring the proximity since some events: exp(-days.since.event/a.chosen.parameter). I splitted every yearly interval for which these covariates change rapidly (i.e. when the events are recent) yielding up to 11 intervals a year] (2) What penalized term to used? It seems natural to include the number of covariates, k. What about the number of observations? I found several definitions: AIC= -2 loglik(b) + 2.k Schwartz Bayesian information criteria: SBIC= -2 loglik(b) + k ln(n) Fr??d??rique Letu?? (author of PhD thesis "COX MODEL: ESTIMATION VIA MODEL SELECTION AND BIVARIATE SHOCK MODEL", http://www-lmc.imag.fr/lmc-sms/Frederique.Letue/These3.ps) suggested me AIC= - loglik(b) + 2.k/n BIC= - loglik(b) + 2.ln(n).k/n, with other possible values for parameter "2" in this case (see her thesis p.100, but this section is in French) All these do not tell *what to take for n*. There are 3 main possibilities: a) Taking the number of observations (including censored one) will give a huge n (around 6000 to 8000), which may seem meaningless since some observations are only a few days long. With n at the denominator (Letu??'s criteria), the penalized term would be so low that it's like not having it:>log(7000)/7000[1] 0.001264809 (where loglik from summary(cox.fit) range from -155 to -175, dependig on the model) b) Volinsky & Raftery "propose a revision of the penalty term in BIC so that the penalty is defined in terms of the number of uncensored events instead of the number of observations." (Volinsky & Raftery , Bayesian Information Criterion for Censored Survival Models, June 16, 1999, http://www.research.att.com/~volinsky/papers/biocs.ps) This could be computed with>sum(coxph.detail(cox.interac.dsi6mois)$nevent)Letu??'s BIC penalized trerm with 50 events will then be> 2*log(50)/50[1] 0.1564809 which will have more effects. However, adding or removing a country which has data for the 36 years but no event (then, it is censored) will not change this BIC. Thus, it is not suitable to account for missing data that do not reduce the number of event. I'd like the criteria to take this into account, because all covariates do not have the same missing data. The question is: When I have the choice with adding a covariate, x10 or x11, which have different (not nested) set of missing values, which one is best? Estimating all subsets of the full model (full model = all covariates) with a dataset containing no missing data for the full model would be a solution but would more than halve the dataset for many subsets of the covariates. I should mention that step(cox.fit) gives a warning and stops: "Error in step(cox.fit) : number of rows in use has changed: remove missing values?" which makes me ask whether the whole procedure is OK with model of different sample size. c) "For discrete time event history analysis, the same choice has been made, while the total number of exposure time units has also been used, for consistency with logistic regresion (Raftery,Lewis,Aghajanian and Kahn,1993;Raftery Lewisand Aghajanian,1994)" (Raftery, Bayesian Model Selection in Social Research, 1994, http://www.stat.washington.edu/tech.reports/bic.ps) I am not sure what "exposure time units" mean. But since I could have used a logit model with yearly observations [but with many flaws...], I suggest I could use the number of years (sum of length of intervals, in year)> sum((fit$y)[,2]-(fit$y)[,1])/365.25[1] 3759.537 This may still be too high. Since I have datas from 1962 to 1997 (36 years), I have the folowing numbers of "complete cases"-equivalent:>sum((fit$y)[,2]-(fit$y)[,1])/ (365.25*36)[1] 104.4316 This seems more resonable and would account for NAs in different models. However, it might be to high, because some countries are not at risk during the all period: some did not existed because they gain independence near the end of the period (E.G. ex-USRR countries in arly 1990's) or because they were experiencing an event (new wars in countries already experiencing a war are not taken into account). I may take the *proportion* of available data to time at risk to adjust for this: a country at risk during 1 year and for which data are available for this entire year will increase n by 1, not by 1/36. If the data frame "dataset" contains all countries at risk (including some with NA), assuming id == id[complete.cases(dataset)] (all countries have at least one complete observation) this will be computed by>attach(dataset) >sum(tapply(stop[complete.cases(dataset)]-start[complete.cases(dataset)],CCO DE,sum) /tapply(stop-start,CCODE,sum)) But this would be a rather empirical adjustment, maybe with no theoretical basis. And I don't think I can enter this as argument k to step().... Thank you for having read this. Hope I was not too long. And thank you a lot for any help, comment, etc. (sorry for mistakes in English as I'm a non native English speaker) Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France
Mayeul KAUFFMANN
2004-Jul-26 22:28 UTC
[R] covariate selection in cox model (counting process)
Thank you a lot for your time and your answer, Thomas. Like all good answers, it raised new questions for me ;-)>In the case of recurrent events coxph() is not > using maximum likelihood or even maximum partial likelihood. It is > maximising the quantity that (roughly speaking) would be the partial > likelihood if the covariates explained all the cluster differences.I could have non repeating events by removing countries once they have experienced a war. But I'm not sure it will change the estimation procedure since this will change the dataset only, not the formula coxph(Surv(start,stop,status)~x1+x2+...+cluster(id),robust=T) I am not sure I understood you well: do you really mean "recurrent events" alone or "any counting process notation (including allowing for recurrent events)". I thought the counting process notation did not differ really from the Cox model in R, since Terry M. Therneau (A Package for Survival Analysis in S, April 22, 1996) concludes his mathematical section "3.3 Cox Model" by "The above notation is derived from the counting process representation [...] It allows very naturally for several extensions to the original Cox model formulation: multiple events per subject, discontinuous intervals of risk [...],left truncation." (I used it to introduce 1. time-dependent covariates, some covariates changing yearly, other irregularly, and 2. left truncation: not all countries existed at the beginning of the study)>In the case of recurrent events coxph() is not > using maximum likelihood or even maximum partial likelihood.Then, what does fit$loglik give in this case? Still a likelihood or a valid criterion to maximise ? If not, how to get ("manually") the criterion that was maximsed? That's of interest for me since> I created artificial covariates measuring the proximity since someevents: exp(-days.since.event/a.chosen.parameter). ...and I used fit$loglik to chose a.chosen.parameter from 8 values, for 3 types of events: la<-c(263.5, 526.9,1053.9,2107.8,4215.6,8431.1) #list of values to choose from z<-NULL;for(a1 in la) for(a2 in la) for(a3 in la) {coxtmp <- (coxph(Surv(start,stop,status)~ +I(exp(-days.since.event.of.type.one/a1)) +I(exp(-days.since.event.of.type.two/a2)) +I(exp(-days.since.event.of.type.three/a3)) + other.time.dependent.covariates +cluster(id) ,data=x,robust=T)) rbind(z,c(a1,a2,a3,coxtmp$wald.test, coxtmp$rscore, coxtmp$loglik, coxtmp$score))->z } z <- data.frame(z) names(z) <- c("a1","a2", "a3","wald.test", "rscore", "NULLloglik","loglik", "score") z[which.max(z$rscore),] z[which.max(z$loglik),] The last two commands gave me almost always the same set for c(a1,a2,a3). But they sometimes differed significantly on some models. Which criteria (if any ?!) should I use to select the best set c(a1,a2,a3) ? (If you wish to see what the proximity variables look like, run the following code. The dashed lines show the "half life" of the proximity variable,here=6 months, which is determined by a.chosen.parameter, e.g. a1=la[1]: #start of code curve(exp(-(x)/263.5),0,8*365.25,xlab="number of days since last political regime change (dsrc)",ylab="Proximity of political regime change exp(-dsrc/263.5)",las=1) axis(1,at=365.25/2, labels= "(6 months)");axis(2,at=seq(0,1,.1),las=1) lines(c(365.25/2,365.25/2,-110),c(-.05,0.5,0.5),lty="dashed") #end of code) Thanks a lot again. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France
Mayeul KAUFFMANN
2004-Jul-28 02:06 UTC
[R] covariate selection in cox model (counting process)
>No, I mean recurrent events. With counting process notation but no >recurrent revents the partial likelihood is still valid, and the approach >of treating it as a real likelihood for AIC (and presumably BIC) makes >sense. > >Roughly speaking, you can't tell there is dependence until you see >multiple events.Thanks a lot, I got it (well, I hope so)! I've read in several places that events in the Andersen-Gill model must be "conditionnaly independent", which is sometimes more precisely written as "conditionnaly independent given the covariates" or even more precisely: "the Andersen-Gill (AG) model assumes that each [individual] has a multi-event counting process with independent increments. The observed increments must be conditionally independent given the history of all observable information up to the event times." (http://www.stat.umu.se/egna/danardono/licdd.pdf) Then, there is still another option. In fact, I already modelled explicitely the influence of past events with a "proximity of last event" covariate, assuming the dependence on the last event decreases at a constant rate (for instance, the proximity covariate varies from 1 to 0.5 in the first 10 years after an event, then from 0.5 to 0.25 in the next ten years, etc). With a well chosen modelisation of the dependence effect, the events become conditionnaly independent, I do not need a +cluster(id) term, and I can use fit$loglik to make a covariate selection based on BIC, right? Thanks a lot again for your time. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France PS: I wrongly concluded from the R statement "(Note: the likelihood ratio and score tests assume independence of observations within a cluster, the Wald and robust score tests do not). " that it meant independence between two consecutive observations (without any event). It made sense to me because when only one covariate changes for a given individual, and with a small change, there is a new observation, with a risk very simlar to the risk for the previous observation. But there is still independence with respect to the question of recurrent event. Maybe the warning should be rewritten saying "assume *conditionnal* independence of *events* (given the covariates)"
Mayeul KAUFFMANN
2004-Jul-28 17:08 UTC
[R] covariate selection in cox model (counting process)
> If you can get the conditional independence (martingaleness) then, yes, > BIC is fine. > > One way to check might be to see how similar the standard errors arewith> and without the cluster(id) term.(Thank you "again !", Thomas.) At first look, the values seemed very similar (see below, case 2). However, to check this without being too subjective, and without a specific test, I needed other values to assess the size of the differences: what is similar, what is not? =============================================================================CASE 1 I first estimated the model without modeling dependence: Call: coxph(formula = Surv(start, stop, status) ~ cluster(ccode) + pop + pib + pib2 + crois + instab.x1 + instab.autres, data = xstep) coef exp(coef) se(coef) robust se z p pop 0.3606 1.434 0.0978 0.1182 3.05 2.3e-03 pib -0.5947 0.552 0.1952 0.1828 -3.25 1.1e-03 pib2 -0.4104 0.663 0.1452 0.1270 -3.23 1.2e-03 crois -0.0592 0.943 0.0245 0.0240 -2.46 1.4e-02 instab.x1 2.2059 9.079 0.4692 0.4097 5.38 7.3e-08 instab.autres 0.9550 2.599 0.4700 0.4936 1.93 5.3e-02 Likelihood ratio test=74 on 6 df, p=6.2e-14 n= 7286 There seems to be a strong linear relationship between standard errors (se, or naive se) and robust se.> summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1))Coefficients: Estimate Std. Error t value Pr(>|t|) sqrt(diag(cox1$naive.var)) 0.96103 0.04064 23.65 2.52e-06 *** Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9894 =============================================================================CASE 2 Then I added a variable (pxcw) measuring the proximity of the previous event (1>pxcw>0) n= 7286 coef exp(coef) se(coef) robust se z p pxcw 0.9063 2.475 0.4267 0.4349 2.08 3.7e-02 pop 0.3001 1.350 0.1041 0.1295 2.32 2.0e-02 pib -0.5485 0.578 0.2014 0.1799 -3.05 2.3e-03 pib2 -0.4033 0.668 0.1450 0.1152 -3.50 4.6e-04 crois -0.0541 0.947 0.0236 0.0227 -2.38 1.7e-02 instab.x1 1.9649 7.134 0.4839 0.4753 4.13 3.6e-05 instab.autres 0.8498 2.339 0.4693 0.4594 1.85 6.4e-02 Likelihood ratio test=78.3 on 7 df, p=3.04e-14 n= 7286 Estimate Std. Error t value Pr(>|t|) sqrt(diag(cox1$naive.var)) 0.98397 0.02199 44.74 8.35e-09 *** Multiple R-Squared: 0.997, Adjusted R-squared: 0.9965 The naive standard errors (se) seem closer to the robust se than they were when not modeling for dependence. 0.98397 is very close to one, R^2 grew, etc. The dependence is high (risk is multiplied by 2.475 the day after an event) but conditional independence (given covariates) seems hard to reject. =============================================================================CASE 3 Finally, I compared these results with those without repeated events (which gives a smaller dataset). A country is removed as soon as we observe its first event. (robust se is still computed, even if naive se should in fact be used here to compute the pvalue) coxph(formula = Surv(start, stop, status) ~ cluster(ccode) + pop + pib + pib2 + crois + instab.x1 + instab.autres, data xstep[no.previous.event, ]) coef exp(coef) se(coef) robust se z p pop 0.4236 1.528 0.1030 0.1157 3.66 2.5e-04 pib -0.7821 0.457 0.2072 0.1931 -4.05 5.1e-05 pib2 -0.3069 0.736 0.1477 0.1254 -2.45 1.4e-02 crois -0.0432 0.958 0.0281 0.0258 -1.67 9.5e-02 instab.x1 1.9925 7.334 0.5321 0.3578 5.57 2.6e-08 instab.autres 1.3571 3.885 0.5428 0.5623 2.41 1.6e-02 Likelihood ratio test=66.7 on 6 df, p=1.99e-12 n=5971 (2466 observations deleted due to missing)> summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1))Estimate Std. Error t value Pr(>|t|) sqrt(diag(cox1$naive.var)) 0.86682 0.07826 11.08 0.000104 *** Residual standard error: 0.06328 on 5 degrees of freedom Multiple R-Squared: 0.9608, Adjusted R-squared: 0.953 There seems to be no evidence that robust se is more different from se in case 2 than in case 3 (and case 1). It even seems closer. I conclude that conditional independence (martingaleness) cannot be rejected in CASE 2, when modeling the dependence between events with a covariate. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France> > Then, there is still another option. In fact, I already modelled > > explicitely the influence of past events with a "proximity of lastevent"> > covariate, assuming the dependence on the last event decreases at a > > constant rate (for instance, the proximity covariate varies from 1 to0.5> > in the first 10 years after an event, then from 0.5 to 0.25 in thenext> > ten years, etc). > > > > With a well chosen modelisation of the dependence effect, the events > > become conditionnaly independent, I do not need a +cluster(id) term,and I> > can use fit$loglik to make a covariate selection based on BIC, right? > > If you can get the conditional independence (martingaleness) then, yes, > BIC is fine. > > One way to check might be to see how similar the standard errors arewith> and without the cluster(id) term.
Thomas Lumley
2004-Jul-28 17:42 UTC
[R] covariate selection in cox model (counting process)
On Wed, 28 Jul 2004, Mayeul KAUFFMANN wrote:> > If you can get the conditional independence (martingaleness) then, yes, > > BIC is fine. > > > > One way to check might be to see how similar the standard errors are > with > > and without the cluster(id) term. > > (Thank you "again !", Thomas.) > > At first look, the values seemed very similar (see below, case 2). > However, to check this without being too subjective, and without a > specific test, I needed other values to assess the size of the > differences: what is similar, what is not? >I think the econometricians have theory for this (comparing the whole covariance matrices). -thomas> > =========================================================================> ====> CASE 1 > I first estimated the model without modeling dependence: > > Call: > coxph(formula = Surv(start, stop, status) ~ cluster(ccode) + > pop + pib + pib2 + crois + instab.x1 + instab.autres, data = xstep) > > > coef exp(coef) se(coef) robust se z p > pop 0.3606 1.434 0.0978 0.1182 3.05 2.3e-03 > pib -0.5947 0.552 0.1952 0.1828 -3.25 1.1e-03 > pib2 -0.4104 0.663 0.1452 0.1270 -3.23 1.2e-03 > crois -0.0592 0.943 0.0245 0.0240 -2.46 1.4e-02 > instab.x1 2.2059 9.079 0.4692 0.4097 5.38 7.3e-08 > instab.autres 0.9550 2.599 0.4700 0.4936 1.93 5.3e-02 > > Likelihood ratio test=74 on 6 df, p=6.2e-14 n= 7286 > > There seems to be a strong linear relationship between standard errors > (se, or naive se) and robust se. > > > summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > sqrt(diag(cox1$naive.var)) 0.96103 0.04064 23.65 2.52e-06 *** > Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9894 > > > =========================================================================> ====> CASE 2 > > Then I added a variable (pxcw) measuring the proximity of the previous > event (1>pxcw>0) > > n= 7286 > coef exp(coef) se(coef) robust se z p > pxcw 0.9063 2.475 0.4267 0.4349 2.08 3.7e-02 > pop 0.3001 1.350 0.1041 0.1295 2.32 2.0e-02 > pib -0.5485 0.578 0.2014 0.1799 -3.05 2.3e-03 > pib2 -0.4033 0.668 0.1450 0.1152 -3.50 4.6e-04 > crois -0.0541 0.947 0.0236 0.0227 -2.38 1.7e-02 > instab.x1 1.9649 7.134 0.4839 0.4753 4.13 3.6e-05 > instab.autres 0.8498 2.339 0.4693 0.4594 1.85 6.4e-02 > > Likelihood ratio test=78.3 on 7 df, p=3.04e-14 n= 7286 > > > Estimate Std. Error t value Pr(>|t|) > sqrt(diag(cox1$naive.var)) 0.98397 0.02199 44.74 8.35e-09 *** > Multiple R-Squared: 0.997, Adjusted R-squared: 0.9965 > > The naive standard errors (se) seem closer to the robust se than they were > when not modeling for dependence. > 0.98397 is very close to one, R^2 grew, etc. > The dependence is high (risk is multiplied by 2.475 the day after an > event) > but conditional independence (given covariates) seems hard to reject. > > > =========================================================================> ====> CASE 3 > Finally, I compared these results with those without repeated events > (which gives a smaller dataset). A country is removed as soon as we > observe its first event. > (robust se is still computed, even if naive se should in fact be used here > to compute the pvalue) > > coxph(formula = Surv(start, stop, status) ~ cluster(ccode) + > pop + pib + pib2 + crois + instab.x1 + instab.autres, data > xstep[no.previous.event, ]) > > coef exp(coef) se(coef) robust se z p > pop 0.4236 1.528 0.1030 0.1157 3.66 2.5e-04 > pib -0.7821 0.457 0.2072 0.1931 -4.05 5.1e-05 > pib2 -0.3069 0.736 0.1477 0.1254 -2.45 1.4e-02 > crois -0.0432 0.958 0.0281 0.0258 -1.67 9.5e-02 > instab.x1 1.9925 7.334 0.5321 0.3578 5.57 2.6e-08 > instab.autres 1.3571 3.885 0.5428 0.5623 2.41 1.6e-02 > > Likelihood ratio test=66.7 on 6 df, p=1.99e-12 n=5971 (2466 observations > deleted due to missing) > > > > summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1)) > Estimate Std. Error t value Pr(>|t|) > sqrt(diag(cox1$naive.var)) 0.86682 0.07826 11.08 0.000104 *** > Residual standard error: 0.06328 on 5 degrees of freedom > Multiple R-Squared: 0.9608, Adjusted R-squared: 0.953 > > > There seems to be no evidence that robust se is more different from se in > case 2 than in case 3 (and case 1). > It even seems closer. > > I conclude that conditional independence (martingaleness) cannot be > rejected in CASE 2, when modeling the dependence between events with a > covariate. > > Mayeul KAUFFMANN > Univ. Pierre Mendes France > Grenoble - France > > > > > > Then, there is still another option. In fact, I already modelled > > > explicitely the influence of past events with a "proximity of last > event" > > > covariate, assuming the dependence on the last event decreases at a > > > constant rate (for instance, the proximity covariate varies from 1 to > 0.5 > > > in the first 10 years after an event, then from 0.5 to 0.25 in the > next > > > ten years, etc). > > > > > > With a well chosen modelisation of the dependence effect, the events > > > become conditionnaly independent, I do not need a +cluster(id) term, > and I > > > can use fit$loglik to make a covariate selection based on BIC, right? > > > > If you can get the conditional independence (martingaleness) then, yes, > > BIC is fine. > > > > One way to check might be to see how similar the standard errors are > with > > and without the cluster(id) term. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle