Mayeul KAUFFMANN
2004-Jul-04 16:46 UTC
[R] smooth non cumulative baseline hazard in Cox model
Hi everyone. There's been several threads on baseline hazard in Cox model but I think they were all on cumulative baseline hazard, for instance http://tolstoy.newcastle.edu.au/R/help/01a/0464.html http://tolstoy.newcastle.edu.au/R/help/01a/0436.html "basehaz" in package survival seems to do a cumulative hazard. extract from the basehaz function: sfit <- survfit(fit) H <- -log(sfit$surv) Since sfit$surv is monotonic, H will be monotonic too, which makes me think it is a cumulative function. I think H(t) it is the "sum" ("integration") of lambda's from 0 to t (Am I right?) What I need might be lambda(t) or lambda(t)dt (I do not know for sure), something involving the instantaneous baseline risk. But, for sure, I 've seen elsewhere what I need. Specifically, here: http://econ.worldbank.org/files/13214_CivilPeace.pdf that is page 41 of HEGRE, H??vard; ELLINGSEN, Tanja; GATES, Scott; GLEDITSCH, Nils Petter (2001). "Toward a Democratic Civil Peace? Democracy, Political Change, and Civil War, 1816-1992". American Political Science Review, vol. 95, no 1. I'm doing the same job as Hegre et al. (studying civil wars) but with the counting process formulation of the Cox model. (I use intervals, my formula looks like Surv(start,stop,status)~ etc.). Like Hegre and alii (who use the stata software) I would like to have a curve showing what is the (instantaneous) overall risk of civil war at a given time, taking away the effect of the covariates. For those who also use the SAS software (which I'm not, unfortunately), the job I need to be done seems to be done by the SMOOTH macro described in "Survival Analysis Using the SAS System: A Practical Guide" by Paul D. Allison (see below). There is a graph (output 5.22 "smoothes hazard functions for two financial aid groups") P. 170 of his book which shows another example (except I only need it for 1 group, at mean values) I hope I am clear enough. Thank you a lot for any help. (sorry for mistakes in English as I'm on non native English speaker) Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France ------------ SMOOTH Macro (SAS) ------------ %macro smooth (data=_last_, time=, width=, survival=survival); /********************************************************************* MACRO SMOOTH produces graphs of smoothed hazard functions using output from either PROC LIFETEST or PROC PHREG. With PROC LIFETEST, it uses the data set produced by the OUTSURV option in the PROC statement. With PROC PHREG, it uses the data set produced by the BASELINE statement. SMOOTH employs a kernel smoothing method described by H. Ramlau-Hansen (1983), "Smoothing Counting Process Intensities by Means of Kernel Functions," The Annals of Statistics 11, 453-466. If there is more than one survival curve in the input data set, SMOOTH will produce multiple smoothed hazard curves on the same axes. There are four parameters: DATA is the name of the data set containing survivor function estimates. The default is the most recently created data set. TIME is name of the variable containing event times. SURVIVAL is the name of a variable containing survivor function estimates (the default is SURVIVAL, which is the automatic name in PROC LIFETEST). WIDTH is bandwidth of smoothing function. The default is 1/5 of the range of event times. Example of usage: %smooth(data=my.data,time=duration,width=8,survival=s) Author: Paul D. Allison, University of Pennsylvania allison at ssc.upenn.edu *************************************************************************/ data _inset_; set &data end=final; retain _grp_ _censor_ 0; t=&time; survival=&survival; if t=0 and survival=1 then _grp_=_grp_+1; keep _grp_ t survival; if final and _grp_ > 1 then call symput('nset','yes'); else if final then call symput('nset','no'); if _censor_ = 1 then delete; if survival in (0,1) then delete; run; proc iml; use _inset_; read all var {t _grp_}; %if &width ne %then %let w2=&width; %else %let w2=(max(t)-min(t))/5; w=&w2; z=char(w,8,2); call symput('width',z); numset=max(_grp_); create _plt_ var{ lambda s group}; setin _inset_ ; do m=1 to numset; read all var {t survival _grp_} where (_grp_=m); n=nrow(survival); lo=t[1] + w; hi=t[n] - w; npt=50; inc=(hi-lo)/npt; s=lo+(1:npt)`*inc; group=j(npt,1,m); slag=1//survival[1:n-1]; h=1-survival/slag; x = (j(npt,1,1)*t` - s*j(1,n,1))/w; k=.75*(1-x#x)#(abs(x)<=1); lambda=k*h/w; append; end; quit; %if &nset = yes %then %let c==group; %else %let c=; proc gplot data=_plt_; plot lambda*s &c / vaxis=axis1 vzero haxis=axis2; axis1 label=(angle=90 f=titalic 'Hazard Function' ) minor=none ; axis2 label=(f=titalic "Time (bandwidth=&width)") minor=none; symbol1 i=join color=black line=1; symbol2 i=join color=red line=2; symbol3 i=join color=green line=3; symbol4 i=join color=blue line=4; run; quit; %mend smooth;
Prof Brian Ripley
2004-Jul-04 20:23 UTC
[R] smooth non cumulative baseline hazard in Cox model
If you have a smooth cumulative hazard you can differentiate it to get a differentiable hazard (using smooth in its technical sense). Try http://www.stats.ox.ac.uk/pub/MASS4/VR4stat.pdf for possible approaches in S/R. There are several, e.g. in packages muhaz, polspline, sm, locfit, .... What you cannot have is - The Cox proportional hazards approach - A smooth baseline hazard since the derivation of the partial likelihood assumes a non-smooth baseline hazard. You can have proportional hazards and e.g. penalized log-likelihood, though. As an approximation you can smooth the fitted baseline cumulative hazard (e.g. by package pspline) and ask for its derivative. On Sun, 4 Jul 2004, Mayeul KAUFFMANN wrote:> Hi everyone. > There's been several threads on baseline hazard in Cox model but I think > they were all on cumulative baseline hazard, > for instance > http://tolstoy.newcastle.edu.au/R/help/01a/0464.html > http://tolstoy.newcastle.edu.au/R/help/01a/0436.html > > > "basehaz" in package survival seems to do a cumulative hazard. > > extract from the basehaz function: > sfit <- survfit(fit) > H <- -log(sfit$surv) > > Since sfit$surv is monotonic, H will be monotonic too, which makes me think > it is a cumulative function. I think H(t) it is the "sum" ("integration") of > lambda's from 0 to t (Am I right?) > > What I need might be lambda(t) or lambda(t)dt (I do not know for sure), > something involving the instantaneous baseline risk. > But, for sure, I 've seen elsewhere what I need. > Specifically, here: > http://econ.worldbank.org/files/13214_CivilPeace.pdf > that is page 41 of HEGRE, H??vard; ELLINGSEN, Tanja; GATES, Scott; GLEDITSCH, > Nils Petter (2001). "Toward a Democratic Civil Peace? Democracy, Political > Change, and Civil War, 1816-1992". American Political Science Review, vol. > 95, no 1. > > I'm doing the same job as Hegre et al. (studying civil wars) but with the > counting process formulation of the Cox model. (I use intervals, my formula > looks like Surv(start,stop,status)~ etc.).Careful, that is left- and right- censored, not intervals. Surv has a type= argument.> Like Hegre and alii (who use the stata software) I would like to have a > curve showing what is the (instantaneous) overall risk of civil war at a > given time, taking away the effect of the covariates. > > For those who also use the SAS software (which I'm not, unfortunately), the > job I need to be done seems to be done by the SMOOTH macro described in > > "Survival Analysis Using the SAS System: A Practical Guide" by Paul D. > Allison (see below). > > There is a graph (output 5.22 "smoothes hazard functions for two financial > aid groups") P. 170 of his book which shows another example (except I only > need it for 1 group, at mean values) > > > > I hope I am clear enough. Thank you a lot for any help. > (sorry for mistakes in English as I'm on non native English speaker) > > Mayeul KAUFFMANN > Univ. Pierre Mendes France > Grenoble - France > -------------- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Mayeul KAUFFMANN
2004-Jul-26 10:07 UTC
[R] Re: smooth non cumulative baseline hazard in Cox model
Hello everyone, I am searching for a covariate selection procedure in a cox model formulated as a counting process. I use intervals, my formula looks like coxph(Surv(start,stop,status)~ x1+x2+...+cluster(id),robust=T) where id is a country code (I study occurence of civil wars from 1962 to 1997). I'd like something not based on p-values, since they have several flaws for this purpose. I turned to other criteria but all the articles I read seems to apply to the classical formulation of the cox model, not the counting process one (or they apply to both but I am not aware of this) I've tried AIC with> step(cox.fit)or> stepAIC(cox.fit)and BIC using>step(cox.fit,k = log(n))but there seems to be 2 theoretical problems to address: (1) These values are based on partial loglikelihood ("loglik") I wonder if this is correct with the cox model formulated as a *counting process*, with many (consecutive) observations for a given individual, and then some observation not being independent Since "the likelihood ratio and score tests assume independence of observations within a cluster, the Wald and robust score tests do not" (R warning), and the likelihood ratio being based on loglik, can I use loglik in BIC with some dependent observations? [I have 170 individuals (namely, countries) for 36 year, some single countries having up to 140 very short observation intervals, other having (on the other extreme) only 1 long interval per year. That's because I created artificial covariates measuring the proximity since some events: exp(-days.since.event/a.chosen.parameter). I splitted every yearly interval for which these covariates change rapidly (i.e. when the events are recent) yielding up to 11 intervals a year] (2) What penalized term to used? It seems natural to include the number of covariates, k. What about the number of observations? I found several definitions: AIC= -2 loglik(b) + 2.k Schwartz Bayesian information criteria: SBIC= -2 loglik(b) + k ln(n) Fr??d??rique Letu?? (author of PhD thesis "COX MODEL: ESTIMATION VIA MODEL SELECTION AND BIVARIATE SHOCK MODEL", http://www-lmc.imag.fr/lmc-sms/Frederique.Letue/These3.ps) suggested me AIC= - loglik(b) + 2.k/n BIC= - loglik(b) + 2.ln(n).k/n, with other possible values for parameter "2" in this case (see her thesis p.100, but this section is in French) All these do not tell *what to take for n*. There are 3 main possibilities: a) Taking the number of observations (including censored one) will give a huge n (around 6000 to 8000), which may seem meaningless since some observations are only a few days long. With n at the denominator (Letu??'s criteria), the penalized term would be so low that it's like not having it:>log(7000)/7000[1] 0.001264809 (where loglik from summary(cox.fit) range from -155 to -175, dependig on the model) b) Volinsky & Raftery "propose a revision of the penalty term in BIC so that the penalty is defined in terms of the number of uncensored events instead of the number of observations." (Volinsky & Raftery , Bayesian Information Criterion for Censored Survival Models, June 16, 1999, http://www.research.att.com/~volinsky/papers/biocs.ps) This could be computed with>sum(coxph.detail(cox.interac.dsi6mois)$nevent)Letu??'s BIC penalized trerm with 50 events will then be> 2*log(50)/50[1] 0.1564809 which will have more effects. However, adding or removing a country which has data for the 36 years but no event (then, it is censored) will not change this BIC. Thus, it is not suitable to account for missing data that do not reduce the number of event. I'd like the criteria to take this into account, because all covariates do not have the same missing data. The question is: When I have the choice with adding a covariate, x10 or x11, which have different (not nested) set of missing values, which one is best? Estimating all subsets of the full model (full model = all covariates) with a dataset containing no missing data for the full model would be a solution but would more than halve the dataset for many subsets of the covariates. I should mention that step(cox.fit) gives a warning and stops: "Error in step(cox.fit) : number of rows in use has changed: remove missing values?" which makes me ask whether the whole procedure is OK with model of different sample size. c) "For discrete time event history analysis, the same choice has been made, while the total number of exposure time units has also been used, for consistency with logistic regresion (Raftery,Lewis,Aghajanian and Kahn,1993;Raftery Lewisand Aghajanian,1994)" (Raftery, Bayesian Model Selection in Social Research, 1994, http://www.stat.washington.edu/tech.reports/bic.ps) I am not sure what "exposure time units" mean. But since I could have used a logit model with yearly observations [but with many flaws...], I suggest I could use the number of years (sum of length of intervals, in year)> sum((fit$y)[,2]-(fit$y)[,1])/365.25[1] 3759.537 This may still be too high. Since I have datas from 1962 to 1997 (36 years), I have the folowing numbers of "complete cases"-equivalent:>sum((fit$y)[,2]-(fit$y)[,1])/ (365.25*36)[1] 104.4316 This seems more resonable and would account for NAs in different models. However, it might be to high, because some countries are not at risk during the all period: some did not existed because they gain independence near the end of the period (E.G. ex-USRR countries in arly 1990's) or because they were experiencing an event (new wars in countries already experiencing a war are not taken into account). I may take the *proportion* of available data to time at risk to adjust for this: a country at risk during 1 year and for which data are available for this entire year will increase n by 1, not by 1/36. If the data frame "dataset" contains all countries at risk (including some with NA), assuming id == id[complete.cases(dataset)] (all countries have at least one complete observation) this will be computed by>attach(dataset) >sum(tapply(stop[complete.cases(dataset)]-start[complete.cases(dataset)],CCODE,sum) /tapply(stop-start,CCODE,sum)) But this would be a rather empirical adjustment, maybe with no theoretical basis. And I don't think I can enter this as argument k to step().... Thank you for having read this. Hope I was not too long. And thank you a lot for any help, comment, etc. (sorry for mistakes in English as I'm a non native English speaker) Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France