Omar De la Cruz C.
2012-Oct-06 03:48 UTC
[R] Expected number of events, Andersen-Gill model fit via coxph in package survival
Hello, I am interested in producing the expected number of events, in a recurring events setting. I am using the Andersen-Gill model, as fit by the function "coxph" in the package "survival." I need to produce expected numbers of events for a cohort, cumulatively, at several fixed times. My ultimate goal is: To fit an AG model to a reference sample, then use that fitted model to generate expected numbers of events for a new cohort; then, comparing the expected vs. the observed numbers of events would give us some idea of whether the new cohort differs from the reference one.>From my reading of the documentation and the text by Therneau andGrambsch, it seems that the function "survexp" is what I need. But using it I am not able to obtain expected numbers of events that match reasonably well the observed numbers *even for the same reference population.* So, I think I am misunderstanding something quite badly. Below is an example that illustrates the situation. At the end I include the sessionInfo(). Thank you! Omar. ############################ # Example of unexpected behavior in computing estimated number of events # in using package "survival" for fitting the Andersen-Gill model require(survival) head(bladder2) # this is the data, in interval format # Fit Andersen-Gill model cphfit = coxph(Surv(start,stop,event)~rx+number+size+cluster(id),data=bladder2) # Choose some arbitrary time horizons t.horiz = seq(min(bladder2$start),max(bladder2$stop),length=6) # Compute the cohort expected survival s = survexp(~1,data=bladder2,ratetable=cphfit,times=t.horiz) # This are the expected survival values: s$surv # We are interested in the rate of events e.r = as.vector( 1 - s$surv ) # How does this compare to the actual number of events, cumulative at # each time horizon? observed = numeric(length(t.horiz)) for (i in 1:length(t.horiz)){ observed[i] = sum(bladder2$event[bladder2$stop <= t.horiz[i]]) } print(observed) # We would like to compute expected numbers of events that approximately # match these observed values. # We should multiply the expected survival rates by the number of individuals. # Now, one would think that this is the number of at-risk individuals: s$n.risk # But that is actually the total number of rows in the data. In any case, # these numbers do not match: rbind(expected = s$n.risk*e.r,observed=observed) # What if we multiply by the number of individuals? rbind(expected = length(unique(bladder2$id))*e.r,observed=observed) # This does not work either! The required factor seems to be about 133, but # I don't see an explanation for that. # In this example, multiplying by 133.182 gives a good match between observed # and expected values, but in other examples even the shape of the curves # are different. # Multiplying by a number of individuals at risk at each time point # (number of individuals # for which there is a time interval containing the time horizon) does # not work either. #####################> sessionInfo()R version 2.15.1 (2012-06-22) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets methods base other attached packages: [1] survival_2.36-14 loaded via a namespace (and not attached): [1] tools_2.15.1
David Winsemius
2012-Oct-06 15:56 UTC
[R] Expected number of events, Andersen-Gill model fit via coxph in package survival
On Oct 5, 2012, at 8:48 PM, Omar De la Cruz C. wrote:> Hello, > > I am interested in producing the expected number of events, in a > recurring events setting. I am using the Andersen-Gill model, as fit > by the function "coxph" in the package "survival." > > I need to produce expected numbers of events for a cohort, > cumulatively, at several fixed times. My ultimate goal is: To fit an > AG model to a reference sample, then use that fitted model to generate > expected numbers of events for a new cohort; then, comparing the > expected vs. the observed numbers of events would give us some idea of > whether the new cohort differs from the reference one. > >> From my reading of the documentation and the text by Therneau and > Grambsch, it seems that the function "survexp" is what I need. But > using it I am not able to obtain expected numbers of events that match > reasonably well the observed numbers *even for the same reference > population.* So, I think I am misunderstanding something quite badly. > > Below is an example that illustrates the situation. At the end I > include the sessionInfo(). > > Thank you! > > Omar. > > > ############################ > > # Example of unexpected behavior in computing estimated number of events > # in using package "survival" for fitting the Andersen-Gill model > > require(survival) > > head(bladder2) # this is the data, in interval format > > # Fit Andersen-Gill model > cphfit = coxph(Surv(start,stop,event)~rx+number+size+cluster(id),data=bladder2) > > # Choose some arbitrary time horizons > t.horiz = seq(min(bladder2$start),max(bladder2$stop),length=6) > > # Compute the cohort expected survival > s = survexp(~1,data=bladder2,ratetable=cphfit,times=t.horiz) > # This are the expected survival values: > s$surv > > # We are interested in the rate of events > e.r = as.vector( 1 - s$surv ) >Rates are events/n-exposed/time, so those are not rates as I understand the term. And I do not see any accounting for the length of intervals at risk in the rest of your code. That vector does not even calculate interval event expectations as I read it. -- David> # How does this compare to the actual number of events, cumulative at > # each time horizon? > > observed = numeric(length(t.horiz)) > > for (i in 1:length(t.horiz)){ > > observed[i] = sum(bladder2$event[bladder2$stop <= t.horiz[i]]) > > } > > print(observed) > > # We would like to compute expected numbers of events that approximately > # match these observed values. > > # We should multiply the expected survival rates by the number of individuals. > > # Now, one would think that this is the number of at-risk individuals: > s$n.risk > > # But that is actually the total number of rows in the data. In any case, > # these numbers do not match: > > rbind(expected = s$n.risk*e.r,observed=observed) > > # What if we multiply by the number of individuals? > > rbind(expected = length(unique(bladder2$id))*e.r,observed=observed) > > # This does not work either! The required factor seems to be about 133, but > # I don't see an explanation for that. > > # In this example, multiplying by 133.182 gives a good match between observed > # and expected values, but in other examples even the shape of the curves > # are different. > > # Multiplying by a number of individuals at risk at each time point > # (number of individuals > # for which there is a time interval containing the time horizon) does > # not work either. > > ##################### > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] survival_2.36-14 > > loaded via a namespace (and not attached): > [1] tools_2.15.1 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Alameda, CA, USA
Terry Therneau
2012-Oct-08 13:58 UTC
[R] Expected number of events, Andersen-Gill model fit via coxph in package survival
> I am interested in producing the expected number of events, in a > recurring events setting. I am using the Andersen-Gill model, as fit > by the function "coxph" in the package "survival." > > I need to produce expected numbers of events for a cohort, > cumulatively, at several fixed times. My ultimate goal is: To fit an > AG model to a reference sample, then use that fitted model to generate > expected numbers of events for a new cohort; then, comparing the > expected vs. the observed numbers of events would give us some idea of > whether the new cohort differs from the reference one. > >> From my reading of the documentation and the text by Therneau and > Grambsch, it seems that the function "survexp" is what I need. But > using it I am not able to obtain expected numbers of events that match > reasonably well the observed numbers *even for the same reference > population.* So, I think I am misunderstanding something quite badly. >You've hit a common confusion. Observed versus expected events computations are done on a cumulative hazard scale H, not the surivival scale S; S = exp(-H). Relating this back to simple Poisson models H(t) would be the expected number of events by time t and S(t) the probability of "no events before time t". G. Berry (Biometrics 1983) has a classic ane readable article on this (especially if you ignore the proofs). Using your example: > cphfit <- coxph(Surv(start,stop,event)~rx+number+size+cluster(id),data=bladder2) > zz <- predict(cphfit, type='expected') > c(sum(zz), sum(bladder2$event)) [1] 112 112 > tdata <- bladder2[1:10] #new data set (lazy way) > predict(cphfit, type='expected', newdata=tdata) [1] 0.0324089 0.3226540 0.4213402 1.0560768 0.6702130 0.2163531 0.6490665 [8] 0.8864808 0.2932915 0.5190647 You can also do this using survexp and the cohort=FALSE argument, which would return S(t) for each subject and we would then use -log(result) to get H. This is how it was done when I wrote the book, but the newer predict function is easier. Terry Therneau