Ferenci Tamas
2019-Aug-23 14:39 UTC
[R] results of a survival analysis change when converting the data to counting process format
Thanks for all those comments, and thanks Terry for the repair! Tamas 2019. augusztus 23., 14:40:05, ?rtad: I've spent some time chasing this down, and another error of similar type. It's repaired in version 3.0-9 (soon to go to CRAN, or at least that is the plan --- I'm working through one last fail in the checks.) For those who want to know more--- When there is start, stop data such as this: (0, 101] (101, 123] (123, 400], .... the algorithm used by coxph is to work from largest to smallest time. Every time it hits the start of an interval (400, 123, 101) that observation is added into the risk set, and whenever it hits the start of an interval the observation is removed (123, 101, 0). "Add" and "remove" also means "add to the current estimate of the running mean and variance" of the covariates. At each event time that current mean and variance are are used to update the loglik and its derivatives. When a subject is diced into little peices of (1,2] (2,3] (3,4], .... like the last fit below, the sheer number of updates to the running mean/variance can lead to accumulated round off error. The code works hard to avoid this, and some of that is subtle --- round off error is a tricky business --- and has gone through several refinements. Nevertheless, the result for this data set should not have been that far off. Another user example from about the same time was worse so I've been focusing on that one: one of the two splits led to an exp() overflow and one didn't, giving results that were completely different. This led to a more careful review and some changes that addressed the example below as well. Terry T. On 8/23/19 5:00 AM, r-help-request at r-project.org wrote: On 2019-08-18 19:10, Ferenci Tamas wrote: Dear All, Consider the following simple example: library( survival ) data( veteran ) coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) ) trt prior karno 0.180197194 -0.005550919 -0.033771018 Note that we have neither time-dependent covariates, nor time-varying coefficients, so the results should be the same if we change to counting process format, no matter where we cut the times. That's true if we cut at event times: veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno, data = veteran, cut = unique( veteran$time ) ) coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) ) trt prior karno 0.180197194 -0.005550919 -0.033771018 But quite interestingly not true, if we cut at every day: veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno, data = veteran, cut = 1:max(veteran$time) ) coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) ) trt prior karno 0.180197215 -0.005550913 -0.033771016 The difference is not large, but definitely more than just a rounding error, or something like that. What's going on? How can the results get wrong, especially by including more cutpoints?