Ferenci Tamas
2019-Aug-23 14:39 UTC
[R] results of a survival analysis change when converting the data to counting process format
Thanks for all those comments, and thanks Terry for the repair!
Tamas
2019. augusztus 23., 14:40:05, ?rtad:
I've spent some time chasing this down, and another error of similar type.
It's repaired in version 3.0-9 (soon to go to CRAN, or at least that is the
plan --- I'm working through one last fail in the checks.)
For those who want to know more---
When there is start, stop data such as this: (0, 101] (101, 123] (123, 400],
.... the algorithm used by coxph is to work from largest to smallest time.
Every time it hits the start of an interval (400, 123, 101) that observation is
added into the risk set, and whenever it hits the start of an interval the
observation is removed (123, 101, 0). "Add" and "remove"
also means "add to the current estimate of the running mean and
variance" of the covariates. At each event time that current mean and
variance are are used to update the loglik and its derivatives.
When a subject is diced into little peices of (1,2] (2,3] (3,4], .... like the
last fit below, the sheer number of updates to the running mean/variance can
lead to accumulated round off error. The code works hard to avoid this, and
some of that is subtle --- round off error is a tricky business --- and has gone
through several refinements. Nevertheless, the result for this data set
should not have been that far off. Another user example from about the same
time was worse so I've been focusing on that one: one of the two splits led
to an exp() overflow and one didn't, giving results that were completely
different. This led to a more careful review and some changes that addressed
the example below as well.
Terry T.
On 8/23/19 5:00 AM, r-help-request at r-project.org wrote:
On 2019-08-18 19:10, Ferenci Tamas wrote:
Dear All,
Consider the following simple example:
library( survival )
data( veteran )
coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
trt prior karno
0.180197194 -0.005550919 -0.033771018
Note that we have neither time-dependent covariates, nor time-varying
coefficients, so the results should be the same if we change to
counting process format, no matter where we cut the times.
That's true if we cut at event times:
veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,
data = veteran, cut = unique( veteran$time ) )
coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )
trt prior karno
0.180197194 -0.005550919 -0.033771018
But quite interestingly not true, if we cut at every day:
veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
data = veteran, cut = 1:max(veteran$time) )
coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )
trt prior karno
0.180197215 -0.005550913 -0.033771016
The difference is not large, but definitely more than just a rounding
error, or something like that.
What's going on? How can the results get wrong, especially by
including more cutpoints?