Alex F. Bokov
2010-Feb-05 00:31 UTC
[R] Using coxph with Gompertz-distributed survival data.
Dear list:
I am attempting to use what I thought would be a pretty straightforward
practical application of Cox regression. I figure users of the survival package
must have come across this problem before, so I would like to ask you how you
dealt with it. I have set up an illustrative example and included it at the end
of this post.
I took a sample of 100 data points from each of two populations ('a' and
'b') known to have Gompertz distributions that differ in both their
lambda (initial rate, which is higher in population 'a') and gamma
(acceleration rate, which is higher in population 'b') parameters. Each
data point represents the age at which an experimental animal died.
First question: given the way the parameters differ between the Gompertz
distributions, the mortality rate will be higher for 'a' younger ages,
higher for 'b' at older ages, and the assumption of the Cox Proportional
Hazards model is violated a priori, isn't it? cox.zph confirms this. Yet I
found plenty of Gompertz parameter values that differ, and lead to differences
in survival times detectable by coxph, yet pass the cox.zph test. Should I
assume that cox.zph is insufficiently sensitive and take measures to account for
non-proportionality of hazards anytime I know I'm looking at longevity data
(which is believed to have a Gompertz distribution for mammals dying from
'old age')?
Second question: supposedly the way to handle such data with coxph is to either
stratify it by age or include an interaction term with age in the formula. The
first option is not optimal, because the hazard for a Gompertz distribution
changes continuously and there are no natural intervals within which the hazard
ratio is relatively constant. Therefore, I would like to use the second option,
but including the age term in the formula produces an error that says
"Error in fitter(X, Y, strats, offset, init, control, weights = weights, :
NA/NaN/Inf in foreign function call (arg 6)". My question is, what would be
the correct way to include age in the formula?
If you have read this far, thank you kindly for your time. Below is code for
reproducing the example I explained above.
# create the data
a<-c(1048, 773, 1072, 753, 888, 1038, 852, 950, 1335, 1227, 904, 634, 980,
1075, 1308, 787, 1079, 1151, 677, 1110, 363, 936, 774, 644, 1080, 1055, 1119,
975, 941, 1148, 1014, 541, 1297, 584, 847, 1136, 793, 1171, 985, 934, 852, 550,
944, 1165, 1190, 99, 1227, 1052, 1106, 888, 884, 555, 862, 1241, 841, 987, 1162,
1028, 1009, 1102, 1082, 1009, 1200, 902, 1215, 1121, 1177, 882, 766, 1366, 1190,
755, 958, 38, 925, 1131, 1031, 639, 704, 1097, 820, 570, 1029, 1205, 1284, 1139,
522, 1197, 1268, 1376, 842, 1310, 959, 1160, 824, 652, 1121, 1262, 1006, 1021);
b<-c(797, 897, 993, 1174, 998, 1117, 583, 698, 1125, 796, 1055, 953, 847,
892, 875, 1108, 1090, 1101, 1104, 991, 882, 1036, 889, 843, 1129, 1018, 936,
1035, 906, 1192, 1028, 1109, 790, 897, 544, 978, 1108, 1004, 730, 1093, 845,
1014, 1040, 1080, 921, 958, 971, 950, 1026, 882, 1055, 703, 901, 991, 863, 1043,
999, 784, 908, 987, 1040, 760, 887, 1028, 808, 1077, 812, 843, 1002, 639, 905,
808, 850, 1112, 736, 851, 1008, 990, 516, 1015, 942, 993, 1127, 959, 963, 1069,
940, 815, 926, 1005, 983, 1093, 898, 901, 1132, 1011, 808, 905, 1129, 840);
# format the data for coxph
age<-c(a,b); group<-as.factor(c('a','b'),c(100,100)));
# run coxph; for simplicity, the data is uncensored
ab.cox<-coxph(Surv(age)~group);
# the results indicate a singificant difference between the groups
ab.cox;
# however, it appears that the hazards are not proportional and the result is
not interpretable
cox.zph(ab.cox);
# including age in an interaction term leads to an error
ab.int.cox<-coxph(Surv(age)~group+group:age);
# the following is like above; 'c' is from a population differs from
'a' in both parameters
# and coxph indicates a highly significant difference, but this time the data
seems to not
# violate the proportional hazards assumption
c<-c(526, 858, 36, 814, 402, 902, 653, 779, 776, 971, 789, 824, 680, 1017,
957, 607, 1142, 1144, 888, 861, 868, 1052, 825, 654, 443, 1023, 727, 612, 1017,
888, 811, 820, 780, 846, 844, 272, 1083, 855, 903, 783, 1036, 1026, 462, 1007,
777, 1124, 1077, 1041, 900, 963, 332, 740, 1012, 1017, 705, 864, 1053, 643,
1195, 795, 1040, 792, 714, 971, 832, 558, 1107, 1038, 1149, 448, 605, 921, 772,
994, 1034, 949, 788, 1013, 788, 1038, 956, 850, 1045, 747, 967, 814, 661, 657,
730, 966, 676, 344, 743, 643, 1004, 766, 942, 1040, 1067, 409);
age2<-c(a,c);
group2<-as.factor(rep(c('a','c'),c(100,100)));
ac.cox<-coxph(Surv(age2)~group2);
ac.cox;
cox.zph(ac.cox);
Terry Therneau
2010-Feb-05 16:48 UTC
[R] Using coxph with Gompertz-distributed survival data.
Before being helpful let me raise a couple of questions:
1. "I know I'm looking at longevity data (which is believed to have a
Gompertz distribution for mammals dying from 'old age')".
I'm not as convinced. The Gompertz is a nice story, but is
confounded by individual risk or 'frailty'. But continue
2. "the mortality rate will be higher for 'a' younger ages,
higher for
'b' at older ages, and the assumption of the Cox Proportional Hazards
model is violated a priori, isn't it?"
That is correct. So why exactly are you using coxph to fit the data.
3. "Yet I found plenty of Gompertz parameter values that differ, and
lead to differences in survival times detectable by coxph, yet pass the
cox.zph test. Should I assume that cox.zph is insufficiently
sensitive..."
The Cox model fit a model with an average hazard ratio over time. If
the data satisfies the proportional hazards model, then this is all you
need -- this single number tells you everything. If the data does not,
this does not mean that such an average hazard is invalid, it tells you
that this average is not the whole story and coxph is an
oversimplification. I view this as similar to the fact that if a
distribution is Gaussian then then (mean, var) is sufficient, everything
that you ever wanted to know about the data (percentiles, outliers, ...)
is summed up in those two values. If it's not Gaussian it does not
follow that the mean is worthless, but it isn't a complete story.
If you pick your parameters so that the change in hazard ratio is "not
very large", of course cox.zph will not see it. That's also the case
where an overall average is probably a pretty good summary.
4: "coxph(Surv(age) ~ group + group:age)"
This is not how a change in hazard ratio over time is approached. The
program should give an error. For one, why do you assume the change is
linear in time? This is rather rare. You might look at the timedep
package.
5. Some actual advice -- if you think it is Gompertzian why not fit a
Gompertz distribution?
I don't see anything in CRAN to directly fit Gompertz, but the note
below talks about how to do so approximately with survreg. It's a note
to myself of something to add to the survival package documentation, not
yet done, and to my embarassment the file has a time stamp in 1996. Ah
well.
Terry Therneau
-------------- next part --------------
My document "A Package for Survival Analysis in S" contains
statements
about how to fit Gompertz and Rayleigh distributions with the survreg
routine. Nicholas Brouard, in a recent query to this group, quite correctly
states that "Therneau's documentation is a little elliptic for people
not
so familiar with extreme value theory".
I've spent the last day trying to work out concrete examples of the fits.
Let me start by saying that I now think my paper's remarks were overly
optimistic. This note will try to indicate why. I will use some
"TeX"
notation below: \alpha, \beta, etc for Greek letters.
Hazard functions:
Weibull: p*(\lambda)^p * t^(p-1)
Extreme value: (1/ \sigma) * exp( (t- \eta)/ \sigma)
Rayleigh: a + bt
Gompertz: b * c^t
Makeham: a + b* c^t
The Makeham hazard seems to fit human mortality experience beyond
infancy quite well, where "a" is a constant mortality which is
independent of the health of the subject (accidents, homicide, etc)
and the second term models the Gompertz assumption that "the average
exhaustion of a man's power to avoid death to is such that at the end
of equal infinitely small itervals of time he lost equal portions of
his remaining power to oppose destruction which he had at the
commencement of these intervals". For older ages "a" is a
neglible
portion of the death rate and the Gompertz model holds.
The fitting routine depends on the decomposition Y = \eta + \sigma W, where
\eta = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + ... is the fitted linear
predictor and W is a distribution in "standard" form. For instance,
if
the response time t is Weibull, then y = log(t) follows this with
\eta = log(\lambda)
\sigma = 1/p
Clearly
1. The Wiebull distribution with p=2 (sigma=.5) is the same as a Rayleigh
distribution with a=0. It is not, however, the most general form of a
Rayleigh.
2. The (least) extreme value and Gompertz distributions have the same
hazard function, with
\sigma = 1/ log(c), and exp(-\eta/ \sigma) = b.
It would appear that the Gompertz can be fit with an identity link function
combined with the extreme value distribution. However, this ignores a
boundary restriction. If f(x; \eta, \sigma) is the extreme value distribution
with paramters \eta and \sigma, then the definition of the Gompertz densitiy
is
g(x; \eta, \sigma) = 0 x< 0
g(x; \eta, \sigma) = c f(x; \eta, \sigma) x>=0
where c= exp(exp(-\eta / \sigma)) is the necessary constant so that g integrates
to 1. If \eta / \sigma is far from 1, then the correction term will be
minimal and survreg should give a reasonable answer. If not, the distribution
can't be fit, nor can it be made to easily conform to the general fitting
scheme of the program.
The Makeham distribution falls into the gamma family (equation 2.3 of
Kalbfleisch and Prentice, Survival Analysis), but with the same range
restriction problem.
In summary, the Gompertz is a truncated form of the extreme value
distribution (Johnson, Kotz and Blakrishnan, Contiuous Univariate Distri-
butions, section 22.8). If one ignores the truncation, i.e., assume that
negative time values are possible, then it can be fit with survreg. My
original note seems to have been compounded of 3 errors: the -1 arises
from confusing the maximal extreme distribution (most common in theory books)
with the minimal extreme distribution (used in survreg), the log() term
was a typing mistake, and I never noticed the range restriction.
This is one of the few topics in the report without a worked example
as part of my test library (the Examples section of the package). The
replacement document, currently in early draft, is intended to have a worked
example for every claim and the code for that example in the appendix.
This will, hopefully, cure any other mistakes of this sort.
Seemingly Similar Threads
- survreg & gompertz
- significant difference between Gompertz hazard parameters?
- coxph diagnostics plot for shape of hazard function?
- Parametric survival models with left truncated, right censored data
- Gompertz-Makeham hazard models---test for significant difference