Alex F. Bokov
2010-Feb-05 00:31 UTC
[R] Using coxph with Gompertz-distributed survival data.
Dear list: I am attempting to use what I thought would be a pretty straightforward practical application of Cox regression. I figure users of the survival package must have come across this problem before, so I would like to ask you how you dealt with it. I have set up an illustrative example and included it at the end of this post. I took a sample of 100 data points from each of two populations ('a' and 'b') known to have Gompertz distributions that differ in both their lambda (initial rate, which is higher in population 'a') and gamma (acceleration rate, which is higher in population 'b') parameters. Each data point represents the age at which an experimental animal died. First question: given the way the parameters differ between the Gompertz distributions, the mortality rate will be higher for 'a' younger ages, higher for 'b' at older ages, and the assumption of the Cox Proportional Hazards model is violated a priori, isn't it? cox.zph confirms this. Yet I found plenty of Gompertz parameter values that differ, and lead to differences in survival times detectable by coxph, yet pass the cox.zph test. Should I assume that cox.zph is insufficiently sensitive and take measures to account for non-proportionality of hazards anytime I know I'm looking at longevity data (which is believed to have a Gompertz distribution for mammals dying from 'old age')? Second question: supposedly the way to handle such data with coxph is to either stratify it by age or include an interaction term with age in the formula. The first option is not optimal, because the hazard for a Gompertz distribution changes continuously and there are no natural intervals within which the hazard ratio is relatively constant. Therefore, I would like to use the second option, but including the age term in the formula produces an error that says "Error in fitter(X, Y, strats, offset, init, control, weights = weights, : NA/NaN/Inf in foreign function call (arg 6)". My question is, what would be the correct way to include age in the formula? If you have read this far, thank you kindly for your time. Below is code for reproducing the example I explained above. # create the data a<-c(1048, 773, 1072, 753, 888, 1038, 852, 950, 1335, 1227, 904, 634, 980, 1075, 1308, 787, 1079, 1151, 677, 1110, 363, 936, 774, 644, 1080, 1055, 1119, 975, 941, 1148, 1014, 541, 1297, 584, 847, 1136, 793, 1171, 985, 934, 852, 550, 944, 1165, 1190, 99, 1227, 1052, 1106, 888, 884, 555, 862, 1241, 841, 987, 1162, 1028, 1009, 1102, 1082, 1009, 1200, 902, 1215, 1121, 1177, 882, 766, 1366, 1190, 755, 958, 38, 925, 1131, 1031, 639, 704, 1097, 820, 570, 1029, 1205, 1284, 1139, 522, 1197, 1268, 1376, 842, 1310, 959, 1160, 824, 652, 1121, 1262, 1006, 1021); b<-c(797, 897, 993, 1174, 998, 1117, 583, 698, 1125, 796, 1055, 953, 847, 892, 875, 1108, 1090, 1101, 1104, 991, 882, 1036, 889, 843, 1129, 1018, 936, 1035, 906, 1192, 1028, 1109, 790, 897, 544, 978, 1108, 1004, 730, 1093, 845, 1014, 1040, 1080, 921, 958, 971, 950, 1026, 882, 1055, 703, 901, 991, 863, 1043, 999, 784, 908, 987, 1040, 760, 887, 1028, 808, 1077, 812, 843, 1002, 639, 905, 808, 850, 1112, 736, 851, 1008, 990, 516, 1015, 942, 993, 1127, 959, 963, 1069, 940, 815, 926, 1005, 983, 1093, 898, 901, 1132, 1011, 808, 905, 1129, 840); # format the data for coxph age<-c(a,b); group<-as.factor(c('a','b'),c(100,100))); # run coxph; for simplicity, the data is uncensored ab.cox<-coxph(Surv(age)~group); # the results indicate a singificant difference between the groups ab.cox; # however, it appears that the hazards are not proportional and the result is not interpretable cox.zph(ab.cox); # including age in an interaction term leads to an error ab.int.cox<-coxph(Surv(age)~group+group:age); # the following is like above; 'c' is from a population differs from 'a' in both parameters # and coxph indicates a highly significant difference, but this time the data seems to not # violate the proportional hazards assumption c<-c(526, 858, 36, 814, 402, 902, 653, 779, 776, 971, 789, 824, 680, 1017, 957, 607, 1142, 1144, 888, 861, 868, 1052, 825, 654, 443, 1023, 727, 612, 1017, 888, 811, 820, 780, 846, 844, 272, 1083, 855, 903, 783, 1036, 1026, 462, 1007, 777, 1124, 1077, 1041, 900, 963, 332, 740, 1012, 1017, 705, 864, 1053, 643, 1195, 795, 1040, 792, 714, 971, 832, 558, 1107, 1038, 1149, 448, 605, 921, 772, 994, 1034, 949, 788, 1013, 788, 1038, 956, 850, 1045, 747, 967, 814, 661, 657, 730, 966, 676, 344, 743, 643, 1004, 766, 942, 1040, 1067, 409); age2<-c(a,c); group2<-as.factor(rep(c('a','c'),c(100,100))); ac.cox<-coxph(Surv(age2)~group2); ac.cox; cox.zph(ac.cox);
Terry Therneau
2010-Feb-05 16:48 UTC
[R] Using coxph with Gompertz-distributed survival data.
Before being helpful let me raise a couple of questions: 1. "I know I'm looking at longevity data (which is believed to have a Gompertz distribution for mammals dying from 'old age')". I'm not as convinced. The Gompertz is a nice story, but is confounded by individual risk or 'frailty'. But continue 2. "the mortality rate will be higher for 'a' younger ages, higher for 'b' at older ages, and the assumption of the Cox Proportional Hazards model is violated a priori, isn't it?" That is correct. So why exactly are you using coxph to fit the data. 3. "Yet I found plenty of Gompertz parameter values that differ, and lead to differences in survival times detectable by coxph, yet pass the cox.zph test. Should I assume that cox.zph is insufficiently sensitive..." The Cox model fit a model with an average hazard ratio over time. If the data satisfies the proportional hazards model, then this is all you need -- this single number tells you everything. If the data does not, this does not mean that such an average hazard is invalid, it tells you that this average is not the whole story and coxph is an oversimplification. I view this as similar to the fact that if a distribution is Gaussian then then (mean, var) is sufficient, everything that you ever wanted to know about the data (percentiles, outliers, ...) is summed up in those two values. If it's not Gaussian it does not follow that the mean is worthless, but it isn't a complete story. If you pick your parameters so that the change in hazard ratio is "not very large", of course cox.zph will not see it. That's also the case where an overall average is probably a pretty good summary. 4: "coxph(Surv(age) ~ group + group:age)" This is not how a change in hazard ratio over time is approached. The program should give an error. For one, why do you assume the change is linear in time? This is rather rare. You might look at the timedep package. 5. Some actual advice -- if you think it is Gompertzian why not fit a Gompertz distribution? I don't see anything in CRAN to directly fit Gompertz, but the note below talks about how to do so approximately with survreg. It's a note to myself of something to add to the survival package documentation, not yet done, and to my embarassment the file has a time stamp in 1996. Ah well. Terry Therneau -------------- next part -------------- My document "A Package for Survival Analysis in S" contains statements about how to fit Gompertz and Rayleigh distributions with the survreg routine. Nicholas Brouard, in a recent query to this group, quite correctly states that "Therneau's documentation is a little elliptic for people not so familiar with extreme value theory". I've spent the last day trying to work out concrete examples of the fits. Let me start by saying that I now think my paper's remarks were overly optimistic. This note will try to indicate why. I will use some "TeX" notation below: \alpha, \beta, etc for Greek letters. Hazard functions: Weibull: p*(\lambda)^p * t^(p-1) Extreme value: (1/ \sigma) * exp( (t- \eta)/ \sigma) Rayleigh: a + bt Gompertz: b * c^t Makeham: a + b* c^t The Makeham hazard seems to fit human mortality experience beyond infancy quite well, where "a" is a constant mortality which is independent of the health of the subject (accidents, homicide, etc) and the second term models the Gompertz assumption that "the average exhaustion of a man's power to avoid death to is such that at the end of equal infinitely small itervals of time he lost equal portions of his remaining power to oppose destruction which he had at the commencement of these intervals". For older ages "a" is a neglible portion of the death rate and the Gompertz model holds. The fitting routine depends on the decomposition Y = \eta + \sigma W, where \eta = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + ... is the fitted linear predictor and W is a distribution in "standard" form. For instance, if the response time t is Weibull, then y = log(t) follows this with \eta = log(\lambda) \sigma = 1/p Clearly 1. The Wiebull distribution with p=2 (sigma=.5) is the same as a Rayleigh distribution with a=0. It is not, however, the most general form of a Rayleigh. 2. The (least) extreme value and Gompertz distributions have the same hazard function, with \sigma = 1/ log(c), and exp(-\eta/ \sigma) = b. It would appear that the Gompertz can be fit with an identity link function combined with the extreme value distribution. However, this ignores a boundary restriction. If f(x; \eta, \sigma) is the extreme value distribution with paramters \eta and \sigma, then the definition of the Gompertz densitiy is g(x; \eta, \sigma) = 0 x< 0 g(x; \eta, \sigma) = c f(x; \eta, \sigma) x>=0 where c= exp(exp(-\eta / \sigma)) is the necessary constant so that g integrates to 1. If \eta / \sigma is far from 1, then the correction term will be minimal and survreg should give a reasonable answer. If not, the distribution can't be fit, nor can it be made to easily conform to the general fitting scheme of the program. The Makeham distribution falls into the gamma family (equation 2.3 of Kalbfleisch and Prentice, Survival Analysis), but with the same range restriction problem. In summary, the Gompertz is a truncated form of the extreme value distribution (Johnson, Kotz and Blakrishnan, Contiuous Univariate Distri- butions, section 22.8). If one ignores the truncation, i.e., assume that negative time values are possible, then it can be fit with survreg. My original note seems to have been compounded of 3 errors: the -1 arises from confusing the maximal extreme distribution (most common in theory books) with the minimal extreme distribution (used in survreg), the log() term was a typing mistake, and I never noticed the range restriction. This is one of the few topics in the report without a worked example as part of my test library (the Examples section of the package). The replacement document, currently in early draft, is intended to have a worked example for every claim and the code for that example in the appendix. This will, hopefully, cure any other mistakes of this sort.
Possibly Parallel Threads
- survreg & gompertz
- significant difference between Gompertz hazard parameters?
- coxph diagnostics plot for shape of hazard function?
- Parametric survival models with left truncated, right censored data
- Gompertz-Makeham hazard models---test for significant difference