Dear R-help members, Apologies - I am posting on behalf of a colleague, who is a little puzzled as STATA and R seem to be yielding different survival estimates for the same dataset when treating a variable as ordinal. Ordered() is used to represent an ordinal variable) I understand that R's coxph (by default) uses the Efron approximation, whereas STATA uses (by default) the Breslow. but we did compare using the same approximations. I am wondering if this is a result of how coxph manages an ordered factor? Essentially, this is a survival dataset using tumor grade (1, 2, 3 and 4) as the risk factor. This is more of an 'ordinal' variable, rather than a continuous variable. For the same data set of 399 patients, when treating the vector of tumor grade as a continuous variable (range of 1 to 4), testing the Efron and the Breslow approximations yield the same result in both R and STATA. However, when Hist_Grade_4 grp is converted into an ordered factor using ordered(), and the same scripts are applied, rather different results are obtained, relative to the STATA output. This is tested across the different approximations, with consistent results. The comparison using Efron approximation and ordinal data is is below. Your advice is very much appreciated! Min-Han Apologies below for the slightly malaligned output. STATA output . xi:stcox i.Hist_Grade_4grp, efr i.Hist_Grade_~p _IHist_Grad_1-4 (naturally coded; _IHist_Grad_1 omitted) failure _d: FFR_censor analysis time _t: FFR_month Iteration 0: log likelihood = -1133.369 Iteration 1: log likelihood = -1129.4686 Iteration 2: log likelihood = -1129.3196 Iteration 3: log likelihood = -1129.3191 Refining estimates: Iteration 0: log likelihood = -1129.3191 Cox regression -- Efron method for ties No. of subjects = 399 Number of obs 399 No. of failures = 218 Time at risk = 9004.484606 LR chi2(3) 8.10 Log likelihood = -1129.3191 Prob > chi2 0.0440 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _IHist_Gra~2 | 1.408166 .3166876 1.52 0.128 .9062001 2.188183 _IHist_Gra~3 | 1.69506 .3886792 2.30 0.021 1.081443 2.656847 _IHist_Gra~4 | 2.540278 .9997843 2.37 0.018 1.17455 5.49403 R Output using> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,method=c("breslow")))> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,method=c("exact")))> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,method=c("efron"))) n=399 (21 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(>|z|) Hist_Grade_4grp.L 0.66685 1.94809 0.26644 2.503 0.0123 * Hist_Grade_4grp.Q 0.03113 1.03162 0.20842 0.149 0.8813 Hist_Grade_4grp.C 0.08407 1.08771 0.13233 0.635 0.5252 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 Hist_Grade_4grp.L 1.948 0.5133 1.1556 3.284 Hist_Grade_4grp.Q 1.032 0.9693 0.6857 1.552 Hist_Grade_4grp.C 1.088 0.9194 0.8392 1.410 Rsquare= 0.02 (max possible= 0.997 ) Likelihood ratio test= 8.1 on 3 df, p=0.044 Wald test = 8.02 on 3 df, p=0.0455 Score (logrank) test = 8.2 on 3 df, p=0.04202 [[alternative HTML version deleted]]
On Sep 8, 2010, at 6:43 PM, Min-Han Tan wrote:> Dear R-help members, > > Apologies - I am posting on behalf of a colleague, who is a little > puzzled > as STATA and R seem to be yielding different survival estimates for > the same > dataset when treating a variable as ordinal. Ordered() is used to > represent > an ordinal variable) I understand that R's coxph (by default) uses > the Efron > approximation, whereas STATA uses (by default) the Breslow. but we did > compare using the same approximations. I am wondering if this is a > result of > how coxph manages an ordered factor? > > Essentially, this is a survival dataset using tumor grade (1, 2, 3 > and 4) as > the risk factor. This is more of an 'ordinal' variable, rather than a > continuous variable. For the same data set of 399 patients, when > treating > the vector of tumor grade as a continuous variable (range of 1 to 4), > testing the Efron and the Breslow approximations yield the same > result in > both R and STATA. > > However, when Hist_Grade_4 grp is converted into an ordered factor > using > ordered(), and the same scripts are applied, rather different > results are > obtained, relative to the STATA output. This is tested across the > different > approximations, with consistent results. The comparison using Efron > approximation and ordinal data is is below.Are you sure you want an ordered factor? In R this means you will be creating linear, quadratic and cubic contrasts. Notice the L, Q and C designations on the coefficients. That certainly does not look to be comparable to what you are getting from Stata. My suggestion would be to create an un-ordered factor in R and see whether you get results more in line with Stata's output when applied to your data. -- David.> > Your advice is very much appreciated! > > Min-Han > > Apologies below for the slightly malaligned output. > > STATA output > > . xi:stcox i.Hist_Grade_4grp, efr > i.Hist_Grade_~p _IHist_Grad_1-4 (naturally coded; _IHist_Grad_1 > omitted) > > failure _d: FFR_censor > analysis time _t: FFR_month > > Iteration 0: log likelihood = -1133.369 > Iteration 1: log likelihood = -1129.4686 > Iteration 2: log likelihood = -1129.3196 > Iteration 3: log likelihood = -1129.3191 > Refining estimates: > Iteration 0: log likelihood = -1129.3191 > > Cox regression -- Efron method for ties > > No. of subjects = 399 Number of obs > 399 > No. of failures = 218 > Time at risk = 9004.484606 > LR chi2(3) > 8.10 > Log likelihood = -1129.3191 Prob > chi2 > 0.0440 > > ------------------------------------------------------------------------------ > _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. > Interval] > ------------- > +---------------------------------------------------------------- > _IHist_Gra~2 | 1.408166 .3166876 1.52 0.128 .9062001 > 2.188183 > _IHist_Gra~3 | 1.69506 .3886792 2.30 0.021 1.081443 > 2.656847 > _IHist_Gra~4 | 2.540278 .9997843 2.37 0.018 1.17455 > 5.49403 > > > > R Output using >> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, > method=c("breslow"))) >> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, > method=c("exact"))) >> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, > method=c("efron"))) > > > > n=399 (21 observations deleted due to missingness) > > coef exp(coef) se(coef) z Pr(>|z|) > Hist_Grade_4grp.L 0.66685 1.94809 0.26644 2.503 0.0123 * > Hist_Grade_4grp.Q 0.03113 1.03162 0.20842 0.149 0.8813 > Hist_Grade_4grp.C 0.08407 1.08771 0.13233 0.635 0.5252 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > exp(coef) exp(-coef) lower .95 upper .95 > Hist_Grade_4grp.L 1.948 0.5133 1.1556 3.284 > Hist_Grade_4grp.Q 1.032 0.9693 0.6857 1.552 > Hist_Grade_4grp.C 1.088 0.9194 0.8392 1.410 > > Rsquare= 0.02 (max possible= 0.997 ) > Likelihood ratio test= 8.1 on 3 df, p=0.044 > Wald test = 8.02 on 3 df, p=0.0455 > Score (logrank) test = 8.2 on 3 df, p=0.04202 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
run it with factor() instead of ordered(). You don't want the "orthogonal polynomial" contrasts that result from ordered if you need to compare against Stata. I attach an R program that I wrote to explore ordered factors a while agol I believe this will clear everything up if you study the examples. pj On Wed, Sep 8, 2010 at 5:43 PM, Min-Han Tan <minhan.science at gmail.com> wrote:> Dear R-help members, > > Apologies - I am posting on behalf of a colleague, who is a little puzzled > as STATA and R seem to be yielding different survival estimates for the same > dataset when treating a variable as ordinal. Ordered() is used ?to represent > an ordinal variable) I understand that R's coxph (by default) uses the Efron > approximation, whereas STATA uses (by default) the Breslow. but we did > compare using the same approximations. I am wondering if this is a result of > how coxph manages an ordered factor? > > Essentially, this is a survival dataset using tumor grade (1, 2, 3 and 4) as > the risk factor. This is more of an 'ordinal' variable, rather than a > continuous variable. For the same data set of 399 patients, when treating > the vector of tumor grade as a continuous variable (range of 1 to 4), > testing the Efron and the Breslow approximations yield the same result in > both R and STATA. > > However, when Hist_Grade_4 grp is converted into an ordered factor using > ordered(), and the same scripts are applied, rather different results are > obtained, relative to the STATA output. This is tested across the different > approximations, with consistent results. The comparison using Efron > approximation and ordinal data is is below. > > Your advice is very much appreciated! > > Min-Han > > Apologies below for the slightly malaligned output. > > STATA output > > . xi:stcox i.Hist_Grade_4grp, efr > i.Hist_Grade_~p ? _IHist_Grad_1-4 ? ? (naturally coded; _IHist_Grad_1 > omitted) > > ? ? ? ?failure _d: ?FFR_censor > ?analysis time _t: ?FFR_month > > Iteration 0: ? log likelihood = ?-1133.369 > Iteration 1: ? log likelihood = -1129.4686 > Iteration 2: ? log likelihood = -1129.3196 > Iteration 3: ? log likelihood = -1129.3191 > Refining estimates: > Iteration 0: ? log likelihood = -1129.3191 > > Cox regression -- Efron method for ties > > No. of subjects = ? ? ? ? ?399 ? ? ? ? ? ? ? ? ? ? Number of obs ? > 399 > No. of failures = ? ? ? ? ?218 > Time at risk ? ?= ?9004.484606 > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?LR chi2(3) ? ? ?> ?8.10 > Log likelihood ?= ? -1129.3191 ? ? ? ? ? ? ? ? ? ? Prob > chi2 ? ? > ?0.0440 > > ------------------------------------------------------------------------------ > ? ? ? ? _t | Haz. Ratio ? Std. Err. ? ? ?z ? ?P>|z| ? ? [95% Conf. > Interval] > -------------+---------------------------------------------------------------- > _IHist_Gra~2 | ? 1.408166 ? .3166876 ? ? 1.52 ? 0.128 ? ? .9062001 > ?2.188183 > _IHist_Gra~3 | ? ?1.69506 ? .3886792 ? ? 2.30 ? 0.021 ? ? 1.081443 > ?2.656847 > _IHist_Gra~4 | ? 2.540278 ? .9997843 ? ? 2.37 ? 0.018 ? ? ?1.17455 > 5.49403 > > > > R Output using >> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, > method=c("breslow"))) >> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, > method=c("exact"))) >> summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, > method=c("efron"))) > > > > ?n=399 (21 observations deleted due to missingness) > > ? ? ? ? ? ? ? ? ? ?coef exp(coef) se(coef) ? ? z Pr(>|z|) > Hist_Grade_4grp.L 0.66685 ? 1.94809 ?0.26644 2.503 ? 0.0123 * > Hist_Grade_4grp.Q 0.03113 ? 1.03162 ?0.20842 0.149 ? 0.8813 > Hist_Grade_4grp.C 0.08407 ? 1.08771 ?0.13233 0.635 ? 0.5252 > --- > Signif. codes: ?0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > ? ? ? ? ? ? ? ? exp(coef) exp(-coef) lower .95 upper .95 > Hist_Grade_4grp.L ? ? 1.948 ? ? 0.5133 ? ?1.1556 ? ? 3.284 > Hist_Grade_4grp.Q ? ? 1.032 ? ? 0.9693 ? ?0.6857 ? ? 1.552 > Hist_Grade_4grp.C ? ? 1.088 ? ? 0.9194 ? ?0.8392 ? ? 1.410 > > Rsquare= 0.02 ? (max possible= 0.997 ) > Likelihood ratio test= 8.1 ?on 3 df, ? p=0.044 > Wald test ? ? ? ? ? ?= 8.02 ?on 3 df, ? p=0.0455 > Score (logrank) test = 8.2 ?on 3 df, ? p=0.04202 > > ? ? ? ?[[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
I look at this question in a different angle. My understanding is: 1. If treat tumor_grade as a numerical variable, you assume the hazard ratio is invariant between any two adjacent levels of the tumor grade (assuming invariant covariate patterns of other risks); 2. If you treat the tumor_grade as factor, you don't assume the constant hazard ratio between the levels (assuming invariant covariate patterns of other risks). I don't see the difference between continuous risk and discrete ordinal risk in this case. If the tumor grade is the response, there will be a difference between variables types (ordinal, continuous and nominal) since that information will be used to select appropriate models. -- View this message in context: http://r.789695.n4.nabble.com/coxph-and-ordinal-variables-tp2532149p2532299.html Sent from the R help mailing list archive at Nabble.com.
Hi, everybody>On Wed, Sep 8, 2010 at 5:43 PM, Min-Han Tan ><minhan.science at gmail.com> wrote:David said my R code text attachment got rejected by the mailing list. Pooh. I don't think that's nice. I don't see anything in the posting guide about a limit on text attachments. Well, if you are still trying to understand what 'orthogonal polynomial' means, I suggest you run the following through. I thought it was an enlightening experience. # Paul Johnson <pauljohn at ku.edu> Nov. 16, 2005 # Ordinal predictors with a small number of possible values # Here is R code and commentary about ordinal predictors. # Please let me know if you have insights to explain this more clearly. set.seed(199999) sampleSize <- 100 subgroupSize <- sampleSize/5 # One of those "5 point opinion" questions: Strongly Disagree... # Strongly agree with values assigned 1,3,5,7,9 surveyQuestion <- c(rep(1,subgroupSize),rep(3,subgroupSize),rep(5,subgroupSize),rep(7,subgroupSize),rep(9,subgroupSize)) ### Make this an unordered factor myufac <- factor(surveyQuestion) ### Study the contrasts: contrasts(myufac) ### Make an ordered factor myofac <- ordered(surveyQuestion) contrasts(myofac) # another independent variable x <- rnorm(sampleSize) # a random error e <- rnorm(sampleSize) # First, suppose the output data is really just a # linear reflection of the surveyQuestion. It is created # by multiplying against the evenly spaced values # observed in the survey y1 <- -5 + 4*surveyQuestion + 6*x + 10 * e mod0 <- lm(y1~x + surveyQuestion) summary(mod0) # Question: are you making a mistake by just "throwing" # surveyQuestion into the analysis? You should not be # making a mistake, because you (luckily) guessed the right model # You might check by running the model with the unordered factor, # which amounts to running "dummy variables" mod1 <- lm(y1~x + myufac) summary(mod1) # or with the ordered factor, which estimates the orthogonal # polynomials mod2 <- lm(y1~x + myofac) summary(mod2) # Does either of those model appear to be "better" than # the original mod0? # If I did not know for sure the factor was ordered, I'd # be stuck with the treatment contrasts in mod1. And what # I would really like to know is this: are the coefficients # in mod1 "stepping up" in a clear, ordinal, evenly spaced pattern? # Since we know the data is created with a coefficient of 4 # we would expect that the coefficients would "step up" like this # myufac3 8 # myufac5 16 # myufac7 24 # myufac9 32 # See why we expect this pattern? The intercept "gobbles up" myufac1, # so each "impact" of the surveyQuestion has to be reduced by 4 units. # The impact of myufac3, which you expect might be 3*4=12, is reduced # to 3*4 - 4 = 8, and so forth. # But that does not happen with a smallish sample. You can run this # code a few times. It appears to me that a sample of more than # 10,000 is necessary to get any faithful reproduction of the "steps" # between values of surveyQuestion. # Question: would we mistakenly think that the uneveness observed in # summary(mod1) is evidence of the need to treat surveyQuestion as a # nominal factor, even though we know (since we created the data) that # we might as well just throw surveyQuestion in as a single variable? # How to decide? # We need a hypothesis test of the conjecture that # the coefficient estimates in mod1 fall "along a line" # i.e, myufac-i = (b2 * i ) - b2 # I believe that the correct test results from this command: anova(mod0, mod1, test="Chisq") # If that is significant, it means you are losing predictive # power by throwing in surveyQuestion as if it were a numerical # variable. # Now, what if we are sure that the data gathered in surveyQuestion is # really ordinal, and so we estimate mod2. # What do those parameters mean? Here I'm just reasoning/guessing # because I cannot find any complete idiot's guide to orthogonal # polynomials and their use in regression and/or R. # Take a look at the contrasts themselves # > ord1 <- contrasts(myofac) # ord1 # .L .Q .C ^4 # 1 -6.324555e-01 0.5345225 -3.162278e-01 0.1195229 # 3 -3.162278e-01 -0.2672612 6.324555e-01 -0.4780914 # 5 -3.287978e-17 -0.5345225 1.595204e-16 0.7171372 # 7 3.162278e-01 -0.2672612 -6.324555e-01 -0.4780914 # 9 6.324555e-01 0.5345225 3.162278e-01 0.1195229 # What does this mean? I believe: we act "as though" there are 4 # independent variables in the regression, L, Q, C, and ^4, and for # each respondent in the dataset, we choose a row of these numbers # to act as independent variables. A person who # answered 1 on the survey would have the input values (-.63, -.53, # -.31, 0.11) for those four variables. # This begs the question, what are L, Q, C, and ^4? ### This is tough to explain. # Background Recall from calculus that any function can be # approximated by a polynomial. Since surveyQuestion has only 5 # possible values, a polynomial of degree 4 is needed to "replace it" # in a regression model. It can replace it EXACTLY, not just # approximately. So we might fit # y1 <- a + b* x + c*surveyQuestion # + d*surveyQuestion^2 # + e*surveyQuestion^3 # + f*surveyQuestion^4 # That would give a "direct test" of whether you need to worry # about the coding of surveyQuestion. If d, e, and f are zero # then that means the inclusion of surveyQuestion as a linear # is the right way to go. If you get significant estimates on # ^2, ^3, and/or ^2, then you would know it was a mistake # to throw in surveyQuestion by itself. # Here is the big problem. # There would be *severe multicollinearity* in those estimates. # The standard errors would be huge, and the variance of the # estimated coefficients would be massive. That would happen # because the ^2 ^3 and ^4 variables are so proportional to each other # in many datasets. # But there is a way to re-scale those terms so they are not # multicollinear. So, instead of estimating the polynomial # directly, we use the "rescaled" orthogonal polynomial values. # Note that there is no correlation between these 4 columns of of the # orgthogonal contrasts. They are "orthogonal polynomials", so there # is absolutely no multicollinearity problem among them. # Observe: # > ord1 <- contrasts(myofac) # > t(ord1[,2])%*% ord1[,3] # [,1] # [1,] -3.579222e-17 # That's very very close to 0.0. # We can do all of those multiplications at once: check diagonal here # > t(ord1) %*% ord1 # .L .Q .C ^4 # .L 1.000000e-00 4.710858e-17 -7.123208e-17 2.143332e-17 # .Q 4.710858e-17 1.000000e-00 -3.579222e-17 -3.916680e-17 # .C -7.123208e-17 -3.579222e-17 1.000000e+00 3.582678e-16 # ^4 2.143332e-17 -3.916680e-17 3.582678e-16 1.000000e+00 # That is as much illinformed guessing as I can do right now about # orthogonal polynomials. # Anyway, if you run the regression mod2 and the higher # order terms are not significant, then it is a hint that your coding is # OK. That is, just "throw in" surveyQuestion. # And I believe a rigorous hypothesis test is obtained by anova (mod0, mod2, test="Chisq") # What's the good news? The Hypothesis Test result is EXACTLY THE # SAME whether we test the ordinal or the nominal coding. Whew! # Not such a big issue, then, whether we do factor or ordered when # deciding if we can just "throw" surveyQuestion into the model. # Now change the problem so that the "real data" is not created # by multiplying against the pure, unadulterated surveyQuestion # Now, the ordinal impact of the "real effect" is not reflected # perfectly well by the data of surveyQuestion surveyImpact <- NA surveyImpact[surveyQuestion==1] <- 0 surveyImpact[surveyQuestion==3] <- 5 surveyImpact[surveyQuestion==5] <- 6 surveyImpact[surveyQuestion==7] <- 9 surveyImpact[surveyQuestion==9] <- 14 y2 <- -5 + 4*surveyImpact + 6*x + 10 * e # Proceed as if you were not wrong and "throw in" survey question. mod3 <- lm(y2~x + surveyQuestion) summary(mod3) # Question: are you making a mistake? # If you are one of the people who says silly things like "my p values # are good, so I must have the correct model," the results should be # very very sobering. mod4 <- lm(y2~x + myufac) summary(mod4) # or with the ordered factor, which estimates the orthogonal # polynomials mod5 <- lm(y2~x + myofac) summary(mod5) anova(mod3,mod4, test="Chisq") anova(mod3,mod5, test="Chisq") # Again, the happy result that the 2 significance tests are the # same. Both tell you that you were just flat-out wrong to put # surveyQuestion into the regression model. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas