I have a variable which is roughly age categories in decades. In the original data, it came in coded:> str(xxx)'data.frame': 58271 obs. of 29 variables: $ issuecat : Factor w/ 5 levels "0 - 39","40 - 49",..: 1 1 1 1... snip I then defined issuecat as ordered:> xxx$issuecat<-as.ordered(xxx$issuecat)When I include issuecat in a glm model, the result makes me think I have asked R for a linear+quadratic+cubic+quartic polynomial fit. The results are not terribly surprising under that interpretation, but I was hoping for only a linear term (which I was taught to called a "test of trend"), at least as a starting point.> age.mdl<-glm(actual~issuecat,data=xxx,family="poisson") > summary(age.mdl)Call: glm(formula = actual ~ issuecat, family = "poisson", data = xxx) Deviance Residuals: Min 1Q Median 3Q Max -0.3190 -0.2262 -0.1649 -0.1221 5.4776 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.31321 0.04865 -88.665 <2e-16 *** issuecat.L 2.12717 0.13328 15.960 <2e-16 *** issuecat.Q -0.06568 0.11842 -0.555 0.579 issuecat.C 0.08838 0.09737 0.908 0.364 issuecat^4 -0.02701 0.07786 -0.347 0.729 This also means my advice to a another poster this morning may have been misleading. I have tried puzzling out what I don't understand by looking at indices or searching in MASSv2, the Blue Book, Thompson's application of R to Agresti's text, and the FAQ, so far without success. What I would like to achieve is having the lowest age category be a reference category (with the intercept being the log-rate) and each succeeding age category be incremented by 1. The linear estimate would be the log(risk-ratio) for increasing ages. I don't want the higher order polynomial estimates. Am I hoping for too much? -- David Winsemius using R 2.6.1 in WinXP
David Winsemius <dwinsemius at comcast.net> wrote in news:Xns9A1CC05755274dNOTwinscomcast at 80.91.229.13:> I have a variable which is roughly age categories in decades. In the > original data, it came in coded: >> str(xxx) > 'data.frame': 58271 obs. of 29 variables: > $ issuecat : Factor w/ 5 levels "0 - 39","40 - 49",..: 1 1 1 > 1... > snip > > I then defined issuecat as ordered: >> xxx$issuecat<-as.ordered(xxx$issuecat) > > When I include issuecat in a glm model, the result makes me think I > have asked R for a linear+quadratic+cubic+quartic polynomial fit. > The results are not terribly surprising under that interpretation, > but I was hoping for only a linear term (which I was taught to > call a "test of trend"), at least as a starting point. > >> age.mdl<-glm(actual~issuecat,data=xxx,family="poisson") >> summary(age.mdl) > > Call: > glm(formula = actual ~ issuecat, family = "poisson", data = xxx) > > Deviance Residuals: > Min 1Q Median 3Q Max > -0.3190 -0.2262 -0.1649 -0.1221 5.4776 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -4.31321 0.04865 -88.665 <2e-16 *** > issuecat.L 2.12717 0.13328 15.960 <2e-16 *** > issuecat.Q -0.06568 0.11842 -0.555 0.579 > issuecat.C 0.08838 0.09737 0.908 0.364 > issuecat^4 -0.02701 0.07786 -0.347 0.729 > > This also means my advice to a another poster this morning may have > been misleading. I have tried puzzling out what I don't understand > by looking at indices or searching in MASSv2, the Blue Book, > Thompson's application of R to Agresti's text, and the FAQ, so far > without success. What I would like to achieve is having the lowest > age category be a reference category (with the intercept being the > log-rate) and each succeeding age category be incremented by 1. The > linear estimate would be the log(risk-ratio) for increasing ages. I > don't want the higher order polynomial estimates. Am I hoping for > too much? >I acheived what I needed by:> xxx$agecat<-as.numeric(xxx$issuecat) > xxx$agecat<-xxx$agecat-1The results look quite sensible:> exp.mdl<-glm(actual~gendercat+agecat+smokecat, data=xxx,family="poisson", offset=expected)> summary(exp.mdl)Call: glm(formula = actual ~ gendercat + agecat + smokecat, family = "poisson", data = xxx, offset = expected) Deviance Residuals: Min 1Q Median 3Q Max -0.5596 -0.2327 -0.1671 -0.1199 5.2386 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.89410 0.11009 -53.539 < 2e-16 *** gendercatMale 0.29660 0.06426 4.615 3.92e-06 *** agecat 0.66143 0.02958 22.360 < 2e-16 *** smokecatSmoker 0.22178 0.07870 2.818 0.00483 ** smokecatUnknown 0.02378 0.08607 0.276 0.78233 I remain curious about how to correctly control ordered factors, or I should just simply avoid them. -- David Winsemius
David Winsemius wrote:> I have a variable which is roughly age categories in decades. In the > original data, it came in coded: >> str(xxx) > 'data.frame': 58271 obs. of 29 variables: > $ issuecat : Factor w/ 5 levels "0 - 39","40 - 49",..: 1 1 1 1... > snip > > I then defined issuecat as ordered: >> xxx$issuecat<-as.ordered(xxx$issuecat) > > When I include issuecat in a glm model, the result makes me think I > have asked R for a linear+quadratic+cubic+quartic polynomial fit. The > results are not terribly surprising under that interpretation, but I > was hoping for only a linear term (which I was taught to called a "test > of trend"), at least as a starting point. > >> age.mdl<-glm(actual~issuecat,data=xxx,family="poisson") >> summary(age.mdl) > > Call: > glm(formula = actual ~ issuecat, family = "poisson", data = xxx) > > Deviance Residuals: > Min 1Q Median 3Q Max > -0.3190 -0.2262 -0.1649 -0.1221 5.4776 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -4.31321 0.04865 -88.665 <2e-16 *** > issuecat.L 2.12717 0.13328 15.960 <2e-16 *** > issuecat.Q -0.06568 0.11842 -0.555 0.579 > issuecat.C 0.08838 0.09737 0.908 0.364 > issuecat^4 -0.02701 0.07786 -0.347 0.729 > > This also means my advice to a another poster this morning may have > been misleading. I have tried puzzling out what I don't understand by > looking at indices or searching in MASSv2, the Blue Book, Thompson's > application of R to Agresti's text, and the FAQ, so far without > success. What I would like to achieve is having the lowest age category > be a reference category (with the intercept being the log-rate) and > each succeeding age category be incremented by 1. The linear estimate > would be the log(risk-ratio) for increasing ages. I don't want the > higher order polynomial estimates. Am I hoping for too much?David, What you are seeing is the impact of using ordered factors versus unordered factors. Reading ?options, you will note: contrasts: the default contrasts used in model fitting such as with aov or lm. A character vector of length two, the first giving the function to be used with unordered factors and the second the function to be used with ordered factors. By default the elements are named c("unordered", "ordered"), but the names are unused. The default in R (which is not the same as S-PLUS) is: > options("contrasts") $contrasts unordered ordered "contr.treatment" "contr.poly" Thus, note that when using ordered factors, the default handling of factors is contr.poly. Reading ?contrast, you will note: contr.poly returns contrasts based on orthogonal polynomials. To show a quick and dirty example from ?glm: counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) # First, the default with outcome as an unordered factor: > summary(glm(counts ~ outcome, family=poisson())) Call: glm(formula = counts ~ outcome, family = poisson()) Deviance Residuals: Min 1Q Median 3Q Max -0.9666 -0.6712 -0.1696 0.8472 1.0494 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.0445 0.1260 24.165 <2e-16 *** outcome2 -0.4543 0.2022 -2.247 0.0246 * outcome3 -0.2930 0.1927 -1.520 0.1285 ... # Now using outcome as an ordered factor: > summary(glm(counts ~ as.ordered(outcome), family=poisson())) Call: glm(formula = counts ~ as.ordered(outcome), family = poisson()) Deviance Residuals: Min 1Q Median 3Q Max -0.9666 -0.6712 -0.1696 0.8472 1.0494 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.7954 0.0831 33.640 <2e-16 *** as.ordered(outcome).L -0.2072 0.1363 -1.520 0.1285 as.ordered(outcome).Q 0.2513 0.1512 1.662 0.0965 . ... Unfortunately, MASSv2 is the only one of the four editions that I do not have for some reason. In MASSv4, this is covered starting on page 146. This is also covered in an Intro to R, in section 11.1.1 on contrasts. For typical clinical applications, the default treatment contrasts are sufficient, whereby the first level of the factor is considered the reference level and all others are compared against it. Thus, using unordered factors is more common, at least in my experience and likely the etiology of the difference between S-PLUS and R in this regard. HTH, Marc Schwartz