Pamela Foggia
2015-Jun-19 21:32 UTC
[R] Help with categorical predicrots in regression models
Hello, In my regression models (linear and logistic models) I have two predictor variables, both are categorical variables: DEGREE and REGION. DEGREE is for educational level, that is an ordinal variable with five levels (0-LT HIGH SCHOOL, 1-HIGH SCHOOL, 2-JUNIOR COLLEGE, 3-BACHELOR, 4-GRADUATE). REGION is for the region of the respondent, that is a nominal variable with 9 levels (1-NEW ENGLAND, 2-MIDDLE ATLANTIC, 3-E. NOR. CENTRAL, 4-W. NOR. CENTRAL, 5-SOUTH ATLANTIC, 6-E. SOU. CENTRAL, 7-W. SOU. CENTRAL, 8- MOUNTAIN, 9-PACIFIC). In many examples I read that, in order to use correctly these predictors as categorical variables, I have to use before the FACTOR function, for example in this way fit1 <- lm(Z ~ factor(X) + factor(Y)) fit2 <- glm(W ~ factor(x) + factor(Y), family=binomial(link="logit")) obtaining the following output for the logistic regression coef.est coef.se (Intercept) 1.027 0.263 factor(DEGREE)1 0.301 0.134 factor(DEGREE)2 0.340 0.211 factor(DEGREE)3 0.748 0.168 factor(DEGREE)4 1.267 0.237 ... where clearly Z is a continuous variable and W is a binary variable. My question is: as far as the ordinal variable X is concerned, would it be more correct to use the ORDERED function rather than FACTOR? I mean an operation like this fit1 <- lm(Z ~ ordered(X) + factor(Y)) fit2 <- glm(W ~ ordered(x) + factor(Y), family=binomial(link="logit")) where I obtain a different output like this coef.est coef.se (Intercept) 1.558 0.241 ordered(DEGREE).L 0.942 0.157 ordered(DEGREE).Q 0.215 0.160 ordered(DEGREE).C 0.118 0.111 ordered(DEGREE)^4 -0.106 0.143 ... What do the letters L, Q, C and the power ^4 (which I find in the output) mean? Thanks in advance [[alternative HTML version deleted]]
David Winsemius
2015-Jun-20 05:05 UTC
[R] Help with categorical predicrots in regression models
On Jun 19, 2015, at 2:32 PM, Pamela Foggia wrote:> Hello, > In my regression models (linear and logistic models) I have two predictor > variables, both are categorical variables: DEGREE and REGION. > > DEGREE is for educational level, that is an ordinal variable with five > levels (0-LT HIGH SCHOOL, 1-HIGH SCHOOL, 2-JUNIOR COLLEGE, 3-BACHELOR, > 4-GRADUATE). > > REGION is for the region of the respondent, that is a nominal variable with > 9 levels (1-NEW ENGLAND, 2-MIDDLE ATLANTIC, 3-E. NOR. CENTRAL, 4-W. NOR. > CENTRAL, 5-SOUTH ATLANTIC, 6-E. SOU. CENTRAL, 7-W. SOU. CENTRAL, 8- > MOUNTAIN, 9-PACIFIC). > > In many examples I read that, in order to use correctly these predictors as > categorical variables, I have to use before the FACTOR function,Please do _not_ capitalize the `factor` function name. R is _not_ SAS.> for > example in this way > > fit1 <- lm(Z ~ factor(X) + factor(Y)) > fit2 <- glm(W ~ factor(x) + factor(Y), family=binomial(link="logit")) > > obtaining the following output for the logistic regression > > coef.est coef.se > (Intercept) 1.027 0.263 > factor(DEGREE)1 0.301 0.134 > factor(DEGREE)2 0.340 0.211 > factor(DEGREE)3 0.748 0.168 > factor(DEGREE)4 1.267 0.237 > ... > > where clearly Z is a continuous variable and W is a binary variable. My> question is: as far as the ordinal variable X is concerned, would it be > more correct to use the ORDERED function rather than FACTOR?That really would depend on the hypotheses under consideration, wouldn't it?> I mean an > operation like this > > fit1 <- lm(Z ~ ordered(X) + factor(Y)) > fit2 <- glm(W ~ ordered(x) + factor(Y), family=binomial(link="logit")) > > where I obtain a different output like this > > coef.est coef.se > (Intercept) 1.558 0.241 > ordered(DEGREE).L 0.942 0.157 > ordered(DEGREE).Q 0.215 0.160 > ordered(DEGREE).C 0.118 0.111 > ordered(DEGREE)^4 -0.106 0.143 > ...Clearly that output does not match the regression call.> > What do the letters L, Q, C and the power ^4 (which I find in the output) > mean?The default set of contrasts for an ordered factor are the orthogonal polynomial contrasts of degree (nothing to do with your factor name) n-1 where there are n levels to the factor. If this doesn't make sense, then you need to do further research to improve your understanding of polynomial contrasts. (They are messy.) You can limit the contrasts to only a linear "degree". You can find further information regarding polynomial contrasts at ?contr.poly and ?C It's possible that this will be helpful: DEGREE <- C( DEGREE, poly, 1) # Only linear contrast fit2 <- glm(W ~ DEGREE + factor(REGION), family=binomial(link="logit")) And refrain from then using `ordered` in the formula. -- David.> > Thanks in advance > > [[alternative HTML version deleted]]This is a mailing list that request plain text. It's not hard to do in gmail. Please read ....> ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA