Paul Johnson
2005-Nov-08 19:41 UTC
[R] Need advice about models with ordinal input variables
Dear colleagues: I've been storing up this question for a long time and apologize for the length and verbosity of it. I am having trouble in consulting with graduate students on their research projects. They are using surveys to investigate the sources of voter behavior or attitudes. They have predictors that are factors, some ordered, but I am never confident in telling them what they ought to do. Usually, they come in with a regression model fitted as though these variables are numerical, and if one looks about in the social science literature, one finds that many people have published doing the same. I want to ask your advice about some cases. 1. An ordered factor that masquerades as a numerical "interval level" score. In the research journals, these are the ones most often treated as numerical variables in regressions. For example: "Thermometer scores" for Presidential candidates range from 0 to 100 in integer units. What's a better idea? In an OLS model with just one input variable, a plot will reveal if there is a significant "nonlinearity". One can recode the assigned values to linearize the final model or take the given values and make a nonlinear model. In the R package "acepack" I found avas, which works like a "rubber ruler" and recodes variables in order to make relationships as linear and homoskedastic as possible. I've never seen this used in the social science literature. It works like magic. Take an ugly scatterplot and shazam, out come transformed variables that have a beautiful plot. But what do you make of these things? There is so much going on in these transformations that interpretation of the results is very difficult. You can't say "a one unit increase in x causes a b increase in y". Furthermore, if the model is a survival model, a logistic regression, or other non-OLS model, I don't see how the avas approach will help. I've tried fiddling about with smoothers, treating the input scores as if they were numerical. I got this idea from Prof. Harrell's Regression Modeling Strategies. In his Design package for R, one can include a cubic spline for a variable in a model by replacing x with rcs(x). Very convenient. If the results say the relationship is mostly linear, then we might as well treat the input variable as a numerical score and save some degrees of freedom. But if the higher order terms are statistically significant, it is difficult to know what to do. The best strategy I have found so far is to calculate fitted values for particular inputs and then try to tell a story about them. 2. Ordinal variables with less than 10 values. Consider variables like self-reported ideology, where respondents are asked to place themselves on a 7 point scale ranging from "very conservative" to "very liberal". Or Party Identification on a 7 point scale, ranging (in the US) from "Strong Democrat" to "Strong Republican". It has been quite common to see these thrown into regression models as if they were numerical. I've sometimes found it useful to run a regression treating them as unordered factors, and then I attempt to glean a pattern in the coefficients. If the parameter estimates step up by a fixed proportion, then one might think there's no damage from treating them as numerical variables. Yesterday, it occurred to us that there should be a signifance test to determine if one looses predictive power by replacing the factor-treatment of x with x itself. Is there a non-nested model test that is most appropriate? 3. Truly numericals variable that are reported as "grouped" ordinal scales. THese variables are aweful in many ways. Income is often reported in a form like this: 1) Less than 20000 2) 20000 to 35000 3) 35001 to 50000 4) 50001 to 100000 5) above 100000 Education often appears in a form that has 1) 8 years or less 2) 9 years 3) 10 years 4) 11 years 5) 12 years 6) some college completed 7) undergraduate degree completed 8) graduate degree completed These predictors pose many problems. We have dissimilar people grouped together, so there are "errors in variables" and it seems obvious that the scores should be recoded somehow to reflect the substance of the differences among groups. But how? 4. Ordered variables with a small number of scores. For example, "has your economic situation been 1) worse 2) same 3) better" or "how do you feel when you see the American flag?" 1) no effect 2) OK 3) great 4) extatic Anyway, in an R model, I think the right thing to do is to enter them into a regression with as.ordered(x). But I don't know what to say about the results. Has anybody written an "idiots guide to orthogonal polynomials"? Aside from calculating fitted values, how do you interpret these things? Is there ever a point when you would say "we should treat that as a numerical variable with scores 1-2-3-4" rather than as an ordered factor? If you have advice, I would be delighted to hear it. -- Paul E. Johnson email: pauljohn at ku.edu Dept. of Political Science http://lark.cc.ku.edu/~pauljohn 1541 Lilac Lane, Rm 504 University of Kansas Office: (785) 864-9086 Lawrence, Kansas 66044-3177 FAX: (785) 864-5700
Dear Paul, I'll try to answer these question briefly.> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Paul Johnson > Sent: Tuesday, November 08, 2005 2:42 PM > To: r-help at stat.math.ethz.ch > Subject: [R] Need advice about models with ordinal input variables > > Dear colleagues: > > I've been storing up this question for a long time and > apologize for the length and verbosity of it. I am having > trouble in consulting with graduate students on their > research projects. They are using surveys to investigate the > sources of voter behavior or attitudes. They have predictors > that are factors, some ordered, but I am never confident in > telling them what they ought to do. Usually, they come in > with a regression model fitted as though these variables are > numerical, and if one looks about in the social science > literature, one finds that many people have published doing the same. > > I want to ask your advice about some cases. > > 1. An ordered factor that masquerades as a numerical > "interval level" score. > > In the research journals, these are the ones most often > treated as numerical variables in regressions. For example: > "Thermometer scores" > for Presidential candidates range from 0 to 100 in integer units. >In my opinion, such scales aren't even really measurments, but they have a prima-facie reasonableness, and I wouldn't hesitate to use them in the absence of something better.> What's a better idea? In an OLS model with just one input > variable, a plot will reveal if there is a significant > "nonlinearity". One can recode the assigned values to > linearize the final model or take the given values and make a > nonlinear model. > > In the R package "acepack" I found avas, which works like a > "rubber ruler" and recodes variables in order to make > relationships as linear and homoskedastic as possible. I've > never seen this used in the social science literature. It > works like magic. Take an ugly scatterplot and shazam, out > come transformed variables that have a beautiful plot. But > what do you make of these things? There is so much going on > in these transformations that interpretation of the results > is very difficult. > You can't say "a one unit increase in x causes a b increase in y". > Furthermore, if the model is a survival model, a logistic > regression, or other non-OLS model, I don't see how the avas > approach will help. >As a general matter, the central issue seems to me the functional form of the relationship between the variables, whether or not X really has a unit of measurement. If you use a nonparametric fit (such as avas) or even if you use a complex parametric fit, such as a regression (rather than smoothing) spline, then why not think of the graph of the fit as the best description?> I've tried fiddling about with smoothers, treating the input > scores as if they were numerical. I got this idea from Prof. > Harrell's Regression Modeling Strategies. In his Design > package for R, one can include a cubic spline for a variable > in a model by replacing x with rcs(x). Very convenient. If > the results say the relationship is mostly linear, then we > might as well treat the input variable as a numerical score > and save some degrees of freedom. > > But if the higher order terms are statistically significant, > it is difficult to know what to do. The best strategy I have > found so far is to calculate fitted values for particular > inputs and then try to tell a story about them. >This seems reasonable, though I'd usually prefer a graph to a table.> > 2. Ordinal variables with less than 10 values. > > Consider variables like self-reported ideology, where > respondents are asked to place themselves on a 7 point scale > ranging from "very conservative" to "very liberal". Or Party > Identification on a 7 point scale, ranging (in the US) from > "Strong Democrat" to "Strong Republican". > > It has been quite common to see these thrown into regression > models as if they were numerical. > > I've sometimes found it useful to run a regression treating > them as unordered factors, and then I attempt to glean a > pattern in the coefficients. If the parameter estimates step > up by a fixed proportion, then one might think there's no > damage from treating them as numerical variables. >Since linear (and other similar) models don't make distributional assumptions about the X's (other than independence from the errors), nothing in principle changes.> Yesterday, it occurred to us that there should be a > signifance test to determine if one looses predictive power > by replacing the factor-treatment of x with x itself. Is > there a non-nested model test that is most appropriate? >Actually, the models are nested, so, e.g., for a linear model, you could do an incremental F-test. That is, a linear relationship is a special case of any relationship at all, which is what you get by treating X as a factor.> 3. Truly numericals variable that are reported as "grouped" > ordinal scales. THese variables are aweful in many ways. > > Income is often reported in a form like this: > > 1) Less than 20000 > 2) 20000 to 35000 > 3) 35001 to 50000 > 4) 50001 to 100000 > 5) above 100000 > > Education often appears in a form that has > 1) 8 years or less > 2) 9 years > 3) 10 years > 4) 11 years > 5) 12 years > 6) some college completed > 7) undergraduate degree completed > 8) graduate degree completed > > These predictors pose many problems. We have dissimilar > people grouped together, so there are "errors in variables" > and it seems obvious that the scores should be recoded > somehow to reflect the substance of the differences among > groups. But how? >I don't see a difference in principle here, although if the intervals are very wide, there is as you suggest a measurement-error problem. Since the information is irretrievably lost at the point of data collection, there isn't much to be done.> > 4. Ordered variables with a small number of scores. > > For example, "has your economic situation been > 1) worse > 2) same > 3) better" > > or "how do you feel when you see the American flag?" > 1) no effect > 2) OK > 3) great > 4) extatic > > Anyway, in an R model, I think the right thing to do is to > enter them into a regression with as.ordered(x). >That seems reasonable. Again, I don't see this as special. Treating the variable as ordered would allow you to look at linear, quadratic, etc., terms.> But I don't know what to say about the results. Has anybody > written an "idiots guide to orthogonal polynomials"? Aside > from calculating fitted values, how do you interpret these > things? Is there ever a point when you would say "we should > treat that as a numerical variable with scores 1-2-3-4" > rather than as an ordered factor? >If the linear term were the only important one, then that would be equivalent to saying that you could treat the variable in this manner. Otherwise you should combine the terms to interpret them, as is generally the case when regressors are related by marginality. I hope this helps, John> > If you have advice, I would be delighted to hear it. > > -- > Paul E. Johnson email: pauljohn at ku.edu > Dept. of Political Science http://lark.cc.ku.edu/~pauljohn > 1541 Lilac Lane, Rm 504 > University of Kansas Office: (785) 864-9086 > Lawrence, Kansas 66044-3177 FAX: (785) 864-5700 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html