Tirthadeep
2007-Jun-12 17:45 UTC
[R] Appropriate regression model for categorical variables
Dear users, In my psychometric test i have applied logistic regression on my data. My data consists of 50 predictors (22 continuous and 28 categorical) plus a binary response. Using glm(), stepAIC() i didn't get satisfactory result as misclassification rate is too high. I think categorical variables are responsible for this debacle. Some of them have more than 6 level (one has 10 level). Please suggest some better regression model for this situation. If possible you can suggest some article. thanking you. Tirtha -- View this message in context: http://www.nabble.com/Appropriate-regression-model-for-categorical-variables-tf3908982.html#a11083540 Sent from the R help mailing list archive at Nabble.com.
Robert A LaBudde
2007-Jun-12 18:08 UTC
[R] Appropriate regression model for categorical variables
At 01:45 PM 6/12/2007, Tirtha wrote:>Dear users, >In my psychometric test i have applied logistic regression on my data. My >data consists of 50 predictors (22 continuous and 28 categorical) plus a >binary response. > >Using glm(), stepAIC() i didn't get satisfactory result as misclassification >rate is too high. I think categorical variables are responsible for this >debacle. Some of them have more than 6 level (one has 10 level). > >Please suggest some better regression model for this situation. If possible >you can suggest some article.1. Using if a factor has many levels, there is a natural order to the levels. If so, consider fitting the factor as an ordered factor. 2. Break the factor levels into 2 or 3 groups that have some rational connection. Then fit the factor with a smaller number of levels. E.g., "race" might have levels "white", "black", "asian", "pacific", "Spanish surname", "other". Consider a change to "white", "nonwhite". ===============================================================Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com Least Cost Formulations, Ltd. URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239 Fax: 757-467-2947 "Vere scire est per causas scire"
(Ted Harding)
2007-Jun-12 23:24 UTC
[R] Appropriate regression model for categorical variables
On 12-Jun-07 17:45:44, Tirthadeep wrote:> > Dear users, > In my psychometric test i have applied logistic regression > on my data. My data consists of 50 predictors (22 continuous > and 28 categorical) plus a binary response. > > Using glm(), stepAIC() i didn't get satisfactory result as > misclassification rate is too high. I think categorical > variables are responsible for this debacle. Some of them have > more than 6 level (one has 10 level). > > Please suggest some better regression model for this situation. > If possible you can suggest some article.I hope you have a very large number of cases in your data! The minimal complexity of the 28 categorical variables compatible with your description is 1 factor at 10 levels 2 factors at 7 levels 25 factors at 2 levels which corresponds to (2^25)*(7^2)*10 = 16441671680 ~= 1.6e10 distinct possible combinations of levels of the factors. Your true factors may have far more than this. Unless you have more cases than this in your data, you are likely to fall into what is called "linear separation", in which the logistic regression will find a perfect predictor for your binary outcome. This prefect predictor may well not be unique (indeed if you have only a few hundred cases there will probably be millions of them). Therefore your logistic reggression is likely to be meaningless. I can only suggest that you consider very closely how to a) reduce the numbers of levels in some of your factors, by coalescing levels together; b) defining new factors in terms of the old so as to reduce the total number of factors (which may include ignoring some factors altogether) so that you end up with new categorical variables whose total number of possible combinations is much smaller (say at most 1/5) of the number of cases in your data. In summary: you have to many explanatory variables. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jun-07 Time: 00:23:49 ------------------------------ XFMail ------------------------------
Moshe Olshansky
2007-Jun-14 01:36 UTC
[R] Appropriate regression model for categorical variables
Tirtha wrote:>Dear users, >In my psychometric test i have applied logistic >regression on my data. >My >data consists of 50 predictors (22 continuous and 28 >categorical) plus >a >binary response. > >Using glm(), stepAIC() i didn't get satisfactory >result as >misclassification >rate is too high. I think categorical variables are >responsible for >this >debacle. Some of them have more than 6 level (one has >10 level). > >Please suggest some better regression model for this >situation. If >possible >you can suggest some article. > >thanking you. > >TirthaHi Tirtha, Are your categorical variables really categorical? What I mean is if you variable is user's satisfaction level (0 for very unsatisfied, 1 for moderately unsatisfied, 2 for slightly unsatisfied, 4 for neutral, etc., finally 7 for very satisfied) then your variable is not really categorical (since 1 is closer to 3 than to 6) and then try what other people suggest. However, if your variable is, say, the 50-th amino acid in a certain gene (with values of 1 for the first amino acid, 2 for the second one,...,20 for the 20-th one) then your variable is really categorical (you generally can not say that amino acid 2 is much closer to amino acid 3 than to amino acid 17). In such a case I would have tried classification method which can treat categorical variables or, alternatively, may be regression trees (i.e. split on the values of categorical variables and at each "node" find regression coefficients of the continuous variables). Regards, Moshe Olshansky m_olshansky at yahoo.com