Michael Haenlein
2012-May-08 10:29 UTC
[R] Regression with very high number of categorical variables
Dear all, I would like to run a simple regression model y~x1+x2+x3+... The problem is that I have a lot of independent variables (xi) -- around one hundred -- and that some of them are categorical with a lot of categories (like, for example, ZIP code). One straightforward way would be to (a) transform all categorical variables into 1/0 dummies and (b) enter all the variables into an lm model. But I'm not sure whether this is very efficient, especially since the analysis is exploratory in nature and I expect that many of the xi will have no significant impact on y. Is there a R library that can handle such a setting? I have read about "Hierarchical Bayesian variance components models" that have been used with ZIP data (www.jstor.org/stable/10.2307/4129723), but I'm not sure to which extent there is a function in R to do that in a straightforward manner. Thanks, Michael [[alternative HTML version deleted]]
Bert Gunter
2012-May-08 14:24 UTC
[R] Regression with very high number of categorical variables
You have received no answer yet. I think this is largely because there is no simple answer. 1. You don't need to mess with dummy variable. R takes care of this itself. Please read up on how to do regression in R. 2. However, it may not work anyway: too many variables/categories for your data. Or it may work but produce nothing useful/sensible. 3. This sort of situation is subject matter area specific. I strongly recommend you seek local statistical help if you can. -- Bert On Tue, May 8, 2012 at 3:29 AM, Michael Haenlein <haenlein at escpeurope.eu> wrote:> Dear all, > > I would like to run a simple regression model y~x1+x2+x3+... > > The problem is that I have a lot of independent variables (xi) -- around > one hundred -- and that some of them are categorical with a lot of > categories (like, for example, ZIP code). One straightforward way would be > to (a) transform all categorical variables into 1/0 dummies and (b) enter > all the variables into an lm model. But I'm not sure whether this is very > efficient, especially since the analysis is exploratory in nature and I > expect that many of the xi will have no significant impact on y. > > Is there a R library that can handle such a setting? I have read about > "Hierarchical Bayesian variance components models" that have been used with > ZIP data (www.jstor.org/stable/10.2307/4129723), but I'm not sure to which > extent there is a function in R to do that in a straightforward manner. > > Thanks, > > Michael > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm