Dear all, I am trying to estimate a lm model with one continuous dependent variable and 11 independent variables that are all categorical, some of which have many categories (several dozens in some cases). I am not interested in statistical inference to a larger population. The objective of my model is to find a way to best predict my continuous variable within the sample. When I run the lm model I evidently get many regression coefficients that are not significant. Is there some way to automatically combine levels of a categorical variable together if the regression coefficients for the individual levels are not significant? My idea is to find some form of grouping of the different categories that allows me to work with less levels while keeping or even improving the quality of predictions. Thanks, Michael [[alternative HTML version deleted]]
> On 20 Sep 2016, at 11:34, Michael Haenlein <haenlein at escpeurope.eu> wrote: > > Dear all, > > I am trying to estimate a lm model with one continuous dependent variable > and 11 independent variables that are all categorical, some of which have > many categories (several dozens in some cases).If I?m not wrong, ( I assume that categorical variables are in factor form) lm will pick the most crowded categories and will try to fit a linear model over them. (This might be wrong, please correct me somebody)> > I am not interested in statistical inference to a larger population. The > objective of my model is to find a way to best predict my continuous > variable within the sample.The best pick would be a CART ( Classification and Reg. Tree, rpart) or CIT (Conditional Inference Tree, ctree) model to predict continous response variable by categorical variables. Please, see new partykit (old party) package for CIT.> > When I run the lm model I evidently get many regression coefficients that > are not significant. Is there some way to automatically combine levels of a > categorical variable together if the regression coefficients for the > individual levels are not significant?> > My idea is to find some form of grouping of the different categories that > allows me to work with less levels while keeping or even improving the > quality of predictions.I also want to mention cforest here, you can measure the importance of your predictor variables. I would recommend partykit package for categorical predictors, but also you can give it a try to rpart.> > Thanks, > > Michael > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
You need statistical help, which is generally off topic here. I suggest you post to a statistcal site like stats.stackexchange.com instead. Better yet, find a local statistical expert with whom you can consult. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Sep 20, 2016 at 1:34 AM, Michael Haenlein <haenlein at escpeurope.eu> wrote:> Dear all, > > I am trying to estimate a lm model with one continuous dependent variable > and 11 independent variables that are all categorical, some of which have > many categories (several dozens in some cases). > > I am not interested in statistical inference to a larger population. The > objective of my model is to find a way to best predict my continuous > variable within the sample. > > When I run the lm model I evidently get many regression coefficients that > are not significant. Is there some way to automatically combine levels of a > categorical variable together if the regression coefficients for the > individual levels are not significant? > > My idea is to find some form of grouping of the different categories that > allows me to work with less levels while keeping or even improving the > quality of predictions. > > Thanks, > > Michael > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.