thr3ads.net - R help - [R] lm model with many categorical variables [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Michael Haenlein

2016-Sep-20 08:34 UTC

[R] lm model with many categorical variables

Dear all,

I am trying to estimate a lm model with one continuous dependent variable
and 11 independent variables that are all categorical, some of which have
many categories (several dozens in some cases).

I am not interested in statistical inference to a larger population. The
objective of my model is to find a way to best predict my continuous
variable within the sample.

When I run the lm model I evidently get many regression coefficients that
are not significant. Is there some way to automatically combine levels of a
categorical variable together if the regression coefficients for the
individual levels are not significant?

My idea is to find some form of grouping of the different categories that
allows me to work with less levels while keeping or even improving the
quality of predictions.

Thanks,

Michael

	[[alternative HTML version deleted]]

Ismail SEZEN

2016-Sep-20 12:24 UTC

head link

[R] lm model with many categorical variables

> On 20 Sep 2016, at 11:34, Michael Haenlein <haenlein at
escpeurope.eu> wrote:
> 
> Dear all,
> 
> I am trying to estimate a lm model with one continuous dependent variable
> and 11 independent variables that are all categorical, some of which have
> many categories (several dozens in some cases).
If I?m not wrong, ( I assume that categorical variables are in factor form) lm
will pick the most crowded categories and will try to fit a linear model over
them. (This might be wrong, please correct me somebody)
> 
> I am not interested in statistical inference to a larger population. The
> objective of my model is to find a way to best predict my continuous
> variable within the sample.
The best pick would be a CART ( Classification and Reg. Tree, rpart) or CIT
(Conditional Inference Tree, ctree) model to predict continous response variable
by categorical variables. Please, see new partykit (old party) package for CIT.
> 
> When I run the lm model I evidently get many regression coefficients that
> are not significant. Is there some way to automatically combine levels of a
> categorical variable together if the regression coefficients for the
> individual levels are not significant?
> 
> My idea is to find some form of grouping of the different categories that
> allows me to work with less levels while keeping or even improving the
> quality of predictions.
I also want to mention cforest here, you can measure the importance of your
predictor variables. I would recommend partykit package for categorical
predictors, but also you can give it a try to rpart.
> 
> Thanks,
> 
> Michael
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2016-Sep-20 14:49 UTC

head link

[R] lm model with many categorical variables

You need statistical help, which is generally off topic here. I
suggest you post to a statistcal site like stats.stackexchange.com
instead. Better yet, find a local statistical expert with whom you can
consult.

Cheers,
Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Sep 20, 2016 at 1:34 AM, Michael Haenlein
<haenlein at escpeurope.eu> wrote:> Dear all,
>
> I am trying to estimate a lm model with one continuous dependent variable
> and 11 independent variables that are all categorical, some of which have
> many categories (several dozens in some cases).
>
> I am not interested in statistical inference to a larger population. The
> objective of my model is to find a way to best predict my continuous
> variable within the sample.
>
> When I run the lm model I evidently get many regression coefficients that
> are not significant. Is there some way to automatically combine levels of a
> categorical variable together if the regression coefficients for the
> individual levels are not significant?
>
> My idea is to find some form of grouping of the different categories that
> allows me to work with less levels while keeping or even improving the
> quality of predictions.
>
> Thanks,
>
> Michael
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Sep 2016 - lm model with many categorical variables

[R] lm model with many categorical variables

[R] lm model with many categorical variables

[R] lm model with many categorical variables