thr3ads.net - R help - [R] Variable Selection for Logistic Regression [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Manish MAHESHWARI

2015-Dec-17 09:36 UTC

[R] Variable Selection for Logistic Regression

Hi,

I have a dataset with approx 400K Rows and 900 columns with a single dependent
variable of 0/1 flag. The independent variables are both categorical and
numerical. I have looked as SO/Cross Validated Posts but couldn't get an
answer for this.

Since I cannot try all possible combinations of variables or even attempt single
model with all 900 columns, I am planning to create independent models of each
variable using something like below -

out = NULL
xnames = colnames(train)[!colnames(train) %in% ignoredcols]
for (f in xnames) {
    glmm = glm(train$conversion_flag ~ train[,f] - 1 , family = binomial)
    out = rbind.fill(out,as.data.frame(cbind(f,fmsb::NagelkerkeR2(glmm)[2]$R2)))
    out =
rbind.fill(out,as.data.frame(cbind(f,'AIC',summary(glmm)$aic)))
}

This will give me the individual AIC and pseudo R2 for each of the variables.
Post that I plan to select the variables with the best scores for both AIC and
pseudoR2. Does this approach make sense?

I obviously will use a nfold cross validation in the final model to ensure
accuracy and avoid over fitting. However before I reach that I plan to use the
above to select which variables to use.

Thanks,
Manish
CONFIDENTIAL NOTE:
The information contained in this email is intended only...{{dropped:11}}

Yixuan Qiu

2015-Dec-17 15:42 UTC

head link

[R] Variable Selection for Logistic Regression

Hello Manish,

In my point of view, the "marginal" selection of variables can be
misleading since it ignores the correlation between independent
variables. I would suggest using some modern variable selection
techniques such as Lasso-typed methods.

One method that directly fits your need is the L1-regularized logistic
regression implemented by the glmnet package. Here is a vignette that
should be helpful for you:
http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#log


Best,
Yixuan


2015-12-17 4:36 GMT-05:00 Manish MAHESHWARI <manishm at
dbs.com>:> Hi,
>
> I have a dataset with approx 400K Rows and 900 columns with a single
dependent variable of 0/1 flag. The independent variables are both categorical
and numerical. I have looked as SO/Cross Validated Posts but couldn't get an
answer for this.
>
> Since I cannot try all possible combinations of variables or even attempt
single model with all 900 columns, I am planning to create independent models of
each variable using something like below -
>
> out = NULL
> xnames = colnames(train)[!colnames(train) %in% ignoredcols]
> for (f in xnames) {
>     glmm = glm(train$conversion_flag ~ train[,f] - 1 , family = binomial)
>     out =
rbind.fill(out,as.data.frame(cbind(f,fmsb::NagelkerkeR2(glmm)[2]$R2)))
>     out =
rbind.fill(out,as.data.frame(cbind(f,'AIC',summary(glmm)$aic)))
> }
>
> This will give me the individual AIC and pseudo R2 for each of the
variables. Post that I plan to select the variables with the best scores for
both AIC and pseudoR2. Does this approach make sense?
>
> I obviously will use a nfold cross validation in the final model to ensure
accuracy and avoid over fitting. However before I reach that I plan to use the
above to select which variables to use.
>
> Thanks,
> Manish
> CONFIDENTIAL NOTE:
> The information contained in this email is intended on...{{dropped:15}}

Christiaan Pauw

2015-Dec-17 20:27 UTC

head link

[R] Variable Selection for Logistic Regression

Lasso is an obvious choice by it may also be interesting to look at the
variable importance from a random forest model
On 17 Dec 2015 17:28, "Manish MAHESHWARI" <manishm at dbs.com>
wrote:
> Hi,
>
> I have a dataset with approx 400K Rows and 900 columns with a single
> dependent variable of 0/1 flag. The independent variables are both
> categorical and numerical. I have looked as SO/Cross Validated Posts but
> couldn't get an answer for this.
>
> Since I cannot try all possible combinations of variables or even attempt
> single model with all 900 columns, I am planning to create independent
> models of each variable using something like below -
>
> out = NULL
> xnames = colnames(train)[!colnames(train) %in% ignoredcols]
> for (f in xnames) {
>     glmm = glm(train$conversion_flag ~ train[,f] - 1 , family = binomial)
>     out >
rbind.fill(out,as.data.frame(cbind(f,fmsb::NagelkerkeR2(glmm)[2]$R2)))
>     out =
rbind.fill(out,as.data.frame(cbind(f,'AIC',summary(glmm)$aic)))
> }
>
> This will give me the individual AIC and pseudo R2 for each of the
> variables. Post that I plan to select the variables with the best scores
> for both AIC and pseudoR2. Does this approach make sense?
>
> I obviously will use a nfold cross validation in the final model to ensure
> accuracy and avoid over fitting. However before I reach that I plan to use
> the above to select which variables to use.
>
> Thanks,
> Manish
> CONFIDENTIAL NOTE:
> The information contained in this email is intended on...{{dropped:13}}

R help - Dec 2015 - Variable Selection for Logistic Regression

[R] Variable Selection for Logistic Regression

[R] Variable Selection for Logistic Regression

[R] Variable Selection for Logistic Regression