Hi, I have a dataset with approx 400K Rows and 900 columns with a single dependent variable of 0/1 flag. The independent variables are both categorical and numerical. I have looked as SO/Cross Validated Posts but couldn't get an answer for this. Since I cannot try all possible combinations of variables or even attempt single model with all 900 columns, I am planning to create independent models of each variable using something like below - out = NULL xnames = colnames(train)[!colnames(train) %in% ignoredcols] for (f in xnames) { glmm = glm(train$conversion_flag ~ train[,f] - 1 , family = binomial) out = rbind.fill(out,as.data.frame(cbind(f,fmsb::NagelkerkeR2(glmm)[2]$R2))) out = rbind.fill(out,as.data.frame(cbind(f,'AIC',summary(glmm)$aic))) } This will give me the individual AIC and pseudo R2 for each of the variables. Post that I plan to select the variables with the best scores for both AIC and pseudoR2. Does this approach make sense? I obviously will use a nfold cross validation in the final model to ensure accuracy and avoid over fitting. However before I reach that I plan to use the above to select which variables to use. Thanks, Manish CONFIDENTIAL NOTE: The information contained in this email is intended only...{{dropped:11}}
Hello Manish, In my point of view, the "marginal" selection of variables can be misleading since it ignores the correlation between independent variables. I would suggest using some modern variable selection techniques such as Lasso-typed methods. One method that directly fits your need is the L1-regularized logistic regression implemented by the glmnet package. Here is a vignette that should be helpful for you: http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#log Best, Yixuan 2015-12-17 4:36 GMT-05:00 Manish MAHESHWARI <manishm at dbs.com>:> Hi, > > I have a dataset with approx 400K Rows and 900 columns with a single dependent variable of 0/1 flag. The independent variables are both categorical and numerical. I have looked as SO/Cross Validated Posts but couldn't get an answer for this. > > Since I cannot try all possible combinations of variables or even attempt single model with all 900 columns, I am planning to create independent models of each variable using something like below - > > out = NULL > xnames = colnames(train)[!colnames(train) %in% ignoredcols] > for (f in xnames) { > glmm = glm(train$conversion_flag ~ train[,f] - 1 , family = binomial) > out = rbind.fill(out,as.data.frame(cbind(f,fmsb::NagelkerkeR2(glmm)[2]$R2))) > out = rbind.fill(out,as.data.frame(cbind(f,'AIC',summary(glmm)$aic))) > } > > This will give me the individual AIC and pseudo R2 for each of the variables. Post that I plan to select the variables with the best scores for both AIC and pseudoR2. Does this approach make sense? > > I obviously will use a nfold cross validation in the final model to ensure accuracy and avoid over fitting. However before I reach that I plan to use the above to select which variables to use. > > Thanks, > Manish > CONFIDENTIAL NOTE: > The information contained in this email is intended on...{{dropped:15}}
Lasso is an obvious choice by it may also be interesting to look at the variable importance from a random forest model On 17 Dec 2015 17:28, "Manish MAHESHWARI" <manishm at dbs.com> wrote:> Hi, > > I have a dataset with approx 400K Rows and 900 columns with a single > dependent variable of 0/1 flag. The independent variables are both > categorical and numerical. I have looked as SO/Cross Validated Posts but > couldn't get an answer for this. > > Since I cannot try all possible combinations of variables or even attempt > single model with all 900 columns, I am planning to create independent > models of each variable using something like below - > > out = NULL > xnames = colnames(train)[!colnames(train) %in% ignoredcols] > for (f in xnames) { > glmm = glm(train$conversion_flag ~ train[,f] - 1 , family = binomial) > out > rbind.fill(out,as.data.frame(cbind(f,fmsb::NagelkerkeR2(glmm)[2]$R2))) > out = rbind.fill(out,as.data.frame(cbind(f,'AIC',summary(glmm)$aic))) > } > > This will give me the individual AIC and pseudo R2 for each of the > variables. Post that I plan to select the variables with the best scores > for both AIC and pseudoR2. Does this approach make sense? > > I obviously will use a nfold cross validation in the final model to ensure > accuracy and avoid over fitting. However before I reach that I plan to use > the above to select which variables to use. > > Thanks, > Manish > CONFIDENTIAL NOTE: > The information contained in this email is intended on...{{dropped:13}}