Hi all, I am using the glmnet R package to run LASSO with binary logistic regression. I have over 290 samples with outcome data (0 for alive, 1 for dead) and over 230 predictor variables. I currently using LASSO to reduce the number of predictor variables. I am using the cv.glmnet function to do 10-fold cross validation on a sequence of lambda values which I let glmnet determine. I then take the optimal lambda value (lambda.1se) which I then use to predict on an independent cohort. What I am finding is that this optimal lambda value fluctuates everytime I run glmnet with LASSO. It deviates quite a bit such that each time I generate an ROC curve for my validation cohort, I get AUC values which deviate a bit. Does anyone know why there is such a fluctuation in the generation of an optimal lambda? I am thinking it might be due to the 10 fold cross validation step the training set is not being split well know to have enough alive and dead cases? Thoughts? -- View this message in context: http://r.789695.n4.nabble.com/glmnet-with-binary-logistic-regression-tp3688126p3688126.html Sent from the R help mailing list archive at Nabble.com.
On 07/22/2011 07:51 PM, fongchun wrote:> I am using the glmnet R package to run LASSO with binary logistic > regression. > ... > What I am finding is that this optimal lambda value fluctuates > everytime I run glmnet with LASSO.> ...> Does anyone know why there is such a fluctuation in the > generation of an optimal lambda?Cross-validation is a random procedure, and the results will vary every time. This reflects the underlying uncertainty regarding the optimal lambda. Or are you saying that you've used glmnet many times, but this time, the fluctuations in lambda are much larger than usual? If so, and you suspect a problem with the way that glmnet is partitioning the data set into cross-validation folds, you can specify that with the 'foldid' option. -- Patrick Breheny Assistant Professor Department of Biostatistics Department of Statistics University of Kentucky
Hi Patrick, Thanks for the reply. I am referring to using the cv.glmnet() function with 10-fold cross validation and letting glmnet determine the lambda sequence. The optimal lambda that it is returning fluctuates between different runs of cv.glmnet. Sometimes the model that is return deviates from like including anywhere from 3-25 predictor variables (I am doing LASSO and I originally had 235 predictor variables). I will try the foldid option. I was also thinking of a bootstrapping approach where I would actually run cv.glmnet say 100 times and then take the mean/median lambda across all the cv.glmnet runs. This way I generate a confidence interval for my optimal lambda I woud use in the end. Another question that I have is I am currently using glmnet to help me fit a two-class predictor (binary logistic regression). The cv.glmnet() function has a type.measure parameter which can be set to auc. If I am understanding this correctly, for each lambda it is doing 10 cross-validation and at each fold it is calculating an AUC. Therefore, the cross-validation score for this lambda is the AVERAGE auc across all folds? Or is it they pool the predicted response values from each fold and then generate one ROC on all the predicted values? Thanks, Fong -- View this message in context: http://r.789695.n4.nabble.com/glmnet-with-binary-logistic-regression-tp3688126p3689024.html Sent from the R help mailing list archive at Nabble.com.
10 fold cv has high variation compared to other methods. Use repeated cv or the bootstrap instead (both of which can be used with glmnet by way of the train() function on the caret package). Max On Jul 23, 2011, at 11:43 AM, fongchun <fongchunchan at gmail.com> wrote:> Hi Patrick, > > Thanks for the reply. I am referring to using the cv.glmnet() function with > 10-fold cross validation and letting glmnet determine the lambda sequence. > The optimal lambda that it is returning fluctuates between different runs of > cv.glmnet. Sometimes the model that is return deviates from like including > anywhere from 3-25 predictor variables (I am doing LASSO and I originally > had 235 predictor variables). I will try the foldid option. > > I was also thinking of a bootstrapping approach where I would actually run > cv.glmnet say 100 times and then take the mean/median lambda across all the > cv.glmnet runs. This way I generate a confidence interval for my optimal > lambda I woud use in the end. > > Another question that I have is I am currently using glmnet to help me fit a > two-class predictor (binary logistic regression). The cv.glmnet() function > has a type.measure parameter which can be set to auc. If I am understanding > this correctly, for each lambda it is doing 10 cross-validation and at each > fold it is calculating an AUC. Therefore, the cross-validation score for > this lambda is the AVERAGE auc across all folds? Or is it they pool the > predicted response values from each fold and then generate one ROC on all > the predicted values? > > Thanks, > > Fong > > > > -- > View this message in context: http://r.789695.n4.nabble.com/glmnet-with-binary-logistic-regression-tp3688126p3689024.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On 07/23/2011 11:43 AM, fongchun wrote:> I was also thinking of a bootstrapping approach where I would actually run > cv.glmnet say 100 times and then take the mean/median lambda across all the > cv.glmnet runs. This way I generate a confidence interval for my optimal > lambda I woud use in the end.A simpler approach is to increase the number of folds. If you set the number of folds equal to n ("leave-one-out" cross validation), the outcome will no longer be random, as there is only one way of choosing the fold partitions. The main reason people settle for 10-fold CV is computational convenience when n is large, which is not a large problem in your case. -- Patrick Breheny Assistant Professor Department of Biostatistics Department of Statistics University of Kentucky