thr3ads.net - R help - [R] glmpath in R [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Steve Lianoglou

2010-Apr-06 15:21 UTC

[R] glmpath in R

Hi Claire,

I'm replying and CC-ing to the R-help list to get more eyes on your
question since others will likely have more/better advice, and perhaps
someone else in the future will have a similar question, and might
find this thread handy.

I've removed your specific research aim since that might be private
information, but you can include that later if others find it
necessary to know in order help.

On Apr 5, 2010, at 5:44 PM, Claire Wooton wrote:
> Dear Steve,
>
> I came across your posting on the R-help mailing list concerning finding
the best lambda in a LASSO-model, and I was wondering whether you would be able
to offer any advice based on your experience.
>
> I'm attempting to build a logistic regression model to explore
[REDACTED] and recently decided to build a LASSO-model, having learned of the
problems with stepwise variable selection. While I've done a fair amount of
reading on the topic, I'm still a bit uncertain when it comes to selecting
an appropriate value for lambda when using the glmpath package.
>
> Any advice you could offer would be much appreciated.
In general, what I've done is to use cross validation to find this
"best" value for lambda, which I'm defining as the value of lambda
that gives me the model with the lowest "objective score" on my
testing data.

The "objective score" is in quotes, because it can change given the
problem. For instance, for normal regression, the best objective score
could be the "lowest mean squared error" (or highest spearman rank) on
my held out examples. In your case, for logistic regression, this
could just be accuracy of the class labels.

So, I do the CV and get 1 value of lambda for each fold in the CV that
returns the model that has the best generalization properties on held
out data. After doing the 10 fold cv (once, or many times), you could
take the avg. value for lambda and use that for my 'downstream
analysis' by building a model on all of my data with that value of
lambda.

I'd also do some smoke tests to see how sensitive your model is w.r.t
the data it is given to train on. Do your best lambdas over each fold
vary a lot? How different is the model between folds -- are the same
predictor vars non-zero? What's their variance? Etc.

Also, what's your objective in building the model? Do you just want
something with high predictive accuracy? Are you trying to draw
conclusions on the model that you build -- like infer meaning from its
coefs?

This should probably go in the beginning of the email, but it's better
late than never:

I should add the disclaimer that I'm not a "real statistician,"
and
I'm "calling uncle" in advance to the card carrying statisticians
on
this list that might argue that (i) this approach isn't principled
enough, (ii) you shouldn't really take any statistical advice on a
mailing list; and (iii) you'd be best off consulting a local
statistician.

Does that answer your question? If not, could you elaborate more about
what you're after?

Please don't forget to CC the R-help list on any further communication.

Thanks,
-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Claire Wooton

2010-Apr-06 16:31 UTC

head link

[R] glmpath in R

Steve Lianoglou <mailinglist.honeypot <at> gmail.com> writes:
> 
> Hi Claire,
> 
> I'm replying and CC-ing to the R-help list to get more eyes on your
> question since others will likely have more/better advice, and perhaps
> someone else in the future will have a similar question, and might
> find this thread handy.
> 
> I've removed your specific research aim since that might be private
> information, but you can include that later if others find it
> necessary to know in order help.
> 
> On Apr 5, 2010, at 5:44 PM, Claire Wooton wrote:
> 
> > Dear Steve,
> >
> > I came across your posting on the R-help mailing list concerning
finding the
best lambda in a LASSO-model,> and I was wondering whether you would be able to offer any advice based on
your experience.> >
> > I'm attempting to build a logistic regression model to explore
[REDACTED]
and recently decided to build a> LASSO-model, having learned of the problems with stepwise variable
selection.
While I've done a fair> amount of reading on the topic, I'm still a bit uncertain when it comes
to
selecting an appropriate value> for lambda when using the glmpath package.
> >
> > Any advice you could offer would be much appreciated.
> 
> In general, what I've done is to use cross validation to find this
> "best" value for lambda, which I'm defining as the value of
lambda
> that gives me the model with the lowest "objective score" on my
> testing data.
> 
> The "objective score" is in quotes, because it can change given
the
> problem. For instance, for normal regression, the best objective score
> could be the "lowest mean squared error" (or highest spearman
rank) on
> my held out examples. In your case, for logistic regression, this
> could just be accuracy of the class labels.
> 
> So, I do the CV and get 1 value of lambda for each fold in the CV that
> returns the model that has the best generalization properties on held
> out data. After doing the 10 fold cv (once, or many times), you could
> take the avg. value for lambda and use that for my 'downstream
> analysis' by building a model on all of my data with that value of
> lambda.
> 
> I'd also do some smoke tests to see how sensitive your model is w.r.t
> the data it is given to train on. Do your best lambdas over each fold
> vary a lot? How different is the model between folds -- are the same
> predictor vars non-zero? What's their variance? Etc.
> 
> Also, what's your objective in building the model? Do you just want
> something with high predictive accuracy? Are you trying to draw
> conclusions on the model that you build -- like infer meaning from its
> coefs?
> 
> This should probably go in the beginning of the email, but it's better
> late than never:
> 
> I should add the disclaimer that I'm not a "real
statistician," and
> I'm "calling uncle" in advance to the card carrying
statisticians on
> this list that might argue that (i) this approach isn't principled
> enough, (ii) you shouldn't really take any statistical advice on a
> mailing list; and (iii) you'd be best off consulting a local
> statistician.
> 
> Does that answer your question? If not, could you elaborate more about
> what you're after?
> 
> Please don't forget to CC the R-help list on any further communication.
> 
> Thanks,
> -steve
> 
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
> 
> Hi Steve,

Thanks very much for your reply. My main objective in building the model is to
determine the relative strength of the variables in predicting my
presence/absence data. It's really an exploratory method, I'm 
interested in
whether the associations that have been observed out in the field come out in
the model. I'm also using rpart to build a classification tree to get a
sense of
any interactions. 

I was planning to use cross-validation to identify a value of lambda that gives
minimum mean cv error and the largest value of lambda such that error is within
1 SE of the minimum. I'm not entirely sure how to proceed in building the
full
model using this value of lambda. At this point do I simply use predict.glmpath
(or predict.glmnet) setting the value of "s" to lambda and return the
coefficients? I plan to validate the chosen coefficient estimates through a
bootstrap analysis. 

Beyond conducting this "smoke test", I'm wondering how I should
assess the
resulting model. Can I assess the fit and predictive accuracy of a glmnet
object?

Thanks again for your help. I am also planning on discussing my work with a
professor in statistics. I appreciate the insight though as I attempt to wrap my
head around these methods. 

Cheers,

Claire

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Apr 2010 - glmpath in R

[R] glmpath in R

[R] glmpath in R

Apparently Analagous Threads