thr3ads.net - R help - [R] Cross-validation for parameter selection (glm/logit) [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Jay

2010-Apr-02 13:14 UTC

[R] Cross-validation for parameter selection (glm/logit)

If my aim is to select a good subset of parameters for my final logit
model built using glm(). What is the best way to cross-validate the
results so that they are reliable?

Let's say that I have a large dataset of 1000's of observations. I
split this data into two groups, one that I use for training and
another for validation. First I use the training set to build a model,
and the the stepAIC() with a Forward-Backward search. BUT, if I base
my parameter selection purely on this result, I suppose it will be
somewhat skewed due to the 1-time data split (I use only 1 training
dataset)

What is the correct way to perform this variable selection? And are
the readily available packages for this?

Similarly, when I have my final parameter set, how should I go about
and make the final assessment of the models predictability? CV? What
package?


Thank you in advance,
Jay

JLucke at ria.buffalo.edu

2010-Apr-02 15:18 UTC

head link

[R] Cross-validation for parameter selection (glm/logit)

Jay
Unless I have misunderstood some statistical subtleties, you can use the 
AIC in place of actual cross-validation, as the AIC is asymptotically 
equivalent to leave-out-one cross-validation under MLE.
Joe

Stone, M.
An asymptotic equivalence of choice of model by cross-validation and 
Akaike's criterion
Journal of the Royal Statistical Society. Series B (Methodological), 1977, 
39, 44-47
Abstract: A logarithmic assessment of the performance of a predicting 
density is found to lead to asymptotic equivalence of choice of model by 
cross-validation and Akaike's criterion, when maximum likelihood 
estimation is used within each model. 





Jay <josip.2000@gmail.com> 
Sent by: r-help-bounces@r-project.org
04/02/2010 09:14 AM

To
r-help@r-project.org
cc

Subject
[R] Cross-validation for parameter selection (glm/logit)






If my aim is to select a good subset of parameters for my final logit
model built using glm(). What is the best way to cross-validate the
results so that they are reliable?

Let's say that I have a large dataset of 1000's of observations. I
split this data into two groups, one that I use for training and
another for validation. First I use the training set to build a model,
and the the stepAIC() with a Forward-Backward search. BUT, if I base
my parameter selection purely on this result, I suppose it will be
somewhat skewed due to the 1-time data split (I use only 1 training
dataset)

What is the correct way to perform this variable selection? And are
the readily available packages for this?

Similarly, when I have my final parameter set, how should I go about
and make the final assessment of the models predictability? CV? What
package?


Thank you in advance,
Jay

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


	[[alternative HTML version deleted]]

Steve Lianoglou

2010-Apr-02 21:33 UTC

head link

[R] Cross-validation for parameter selection (glm/logit)

Hi,

On Fri, Apr 2, 2010 at 9:14 AM, Jay <josip.2000 at gmail.com>
wrote:> If my aim is to select a good subset of parameters for my final logit
> model built using glm(). What is the best way to cross-validate the
> results so that they are reliable?
>
> Let's say that I have a large dataset of 1000's of observations. I
> split this data into two groups, one that I use for training and
> another for validation. First I use the training set to build a model,
> and the the stepAIC() with a Forward-Backward search. BUT, if I base
> my parameter selection purely on this result, I suppose it will be
> somewhat skewed due to the 1-time data split (I use only 1 training
> dataset)
Another approach would be to use penalized regression models.

The glment package has lasso and elasticnet models for both logistic
and "normal" regression models.

Intuitively: in addition to minimizing (say) the squared loss, the
model has to pay some cost (lambda) for including a non-zero parameter
in your model, which in turn provides sparse models.

You ca use CV to fine tune the value for lambda.

If you're not familiar with these penalized models, the glmnet package
has a few references to get you started.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Apr 2010 - Cross-validation for parameter selection (glm/logit)

[R] Cross-validation for parameter selection (glm/logit)

[R] Cross-validation for parameter selection (glm/logit)

[R] Cross-validation for parameter selection (glm/logit)

Possibly Parallel Threads