Jin Minming
2012-Jan-30 13:14 UTC
[R] Variable selection based on both training and testing data
Dear all, The variable selection in regression is usually determined by the training data using AIC or F value, such as stepAIC. Is there some R package that can consider both the training and test dataset? For example, I have two separate training data and test data. Firstly, a regression model is obtained by using training data, and then this model is tested by using test data. This process continues in order to find some possible optimal models in terms of RMSE or R2 for both training and test data. Thanks, Jim
Liaw, Andy
2012-Jan-30 13:39 UTC
[R] Variable selection based on both training and testing data
Variable section is part of the training process-- it chooses the model. By definition, test data is used only for testing (evaluating chosen model). If you find a package or function that does variable selection on test data, run from it! Best, Andy> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Jin Minming > Sent: Monday, January 30, 2012 8:14 AM > To: r-help at r-project.org > Subject: [R] Variable selection based on both training and > testing data > > Dear all, > > The variable selection in regression is usually determined by > the training data using AIC or F value, such as stepAIC. Is > there some R package that can consider both the training and > test dataset? For example, I have two separate training data > and test data. Firstly, a regression model is obtained by > using training data, and then this model is tested by using > test data. This process continues in order to find some > possible optimal models in terms of RMSE or R2 for both > training and test data. > > Thanks, > > Jim > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Notice: This e-mail message, together with any attachme...{{dropped:11}}
SR Millis
2012-Jan-30 14:57 UTC
[R] Fw: Variable selection based on both training and testing data
From: SR Millis <srmillis@yahoo.com> To: Jin Minming <jminming@yahoo.com> Sent: Monday, January 30, 2012 9:25 AM Subject: Re: [R] Variable selection based on both training and testing data Jim, First, stepwise methods for variable selection should be avoided. Frank Harrell (in Regression Modeling Strategies) discusses this at length. Second, splitting a dataset into training and validation sets is generally not a good idea unless you have a really large sample, eg, > 20,000. As Harrell has discussed, split-sample validation does not provide external validation, is terribly inefficient, and is arbitrary. It's better to specify your model a priori and use the bootstrap to obtain an estimate of your model's over-optimism. Bootstrapping can be implemented with Harrell's rms package in R. Scott ~~~~~~~~~~~ Scott R Millis, PhD, ABPP, CStat, PStat® Professor Wayne State University School of Medicine Email: aa3379@wayne.edu Email: srmillis@yahoo.com Tel: 313-993-8085 ________________________________ To: r-help@r-project.org Sent: Monday, January 30, 2012 8:14 AM Subject: [R] Variable selection based on both training and testing data Dear all, The variable selection in regression is usually determined by the training data using AIC or F value, such as stepAIC. Is there some R package that can consider both the training and test dataset? For example, I have two separate training data and test data. Firstly, a regression model is obtained by using training data, and then this model is tested by using test data. This process continues in order to find some possible optimal models in terms of RMSE or R2 for both training and test data. Thanks, Jim ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]