Dear all, I am wondering why the step() procedure in R has the description 'Select a formula-based model by AIC'. I have been using Stata and SPSS and neither package made any reference to AIC in its stepwise procedure, and I read from an earlier R-Help post that step() is really the 'usual' way for doing stepwise (R Help post from Prof Ripley, Fri, 2 Apr 1999 05:06:03 +0100 (BST)). My understanding of the 'usual' way of doing say forward regression is that variables whose p value drops below a criterion (commonly 0.05) become candidates for being included in the model, and the one with the lowest p among these gets chosen, and the step is repeated until all p values not in the model are above 0.05, cf Hosmer and Lemeshow (1989) Applied Logistic Regression. The procedure does not require examination of the AIC. I am not well aquainted with R enough to understand the codes used in step(), so can somebody tell me how step() works? Thanks very much, Tim
Dear all, I am wondering why the step() procedure in R has the description 'Select a formula-based model by AIC'. I have been using Stata and SPSS and neither package made any reference to AIC in its stepwise procedure, and I read from an earlier R-Help post that step() is really the 'usual' way for doing stepwise (R Help post from Prof Ripley, Fri, 2 Apr 1999 05:06:03 +0100 (BST)). My understanding of the 'usual' way of doing say forward regression is that variables whose p value drops below a criterion (commonly 0.05) become candidates for being included in the model, and the one with the lowest p among these gets chosen, and the step is repeated until all p values not in the model are above 0.05, cf Hosmer and Lemeshow (1989) Applied Logistic Regression. The procedure does not require examination of the AIC. I am not well aquainted with R enough to understand the codes used in step(), so can somebody tell me how step() works? Thanks very much, Tim
On Thu, 2006-12-14 at 14:37 +0000, Timothy.Mak at iop.kcl.ac.uk wrote:> Dear all, > > I am wondering why the step() procedure in R has the description 'Select a > formula-based model by AIC'. > > I have been using Stata and SPSS and neither package made any reference to > AIC in its stepwise procedure, and I read from an earlier R-Help post that > step() is really the 'usual' way for doing stepwise (R Help post from Prof > Ripley, Fri, 2 Apr 1999 05:06:03 +0100 (BST)). > > My understanding of the 'usual' way of doing say forward regression is > that variables whose p value drops below a criterion (commonly 0.05) > become candidates for being included in the model, and the one with the > lowest p among these gets chosen, and the step is repeated until all p > values not in the model are above 0.05, cf Hosmer and Lemeshow (1989) > Applied Logistic Regression. The procedure does not require examination of > the AIC. > > I am not well aquainted with R enough to understand the codes used in > step(), so can somebody tell me how step() works? > > Thanks very much, > > Tim> library(fortunes)> fortune("stepwise")Frank Harrell: Here is an easy approach that will yield results only slightly less valid than one actually using the response variable: x <- data.frame(x1, x2, x3, x4, ..., other potential predictors) x[ , sample(ncol(x))] Andy Liaw: Hmm... Shouldn't that be something like: x[, sample(ncol(x), ceiling(ncol(x) * runif(1)))] -- Frank Harrell and Andy Liaw (about alternative strategies for stepwise regression and `random parsimony') R-help (May 2005) But seriously, using: RSiteSearch("stepwise") will provide links to prior discussions on why the use of stepwise based model building is to be avoided. A copy of Frank's book (more info here): http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS will also provide insight. HTH, Marc Schwartz
You may want to look at a book that was published more recently than 17 years ago (computing has changed a lot since then). Doing stepwise regression using p-values is one approach (and when p-values were the easiest (only) thing to compute, it was reasonable to use them). But think about how many p-values you would be computing and comparing to 0.05 in a stepwise regression, now think about how many you would have computed if your data had come from a different sample, what is your type I error rate? Is the usual p-value theory even meaningful in this situation? There are several criteria that can be used in stepwise regression to decide which term to add/drop, p-value (or F-statistic) is only 1, others include AIC, BIC, Adjusted R-squared, PRESS, gut feeling, prior knowledge, cost, ... Some of these have properties better than p-values, but most still suffer from the fact that a small change in the data can result in a very different model. Look at the lars, lasso2, and BMA packages for some more modern alternatives to stepwise regression. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Timothy.Mak at iop.kcl.ac.uk Sent: Thursday, December 14, 2006 9:28 AM To: r-help at stat.math.ethz.ch Subject: [R] Stepwise regression Dear all, I am wondering why the step() procedure in R has the description 'Select a formula-based model by AIC'. I have been using Stata and SPSS and neither package made any reference to AIC in its stepwise procedure, and I read from an earlier R-Help post that step() is really the 'usual' way for doing stepwise (R Help post from Prof Ripley, Fri, 2 Apr 1999 05:06:03 +0100 (BST)). My understanding of the 'usual' way of doing say forward regression is that variables whose p value drops below a criterion (commonly 0.05) become candidates for being included in the model, and the one with the lowest p among these gets chosen, and the step is repeated until all p values not in the model are above 0.05, cf Hosmer and Lemeshow (1989) Applied Logistic Regression. The procedure does not require examination of the AIC. I am not well aquainted with R enough to understand the codes used in step(), so can somebody tell me how step() works? Thanks very much, Tim ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.