Dear all, I am a newcomer to R. I intend to using R to do stepwise regression and PLS with a data set (a 55x20 matrix, with one dependent and 19 independent variable). Based on the same data set, I have done the same work using SPSS and SAS. However, there is much difference between the results obtained by R and SPSS or SAS. In the case of stepwise, SPSS gave out a model with 4 independent variable, but with step(), R gave out a model with 10 and much higher R2. Furthermore, regsubsets() also indicate the 10 variable is one of the best regression subset. How to explain this difference? And in the case of my data set, how many variables that enter the model would be reasonable? In the case of PLS, the results of mvr function of pls.pcr package is also different with that of SAS. Although the number of optimum latent variables is same, the difference between R2 is much large. Why? Any comment and suggestion is very appreciated. Thanks in advance! Best wishes, Jinsong Zhao ====(Mr.) Jinsong Zhao Ph.D. Candidate School of the Environment Nanjing University No.22 Hankou Road, Najing 210093 P.R. China E-mail: zh_jinsong at yahoo.com.cn _________________________________________________________
Dear all, I am a newcomer to R. I intend to using R to do stepwise regression and PLS with a data set (a 55x20 matrix, with one dependent and 19 independent variable). Based on the same data set, I have done the same work using SPSS and SAS. However, there is much difference between the results obtained by R and SPSS or SAS. In the case of stepwise, SPSS gave out a model with 4 independent variable, but with step(), R gave out a model with 10 and much higher R2. Furthermore, regsubsets() also indicate the 10 variable is one of the best regression subset. How to explain this difference? And in the case of my data set, how many variables that enter the model would be reasonable? In the case of PLS, the results of mvr function of pls.pcr package is also different with that of SAS. Although the number of optimum latent variables is same, the difference between R2 is much large. Why? Any comment and suggestion is very appreciated. Thanks in advance! Best wishes, Jinsong Zhao ====(Mr.) Jinsong Zhao Ph.D. Candidate School of the Environment Nanjing University 22 Hankou Road, Nanjing 210093 P.R. China E-mail: jinsong_zh at yahoo.com
On Sun, 1 Feb 2004 20:03:36 -0800 (PST) Jinsong Zhao <jinsong_zh at yahoo.com> wrote:> > --- Frank E Harrell Jr <feh3k at spamcop.net> wrote: > > > > > > For the case of stepwise regression, I have found > > that > > > the subsets I got using regsubsets() are > > collinear. > > > However, the variables in SPSS's result are not > > > collinear. I wonder what I should do to get a same > > or > > > better linear model. > > > > I think you missed the point. None of the variable > > selection procedures > > will provide results that have a fair probability of > > replicating in > > another sample. > > > > FH > > --- > > Frank E Harrell Jr Professor and Chair > > School of Medicine > > Department of Biostatistics > > Vanderbilt University > > Do you mean different procedures will provide > different results? Maybe I don't understand your email > correctly. Now, I just hope I could get a reasonable > linear model using stepwise method in R, but I don't > know how to deal with collinear problem. > > ====> (Mr.) Jinsong ZhaoNo, I mean the SAME procedure will provide different results. Use the bootstrap, or use simulation to repeatedly sample from the same population and the same true regression model. You will see dramatically different "final models" selected by same algorithm. The algorithm is inherently unstable unless perhaps you have a sample an order of magnitude larger than the one you have. See http://www.pitt.edu/~wpilib/statfaq/regrfaq.html) which contains some good references. --- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Frank Harrell wrote> I think you missed the point. None of the variable > selection procedures > will provide results that have a fair probability of > replicating in > another sample. > > FHAnd Jinsong Zhao answered <<< Do you mean different procedures will provide different results? Maybe I don't understand your email correctly. Now, I just hope I could get a reasonable linear model using stepwise method in R, but I don't know how to deal with collinear problem.>>>>The problem is not with R, SAS, or SPSS, but with your desire to produce "a reasonable linear model using stepwise". Stepwise does not, in general, produce reasonable linear models, nor does it produce models that are generally replicable. This issue has been discussed here in the past, but there have been more extensive discussions on SAS-L, or in numerous statistics books, including Dr. Harrell's excellent one. HTH Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)
Just a few more comments to what Chris said: Collinearity usually arise in two situations: 1. Insufficient sample; i.e., data points that make the variables _not_ as collinear are not included in the sample. 2. The variables are `naturally' correlated. If it's the first, then #2 from the list Chris cited is an possible option. Otherwise, I'd say shrinkage makes more sense than regressing on principal components. Both are in the same class of biased estimators, but one needs to be lucky to have the first few PCs correlate well to the response in case of PCR. In any case, interpretation of model coefficients from such data will likely be difficult. Just my $0.02... Andy> From: Chris LawrencePeter Kennedy, in "A Guide to Econometrics" (pp. 187-89) suggests the following options for dealing with collinearity: 1. "Do nothing." The main problem in OLS when variables are collinear is that the estimated variances of the parameters are often inflated. 2. Obtain more data. 3. Formalize relationships among regressors (for example, in a simultaneous equation model). 4. Specify a relationship among the *parameters*. 5. Drop one or more variables. (In essence, a subset of #4 where coefficients are set to zero.) 6. Incorporate estimates from other studies. (A Bayesian might consider using a strong prior.) 7. Form a principal component from the variables, and use that instead. 8. Shrink the OLS estimates using the ridge or Stein estimators. ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
On Sun, 1 Feb 2004, [gb2312] Jinsong Zhao wrote:> > In the case of stepwise, SPSS gave out a model with 4 independent > variable, but with step(), R gave out a model with 10 and much higher > R2. Furthermore, regsubsets() also indicate the 10 variable is one of > the best regression subset. How to explain this difference? And in the > case of my data set, how many variables that enter the model would be > reasonable? >Most likely because step() uses AIC and SPSS uses a p-value criterion, so the models are `best' in different ways. regsubsets() gives best models of each size, so it doesn't address the 4 vs 10 issue. This isn't what regsubsets() is intended for. If you want a single model for prediction, you need a method based on an honest estimate of prediction error and if you want a single model to explain relationships you need to think about relationships. While people seem to want to use it for finding a single model, the purpose of regsubsets() is to give you many models, precisely as a way around the problem of instability everyone else has pointed out. Given a large number of models you can see what features are common to them, or you can do a crude but reasonably effective approximation to model averaging. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle