thr3ads.net - R help - [R] Stepwise regression and PLS [Feb 2004]

If this information is useful, please help other people find it:
Share via:

Jinsong Zhao

2004-Feb-01 14:05 UTC

[R] Stepwise regression and PLS

Dear all,

I am a newcomer to R. I intend to using R to do stepwise regression and
PLS with a data set (a 55x20 matrix, with one dependent and 19
independent variable). Based on the same data set, I have done the same
work using SPSS and SAS. However, there is much difference between the
results obtained by R and SPSS or SAS.

In the case of stepwise, SPSS gave out a model with 4 independent
variable, but with step(), R gave out a model with 10 and much higher
R2. Furthermore, regsubsets() also indicate the 10 variable is one of
the best regression subset. How to explain this difference? And in the
case of my data set, how many variables that enter the model would be
reasonable?

In the case of PLS, the results of mvr function of pls.pcr package is
also different with that of SAS. Although the number of optimum latent
variables is same, the difference between R2 is much large. Why?

Any comment and suggestion is very appreciated. Thanks in advance!

Best wishes,

Jinsong Zhao

====(Mr.) Jinsong Zhao
Ph.D. Candidate
School of the Environment
Nanjing University
No.22 Hankou Road, Najing 210093
P.R. China
E-mail: zh_jinsong at yahoo.com.cn

_________________________________________________________

Jinsong Zhao

2004-Feb-01 19:09 UTC

head link

[R] Stepwise Regression and PLS

Dear all,

I am a newcomer to R. I intend to using R to do
stepwise regression and PLS with a data set (a 55x20
matrix, with one dependent and 19 independent
variable). Based on the same data set, I have done the
same work using SPSS and SAS. However, there is much
difference between the results obtained by R and SPSS
or SAS.

In the case of stepwise, SPSS gave out a model with 4
independent variable, but with step(), R gave out a
model with 10 and much higher R2. Furthermore,
regsubsets() also indicate the 10 variable is one of
the best regression subset. How to explain this
difference? And in the case of my data set, how many
variables that enter the model would be reasonable?

In the case of PLS, the results of mvr function of
pls.pcr package is also different with that of SAS.
Although the number of optimum latent variables is
same, the difference between R2 is much large. Why?

Any comment and suggestion is very appreciated. Thanks
in advance!

Best wishes,

Jinsong Zhao


====(Mr.) Jinsong Zhao
Ph.D. Candidate
School of the Environment
Nanjing University
22 Hankou Road, Nanjing 210093
P.R. China
E-mail: jinsong_zh at yahoo.com

Frank E Harrell Jr

2004-Feb-02 11:43 UTC

head link

[R] Stepwise Regression and PLS

On Sun, 1 Feb 2004 20:03:36 -0800 (PST)
Jinsong Zhao <jinsong_zh at yahoo.com> wrote:
> 
> --- Frank E Harrell Jr <feh3k at spamcop.net> wrote:
> > > 
> > > For the case of stepwise regression, I have found
> > that
> > > the subsets I got using regsubsets() are
> > collinear.
> > > However, the variables in SPSS's result are not
> > > collinear. I wonder what I should do to get a same
> > or
> > > better linear model.
> > 
> > I think you missed the point.  None of the variable
> > selection procedures
> > will provide results that have a fair probability of
> > replicating in
> > another sample.
> > 
> > FH
> > ---
> > Frank E Harrell Jr   Professor and Chair          
> > School of Medicine
> >                      Department of Biostatistics  
> > Vanderbilt University
> 
> Do you mean different procedures will provide
> different results? Maybe I don't understand your email
> correctly. Now, I just hope I could get a reasonable
> linear model using stepwise method in R, but I don't
> know how to deal with collinear problem.
> 
> ====> (Mr.) Jinsong Zhao
No, I mean the SAME procedure will provide different results.  Use the
bootstrap, or use simulation to repeatedly sample from the same population
and the same true regression model.  You will see dramatically different
"final models" selected by same algorithm.  The algorithm is
inherently
unstable unless perhaps you have a sample an order of magnitude larger
than the one you have.  See
http://www.pitt.edu/~wpilib/statfaq/regrfaq.html) which contains some good
references.

---
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

Peter Flom

2004-Feb-02 12:15 UTC

head link

[R] Stepwise Regression and PLS

Frank Harrell wrote
> I think you missed the point.  None of the variable
> selection procedures
> will provide results that have a fair probability of
> replicating in
> another sample.
> 
> FH

And Jinsong Zhao answered
<<<
Do you mean different procedures will provide
different results? Maybe I don't understand your email
correctly. Now, I just hope I could get a reasonable
linear model using stepwise method in R, but I don't
know how to deal with collinear problem.>>>>
The problem is not with R, SAS, or SPSS, but with your desire to
produce "a reasonable linear model using stepwise".  Stepwise does
not,
in general, produce reasonable linear models, nor does it produce 
models that are generally replicable.

This issue has been discussed here in the past, but there have been
more extensive discussions on SAS-L, or in numerous statistics books,
including Dr. Harrell's excellent one.

HTH

Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
www.peterflom.com
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)

Liaw, Andy

2004-Feb-02 14:46 UTC

head link

[R] Stepwise Regression and PLS

Just a few more comments to what Chris said:

Collinearity usually arise in two situations:
1. Insufficient sample; i.e., data points that make the variables _not_ as
collinear are not included in the sample.
2. The variables are `naturally' correlated.

If it's the first, then #2 from the list Chris cited is an possible option.
Otherwise, I'd say shrinkage makes more sense than regressing on principal
components. Both are in the same class of biased estimators, but one needs
to be lucky to have the first few PCs correlate well to the response in case
of PCR. In any case, interpretation of model coefficients from such data
will likely be difficult.

Just my $0.02...

Andy
> From: Chris LawrencePeter Kennedy, in "A Guide to Econometrics" (pp. 187-89) suggests the
following options for dealing with collinearity:

1. "Do nothing." The main problem in OLS when variables are collinear
is that the estimated variances of the parameters are often inflated.
2. Obtain more data.
3. Formalize relationships among regressors (for example, in a
simultaneous equation model).
4. Specify a relationship among the *parameters*.
5. Drop one or more variables. (In essence, a subset of #4 where
coefficients are set to zero.)
6. Incorporate estimates from other studies. (A Bayesian might consider
using a strong prior.)
7. Form a principal component from the variables, and use that instead.
8. Shrink the OLS estimates using the ridge or Stein estimators.

------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments,...{{dropped}}

Thomas Lumley

2004-Feb-02 17:47 UTC

head link

[R] Stepwise regression and PLS

On Sun, 1 Feb 2004, [gb2312] Jinsong Zhao wrote:
>
> In the case of stepwise, SPSS gave out a model with 4 independent
> variable, but with step(), R gave out a model with 10 and much higher
> R2. Furthermore, regsubsets() also indicate the 10 variable is one of
> the best regression subset. How to explain this difference? And in the
> case of my data set, how many variables that enter the model would be
> reasonable?
>
Most likely because step() uses AIC and SPSS uses a p-value criterion, so
the models are `best' in different ways.   regsubsets() gives best models
of each size, so it doesn't address the 4 vs 10 issue.

This isn't what regsubsets() is intended for.  If you want a single model
for prediction, you need a method based on an honest estimate of
prediction error and if you want a single model to explain relationships
you need to think about relationships.

While people seem to want to use it for finding a single model,
the purpose of regsubsets() is to give you many models,  precisely as a
way around the problem of instability everyone else has pointed out.
Given a large number of models you can see what features
are common to them, or you can do a crude but reasonably effective
approximation to model averaging.

	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Maybe Matching Threads

Search for more reasonably related threads

R help - Feb 2004 - Stepwise regression and PLS

[R] Stepwise regression and PLS

[R] Stepwise Regression and PLS

[R] Stepwise Regression and PLS

[R] Stepwise Regression and PLS

[R] Stepwise Regression and PLS

[R] Stepwise regression and PLS

Maybe Matching Threads