thr3ads.net - R help - [R] Did I use "step" function correctly? (Is R's step() function reliable?) [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Michael

2006-Mar-16 09:30 UTC

[R] Did I use "step" function correctly? (Is R's step() function reliable?)

Hi all,

I put up an exhaustive model to use R's "step" function:

------------------------

mygam=gam(col1 ~ 1
+ col2     + col3     + col4
+ col2 ^ 2 + col3 ^ 2 + col4 ^ 2
+ col2 ^ 3 + col3 ^ 3 + col4 ^ 3
+ s(col2, 1) + s(col3, 1) + s(col4, 1)
+ s(col2, 2) + s(col3, 2) + s(col4, 2)
+ s(col2, 3) + s(col3, 3) + s(col4, 3)
+ s(col2, 4) + s(col3, 4) + s(col4, 4)
+ s(col2, 5) + s(col3, 5) + s(col4, 5)
+ s(col2, 6) + s(col3, 6) + s(col4, 6)
+ s(col2, 7) + s(col3, 7) + s(col4, 7)
+ s(col2, 8) + s(col3, 8) + s(col4, 8)
+ s(col2, 9) + s(col3, 9) + s(col4, 9),
data=X);

 mystep=step(mygam);

---------------------
After a long list, the following are two lowest AIC:

Step:  AIC= 152.1
 col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4, 3)


Step:  AIC= 153.45
 col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3)
-----------------------------------------------

However, the lowest AIC model,  " col1 ~ col2 + col3 + col4 + s(col2, 3) +
s(col3, 3) + s(col4, 3)" does not give the best Residual Deviance.

Instead, the model "mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6),
data=X)" is the best, in fact,

I found that as I increase the "degree-of-freedom", it always give
better
residual deviance, lower than that of the "best" model returned by
"step"
function... Please see below.

I am wondering if I need to increase "degree-of-freedom" all the way
up...
Perhaps to avoid overfitting, I should do a cross validation. Is there an
automatic Cross Validation inside "step" or "gam"?

Is "step" function result reliable? Or perhaps I used it incorrectly?

Thanks a lot,

Michael.

--------------------------
>
> mygam1=gam(col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4,
3), data=X);>
> mygam2=gam(col1 ~ col2 + col3 + col4 , data=X);
>
> mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6), data=X);
>
> mygam1Call:
gam(formula = col1 ~ col2 + col3 + col4 +
    s(col2, 3) + s(col3, 3) + s(col4, 3), data = X)

Degrees of Freedom: 110 total; 100.9999 Residual
Residual Deviance: 20.98365> mygam2Call:
gam(formula = col1 ~ col2 + col3 + col4, data = X)

Degrees of Freedom: 110 total; 107 Residual
Residual Deviance: 27.84808> mygam3Call:
gam(formula = col1 ~ s(col2, 6) + s(col3, 6) +
    s(col4, 6), data = X)

Degrees of Freedom: 110 total; 91.99957 Residual
Residual Deviance: 18.45776>
> anova(mygam1, mygam2, mygam3);Analysis of Deviance Table

Model 1: col1 ~ col2 + col3 + col4 + s(col2,
    3) + s(col3, 3) + s(col4, 3)
Model 2: col1 ~ col2 + col3 + col4
Model 3: col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6)
  Resid. Df Resid. Dev       Df Deviance P(>|Chi|)
1  100.9999    20.9836
2  107.0000    27.8481  -6.0001  -6.8644 6.115e-06
3   91.9996    18.4578  15.0004   9.3903 3.958e-05

	[[alternative HTML version deleted]]

Berton Gunter

2006-Mar-16 17:14 UTC

head link

[R] Did I use "step" function correctly? (Is R's step() functionreliable?)

The questions you ask lead to far more complex issues than you are aware of.
As a result, there are no "good" -- nor certainly any simple --
answers to
them. The deeper issue is how to choose an appropriate model, balancing
complexity (overfitting) with parsimony (predictive/scientific validity). A
few miscellaneous comments are:

1) Increasing model complexity (more parameters, more df for the model) for
nested models must **always** improve (or cannot harm, anyway) the fit; this
is just a mathematical identity, and so your comment to this effect is
basically meaningless.

2) The hard question is: does increased complexity improve the fit
**enough** to believe that it is "meaningful"?

3) AIC, BIC, cross-validation, extra sums of squares principles, anovas via
likelihood ratio tests, etc. etc. are all addressed at this question. All
are sometimes useful and sometimes produce junk.

4) There are now many statisticians and statistical learners who believe
that the the basic question -- what is the right model? -- is fundamentally
misleading. Their view is (approximately, I'm no expert) that there are
always several (many?) different models that are essentially equally good.
So they cast the issue as one of prediction and eschew a single model
altogether. Instead, they have developed various approaches to creating
"ensembles" of models that are basically just prediction engines
largely
incapable of interpretation. Random forests, boosting, and bagging are some
of the buzzwords here, and R has packages for all.

Brian Ripley has at least one excellent presentation on these issues on his
website (his presentation on John Nelder's 80th birthday. Unfortunately,
while I have the paper, I no longer have the link. Perhaps he might re-post
it). You might also wish to have a look at some of Leo Breiman's writings on
these issues.

I repeat: I am not an expert on these matters, and my brief comments above
no doubt already contain distortions and inaccuracies. I would greatly
appreciate it if those with real expertise would correct any of the more
egregious errors.

Finally, if I may hazard a personal opinion regarding use of stepAIC or
other stepwise fitting methods: Beware! -- there lie dragons! They can be an
excellent way to generate complete spurious models.

Cheers,
Bert

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Michael
> Sent: Thursday, March 16, 2006 1:31 AM
> To: R-help at stat.math.ethz.ch
> Subject: [R] Did I use "step" function correctly? (Is R's 
> step() functionreliable?)
> 
> Hi all,
> 
> I put up an exhaustive model to use R's "step" function:
> 
> ------------------------
> 
> mygam=gam(col1 ~ 1
> + col2     + col3     + col4
> + col2 ^ 2 + col3 ^ 2 + col4 ^ 2
> + col2 ^ 3 + col3 ^ 3 + col4 ^ 3
> + s(col2, 1) + s(col3, 1) + s(col4, 1)
> + s(col2, 2) + s(col3, 2) + s(col4, 2)
> + s(col2, 3) + s(col3, 3) + s(col4, 3)
> + s(col2, 4) + s(col3, 4) + s(col4, 4)
> + s(col2, 5) + s(col3, 5) + s(col4, 5)
> + s(col2, 6) + s(col3, 6) + s(col4, 6)
> + s(col2, 7) + s(col3, 7) + s(col4, 7)
> + s(col2, 8) + s(col3, 8) + s(col4, 8)
> + s(col2, 9) + s(col3, 9) + s(col4, 9),
> data=X);
> 
>  mystep=step(mygam);
> 
> ---------------------
> After a long list, the following are two lowest AIC:
> 
> Step:  AIC= 152.1
>  col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4, 3)
> 
> 
> Step:  AIC= 153.45
>  col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3)
> -----------------------------------------------
> 
> However, the lowest AIC model,  " col1 ~ col2 + col3 + col4 + 
> s(col2, 3) +
> s(col3, 3) + s(col4, 3)" does not give the best Residual Deviance.
> 
> Instead, the model "mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) 
> + s(col4, 6),
> data=X)" is the best, in fact,
> 
> I found that as I increase the "degree-of-freedom", it always 
> give better
> residual deviance, lower than that of the "best" model 
> returned by "step"
> function... Please see below.
> 
> I am wondering if I need to increase "degree-of-freedom" all 
> the way up...
> Perhaps to avoid overfitting, I should do a cross validation. 
> Is there an
> automatic Cross Validation inside "step" or "gam"?
> 
> Is "step" function result reliable? Or perhaps I used it
incorrectly?
> 
> Thanks a lot,
> 
> Michael.
> 
> --------------------------
> 
> >
> > mygam1=gam(col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 
> 3) + s(col4,
> 3), data=X);
> >
> > mygam2=gam(col1 ~ col2 + col3 + col4 , data=X);
> >
> > mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6), data=X);
> >
> > mygam1
> Call:
> gam(formula = col1 ~ col2 + col3 + col4 +
>     s(col2, 3) + s(col3, 3) + s(col4, 3), data = X)
> 
> Degrees of Freedom: 110 total; 100.9999 Residual
> Residual Deviance: 20.98365
> > mygam2
> Call:
> gam(formula = col1 ~ col2 + col3 + col4, data = X)
> 
> Degrees of Freedom: 110 total; 107 Residual
> Residual Deviance: 27.84808
> > mygam3
> Call:
> gam(formula = col1 ~ s(col2, 6) + s(col3, 6) +
>     s(col4, 6), data = X)
> 
> Degrees of Freedom: 110 total; 91.99957 Residual
> Residual Deviance: 18.45776
> >
> > anova(mygam1, mygam2, mygam3);
> Analysis of Deviance Table
> 
> Model 1: col1 ~ col2 + col3 + col4 + s(col2,
>     3) + s(col3, 3) + s(col4, 3)
> Model 2: col1 ~ col2 + col3 + col4
> Model 3: col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6)
>   Resid. Df Resid. Dev       Df Deviance P(>|Chi|)
> 1  100.9999    20.9836
> 2  107.0000    27.8481  -6.0001  -6.8644 6.115e-06
> 3   91.9996    18.4578  15.0004   9.3903 3.958e-05
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

Michael

2006-Mar-17 07:17 UTC

head link

[R] Did I use "step" function correctly? (Is R's step() functionreliable?)

Hi L.Y,

Thank you for your advice.

Are you talking about Trevor Hastie's gam()?

I did not see anywhere from the result that it has an automatic Cross
Validation?

I also could not verify that the gam() function will automatically find the
degree-of-freedom if I don't specify the df, and just use
tems such as

s(col1) + s(col2) ...

Does the "step()" function also include the gam() with CV and
auto-tweaking
for df?

I wondered if I have called "step()" correctly, because it looks to me
that
it only run at a very short time(1second), and immediately returned two
models, in fact has even larger residual deviance than the model I have
provided to it initially... (obviously I've included every possibilities in
the initial model, and rely on the step() function to cut off some terms for
me...)

Thanks a lot!



On 3/16/06, Dr L. Y Hin <lyhin@netvigator.com>
wrote:>
> The engine of gam() lies in a function called smooth.spline() that is
> found
> in the
> library splines. If you leave out specifying the degree of freedom in the
> formulary determination,
> it will automatically specify it for you via cross-validation. The results
> of model fit obtainable via
> summary(mygam) will show you the "degree of freedom as choosen by the
> cross-validation method".
> On a more philosophical plane, Buja et al. (Ann Stat. 1989;17(2):453-510)
> pointed out that the fact
> that linear smoothers such as cubic splines and smoothing splines are
> linear
> lies in the fact that
> they are x-dependent and not y-dependent. By using cross-validation, you
> will invariably involve the
> use of y, which renders the determination of degree of freedom
> y-dependent,
> hence the smoothing
> parameter \lambda y-dependent, and for such a case, the smoothing matrix,
> strictly speaking,
> non-linear becasue S= (I + \lambda * K)^-1 in the non weighted form with
> unique x-points.
>
> If you increase the degree of freedom, the \lambda decreases, to a point
> where you will efffectively
> have a straightforward interpolation of points on the graph. Conversely,
> if
> \lambda is increased,
> the smoothing line reduces to a linear regression line through all the
> points.
>
> In my opinion, AIC and Residual sum of squares are competing tools looking
> for the best fit.
> The minimum of AIC and that of RSS may not concur. If you believe in AIC,
> then I would assume
> you also believe that it is a better tool than RSS in that the former uses
> an information theoretic
> approach, which is not sensitive to offset in accuracy due to penalization
> of outliers. Following that,
> I would disregard RSS and go according to what AIC tells me.
>
> I don't think you have used step.gam incorrectly, but I think you have
> been
> observant enough to
> realize not all statistical tools agree all the times :)
>
> Lin
>
> ----- Original Message -----
> From: "Michael" <comtech.usa@gmail.com>
> To: <R-help@stat.math.ethz.ch>
> Sent: Thursday, March 16, 2006 5:30 PM
> Subject: [R] Did I use "step" function correctly? (Is R's
step()
> functionreliable?)
>
>
> > Hi all,
> >
> > I put up an exhaustive model to use R's "step" function:
> >
> > ------------------------
> >
> > mygam=gam(col1 ~ 1
> > + col2     + col3     + col4
> > + col2 ^ 2 + col3 ^ 2 + col4 ^ 2
> > + col2 ^ 3 + col3 ^ 3 + col4 ^ 3
> > + s(col2, 1) + s(col3, 1) + s(col4, 1)
> > + s(col2, 2) + s(col3, 2) + s(col4, 2)
> > + s(col2, 3) + s(col3, 3) + s(col4, 3)
> > + s(col2, 4) + s(col3, 4) + s(col4, 4)
> > + s(col2, 5) + s(col3, 5) + s(col4, 5)
> > + s(col2, 6) + s(col3, 6) + s(col4, 6)
> > + s(col2, 7) + s(col3, 7) + s(col4, 7)
> > + s(col2, 8) + s(col3, 8) + s(col4, 8)
> > + s(col2, 9) + s(col3, 9) + s(col4, 9),
> > data=X);
> >
> > mystep=step(mygam);
> >
> > ---------------------
> > After a long list, the following are two lowest AIC:
> >
> > Step:  AIC= 152.1
> > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4, 3)
> >
> >
> > Step:  AIC= 153.45
> > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3)
> > -----------------------------------------------
> >
> > However, the lowest AIC model,  " col1 ~ col2 + col3 + col4 +
s(col2, 3)
> +
> > s(col3, 3) + s(col4, 3)" does not give the best Residual
Deviance.
> >
> > Instead, the model "mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) +
s(col4,
> > 6),
> > data=X)" is the best, in fact,
> >
> > I found that as I increase the "degree-of-freedom", it
always give
> better
> > residual deviance, lower than that of the "best" model
returned by
> "step"
> > function... Please see below.
> >
> > I am wondering if I need to increase "degree-of-freedom" all
the way
> up...
> > Perhaps to avoid overfitting, I should do a cross validation. Is there
> an
> > automatic Cross Validation inside "step" or "gam"?
> >
> > Is "step" function result reliable? Or perhaps I used it
incorrectly?
> >
> > Thanks a lot,
> >
> > Michael.
> >
> > --------------------------
> >
> >>
> >> mygam1=gam(col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) +
> s(col4,
> > 3), data=X);
> >>
> >> mygam2=gam(col1 ~ col2 + col3 + col4 , data=X);
> >>
> >> mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6), data=X);
> >>
> >> mygam1
> > Call:
> > gam(formula = col1 ~ col2 + col3 + col4 +
> >    s(col2, 3) + s(col3, 3) + s(col4, 3), data = X)
> >
> > Degrees of Freedom: 110 total; 100.9999 Residual
> > Residual Deviance: 20.98365
> >> mygam2
> > Call:
> > gam(formula = col1 ~ col2 + col3 + col4, data = X)
> >
> > Degrees of Freedom: 110 total; 107 Residual
> > Residual Deviance: 27.84808
> >> mygam3
> > Call:
> > gam(formula = col1 ~ s(col2, 6) + s(col3, 6) +
> >    s(col4, 6), data = X)
> >
> > Degrees of Freedom: 110 total; 91.99957 Residual
> > Residual Deviance: 18.45776
> >>
> >> anova(mygam1, mygam2, mygam3);
> > Analysis of Deviance Table
> >
> > Model 1: col1 ~ col2 + col3 + col4 + s(col2,
> >    3) + s(col3, 3) + s(col4, 3)
> > Model 2: col1 ~ col2 + col3 + col4
> > Model 3: col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6)
> >  Resid. Df Resid. Dev       Df Deviance P(>|Chi|)
> > 1  100.9999    20.9836
> > 2  107.0000    27.8481  -6.0001  -6.8644 6.115e-06
> > 3   91.9996    18.4578  15.0004   9.3903 3.958e-05
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
>
>
	[[alternative HTML version deleted]]

Michael

2006-Mar-17 08:54 UTC

head link

[R] Did I use "step" function correctly? (Is R's step() functionreliable?)

Dear Dr. L. Y Hin,

Thank you very much for your help once again!

Would it possible for the gam() just pick a default df to use? For example,
its default df is 4?

?s
s(x, df=4, spar=1)

---------------------

s(col3, 9)

is constructing the smoothing matrix using df=9,

s() is a function constructing smoothing matrix...


>  As far as I am aware, once you specify a smoothing term, you need a
> smoothing parameter to perform
> the smoothing, and without finding a smoothing parameter using the
> automatic cross validation measure,
> gam() cannot come up with one which it can use to construct the linear
> smoother, and hence the calculation
> cannot proceed. Therefore, even if you have not specified a degree of
> dreedom, gam has actually done so
> for you. You can try fit a model, say,
> try<- gam(y~s(col1)+s(col2),...)
> and then look at the results by
> summary(try)
> and you will see a column called degree of freedom. Without automatic
> cross-validation, you cannot have
> results to this column.
>
>  Can I ask what does, for example, s(col3, 9) refers to?
> Clarifying that may be helpful for me to give you more tangible and
> workable suggestion.
>
>
>
>
> ----- Original Message -----
> *From:* Michael <comtech.usa@gmail.com>
> *To:* Dr L. Y Hin <lyhin@netvigator.com>
> *Cc:* R-help@stat.math.ethz.ch
> *Sent:* Friday, March 17, 2006 3:17 PM
> *Subject:* Re: [R] Did I use "step" function correctly? (Is
R's step()
> functionreliable?)
>
>
> Hi L.Y,
>
> Thank you for your advice.
>
> Are you talking about Trevor Hastie's gam()?
>
> I did not see anywhere from the result that it has an automatic Cross
> Validation?
>
> I also could not verify that the gam() function will automatically find
> the degree-of-freedom if I don't specify the df, and just use
> tems such as
>
> s(col1) + s(col2) ...
>
> Does the "step()" function also include the gam() with CV and
> auto-tweaking for df?
>
> I wondered if I have called "step()" correctly, because it looks
to me
> that it only run at a very short time(1second), and immediately returned
two
> models, in fact has even larger residual deviance than the model I have
> provided to it initially... (obviously I've included every
possibilities in
> the initial model, and rely on the step() function to cut off some terms
for
> me...)
>
> Thanks a lot!
>
>
>
> On 3/16/06, Dr L. Y Hin <lyhin@netvigator.com> wrote:
> >
> > The engine of gam() lies in a function called smooth.spline() that is
> > found
> > in the
> > library splines. If you leave out specifying the degree of freedom in
> > the
> > formulary determination,
> > it will automatically specify it for you via cross-validation. The
> > results
> > of model fit obtainable via
> > summary(mygam) will show you the "degree of freedom as choosen by
the
> > cross-validation method".
> > On a more philosophical plane, Buja et al. (Ann Stat.
> > 1989;17(2):453-510)
> > pointed out that the fact
> > that linear smoothers such as cubic splines and smoothing splines are
> > linear
> > lies in the fact that
> > they are x-dependent and not y-dependent. By using cross-validation,
you
> >
> > will invariably involve the
> > use of y, which renders the determination of degree of freedom
> > y-dependent,
> > hence the smoothing
> > parameter \lambda y-dependent, and for such a case, the smoothing
> > matrix,
> > strictly speaking,
> > non-linear becasue S= (I + \lambda * K)^-1 in the non weighted form
with
> > unique x-points.
> >
> > If you increase the degree of freedom, the \lambda decreases, to a
point
> > where you will efffectively
> > have a straightforward interpolation of points on the graph.
Conversely,
> > if
> > \lambda is increased,
> > the smoothing line reduces to a linear regression line through all the
> > points.
> >
> > In my opinion, AIC and Residual sum of squares are competing tools
> > looking
> > for the best fit.
> > The minimum of AIC and that of RSS may not concur. If you believe in
> > AIC,
> > then I would assume
> > you also believe that it is a better tool than RSS in that the former
> > uses
> > an information theoretic
> > approach, which is not sensitive to offset in accuracy due to
> > penalization
> > of outliers. Following that,
> > I would disregard RSS and go according to what AIC tells me.
> >
> > I don't think you have used step.gam incorrectly, but I think you
have
> > been
> > observant enough to
> > realize not all statistical tools agree all the times :)
> >
> > Lin
> >
> > ----- Original Message -----
> > From: "Michael" <comtech.usa@gmail.com>
> > To: <R-help@stat.math.ethz.ch >
> > Sent: Thursday, March 16, 2006 5:30 PM
> > Subject: [R] Did I use "step" function correctly? (Is
R's step()
> > functionreliable?)
> >
> >
> > > Hi all,
> > >
> > > I put up an exhaustive model to use R's "step"
function:
> > >
> > > ------------------------
> > >
> > > mygam=gam(col1 ~ 1
> > > + col2     + col3     + col4
> > > + col2 ^ 2 + col3 ^ 2 + col4 ^ 2
> > > + col2 ^ 3 + col3 ^ 3 + col4 ^ 3
> > > + s(col2, 1) + s(col3, 1) + s(col4, 1)
> > > + s(col2, 2) + s(col3, 2) + s(col4, 2)
> > > + s(col2, 3) + s(col3, 3) + s(col4, 3)
> > > + s(col2, 4) + s(col3, 4) + s(col4, 4)
> > > + s(col2, 5) + s(col3, 5) + s(col4, 5)
> > > + s(col2, 6) + s(col3, 6) + s(col4, 6)
> > > + s(col2, 7) + s(col3, 7) + s(col4, 7)
> > > + s(col2, 8) + s(col3, 8) + s(col4, 8)
> > > + s(col2, 9) + s(col3, 9) + s(col4, 9),
> > > data=X);
> > >
> > > mystep=step(mygam);
> > >
> > > ---------------------
> > > After a long list, the following are two lowest AIC:
> > >
> > > Step:  AIC= 152.1
> > > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4, 3)
> > >
> > >
> > > Step:  AIC= 153.45
> > > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3)
> > > -----------------------------------------------
> > >
> > > However, the lowest AIC model,  " col1 ~ col2 + col3 + col4
+ s(col2,
> > 3) +
> > > s(col3, 3) + s(col4, 3)" does not give the best Residual
Deviance.
> > >
> > > Instead, the model "mygam3=gam(col1 ~ s(col2, 6) + s(col3,
6) +
> > s(col4,
> > > 6),
> > > data=X)" is the best, in fact,
> > >
> > > I found that as I increase the "degree-of-freedom", it
always give
> > better
> > > residual deviance, lower than that of the "best" model
returned by
> > "step"
> > > function... Please see below.
> > >
> > > I am wondering if I need to increase
"degree-of-freedom" all the way
> > up...
> > > Perhaps to avoid overfitting, I should do a cross validation. Is
there
> > an
> > > automatic Cross Validation inside "step" or
"gam"?
> > >
> > > Is "step" function result reliable? Or perhaps I used
it incorrectly?
> > >
> > > Thanks a lot,
> > >
> > > Michael.
> > >
> > > --------------------------
> > >
> > >>
> > >> mygam1=gam(col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3,
3) +
> > s(col4,
> > > 3), data=X);
> > >>
> > >> mygam2=gam(col1 ~ col2 + col3 + col4 , data=X);
> > >>
> > >> mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6),
data=X);
> > >>
> > >> mygam1
> > > Call:
> > > gam(formula = col1 ~ col2 + col3 + col4 +
> > >    s(col2, 3) + s(col3, 3) + s(col4, 3), data = X)
> > >
> > > Degrees of Freedom: 110 total; 100.9999 Residual
> > > Residual Deviance: 20.98365
> > >> mygam2
> > > Call:
> > > gam(formula = col1 ~ col2 + col3 + col4, data = X)
> > >
> > > Degrees of Freedom: 110 total; 107 Residual
> > > Residual Deviance: 27.84808
> > >> mygam3
> > > Call:
> > > gam(formula = col1 ~ s(col2, 6) + s(col3, 6) +
> > >    s(col4, 6), data = X)
> > >
> > > Degrees of Freedom: 110 total; 91.99957 Residual
> > > Residual Deviance: 18.45776
> > >>
> > >> anova(mygam1, mygam2, mygam3);
> > > Analysis of Deviance Table
> > >
> > > Model 1: col1 ~ col2 + col3 + col4 + s(col2,
> > >    3) + s(col3, 3) + s(col4, 3)
> > > Model 2: col1 ~ col2 + col3 + col4
> > > Model 3: col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6)
> > >  Resid. Df Resid. Dev       Df Deviance P(>|Chi|)
> > > 1  100.9999     20.9836
> > > 2  107.0000    27.8481  -6.0001  -6.8644 6.115e-06
> > > 3   91.9996    18.4578  15.0004   9.3903 3.958e-05
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help@stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > >
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> > >
> >
> >
> >
>
	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Mar 2006 - Did I use "step" function correctly? (Is R's step() function reliable?)

[R] Did I use "step" function correctly? (Is R's step() function reliable?)

[R] Did I use "step" function correctly? (Is R's step() functionreliable?)

[R] Did I use "step" function correctly? (Is R's step() functionreliable?)

[R] Did I use "step" function correctly? (Is R's step() functionreliable?)

Seemingly Similar Threads