OK, I think I understand better but still have two points to clarify.
The first one is about the number of df. I think those who replied on
this objected to the way I chose df, not the fact that I would run a
model with 7.4 df per se. If I read you correctly, I artificially
reduce my p-value by using the estimated df found in a mgcv gam model
into another model where I fix df. This is fine, I am quite willing
not to run a second model with a fixed df and instead tell my readers
that my model is "marginally significant" with a p-value of 0.03.
This being said, do you know of guidelines for choosing df? A
colleague told me he does not go above 10% of the number of points.
Should I be concerned when mgcv estimates 7.4 df for 34 points? Note
that for this particular model P < 1e-16, and P is also very small if
I fix df to either 4 or 7.
My second point is the difference between models fitted by packages
gam and mgcv. Sure, some of you have said "different algorithms". And
when I specify dfs, shouldn't P-values be very similar for the 2
packages? If not, what does it say of the confidence we can have in
the models?
I draw your attention to this exampl: I obtained P-values of 0.50 and
0.03 with packages gam and mgcv respectively:
> library(gam)
Loading required package: splines
> data(kyphosis)
> kyp1 <- gam(Kyphosis ~ s(Number, 3), family=binomial, data=kyphosis)
> summary.gam(kyp1)
Call: gam(formula = Kyphosis ~ s(Number, 3), family = binomial, data
= kyphosis)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3646 -0.6233 -0.4853 -0.3133 2.0965
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 83.2345 on 80 degrees of freedom
Residual Deviance: 71.9973 on 76.9999 degrees of freedom
AIC: 79.9976
Number of Local Scoring Iterations: 7
DF for Terms and Chi-squares for Nonparametric Effects
Df Npar Df Npar Chisq P(Chi)
(Intercept) 1
s(Number, 3) 1 2 1.37149 0.50375
> detach(package:gam)
> library(mgcv)
This is mgcv 1.3-7
> kyp2 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial,
data=kyphosis)
> summary.gam(kyp2)
Family: binomial
Link function: logit
Formula:
Kyphosis ~ s(Number, k = 4, fx = T)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5504 0.3342 -4.64 3.49e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Approximate significance of smooth terms:
edf Est.rank Chi.sq p-value
s(Number) 3 3 8.898 0.0307 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
R-sq.(adj) = 0.101 Deviance explained = 12.5%
UBRE score = 0.075202 Scale est. = 1 n = 81
> kyp2$deviance
[1] 72.85893
> kyp2$null.deviance
[1] 83.23447
> kyp2$df.null
[1] 80
> kyp2$df.residual
[1] 77
How can we explain this huge difference?
Denis
> Le 05-09-29 06:00, r-help-request at stat.math.ethz.ch a crit :
>
> De : Henric Nilsson <henric.nilsson at statisticon.se>
> Date : 29 septembre 2005 03:55:19 HAE
> : ym at climpact.com
> Cc : r-help at stat.math.ethz.ch
> Objet : Rp : [R] p-level in packages mgcv and gam
> Rpondre : Henric Nilsson <henric.nilsson at statisticon.se>
>
>
> Yves Magliulo said the following on 2005-09-28 17:05:
>
>> hi,
>> i'll try to help you, i send a mail about this subject last
>> week... and
>> i did not have any response...
>> I'm using gam from package mgcv. 1)
>> How to interpret the significance of smooth terms is hard for me to
>> understand perfectly : using UBRE, you fix df. p-value are
>> estimated by chi-sq distribution using GCV, the best df are
>> estimated by GAM. (that's what i want) and
>> p-values
>
>
> This is not correct. The df are estimated in both cases (i.e. UBRE
> and GCV), but the scale parameter is fixed in the UBRE case. Hence,
> by default UBRE is used for family = binomial or poisson since the
> scale parameter is assumed to be 1. Similarly, GCV is the default
> for family = gaussian since we most often want the scale (usually
> denoted sigma^2) to be estimated.
>
>
>> are estimated by an F distribution But in that case they said "use
at
>> your own risk" in ?summary.gam
>
>
> The warning applies in both cases. The p-values are conditional on
> the smoothing parameters, and the uncertainty of the smooths is not
> taken into account when computing the p-values.
>
>
>> so you can also look at the chi.sq : but i don't know how to choose
a
>
>
> No...
>
>
>> criterion like for p-values... for me, chi.sq show the best
>> predictor in
>> a model, but it's hard to reject one with it.
>
>
> Which version of mgcv do you use? The confusion probably stems from
> earlier versions of mgcv (< 1.3-5): the summary and anova methods
> used to have a column denoted Chi.sq even when the displayed
> statistic was computed as F. Recent versions of mgcv has
>
> > summary(b)
>
> Family: gaussian
> Link function: identity
>
> Formula:
> y ~ s(x0) + s(x1) + s(x2) + s(x3)
>
> Parametric coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 7.9150 0.1049 75.44 <2e-16 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
>
> Approximate significance of smooth terms:
> edf Est.rank F p-value
> s(x0) 5.173 9.000 3.785 0.000137 ***
> s(x1) 2.357 9.000 34.631 < 2e-16 ***
> s(x2) 8.517 9.000 84.694 < 2e-16 ***
> s(x3) 1.000 1.000 0.444 0.505797
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
>
> R-sq.(adj) = 0.726 Deviance explained = 73.7%
> GCV score = 4.611 Scale est. = 4.4029 n = 400
>
>
> If we assume that the scale is known and fixed at 4.4029, we get
>
> > summary(b, dispersion = 4.4029)
>
> Family: gaussian
> Link function: identity
>
> Formula:
> y ~ s(x0) + s(x1) + s(x2) + s(x3)
>
> Parametric coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 7.9150 0.1049 75.44 <2e-16 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
>
> Approximate significance of smooth terms:
> edf Est.rank Chi.sq p-value
> s(x0) 5.173 9.000 34.067 8.7e-05 ***
> s(x1) 2.357 9.000 311.679 < 2e-16 ***
> s(x2) 8.517 9.000 762.255 < 2e-16 ***
> s(x3) 1.000 1.000 0.444 0.505
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
>
> R-sq.(adj) = 0.726 Deviance explained = 73.7%
> GCV score = 4.611 Scale est. = 4.4029 n = 400
>
> Note that t/F changed into z/Chi.sq.
>
>
>> so as far as i m concerned, i use GCV methods, and fix a 5% on the
>> null
>> hypothesis (pvalue) to select significant predictor. after, i look
>> at my
>> smooth, and if the parametrization look fine to me, i validate.
>> generaly, for p-values smaller than 0.001, you can be confident. over
>> 0.001, you have to check. 2)
>> for difference between package gam and mgcv, i sent a mail about this
>
>
> The underlying algorithms are very different.
>
>
> HTH,
> Henric
>
>
> De : "Liaw, Andy" <andy_liaw at merck.com>
> Date : 28 septembre 2005 14:01:25 HAE
> : 'Denis Chabot' <chabotd at globetrotter.net>, Peter
Dalgaard
> <p.dalgaard at biostat.ku.dk>
> Cc : Thomas Lumley <tlumley at u.washington.edu>, R list <r-
> help at stat.math.ethz.ch>
> Objet : RE: [R] p-level in packages mgcv and gam
>
> Just change the df in what Thomas described to the degree of
> polynomial, and
> everything he said still applies. Any good book on regression that
> covers
> polynomial regression ought to point this out.
>
> Andy
>
>
>> From: Denis Chabot
>>
>> But what about another analogy, that of polynomials? You may not be
>> sure what degree polynomial to use, and you have not decided before
>> analysing your data. You fit different polynomials to your data,
>> checking if added degrees increase r2 sufficiently by doing F-tests.
>>
>> I thought it was the same thing with GAMs. You can fit a
>> model with 4
>> df, and in some cases it is of interest to see if this is a better
>> fit than a linear fit. But why can't you also check if 7df is
better
>> than 4df? And if you used mgcv first and it tells you that 7df is
>> better than 4df, why bother repeating the comparison 7df
>> against 4df,
>> why not just take the p-value for the model with 7df (fixed)?
>>
>> Denis
>
> De : Peter Dalgaard <p.dalgaard at biostat.ku.dk>
> Date : 28 septembre 2005 12:04:58 HAE
> : Thomas Lumley <tlumley at u.washington.edu>
> Cc : Denis Chabot <chabotd at globetrotter.net>, R list <r-
> help at stat.math.ethz.ch>
> Objet : Rp : [R] p-level in packages mgcv and gam
>
> Thomas Lumley <tlumley at u.washington.edu> writes:
>>
>> Bob, on the other hand, chooses the amount of smoothing depending on
>> the data. When a 4 df smooth fits best he ends up with the same model
>> as Alice and the same p-value. When some other df fits best he ends
>> up with a different model and a *smaller* p-value than Alice.
>
> This doesn't actually follow, unless the p-value (directly or
> indirectly) found its way into the definition of "best fit". It
does
> show the danger, though.
>
> --
> O__ ---- Peter Dalgaard ster Farimagsgade 5, Entr.B
> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
> (*) \(*) -- University of Copenhagen Denmark Ph: (+45)
> 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45)
> 35327907
>
> De : Thomas Lumley <tlumley at u.washington.edu>
> Date : 26 septembre 2005 12:54:43 HAE
> : Denis Chabot <chabotd at globetrotter.net>
> Cc : r-help at stat.math.ethz.ch
> Objet : Rp : [R] p-level in packages mgcv and gam
>
> On Mon, 26 Sep 2005, Denis Chabot wrote:
>
> But the mgcv manual warns that p-level for the smooth can be
> underestimated when df are estimated by the model. Most of the time
> my p-levels are so small that even doubling them would not result in
> a value close to the P=0.05 threshold, but I have one case with
> P=0.033.
>
> I thought, probably naively, that running a second model with fixed
> df, using the value of df found in the first model. I could not
> achieve this with mgcv: its gam function does not seem to accept
> fractional values of df (in my case 8.377).
>
> No, this won't work. The problem is the usual one with model
> selection: the p-value is calculated as if the df had been fixed,
> when really it was estimated.
>
> It is likely to be quite hard to get an honest p-value out of
> something that does adaptive smoothing.
>
> -thomas
> De : Thomas Lumley <tlumley at u.washington.edu>
> Date : 28 septembre 2005 11:33:27 HAE
> : Denis Chabot <chabotd at globetrotter.net>
> Cc : R list <r-help at stat.math.ethz.ch>
> Objet : Rp : [R] p-level in packages mgcv and gam
>
> On Wed, 28 Sep 2005, Denis Chabot wrote:
>
> I only got one reply to my message:
>
> No, this won't work. The problem is the usual one with model
> selection: the p-value is calculated as if the df had been fixed,
> when really it was estimated.
>
> It is likely to be quite hard to get an honest p-value out of
> something that does adaptive smoothing.
>
> -thomas
>
> I do not understand this: it seems that a lot of people chose df=4
> for no particular reason, but p-levels are correct. If instead I
> choose df=8 because a previous model has estimated this to be an
> optimal df, P-levels are no good because df are estimated?
>
> Yes. I know this sounds strange initially, but it really does make
> sense if you think about it carefully.
>
> Suppose that Alice and Bob are kyphosis researchers, and that Alice
> always chooses 4df for smoothing Age. We would all agree that her
> p-values are correct [in fact we wouldn't, but that is a separate
> issue]
>
> Bob, on the other hand, chooses the amount of smoothing depending
> on the data. When a 4 df smooth fits best he ends up with the same
> model as Alice and the same p-value. When some other df fits best
> he ends up with a different model and a *smaller* p-value than Alice.
>
> In particular, this is still true under the null hypothesis that
> Age has no effect [If Alice and Bob are interested in p-values, the
> null hypothesis must be plausible.]
>
> If Bob's p-values are always less than or equal to Alice's p-values
> under the null hypothesis, and Alice's p-values are less than 0.05
> 5% of the time, then Bob's p-values are less than 0.05 more than 5%
> of the time.
>
>
> -thomas
>
>
> Furthermore, shouldn't packages gam and mgcv give similar results
> when the same data and df are used? I tried this:
>
> library(gam)
> data(kyphosis)
> kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis)
> kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis)
> kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis)
> anova.gam(kyp1)
> anova.gam(kyp2)
> anova.gam(kyp3)
>
> detach(package:gam)
> library(mgcv)
> kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial,
> data=kyphosis)
> kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial,
> data=kyphosis)
> kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial,
> data=kyphosis)
> anova.gam(kyp4)
> anova.gam(kyp5)
> anova.gam(kyp6)
>
>
> P levels for these models, by pair
>
> kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad)
> kyp2 vs kyp5: p= 0.445 and 0.03 (wow!)
> kyp3 vs kyp6: p= 0.053 and 0.008 (wow again)
>
> Also if you plot all these you find that the mgcv plots are smoother
> than the gam plots, even the same df are used all the time.
>
> I am really confused now!
>
> Denis
>
>
> Thomas Lumley Assoc. Professor, Biostatistics
> tlumley at u.washington.edu University of Washington, SeattleDe :
> Thomas Lumley <tlumley at u.washington.edu>
> Date : 28 septembre 2005 14:35:26 HAE
> : Denis Chabot <chabotd at globetrotter.net>
> Cc : Peter Dalgaard <p.dalgaard at biostat.ku.dk>, R list <r-
> help at stat.math.ethz.ch>
> Objet : Rp : [R] p-level in packages mgcv and gam
>
> On Wed, 28 Sep 2005, Denis Chabot wrote:
>
> But what about another analogy, that of polynomials? You may not be
> sure what degree polynomial to use, and you have not decided before
> analysing your data. You fit different polynomials to your data,
> checking if added degrees increase r2 sufficiently by doing F-tests.
>
> Yes, you can. And this procedure gives you incorrect p-values.
>
> They may not be very incorrect -- it depends on how much model
> selection you do, and how strongly the feature you are selecting on
> is related to the one you are testing.
>
> For example, using step() to choose a polynomial in x even when x
> is unrelated to y and z inflates the Type I error rate by giving a
> biased estimate of the residual mean squared error:
>
> once<-function(){
> y<-rnorm(50);x<-runif(50);z<-rep(0:1,25)
> summary(step(lm(y~z),
> scope=list(lower=~z,upper=~z+x+I(x^2)+I(x^3)+I(x^4)),
> trace=0))$coef["z",4]
> }
> p<-replicate(1000,once())
> mean(p<0.05)
> [1] 0.072
>
> which is significantly higher than you would expect for an honest
> level 0.05 test.
>
> -thomas