Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv)> This is mgcv 1.3-7Temp <- c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets <- c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed <- c(0.000000000, 0.016129032, 0.000000000, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.000000000, 0.000000000, 0.014808824, 0.000000000, 0.222000000, 0.167000000, 0.000000000, 0.000000000, 0.000000000) sed.gam <- gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam)> Family: gaussian > Link function: identity > > Formula: > wm.sed ~ s(Temp) > > Parametric coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.08403 0.01347 6.241 3.73e-07 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Approximate significance of smooth terms: > edf Est.rank F p-value > s(Temp) 1 1 13.95 0.000666 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > R-sq.(adj) = 0.554 Deviance explained = 28.5% > GCV score = 0.09904 Scale est. = 0.093686 n = 37# testing non-linear contribution sed.lin <- gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin)> Family: gaussian > Link function: identity > > Formula: > wm.sed ~ Temp > > Parametric coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** > Temp -0.023792 0.006369 -3.736 0.000666 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > > R-sq.(adj) = 0.554 Deviance explained = 28.5% > GCV score = 0.09904 Scale est. = 0.093686 n = 37anova.gam(sed.lin, sed.gam, test="F")> Analysis of Deviance Table > > Model 1: wm.sed ~ Temp > Model 2: wm.sed ~ s(Temp) > Resid. Df Resid. Dev Df Deviance F Pr(>F) > 1 3.5000e+01 3.279 > 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 ***Thanks in advance, Denis Chabot
I think you probably should state more clearly the goal of your analysis. In such situation, estimation and hypothesis testing are quite different. The procedure that gives you the `best' estimate is not necessarily the `best' for testing linearity of components. If your goal is estimation/prediction, why test linearity of components? If the goal is testing linearity, then you probably need to find the procedure that gives you a good test, rather than relying on what gam() gives you. Just my $0.02... Andy> From: Denis Chabot > > Hi, > > I need further help with my GAMs. Most models I test are very > obviously non-linear. Yet, to be on the safe side, I report the > significance of the smooth (default output of mgcv's > summary.gam) and > confirm it deviates significantly from linearity. > > I do the latter by fitting a second model where the same > predictor is > entered without the s(), and then use anova.gam to compare > the two. I > thought this was the equivalent of the default output of anova.gam > using package gam instead of mgcv. > > I wonder if this procedure is correct because one of my models > appears to be linear. In fact mgcv estimates df to be exactly 1.0 so > I could have stopped there. However I inadvertently repeated the > procedure outlined above. I would have thought in this case the > anova.gam comparing the smooth and the linear fit would for > sure have > been not significant. To my surprise, P was 6.18e-09! > > Am I doing something wrong when I attempt to confirm the non- > parametric part a smoother is significant? Here is my example case > where the relationship does appear to be linear: > > library(mgcv) > > This is mgcv 1.3-7 > Temp <- c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, > 0.38, 0.62, > 0.88, 1.12, > 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, > 3.62, 3.88, > 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, > 6.38, 6.62, 6.88, > 7.12, 8.38, 13.62) > N.sets <- c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, > 22, 26, 24, 23, > 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, > 1, 1, 1) > wm.sed <- c(0.000000000, 0.016129032, 0.000000000, 0.062046512, > 0.396459596, 0.189082949, > 0.054757925, 0.142810440, 0.168005168, 0.180804428, > 0.111439628, 0.128799505, > 0.193707937, 0.105921610, 0.103497845, 0.028591837, > 0.217894389, 0.020535469, > 0.080389068, 0.105234450, 0.070213450, 0.050771363, > 0.042074434, 0.102348837, > 0.049748344, 0.019100478, 0.005203125, 0.101711864, > 0.000000000, 0.000000000, > 0.014808824, 0.000000000, 0.222000000, 0.167000000, > 0.000000000, 0.000000000, > 0.000000000) > > sed.gam <- gam(wm.sed~s(Temp),weight=N.sets) > summary.gam(sed.gam) > > Family: gaussian > > Link function: identity > > > > Formula: > > wm.sed ~ s(Temp) > > > > Parametric coefficients: > > Estimate Std. Error t value Pr(>|t|) > > (Intercept) 0.08403 0.01347 6.241 3.73e-07 *** > > --- > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > > > Approximate significance of smooth terms: > > edf Est.rank F p-value > > s(Temp) 1 1 13.95 0.000666 *** > > --- > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > > > R-sq.(adj) = 0.554 Deviance explained = 28.5% > > GCV score = 0.09904 Scale est. = 0.093686 n = 37 > > # testing non-linear contribution > sed.lin <- gam(wm.sed~Temp,weight=N.sets) > summary.gam(sed.lin) > > Family: gaussian > > Link function: identity > > > > Formula: > > wm.sed ~ Temp > > > > Parametric coefficients: > > Estimate Std. Error t value Pr(>|t|) > > (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** > > Temp -0.023792 0.006369 -3.736 0.000666 *** > > --- > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > > > > > R-sq.(adj) = 0.554 Deviance explained = 28.5% > > GCV score = 0.09904 Scale est. = 0.093686 n = 37 > anova.gam(sed.lin, sed.gam, test="F") > > Analysis of Deviance Table > > > > Model 1: wm.sed ~ Temp > > Model 2: wm.sed ~ s(Temp) > > Resid. Df Resid. Dev Df Deviance F Pr(>F) > > 1 3.5000e+01 3.279 > > 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** > > > Thanks in advance, > > > Denis Chabot > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John -------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox --------------------------------> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Denis Chabot > Sent: Wednesday, October 05, 2005 8:22 AM > To: r-help at stat.math.ethz.ch > Subject: [R] testing non-linear component in mgcv:gam > > Hi, > > I need further help with my GAMs. Most models I test are very > obviously non-linear. Yet, to be on the safe side, I report > the significance of the smooth (default output of mgcv's > summary.gam) and confirm it deviates significantly from linearity. > > I do the latter by fitting a second model where the same > predictor is entered without the s(), and then use anova.gam > to compare the two. I thought this was the equivalent of the > default output of anova.gam using package gam instead of mgcv. > > I wonder if this procedure is correct because one of my > models appears to be linear. In fact mgcv estimates df to be > exactly 1.0 so I could have stopped there. However I > inadvertently repeated the procedure outlined above. I would > have thought in this case the anova.gam comparing the smooth > and the linear fit would for sure have been not significant. > To my surprise, P was 6.18e-09! > > Am I doing something wrong when I attempt to confirm the non- > parametric part a smoother is significant? Here is my example > case where the relationship does appear to be linear: > > library(mgcv) > > This is mgcv 1.3-7 > Temp <- c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, > 0.38, 0.62, 0.88, 1.12, > 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, > 3.38, 3.62, 3.88, > 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, > 6.12, 6.38, 6.62, 6.88, > 7.12, 8.38, 13.62) > N.sets <- c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, > 29, 31, 22, 26, 24, 23, > 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, > 1, 1, 1, 1, 1) wm.sed <- c(0.000000000, 0.016129032, > 0.000000000, 0.062046512, 0.396459596, 0.189082949, > 0.054757925, 0.142810440, 0.168005168, > 0.180804428, 0.111439628, 0.128799505, > 0.193707937, 0.105921610, 0.103497845, > 0.028591837, 0.217894389, 0.020535469, > 0.080389068, 0.105234450, 0.070213450, > 0.050771363, 0.042074434, 0.102348837, > 0.049748344, 0.019100478, 0.005203125, > 0.101711864, 0.000000000, 0.000000000, > 0.014808824, 0.000000000, 0.222000000, > 0.167000000, 0.000000000, 0.000000000, > 0.000000000) > > sed.gam <- gam(wm.sed~s(Temp),weight=N.sets) > summary.gam(sed.gam) > > Family: gaussian > > Link function: identity > > > > Formula: > > wm.sed ~ s(Temp) > > > > Parametric coefficients: > > Estimate Std. Error t value Pr(>|t|) > > (Intercept) 0.08403 0.01347 6.241 3.73e-07 *** > > --- > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > > > Approximate significance of smooth terms: > > edf Est.rank F p-value > > s(Temp) 1 1 13.95 0.000666 *** > > --- > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > > > R-sq.(adj) = 0.554 Deviance explained = 28.5% > > GCV score = 0.09904 Scale est. = 0.093686 n = 37 > > # testing non-linear contribution > sed.lin <- gam(wm.sed~Temp,weight=N.sets) > summary.gam(sed.lin) > > Family: gaussian > > Link function: identity > > > > Formula: > > wm.sed ~ Temp > > > > Parametric coefficients: > > Estimate Std. Error t value Pr(>|t|) > > (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** > > Temp -0.023792 0.006369 -3.736 0.000666 *** > > --- > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > > > > > R-sq.(adj) = 0.554 Deviance explained = 28.5% > > GCV score = 0.09904 Scale est. = 0.093686 n = 37 > anova.gam(sed.lin, sed.gam, test="F") > > Analysis of Deviance Table > > > > Model 1: wm.sed ~ Temp > > Model 2: wm.sed ~ s(Temp) > > Resid. Df Resid. Dev Df Deviance F Pr(>F) > > 1 3.5000e+01 3.279 > > 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** > > > Thanks in advance, > > > Denis Chabot > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html
In this case the models being compared are really identical, and the P-value is meaningless numerical noise. If your main focus is hypothesis testing and you really need near exact p-values, then do this sort of testing using unpenalized models. i.e. don't have mgcv::gam estimate the EDF of the smooth, just use `s(...,fx=TRUE)' to estimate it unpenalized. This gives horrible point estimates and excellent p-values. best, Simon> Hi, > > I need further help with my GAMs. Most models I test are very > obviously non-linear. Yet, to be on the safe side, I report the > significance of the smooth (default output of mgcv's summary.gam) and > confirm it deviates significantly from linearity. > > I do the latter by fitting a second model where the same predictor is > entered without the s(), and then use anova.gam to compare the two. I > thought this was the equivalent of the default output of anova.gam > using package gam instead of mgcv. > > I wonder if this procedure is correct because one of my models > appears to be linear. In fact mgcv estimates df to be exactly 1.0 so > I could have stopped there. However I inadvertently repeated the > procedure outlined above. I would have thought in this case the > anova.gam comparing the smooth and the linear fit would for sure have > been not significant. To my surprise, P was 6.18e-09! > > Am I doing something wrong when I attempt to confirm the non- > parametric part a smoother is significant? Here is my example case > where the relationship does appear to be linear: > > library(mgcv) > > This is mgcv 1.3-7 > Temp <- c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, > 0.88, 1.12, > 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, > 3.62, 3.88, > 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, > 6.38, 6.62, 6.88, > 7.12, 8.38, 13.62) > N.sets <- c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, > 22, 26, 24, 23, > 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, > 1, 1, 1) > wm.sed <- c(0.000000000, 0.016129032, 0.000000000, 0.062046512, > 0.396459596, 0.189082949, > 0.054757925, 0.142810440, 0.168005168, 0.180804428, > 0.111439628, 0.128799505, > 0.193707937, 0.105921610, 0.103497845, 0.028591837, > 0.217894389, 0.020535469, > 0.080389068, 0.105234450, 0.070213450, 0.050771363, > 0.042074434, 0.102348837, > 0.049748344, 0.019100478, 0.005203125, 0.101711864, > 0.000000000, 0.000000000, > 0.014808824, 0.000000000, 0.222000000, 0.167000000, > 0.000000000, 0.000000000, > 0.000000000) > > sed.gam <- gam(wm.sed~s(Temp),weight=N.sets) > summary.gam(sed.gam) > > Family: gaussian > > Link function: identity > > > > Formula: > > wm.sed ~ s(Temp) > > > > Parametric coefficients: > > Estimate Std. Error t value Pr(>|t|) > > (Intercept) 0.08403 0.01347 6.241 3.73e-07 *** > > --- > > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > > > Approximate significance of smooth terms: > > edf Est.rank F p-value > > s(Temp) 1 1 13.95 0.000666 *** > > --- > > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > > > R-sq.(adj) = 0.554 Deviance explained = 28.5% > > GCV score = 0.09904 Scale est. = 0.093686 n = 37 > > # testing non-linear contribution > sed.lin <- gam(wm.sed~Temp,weight=N.sets) > summary.gam(sed.lin) > > Family: gaussian > > Link function: identity > > > > Formula: > > wm.sed ~ Temp > > > > Parametric coefficients: > > Estimate Std. Error t value Pr(>|t|) > > (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** > > Temp -0.023792 0.006369 -3.736 0.000666 *** > > --- > > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > > > > > R-sq.(adj) = 0.554 Deviance explained = 28.5% > > GCV score = 0.09904 Scale est. = 0.093686 n = 37 > anova.gam(sed.lin, sed.gam, test="F") > > Analysis of Deviance Table > > > > Model 1: wm.sed ~ Temp > > Model 2: wm.sed ~ s(Temp) > > Resid. Df Resid. Dev Df Deviance F Pr(>F) > > 1 3.5000e+01 3.279 > > 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** > > > Thanks in advance, > > > Denis Chabot > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >