aditya gangadharan
2006-Dec-04  12:30 UTC
[R] GAM model selection and dropping terms based on GCV
Hello, I have a question regarding model selection and dropping of terms for GAMs fitted with package mgcv. I am following the approach suggested in Wood (2001), Wood and Augustin (2002). I fitted a saturated model, and I find from the plots that for two of the covariates, 1. The confidence interval includes 0 almost everywhere 2. The degrees of freedom are NOT close to 1 3. The partial residuals from plot.gam don?t show much pattern visually (to me) 4. When I drop either or both of the terms, the GCV score increases; This is my main problem: how much of an increase in GCV is ?acceptable? when terms are dropped? In the above case, the delta GCV scores are .03, .06 and .11 when I drop covariate A, covariate B and both respectively from the full model. I would be very grateful for any advice on this. Thank you Best Wishes Aditya
On Monday 04 December 2006 12:30, aditya gangadharan wrote:> Hello, > I have a question regarding model selection and dropping of terms for GAMs > fitted with package mgcv. I am following the approach suggested in Wood > (2001), Wood and Augustin (2002). > > I fitted a saturated model, and I find from the plots that for two of the > covariates, 1. The confidence interval includes 0 almost everywhere > 2. The degrees of freedom are NOT close to 1 > 3. The partial residuals from plot.gam don?t show much pattern visually (to > me) 4. When I drop either or both of the terms, the GCV score increases; > > This is my main problem: how much of an increase in GCV is ?acceptable? > when terms are dropped? In the above case, the delta GCV scores are .03, > .06 and .11 when I drop covariate A, covariate B and both respectively from > the full model. I would be very grateful for any advice on this.- I'm not sure that there is really an answer to this. GCV is based on minimizing some approximation to the expected prediction error of the model. So to answer the question you'd need to do something like decide how much increase from `optimal' prediction error you would be prepared to tolerate. I think that it's not all that easy to come up with a nice way of blending prediction error based approaches to model selection, with approaches based on finding a model that is somehow the simplest model consistent with the data (but perhaps other people will comment on this). - That said, there is certainly an issue relating to the fact that the GCV score (or AIC, in fact) is rather asymmetric, so that random variability in the score tends to lead more readily to overfitting than to underfitting. This suggests that in fact prediction error performance at finite sample sizes may be improved by shrinking the smoothing parameters themselves. With `mgcv::gam' you can do this by increasing the `gamma' parameter above it's default value, which favours smoother models by making each model degree of freedom count as gamma degrees of freedom in the GCV score (or AIC/UBRE). It is possible to choose `gamma' by e.g. 10-fold cross-validation, but that requires some coding. - There are more discussions of GAM model selection in various mgcv help files and my book. See help("mgcv-package") for details of which pages, and the reference. My bottom line on model seelction is to use things like GCV, AIC, confidence interval coverage and approximate p-values for guidance, but not as the basis for rules... modelling context has to play a part as well. Sorry if that's all a bit vague. Simon --> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK > +44 1225 386603 www.maths.bath.ac.uk/~sw283
Maybe Matching Threads
- Relative GCV - poisson and negbin GAMs (mgcv)
- two basic question regarding model selection in GAM
- CV and GCV for finding smoothness parameter
- mgcv: increasing basis dimension
- mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization