scarrizo at cs.usyd.edu.au
2006-Jun-18 13:12 UTC
[R] GAM selection error msgs (mgcv & gam packages)
Hi all, My question concerns 2 error messages; one in the gam package and one in the mgcv package (see below). I have read help files and Chambers and Hastie book but am failing to understand how I can solve this problem. Could you please tell me what I must adjust so that the command does not generate error message? I am trying to achieve model selection for a GAM which is required for prediction purposes, thus my focus is on AIC. My data set has 3038 records and 116 predictor variables and a binary response variable [0 or 1]. There is no current understanding of the predictors' relationship to response so I am relying on GAM for selection of appropriate predictors. Thanks Savrina *mgcv package 1.3-12: # I start with specifying the full model with 116 predictors including isotropic smooth of 3D location variables (when I specify only the first 14 predictors I get no error message)>m0<-gam(label~s(x,y,z,k=50),s+(feature4)+s(feature5)+s(feature6)+...+s(feature116),data=k.data, family=binomial) Error in smooth.construct.tp.smooth.spec(object, data, knots): A term has fewer unique covariate combinations than specified maximum degrees of freedom # I was going to follow this with backwards selection by hypothesis testing (remove highest p-val term one at a time) and also AIC comparison of all the models>From help file entitled 'Generalised additive models with integratedsmoothness estimation' I calculated the following where do I go from here? A) "k is the basis dimension of a given term...if k is not specified k=10*3^(d-1) where 'd' is the number of covariates for this term" My calculations: for all my terms but the first d=1 thus k=10*3^0=10. B) "You must have more unique combinations of covariates than the model has total parameters" My calculations: total parameters = sum of basis dimensions(50+10*113) + sum of non-spline terms(0) - number of spline terms(114) = 1066 *gam package: I think stepwise selection provided by gam package would be useful in finding the best predictive model. I follow example on pg 283 from 'Statistical models in S' Chambers and Hastie 1993. # I start with a full model where all predictors enter linearly> k.start<-gam(label~., data=k.data, family=binomial)# set up scope list with possibilities for each term eg .~1 + x + s(x) # ignore the first column of the data set> k.scope<-gam.scope(k.data[,-1])# start step wise selection> k.step<-step(k.start,k.scope)#condensed output Start: AIC=1549.48 label~s+y+z+feature4+feature5+...+feature116 Df Deviance AIC <none> 1319.5 1549.5 - feature54 -1 1319.2 1551.2 - feature26 -1 1319.2 1551.2 ... -feature12 -1 1357.4 1589.4 There were 50 or more warnings (use warnings() to see the first 50) # all 50 warnings are the same> warnings()Warning messages: 1: fitted probabilities numerically 0 or 1 occurred in: glm.fit(x[, jj, drop = FALSE], y, wt, offset = object$offset, ... # it seems to not get passed the orginal linear model. It should show all the steps taken to the final model> k.step$anovaStep Df Deviance Resid. Df Resid. Dev AIC 1 NA NA 2922 1317.599 1549.599
> > My question concerns 2 error messages; one in the gam package and one in > the mgcv package (see below). I have read help files and Chambers and > Hastie book but am failing to understand how I can solve this problem. > Could you please tell me what I must adjust so that the command does not > generate error message? > > I am trying to achieve model selection for a GAM which is required for > prediction purposes, thus my focus is on AIC. My data set has 3038 records > and 116 predictor variables and a binary response variable [0 or 1]. There > is no current understanding of the predictors' relationship to response so > I am relying on GAM for selection of appropriate predictors.- I have some worries about using a GAM in this sort of situation - it seems like an odd model to start from to me: you don't know the relationship to the covariates, but do know that it should be additive? Is that really true? If it is then it may still be alot to ask of the model selection methods to find a good model. (I'd certainly consider upping the `gamma' parameter in mgcv:::gam). - General uneasiness apart, the specific warning message relates to the number of distinct covariate values that you have (or number of distinct X,Y,Z triplets). Do any of the covariates for single smooths have fewer than 10 distinct values? There are more than 50 distinct x,y,z triplets, I suppose? If you have distinct fewer covariate points for a smooth than the default k (10), then you need to reduce k to the number of distinct points, or fewer. - Finally, for speed reasons, I'd use the "cr" basis (see ?s) if doing this. best, Simon>- Simon Wood, Mathematical Sciences, University of Bath, Bath BA2 7AY >- +44 (0)1225 386603 www.maths.bath.ac.uk/~sw283/> > Thanks > Savrina > > *mgcv package 1.3-12: > > # I start with specifying the full model with 116 predictors including > isotropic smooth of 3D location variables (when I specify only the first > 14 predictors I get no error message) >> > m0<-gam(label~s(x,y,z,k=50),s+(feature4)+s(feature5)+s(feature6)+...+s(feature116),data=k.data, > family=binomial) > > Error in smooth.construct.tp.smooth.spec(object, data, knots): > A term has fewer unique covariate combinations than specified maximum > degrees of freedom > > # I was going to follow this with backwards selection by hypothesis testing > (remove highest p-val term one at a time) and also AIC comparison of all > the models > >> From help file entitled 'Generalised additive models with integrated > smoothness estimation' I calculated the following where do I go from here? > A) "k is the basis dimension of a given term...if k is not specified > k=10*3^(d-1) where 'd' is the number of covariates for this term" > My calculations: for all my terms but the first d=1 thus k=10*3^0=10. > B) "You must have more unique combinations of covariates than the model has > total parameters" > My calculations: total parameters = sum of basis dimensions(50+10*113) + > sum of non-spline terms(0) - number of spline terms(114) = 1066 > > *gam package: > I think stepwise selection provided by gam package would be useful in > finding the best predictive model. I follow example on pg 283 from > 'Statistical models in S' Chambers and Hastie 1993. > # I start with a full model where all predictors enter linearly >> k.start<-gam(label~., data=k.data, family=binomial) > > # set up scope list with possibilities for each term eg .~1 + x + s(x) > # ignore the first column of the data set >> k.scope<-gam.scope(k.data[,-1]) > > # start step wise selection >> k.step<-step(k.start,k.scope) > #condensed output > Start: AIC=1549.48 > label~s+y+z+feature4+feature5+...+feature116 > Df Deviance AIC > <none> 1319.5 1549.5 > - feature54 -1 1319.2 1551.2 > - feature26 -1 1319.2 1551.2 > ... > -feature12 -1 1357.4 1589.4 > There were 50 or more warnings (use warnings() to see the first 50) > > # all 50 warnings are the same >> warnings() > Warning messages: > 1: fitted probabilities numerically 0 or 1 occurred in: glm.fit(x[, jj, > drop = FALSE], y, wt, offset = object$offset, ... > > # it seems to not get passed the orginal linear model. It should show all > the steps taken to the final model >> k.step$anova > Step Df Deviance Resid. Df Resid. Dev AIC > 1 NA NA 2922 1317.599 1549.599 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >