Robert Michael Inman
2009-Jan-26 12:49 UTC
[R] glm StepAIC with all interactions and update to remove a term vs. glm specifying all but a few terms and stepAIC
Problem: I am sorting through model selection process for first time and want to make sure that I have used glm, stepAIC, and update correctly. Something is strange because I get a different result between: 1) a glm of 12 predictor variables followed by a stepAIC where all interactions are considered and then an update to remove one specific interaction. vs. 2) entering all the terms individually in a glm (exept the one that I removed with update and 4 others like it but which did not make it to final model anyway), and then running stepAIC. Question: Why do these processes not yield same model? Here are all the details if helpful: I start with 12 potential predictor variables, 7 "primary" terms and 5 additional that are I(primary_terms^2). I run a glm for these 12 and then do stepAIC (BIC actually) both directions. The scope argument is scope=list(upper=~.^2,lower=NULL). This means there are 78 predictor terms considered, the 12 primary terms and 66 interactions [n(n+1)/2]. I see this with trace=T also. Here is the code used:>glm1<-glm(formula = PRESENCE == "1" ~ SNOW + I(SNOW^2) + POP_DEN + ROAD_DE+ ADJELEV + I(ADJELEV^2) + TRI + I(TRI^2) + EDGE + I(EDGE^2) + TREECOV + I(TREECOV^2),family = binomial, data = wolv) summary(glm1)>library(MASS) >stepglm2<-stepAIC(glm1,scope=list(upper=~.^2,lower=NULL),trace=T,k=log(4828),direction="both")> summary(stepglm2) > extractAIC(stepglm2,k=log(4828))This results in a 15 term model with a BIC of 3758.659> Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -4.983e+01 9.263e+00 -5.379 7.50e-08 *** > SNOW 6.085e-02 8.641e-03 7.041 1.90e-12 *** > ROAD_DE -5.637e-01 1.192e-01 -4.730 2.24e-06 *** > ADJELEV 2.880e-02 7.457e-03 3.863 0.000112 *** > I(ADJELEV^2) -4.038e-06 1.487e-06 -2.715 0.006618 ** > TRI 5.675e-02 1.081e-02 5.248 1.54e-07 *** > I(TRI^2) -1.713e-03 4.243e-04 -4.036 5.43e-05 *** > EDGE 6.418e-03 1.697e-03 3.782 0.000156 *** > TREECOV 1.680e-01 2.929e-02 5.735 9.76e-09 *** > SNOW:ADJELEV -4.313e-05 6.935e-06 -6.219 5.00e-10 *** > ADJELEV:TREECOV -6.628e-05 1.161e-05 -5.711 1.13e-08 *** > SNOW:I(ADJELEV^2) 7.437e-09 1.384e-09 5.373 7.74e-08 *** > TRI:I(TRI^2) 1.321e-06 3.419e-07 3.863 0.000112 *** > I(ADJELEV^2):I(TRI^2) -2.127e-10 5.745e-11 -3.702 0.000214 *** > ADJELEV:I(TRI^2) 1.029e-06 3.004e-07 3.424 0.000617 *** > SNOW:TRI 1.057e-05 3.372e-06 3.135 0.001721 **The final model included a the TRI:I(TRI^2) term, which is effectively a cubic function. So this was removed because cubic's were not considered for all variables. I used update to remove TRI:I(TRI^2). Code:>stepglm3<-update(stepglm2,~.-TRI:I(TRI^2),trace=T) > summary(stepglm3) > extractAIC(stepglm3,k=log(4828))This results in a 14 term model with a BIC of 3770.172. The BIC is a little higher, but the cubic term improved fit and is no longer in, so expected.>Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -5.329e+01 9.267e+00 -5.750 8.92e-09 *** > SNOW 6.241e-02 8.695e-03 7.178 7.06e-13 *** > ROAD_DE -5.756e-01 1.184e-01 -4.863 1.16e-06 *** > ADJELEV 3.233e-02 7.452e-03 4.338 1.44e-05 *** > I(ADJELEV^2) -4.724e-06 1.487e-06 -3.177 0.001489 ** > TRI 1.834e-02 5.402e-03 3.395 0.000687 *** > I(TRI^2) -1.122e-03 3.920e-04 -2.863 0.004190 ** > EDGE 6.344e-03 1.690e-03 3.754 0.000174 *** > TREECOV 1.745e-01 2.923e-02 5.969 2.39e-09 *** > SNOW:ADJELEV -4.444e-05 6.984e-06 -6.363 1.98e-10 *** > ADJELEV:TREECOV -6.885e-05 1.160e-05 -5.937 2.90e-09 *** > SNOW:I(ADJELEV^2) 7.681e-09 1.395e-09 5.506 3.67e-08 *** > I(ADJELEV^2):I(TRI^2) -1.839e-10 5.692e-11 -3.232 0.001231 ** > ADJELEV:I(TRI^2) 8.860e-07 2.974e-07 2.979 0.002892 ** > SNOW:TRI 1.219e-05 3.260e-06 3.740 0.000184 ***This all seems to be as it should. I then decided to try and confim this result by running a glm without any of the 5 potential cubic terms ( note - TRI:I(TRI^2) was the only one that made it into the final model but there were 5 potential). After entering the 73 potential terms (12 primary vaiables and now 66 minus 5 interactions = 73 total), the glm and stepAIC produces a completely different final model. It has 8 variables that were not in the model that was chosen with scope statement and manually removing TRI:TRI^2, and it is missing 7 variables that were in the model chosen with the scope statement. It has 8 variables that were in both. Code and Result:>glmalt1b<-glm(formula = PRESENCE =="1" ~SNOW+SNOW:POP_DEN+SNOW:ROAD_DE+SNOW:ADJELEV+SNOW:I(ADJELEV^2)+SNOW:TRI+SNOW: I(TRI^2)+SNOW:EDGE+SNOW:I(EDGE^2)+SNOW:TREECOV+SNOW:I(TREECOV^2)+I(SNOW^2)+I (SNOW^2):POP_DEN+>I(SNOW2):ROAD_DE+I(SNOW^2):ADJELEV+I(SNOW^2):I(ADJELEV^2)+I(SNOW^2):TRI+I(SN OW^2):I(TRI^2)+I(SNOW^2):EDGE+I(SNOW^2):I(EDGE^2)+I(SNOW^2):TREECOV+I(SNOW^2 ):I(TREECOV^2)+POP_DEN+POP_DEN:ROAD_DE+>POP_DEN:ADJELEV+POP_DEN:I(ADJELEV^2)+POP_DEN:TRI+POP_DEN:I(TRI^2)+POP_DEN:ED GE+POP_DEN:I(EDGE^2)+POP_DEN:TREECOV+POP_DEN:I(TREECOV^2)+ROAD_DE+ROAD_DE:AD JELEV+ROAD_DE:I(ADJELEV^2)+ROAD_DE:TRI+>ROAD_DE:I(TRI^2)+ROAD_DE:EDGE+ROAD_DE:I(EDGE^2)+ROAD_DE:TREECOV+ROAD_DE:I(TR EECOV^2)+ADJELEV+ADJELEV:TRI+ADJELEV:I(TRI^2)+ADJELEV:EDGE+ADJELEV:I(EDGE^2) +ADJELEV:TREECOV+ADJELEV:I(TREECOV^2)+I(ADJELEV^2)+>I(ADJELEV^2):TRI+I(ADJELEV^2):I(TRI^2)+I(ADJELEV^2):EDGE+I(ADJELEV^2):I(EDGE ^2)+I(ADJELEV^2):TREECOV+I(ADJELEV^2):I(TREECOV^2)+TRI+TRI:EDGE+TRI:I(EDGE^2 )+TRI:TREECOV+TRI:I(TREECOV^2)+I(TRI^2)+>I(TRI^2):EDGE+I(TRI^2):I(EDGE^2)+I(TRI^2):TREECOV+I(TRI^2):I(TREECOV^2)+EDGE +EDGE:TREECOV+EDGE:I(TREECOV^2)+I(EDGE^2)+I(EDGE^2):TREECOV+I(EDGE^2):I(TREE COV^2)+TREECOV+I(TREECOV^2), family=binomial, data=wolv)> summary(glmalt1b) > stepglmalt2b<-stepAIC(glmalt1b, trace=T, k=log(4828),direction="both") #BIC> summary(stepglmalt2b) > extractAIC(stepglmalt2b,k=log(4828)) > >Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -1.995e+01 7.499e+00 -2.660 0.007819 ** > SNOW 1.641e-02 4.881e-03 3.363 0.000772 *** > I(SNOW^2) 2.238e-05 4.729e-06 4.732 2.22e-06 *** > ROAD_DE -5.619e-01 1.187e-01 -4.733 2.21e-06 *** > ADJELEV 4.361e-03 5.876e-03 0.742 0.457966 > I(ADJELEV^2) 1.001e-06 1.165e-06 0.859 0.390257 > TRI -1.982e-01 6.066e-02 -3.268 0.001083 ** > I(TRI^2) -6.842e-05 1.868e-05 -3.664 0.000249 *** > I(EDGE^2) 6.321e-05 2.119e-05 2.983 0.002857 ** > I(TREECOV^2) 2.947e-03 4.984e-04 5.912 3.38e-09 *** > SNOW:ADJELEV -6.244e-06 1.959e-06 -3.187 0.001438 ** > SNOW:TRI 1.018e-05 3.403e-06 2.991 0.002778 ** > I(SNOW^2):ADJELEV -1.852e-08 3.477e-09 -5.326 1.00e-07 *** > I(SNOW^2):I(ADJELEV^2) 3.726e-12 6.771e-13 5.503 3.73e-08 *** > ADJELEV:TRI 1.887e-04 4.895e-05 3.855 0.000116 *** > I(ADJELEV^2):TRI -4.010e-08 9.697e-09 -4.135 3.55e-05 *** > I(ADJELEV^2):I(TREECOV^2) -4.532e-10 7.727e-11 -5.865 4.48e-09 ***If anyone can tell me why this is different I would greatly appreciate it. Also, why does this last model include terms that are not significant? Thanks Bob
Michael Dewey
2009-Jan-27 13:57 UTC
[R] glm StepAIC with all interactions and update to remove a term vs. glm specifying all but a few terms and stepAIC
At 12:49 26/01/2009, Robert Michael Inman wrote:>Problem: >I am sorting through model selection process for first time and want to make >sure that I have used glm, stepAIC, and update correctly. Something is >strange because I get a different result between: > >1) a glm of 12 predictor variables followed by a stepAIC where all >interactions are considered and then an update to remove one specific >interaction. > >vs. > >2) entering all the terms individually in a glm (exept the one that I >removed with update and 4 others like it but which did not make it to final >model anyway), and then running stepAIC.I am not the world's leading expert on this area but nobody else seems to have replied yet so here goes: 1 - stepwise methods capitalise on chance features of your dataset and so starting from a slightly different place may give different results. For instance if you do all possible subsets then the 'best' subset of size k is not guaranteed to include the members of the 'best' subset of size j (j<k) and indeed may not include any of them. 2 - the lack of significance of some predictors in the last model is probably because stepAIC respects marginality, certainly MASS implies this. You might find @book{miller90, author = "Miller, A J", title = "Subset selection in regression", year = "1990", publisher = "Chapman and Hall", address = "London", keywords= {stepwise} } helpful. And MASS of course since the package is support for the book.>Question: >Why do these processes not yield same model? > > > >Here are all the details if helpful: >I start with 12 potential predictor variables, 7 "primary" terms and 5 >additional that are I(primary_terms^2). I run a glm for these 12 and then >do stepAIC (BIC actually) both directions. The scope argument is >scope=list(upper=~.^2,lower=NULL). This means there are 78 predictor terms >considered, the 12 primary terms and 66 interactions [n(n+1)/2]. I see this >with trace=T also. Here is the code used: > > >glm1<-glm(formula = PRESENCE == "1" ~ SNOW + I(SNOW^2) + POP_DEN + ROAD_DE >+ ADJELEV + I(ADJELEV^2) + TRI + I(TRI^2) + EDGE + I(EDGE^2) + TREECOV + >I(TREECOV^2),family = binomial, data = wolv) > summary(glm1) > >library(MASS) > >stepglm2<-stepAIC(glm1,scope=list(upper=~.^2,lower=NULL), >trace=T,k=log(4828),direction="both") > > summary(stepglm2) > > extractAIC(stepglm2,k=log(4828)) > >This results in a 15 term model with a BIC of 3758.659 > > > Coefficients: > > Estimate Std. Error z value Pr(>|z|) > > (Intercept) -4.983e+01 9.263e+00 -5.379 7.50e-08 *** > > SNOW 6.085e-02 8.641e-03 7.041 1.90e-12 *** > > ROAD_DE -5.637e-01 1.192e-01 -4.730 2.24e-06 *** > > ADJELEV 2.880e-02 7.457e-03 3.863 0.000112 *** > > I(ADJELEV^2) -4.038e-06 1.487e-06 -2.715 0.006618 ** > > TRI 5.675e-02 1.081e-02 5.248 1.54e-07 *** > > I(TRI^2) -1.713e-03 4.243e-04 -4.036 5.43e-05 *** > > EDGE 6.418e-03 1.697e-03 3.782 0.000156 *** > > TREECOV 1.680e-01 2.929e-02 5.735 9.76e-09 *** > > SNOW:ADJELEV -4.313e-05 6.935e-06 -6.219 5.00e-10 *** > > ADJELEV:TREECOV -6.628e-05 1.161e-05 -5.711 1.13e-08 *** > > SNOW:I(ADJELEV^2) 7.437e-09 1.384e-09 5.373 7.74e-08 *** > > TRI:I(TRI^2) 1.321e-06 3.419e-07 3.863 0.000112 *** > > I(ADJELEV^2):I(TRI^2) -2.127e-10 5.745e-11 -3.702 0.000214 *** > > ADJELEV:I(TRI^2) 1.029e-06 3.004e-07 3.424 0.000617 *** > > SNOW:TRI 1.057e-05 3.372e-06 3.135 0.001721 ** > > > >The final model included a the TRI:I(TRI^2) term, which is effectively a >cubic function. So this was removed because cubic's were not considered for >all variables. I used update to remove TRI:I(TRI^2). Code: > > >stepglm3<-update(stepglm2,~.-TRI:I(TRI^2),trace=T) > > summary(stepglm3) > > extractAIC(stepglm3,k=log(4828)) > >This results in a 14 term model with a BIC of 3770.172. The BIC is a little >higher, but the cubic term improved fit and is no longer in, so expected. > > >Coefficients: > > Estimate Std. Error z value Pr(>|z|) > > (Intercept) -5.329e+01 9.267e+00 -5.750 8.92e-09 *** > > SNOW 6.241e-02 8.695e-03 7.178 7.06e-13 *** > > ROAD_DE -5.756e-01 1.184e-01 -4.863 1.16e-06 *** > > ADJELEV 3.233e-02 7.452e-03 4.338 1.44e-05 *** > > I(ADJELEV^2) -4.724e-06 1.487e-06 -3.177 0.001489 ** > > TRI 1.834e-02 5.402e-03 3.395 0.000687 *** > > I(TRI^2) -1.122e-03 3.920e-04 -2.863 0.004190 ** > > EDGE 6.344e-03 1.690e-03 3.754 0.000174 *** > > TREECOV 1.745e-01 2.923e-02 5.969 2.39e-09 *** > > SNOW:ADJELEV -4.444e-05 6.984e-06 -6.363 1.98e-10 *** > > ADJELEV:TREECOV -6.885e-05 1.160e-05 -5.937 2.90e-09 *** > > SNOW:I(ADJELEV^2) 7.681e-09 1.395e-09 5.506 3.67e-08 *** > > I(ADJELEV^2):I(TRI^2) -1.839e-10 5.692e-11 -3.232 0.001231 ** > > ADJELEV:I(TRI^2) 8.860e-07 2.974e-07 2.979 0.002892 ** > > SNOW:TRI 1.219e-05 3.260e-06 3.740 0.000184 *** > >This all seems to be as it should. I then decided to try and confim this >result by running a glm without any of the 5 potential cubic terms ( note - >TRI:I(TRI^2) was the only one that made it into the final model but there >were 5 potential). After entering the 73 potential terms (12 primary >vaiables and now 66 minus 5 interactions = 73 total), the glm and stepAIC >produces a completely different final model. It has 8 variables that were >not in the model that was chosen with scope statement and manually removing >TRI:TRI^2, and it is missing 7 variables that were in the model chosen with >the scope statement. It has 8 variables that were in both. Code and >Result: > > >glmalt1b<-glm(formula = PRESENCE =="1" ~ >SNOW+SNOW:POP_DEN+SNOW:ROAD_DE+SNOW:ADJELEV+SNOW:I(ADJELEV^2)+SNOW:TRI+SNOW: >I(TRI^2)+SNOW:EDGE+SNOW:I(EDGE^2)+SNOW:TREECOV+SNOW:I(TREECOV^2)+I(SNOW^2)+I >(SNOW^2):POP_DEN+ > > >I(SNOW2):ROAD_DE+I(SNOW^2):ADJELEV+I(SNOW^2):I(ADJELEV^2)+I(SNOW^2):TRI+I(SN >OW^2):I(TRI^2)+I(SNOW^2):EDGE+I(SNOW^2):I(EDGE^2)+I(SNOW^2):TREECOV+I(SNOW^2 >):I(TREECOV^2)+POP_DEN+POP_DEN:ROAD_DE+ > > >POP_DEN:ADJELEV+POP_DEN:I(ADJELEV^2)+POP_DEN:TRI+POP_DEN:I(TRI^2)+POP_DEN:ED >GE+POP_DEN:I(EDGE^2)+POP_DEN:TREECOV+POP_DEN:I(TREECOV^2)+ROAD_DE+ROAD_DE:AD >JELEV+ROAD_DE:I(ADJELEV^2)+ROAD_DE:TRI+ > > >ROAD_DE:I(TRI^2)+ROAD_DE:EDGE+ROAD_DE:I(EDGE^2)+ROAD_DE:TREECOV+ROAD_DE:I(TR >EECOV^2)+ADJELEV+ADJELEV:TRI+ADJELEV:I(TRI^2)+ADJELEV:EDGE+ADJELEV:I(EDGE^2) >+ADJELEV:TREECOV+ADJELEV:I(TREECOV^2)+I(ADJELEV^2)+ > > >I(ADJELEV^2):TRI+I(ADJELEV^2):I(TRI^2)+I(ADJELEV^2):EDGE+I(ADJELEV^2):I(EDGE >^2)+I(ADJELEV^2):TREECOV+I(ADJELEV^2):I(TREECOV^2)+TRI+TRI:EDGE+TRI:I(EDGE^2 >)+TRI:TREECOV+TRI:I(TREECOV^2)+I(TRI^2)+ > > >I(TRI^2):EDGE+I(TRI^2):I(EDGE^2)+I(TRI^2):TREECOV+I(TRI^2):I(TREECOV^2)+EDGE >+EDGE:TREECOV+EDGE:I(TREECOV^2)+I(EDGE^2)+I(EDGE^2):TREECOV+I(EDGE^2):I(TREE >COV^2)+TREECOV+I(TREECOV^2), family=binomial, data=wolv) > > summary(glmalt1b) > > stepglmalt2b<-stepAIC(glmalt1b, trace=T, k=log(4828), >direction="both") #BIC > > summary(stepglmalt2b) > > extractAIC(stepglmalt2b,k=log(4828)) > > > >Coefficients: > > Estimate Std. Error z value Pr(>|z|) > > (Intercept) -1.995e+01 7.499e+00 -2.660 0.007819 ** > > SNOW 1.641e-02 4.881e-03 3.363 0.000772 *** > > I(SNOW^2) 2.238e-05 4.729e-06 4.732 2.22e-06 *** > > ROAD_DE -5.619e-01 1.187e-01 -4.733 2.21e-06 *** > > ADJELEV 4.361e-03 5.876e-03 0.742 0.457966 > > I(ADJELEV^2) 1.001e-06 1.165e-06 0.859 0.390257 > > TRI -1.982e-01 6.066e-02 -3.268 0.001083 ** > > I(TRI^2) -6.842e-05 1.868e-05 -3.664 0.000249 *** > > I(EDGE^2) 6.321e-05 2.119e-05 2.983 0.002857 ** > > I(TREECOV^2) 2.947e-03 4.984e-04 5.912 3.38e-09 *** > > SNOW:ADJELEV -6.244e-06 1.959e-06 -3.187 0.001438 ** > > SNOW:TRI 1.018e-05 3.403e-06 2.991 0.002778 ** > > I(SNOW^2):ADJELEV -1.852e-08 3.477e-09 -5.326 1.00e-07 *** > > I(SNOW^2):I(ADJELEV^2) 3.726e-12 6.771e-13 5.503 3.73e-08 *** > > ADJELEV:TRI 1.887e-04 4.895e-05 3.855 0.000116 *** > > I(ADJELEV^2):TRI -4.010e-08 9.697e-09 -4.135 3.55e-05 *** > > I(ADJELEV^2):I(TREECOV^2) -4.532e-10 7.727e-11 -5.865 4.48e-09 *** > > > >If anyone can tell me why this is different I would greatly appreciate it. >Also, why does this last model include terms that are not significant? > >Thanks > >BobMichael Dewey http://www.aghmed.fsnet.co.uk
Apparently Analagous Threads
- Vegan(ordistep) error: Error in if (aod[1, 5] <= Pin) { : missing value where TRUE/FALSE needed
- Help me get this function to work...
- Help get this simple function to work...
- Error message in vegan ordistep
- Fwd: "Make_passthrough: pid XXXX, dup2(6, 0) failed" with LPRNG 3.8.C and Samba 4.3.8 on FreeBSD