Maggie Wang
2009-Mar-18 03:52 UTC
[R] Extreme AIC or BIC values in glm(), logistic regression
Dear R-users, I use glm() to do logistic regression and use stepAIC() to do stepwise model selection. The common AIC value comes out is about 100, a good fit is as low as around 70. But for some model, the AIC went to extreme values like 1000. When I check the P-values, All the independent variables (about 30 of them) included in the equation are very significant, which is impossible, because we expect some would be dropped. This situation is not uncommon. A summary output like this: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.883e+14 1.671e+07 29217415 <2e-16 *** g761 -5.383e+14 9.897e+07 -5438529 <2e-16 *** g2809 -1.945e+15 1.082e+08 -17977871 <2e-16 *** g3106 -2.803e+15 9.351e+07 -29976674 <2e-16 *** g4373 -9.272e+14 6.534e+07 -14190077 <2e-16 *** g4583 -2.279e+15 1.223e+08 -18640563 <2e-16 *** g761:g2809 -5.101e+14 4.693e+08 -1086931 <2e-16 *** g761:g3106 -3.399e+16 6.923e+08 -49093218 <2e-16 *** g2809:g3106 3.016e+15 6.860e+08 4397188 <2e-16 *** g761:g4373 3.180e+15 4.595e+08 6920270 <2e-16 *** g2809:g4373 -5.184e+15 4.436e+08 -11685382 <2e-16 *** g3106:g4373 1.589e+16 2.572e+08 61788148 <2e-16 *** g761:g4583 -1.419e+16 8.199e+08 -17303033 <2e-16 *** g2809:g4583 -2.540e+16 8.151e+08 -31156781 <2e-16 *** ........ (omit) ........ f. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 120.32 on 86 degrees of freedom Residual deviance: 1009.22 on 55 degrees of freedom AIC: 1073.2 Number of Fisher Scoring iterations: 25 Could anyone suggest what does this mean? How can I perform a reliable logistic regression? Thank you so much for the help! Best Regards, Maggie [[alternative HTML version deleted]]
Dieter Menne
2009-Mar-18 07:30 UTC
[R] Extreme AIC or BIC values in glm(), logistic regression
Maggie Wang <haitian <at> ust.hk> writes:> I use glm() to do logistic regression and use stepAIC() to do stepwise model > selection. > > The common AIC value comes out is about 100, a good fit is as low as around > 70. But for some model, the AIC went to extreme values like 1000. When I > check the P-values, All the independent variables (about 30 of them) > included in the equation are very significant, which is impossible, because > we expect some would be dropped. This situation is not uncommon. > > A summary output like this: > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 4.883e+14 1.671e+07 29217415 <2e-16 *** > g761 -5.383e+14 9.897e+07 -5438529 <2e-16 *** > g2809 -1.945e+15 1.082e+08 -17977871 <2e-16 *** > g3106 -2.803e+15 9.351e+07 -29976674 <2e-16 ***I suspect that you have specified your target variables incorrectly. Note that there three method to define the variables which is better explained in MASS, chapter Binomial data in the budworm context. Try to extract a few of your data and post these here in a self running example. Dieter
Thomas Lumley
2009-Mar-18 07:38 UTC
[R] Extreme AIC or BIC values in glm(), logistic regression
With 30 variables and only 55 residual degrees of freedom you probably have perfect separation due to not having enough data. Look at the coefficients -- they are infinite, implying perfect overfitting. -thomas On Wed, 18 Mar 2009, Maggie Wang wrote:> Dear R-users, > > I use glm() to do logistic regression and use stepAIC() to do stepwise model > selection. > > The common AIC value comes out is about 100, a good fit is as low as around > 70. But for some model, the AIC went to extreme values like 1000. When I > check the P-values, All the independent variables (about 30 of them) > included in the equation are very significant, which is impossible, because > we expect some would be dropped. This situation is not uncommon. > > A summary output like this: > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 4.883e+14 1.671e+07 29217415 <2e-16 *** > g761 -5.383e+14 9.897e+07 -5438529 <2e-16 *** > g2809 -1.945e+15 1.082e+08 -17977871 <2e-16 *** > g3106 -2.803e+15 9.351e+07 -29976674 <2e-16 *** > g4373 -9.272e+14 6.534e+07 -14190077 <2e-16 *** > g4583 -2.279e+15 1.223e+08 -18640563 <2e-16 *** > g761:g2809 -5.101e+14 4.693e+08 -1086931 <2e-16 *** > g761:g3106 -3.399e+16 6.923e+08 -49093218 <2e-16 *** > g2809:g3106 3.016e+15 6.860e+08 4397188 <2e-16 *** > g761:g4373 3.180e+15 4.595e+08 6920270 <2e-16 *** > g2809:g4373 -5.184e+15 4.436e+08 -11685382 <2e-16 *** > g3106:g4373 1.589e+16 2.572e+08 61788148 <2e-16 *** > g761:g4583 -1.419e+16 8.199e+08 -17303033 <2e-16 *** > g2809:g4583 -2.540e+16 8.151e+08 -31156781 <2e-16 *** > ........ > (omit) > ........ > > f. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > (Dispersion parameter for binomial family taken to be 1) > > Null deviance: 120.32 on 86 degrees of freedom > Residual deviance: 1009.22 on 55 degrees of freedom > AIC: 1073.2 > > Number of Fisher Scoring iterations: 25 > > Could anyone suggest what does this mean? How can I perform a reliable > logistic regression? > > Thank you so much for the help! > > Best Regards, > Maggie > > [[alternative HTML version deleted]] > >Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Gad Abraham
2009-Mar-19 00:55 UTC
[R] Extreme AIC or BIC values in glm(), logistic regression
Maggie Wang wrote:> Dear R-users, > > I use glm() to do logistic regression and use stepAIC() to do stepwise model > selection. > > The common AIC value comes out is about 100, a good fit is as low as around > 70. But for some model, the AIC went to extreme values like 1000. When I > check the P-values, All the independent variables (about 30 of them) > included in the equation are very significant, which is impossible, because > we expect some would be dropped. This situation is not uncommon. > > A summary output like this: > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 4.883e+14 1.671e+07 29217415 <2e-16 *** > g761 -5.383e+14 9.897e+07 -5438529 <2e-16 *** > g2809 -1.945e+15 1.082e+08 -17977871 <2e-16 *** > g3106 -2.803e+15 9.351e+07 -29976674 <2e-16 *** > g4373 -9.272e+14 6.534e+07 -14190077 <2e-16 *** > g4583 -2.279e+15 1.223e+08 -18640563 <2e-16 *** > g761:g2809 -5.101e+14 4.693e+08 -1086931 <2e-16 *** > g761:g3106 -3.399e+16 6.923e+08 -49093218 <2e-16 *** > g2809:g3106 3.016e+15 6.860e+08 4397188 <2e-16 *** > g761:g4373 3.180e+15 4.595e+08 6920270 <2e-16 *** > g2809:g4373 -5.184e+15 4.436e+08 -11685382 <2e-16 *** > g3106:g4373 1.589e+16 2.572e+08 61788148 <2e-16 *** > g761:g4583 -1.419e+16 8.199e+08 -17303033 <2e-16 *** > g2809:g4583 -2.540e+16 8.151e+08 -31156781 <2e-16 ***I don't have an answer (and you haven't supplied the full code), but one obvious thing is that the estimated coefficients are extremely large (this is the linear predictor scale, so in the response scale it's even worse since you exponentiate it). Perhaps this is due to very high collinearity of your variables (however the standard error is low relative to the estimate so maybe not), and/or issues of scaling (i.e., your variables are very small, use scale() to standardise them.) -- Gad Abraham MEng Student, Dept. CSSE and NICTA The University of Melbourne Parkville 3010, Victoria, Australia email: gabraham at csse.unimelb.edu.au web: http://www.csse.unimelb.edu.au/~gabraham