Jessica L Hite/hitejl/O/VCU
2009-Feb-16 19:23 UTC
[R] Overdispersion with binomial distribution
I am attempting to run a glm with a binomial model to analyze proportion data. I have been following Crawley's book closely and am wondering if there is an accepted standard for how much is too much overdispersion? (e.g. change in AIC has an accepted standard of 2). In the example, he fits several models, binomial and quasibinomial and then accepts the quasibinomial. The output for residual deviance and df does not change between these two models, however, he accepts the quasibinomial. Is there a different calculation that I missed that actually provides an overdispersion factor (as in SAS?) My code and output are below, given the example in the book, these data are WAY overdispersed .....do I mention this and go on or does this signal the need to try a different model? If so, any suggestions on the type of distribution (gamma or negative binomial ?)? attach(Clutch2) y<-cbind(Total,Size-Total) glm1<-glm(y~Pred,"binomial") summary(glm1) Call: glm(formula = y ~ Pred, family = "binomial") Deviance Residuals: Min 1Q Median 3Q Max -9.1022 -2.7899 -0.4781 2.6058 8.4852 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.35095 0.06612 20.433 < 2e-16 *** PredF -0.34811 0.11719 -2.970 0.00297 ** PredSN -3.29156 0.10691 -30.788 < 2e-16 *** PredW -1.46451 0.12787 -11.453 < 2e-16 *** PredWF -0.56412 0.13178 -4.281 1.86e-05 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2815.5 on 108 degrees of freedom Residual deviance: 1323.5 on 104 degrees of freedom (3 observations deleted due to missingness) AIC: 1585.2 Number of Fisher Scoring iterations: 5 #### the output for residual deviance and df does not change even when I use quasibinomial, is this ok? ##### library(MASS)> glm2<-glm(y~Pred,"quasibinomial") > summary(glm2)Call: glm(formula = y ~ Pred, family = "quasibinomial") Deviance Residuals: Min 1Q Median 3Q Max -9.1022 -2.7899 -0.4781 2.6058 8.4852 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.3510 0.2398 5.633 1.52e-07 *** PredF -0.3481 0.4251 -0.819 0.41471 PredSN -3.2916 0.3878 -8.488 1.56e-13 *** PredW -1.4645 0.4638 -3.157 0.00208 ** PredWF -0.5641 0.4780 -1.180 0.24063 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for quasibinomial family taken to be 13.15786) Null deviance: 2815.5 on 108 degrees of freedom Residual deviance: 1323.5 on 104 degrees of freedom (3 observations deleted due to missingness) AIC: NA Number of Fisher Scoring iterations: 5> anova(glm2,test="F")Analysis of Deviance Table Model: quasibinomial, link: logit Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev F Pr(>F) NULL 108 2815.5 Pred 4 1492.0 104 1323.5 28.349 6.28e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1> model1<-update(glm2,~.-Pred) > anova(glm2,model1,test="F")Analysis of Deviance Table Model 1: y ~ Pred Model 2: y ~ 1 Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 104 1323.5 2 108 2815.5 -4 -1492.0 28.349 6.28e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1> coef(glm2)(Intercept) PredF PredSN PredW PredWF 1.3509550 -0.3481096 -3.2915601 -1.4645097 -0.5641223 Thanks Jessica hitejl@vcu.edu [[alternative HTML version deleted]]
Jessica L Hite/hitejl/O/VCU <hitejl <at> vcu.edu> writes:> I am attempting to run a glm with a binomial model to analyze proportion > data. > I have been following Crawley's book closely and am wondering if there is > an accepted standard for how much is too much overdispersion? (e.g. change > in AIC has an accepted standard of 2).In principle, in the null case (i.e. data are really binomial) the deviance is chi-squared distributed with the df equal to the residual df. For example: example(glm) deviance(glm.D93) ## 5.13 summary(glm.D93)$dispersion ## 1 (by definition) dfr <- df.residual(glm.D93) deviance(glm.D93)/dfr ## 1.28 d2 <- sum(residuals(glm.D93,"pearson")^2) ## 5.17 (disp2 <- d2/dfr) ## 1.293 gg2 <- update(glm.D93,family=quasipoisson) summary(gg2)$dispersion ## 1.293, same as above pchisq(d2,df=dfr,lower.tail=FALSE) all.equal(coef(glm.D93),coef(gg2)) ## TRUE se1 <- coef(summary(glm.D93))[,"Std. Error"] se2 <- coef(summary(gg2))[,"Std. Error"] se2/se1 # (Intercept) outcome2 outcome3 treatment2 treatment3 # 1.137234 1.137234 1.137234 1.137234 1.137234 sqrt(disp2) # [1] 1.137234> My code and output are below, given the example in the book, these data are > WAY overdispersed .....do I mention this and go on or does this signal the > need to try a different model? If so, any suggestions on the type of > distribution (gamma or negative binomial ?)?Way overdispersed may indicate model lack of fit. Have you examined residuals/data for outliers etc.? quasibinomial should be fine, or you can try beta-binomial (see the aod package) ...> attach(Clutch2) > y<-cbind(Total,Size-Total) > glm1<-glm(y~Pred,"binomial") > summary(glm1) > > Call: > glm(formula = y ~ Pred, family = "binomial") > > Deviance Residuals: > Min 1Q Median 3Q Max > -9.1022 -2.7899 -0.4781 2.6058 8.4852 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 1.35095 0.06612 20.433 < 2e-16 *** > PredF -0.34811 0.11719 -2.970 0.00297 ** > PredSN -3.29156 0.10691 -30.788 < 2e-16 *** > PredW -1.46451 0.12787 -11.453 < 2e-16 *** > PredWF -0.56412 0.13178 -4.281 1.86e-05 *** > --- > #### the output for residual deviance and df does not change even when I > use quasibinomial, is this ok? #####That's as expected.> library(MASS)you don't really need MASS for quasibinomial.> > glm2<-glm(y~Pred,"quasibinomial") > > summary(glm2) > > Call: > glm(formula = y ~ Pred, family = "quasibinomial") > > Deviance Residuals: > Min 1Q Median 3Q Max > -9.1022 -2.7899 -0.4781 2.6058 8.4852 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 1.3510 0.2398 5.633 1.52e-07 *** > PredF -0.3481 0.4251 -0.819 0.41471 > PredSN -3.2916 0.3878 -8.488 1.56e-13 *** > PredW -1.4645 0.4638 -3.157 0.00208 ** > PredWF -0.5641 0.4780 -1.180 0.24063 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > (Dispersion parameter for quasibinomial family taken to be 13.15786) > > Null deviance: 2815.5 on 108 degrees of freedom > Residual deviance: 1323.5 on 104 degrees of freedom > (3 observations deleted due to missingness) > AIC: NA > > Number of Fisher Scoring iterations: 5 > > > anova(glm2,test="F") > Analysis of Deviance Table > > Model: quasibinomial, link: logit > > Response: y > > Terms added sequentially (first to last) > > Df Deviance Resid. Df Resid. Dev F Pr(>F) > NULL 108 2815.5 > Pred 4 1492.0 104 1323.5 28.349 6.28e-16 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > model1<-update(glm2,~.-Pred) > > anova(glm2,model1,test="F") > Analysis of Deviance Table > > Model 1: y ~ Pred > Model 2: y ~ 1 > Resid. Df Resid. Dev Df Deviance F Pr(>F) > 1 104 1323.5 > 2 108 2815.5 -4 -1492.0 28.349 6.28e-16 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > coef(glm2) > (Intercept) PredF PredSN PredW PredWF > 1.3509550 -0.3481096 -3.2915601 -1.4645097 -0.5641223 > > Thanks > Jessica > hitejl <at> vcu.edu