Chris Oosthuizen
2010-Feb-18 01:12 UTC
[R] Appropriate test for overdispersion in binomial data
Dear R users, Overdispersion is often a problem in binomial data. I attempt to model a binary response (sex-ratio) with three categorical explanatory variables, using GLM, which could assume the form: y<-cbind(sexf, sample-sexf) model<-glm(y ~ age+month+year, binomial) summary(model) Output: (Dispersion parameter for binomial family taken to be 1) Null deviance: 8956.7 on 582 degrees of freedom Residual deviance: 4111.9 on 555 degrees of freedom AIC: 6735.2 Following MJ Crawley (The R Book 2007) this model can be updated to: model2<-glm(y ~ age+month+year, quasibinomial) summary(model2) Output: (Dispersion parameter for quasibinomial family taken to be 7.080681) Null deviance: 8956.7 on 582 degrees of freedom Residual deviance: 4111.9 on 555 degrees of freedom AIC: NA As far as I can tell, R users (from the Forum) and MJ Crawley calculate the degree of overdispersion for binomial data from residual deviance (the residual scaled deviance should be roughly equal to the residual degrees of freedom). HOWEVER, please read the following comment, that I copied from the thread "Under dispersion; Was: [R] binomial glm warnings revisited", posted in 2003 by Peter Dalgaard: "Don't trust deviances as measures of dispersion with binary data!" and "With binary data, the deviance is purely a function of the fitted parameters. It is the difference in -2 log L between a "perfect fit" and the observed fit. A perfect fit has a zero prob. where the obs is "0" and probability 1 where it is "1", and L == 1 identically in that case. Now consider the likelihood for the "complete toss-up" i.e. intercept and slope both equal to 0 so all probabilities are 0.5. The likelihood in that case is 0.5^269, i.e. a constant. Take logarithms and notice that the model deviance plus the change in deviance from the model to the "toss-up" model is constant (2*269*log(2) to be precise). So what appears to be a measure of residual error is really just a measure of how far the fitted probabilities are from 0.5!" My questions are: 1) Is residual deviance / df an appropriate measure of dispersion for binary data? (it seems to be widely used) 2) If I understand P. Daldaard's comment correctly, and it is not, what is the appropriate way? Many thanks to all who have asked and anwered questions in the past - it is of great assistance. Chris -- W.C.Oosthuizen Mammal Research Institute Department of Zoology & Entomology University of Pretoria Pretoria South Africa ------------------------------------------------------------------ This message and attachments are subject to a disclaimer. Please refer to http://www.it.up.ac.za/documentation/governance/disclaimer/ for full details. / Hierdie boodskap en aanhangsels is aan 'n vrywaringsklousule onderhewig. Volledige besonderhede is by http://www.it.up.ac.za/documentation/governance/disclaimer/ beskikbaar.