I've been poking around with GLMs (on which I am *not* an expert) on behalf of a student, particularly binomial (standard logit link) nested models with overdispersion. I have one possible bug to report (but I'm not confident enough to be *sure* it's a bug); one comment on the general inconsistency that seems to afflict the various functions for dealing with overdispersion in GLMs (anova.glm, drop1.glm, summary.glm); and one statistical query that maybe someone would answer if they're feeling generous. 1. possible bug: in drop1.glm() with scale != 0, R seems to divide by the dispersion parameter twice: first (if family != "gaussian") it sets loglik <- dev/dispersion then (if test == "Chisq") it calculates dev <- loglik - loglik[1] dev[nas] <- 1 - pchisq(dev[nas]/dispersion, aod$Df[nas]) (in addition, I suppose this could now be just pchisq(...,lower.tail=FALSE) for greater precision) Is this a bug or am I missing something? 2. There seems to be a fair amount of variability in how drop1.glm, anova.glm, stat.anova, summary.glm deal with dispersion parameters: summary.glm() has an optional parameter called "dispersion" stat.anova() has a parameter called "scale", which it doesn't even use if test=="Chisq" anova.glm calls stat.anova() with an automatically calculated scale parameter (sum(object$weights*object$residuals^2)/object$df.residual), not allowing a user-defined scale drop1.glm() has an optional parameter called "scale" I would mess around and try to fix these myself, but (1) there seem to be some design decisions to make here, (2) I'm not sufficiently sure of myself to want to break things. 3. There seems to be a certain variety of advice around in how to deal with overdispersion in GLMs. One could (1) calculate the scale parameter from the residuals (either residual deviance/residual df or Pearson chi-sq/residual df) and feed that back into the analysis; (2) use quasi-likelihood with logit link and variance mu(1-mu); (3) use a form of F-test, as suggested by Crawley (which I guess has something to do with the fact that the scale parameter is estimated, not really known). Are these equivalent, or approximately equivalent? Does anyone have a favorite reference on the subject? (I've been looking at V&R3, Crawley's "GLIM for Ecologists", and Lindsey's GLM book -- I haven't acquired Dobson or McCullagh and Nelder yet). -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Prof Brian Ripley
2000-May-11 12:53 UTC
[Rd] scale factors/overdispersion in GLM: possible bug?
> Date: Wed, 19 Apr 2000 10:46:45 -0400 (EDT) > From: Ben Bolker <ben@zoo.ufl.edu>A belated reply: I have been away and then busy (including on this).> I've been poking around with GLMs (on which I am *not* an expert) on > behalf of a student, particularly binomial (standard logit link) nested > models with overdispersion. > > I have one possible bug to report (but I'm not confident enough to be > *sure* it's a bug); one comment on the general inconsistency that seems to > afflict the various functions for dealing with overdispersion in GLMs > (anova.glm, drop1.glm, summary.glm); and one statistical query that maybe > someone would answer if they're feeling generous. > > 1. possible bug: > in drop1.glm() with scale != 0, R seems to divide by the dispersion > parameter twice: > > first (if family != "gaussian") it sets > > loglik <- dev/dispersion > > then (if test == "Chisq") it calculates > > dev <- loglik - loglik[1] > dev[nas] <- 1 - pchisq(dev[nas]/dispersion, aod$Df[nas]) > > (in addition, I suppose this could now be just > pchisq(...,lower.tail=FALSE) for greater precision) > > Is this a bug or am I missing something?A bug. Using the appropriate tail for accuracy is irrelevant, though.> 2. There seems to be a fair amount of variability in how drop1.glm, > anova.glm, stat.anova, summary.glm deal with dispersion parameters: > summary.glm() has an optional parameter called "dispersion" > stat.anova() has a parameter called "scale", which it doesn't even use > if test=="Chisq" > anova.glm calls stat.anova() with an automatically calculated scale > parameter (sum(object$weights*object$residuals^2)/object$df.residual), not > allowing a user-defined scale > drop1.glm() has an optional parameter called "scale" > > I would mess around and try to fix these myself, but (1) there seem to > be some design decisions to make here, (2) I'm not sufficiently sure of > myself to want to break things.I've worked over all these for 1.1.0. Be careful in how much sameness to expect, though. drop1 has an argument called 'scale', and is generic. So drop1.glm must have the same name even though `dispersion' might be more appropriate.> 3. There seems to be a certain variety of advice around in how to deal > with overdispersion in GLMs. One could (1) calculate the scale parameter > from the residuals (either residual deviance/residual df or Pearson > chi-sq/residual df) and feed that back into the analysis; (2) use > quasi-likelihood with logit link and variance mu(1-mu); (3) use a form of > F-test, as suggested by Crawley (which I guess has something to do with > the fact that the scale parameter is estimated, not really known). Are > these equivalent, or approximately equivalent? Does anyone have a > favorite reference on the subject? (I've been looking at V&R3, Crawley's > "GLIM for Ecologists", and Lindsey's GLM book -- I haven't acquired Dobson > or McCullagh and Nelder yet).Over-dispersion is discussed at length in Collett, D. (1991) Modelling Binary Data and somewhat less discursively in Cox & Snell (1989) Analysing Binary Data McCullagh & Nelder (1989, section 4.5) Aitkin, Anderson, Francis, Hinde (1983) Statistical Modelling in GLIM amongst others. Almost all of this is binomial models, but Aitkin et al do consider Poisson. Basically, all find some justification for a model with variance phi >1 times that given by a binomial or poisson family. For X_i ~ binomial(n, p_i), n > 1 not depending on i, this is known as the Williams model in some circles. So people since Williams (and I seem to remember before) have been fitting such models and estimating phi by residual deviance/residual df. That is a quasi-likelihood procedure, as even where real models exist, the glm is not giving the MLE. To make sense of this I have introduced two new families, quasibinomial and quasipoisson, that use the Williams model, and now binomial and Poisson never allow phi to be estimated. And for the Williams model there is an "F" test option to add1 and drop1 (and if you use "F" with a binomial or poisson it tells you it is really using the quasi-version). This applies to summary.glm, predict.glm, anova.glm, add1.glm, drop1.glm. (As quasi models have no likelihood, they have no AIC either, and so step will not work for them.) -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._