Hi Folks, I'm playing with glm.nb() in MASS. Reference: the negative binomial distribution P(y) = (Gamma(theta+y)/(Gamma(theta)*y!))*(p^theta)*(1-p)^y y = 0,1,2,... in the notation of the MASS book (section 7.4), where p = theta/(mu + theta) so (1-p) = mu/(mu + theta) where mu is the expected value of Y. It seems from ?glm.nb that an initial value of theta is either supplied, or obtained by a method of moments after an initial Poisson fit to the data; then it casts back and forth: A: given theta, glm is applied to fit the linear model, with log link, for mu. Then theta is re-estimed, then the linear model, and so on. I hope I've understood right! However, this procedure if based on the assumption that the same value of theta applies across the board, regardless of the values of any covariates or factors. The MASS book (Section 7.4) illustrates the use of glm.nb() on the Quine data (for numbers of days absence from school in a school year by Australian children). There are four factors in addition to the "outcome" Days: Eth: Ethnicity (Aboriginal "A"/Non-Aboriginal "N") Lrn: Learning Ability (Slow learner"SL", Average Learner "AL") Age: Age Group (Primary "F0", First/Second/Third Forms "F1/2/3") Sex: Gender ("M"/"F") This dataset is earlier analysed thoroughly in the MASS book (Section 6.8), primarily with reference to transforming Days by a Box-Cox transformation. The negative binomial does not play a role in that part. When it comes to the glm.nb(), however, the assumption of "constant theta" is implicit in Section 7.4. But, if you look at the data, this does seem all that reasonable. Preamble: Get the variables out of 'quine' into convenient names. library(MASS) D<-quine$Days ; E<-quine$Eth; L<-quine$Lrn; A<-quine$Age; S<-quine$Sex D.A<-D[E=="A"]; L.A<-L[E=="A"]; A.A<-A[E=="A"]; S.A<-S[E=="A"] D.N<-D[E=="N"]; L.N<-L[E=="N"]; A.N<-A[E=="N"]; S.N<-S[E=="N"] Now, if you look at the histograms of D.A and D.N separately: hist(D.A,breaks=-0.5+4*(0:22)) hist(D.N,breaks=-0.5+4*(0:22)) you can see that there is a major difference in shape, such as would arise if theta<1 for Eth="A", and theta>1 for Eth="B". This is confirmed by a separate covariate-free fit of the mean to each group, from which the fitted theta is returned: summary(glm.nb(D.A~1)) --> Theta: 1.499 Std. Err.: 0.260 summary(glm.nb(D.A~1)) --> Theta: 0.919 Std. Err.: 0.159 (1.499-0.919)/sqrt(0.260^2 + 0.159^2) ##[1] 1.903113 1-pnorm(1.903113) ##[1] 0.0285129 so we have a 1-sided P-value of 0.029, a 2-sided of 0.057 That's a fairly strong indication that theta is not the same for the two groups; and the shapes of the histograms show that the difference has an important effect on the distribution. Of course, this dataset is notoriously unbalanced, so maybe this is a reflection of the mixtures of categories. A simple linear model on each subgroup eases tha above situation a bit: summary(glm.nb(D.A ~ S.A+A.A+L.A)) --> Theta: 1.767 Std. Err.: 0.318 summary(glm.nb(D.N ~ S.N+A.N+L.N)) Theta: 1.107 Std. Err.: 0.201 1-pnorm((1.767-1.107)/sqrt(0.318^2+0.201^2)) = 0.0397 and throwing everything into the pot eases it a bit more: summary(glm.nb(D.A ~ S.A*A.A*L.A)) --> Theta: 2.317 Std. Err.: 0.438 summary(glm.nb(D.N ~ S.N*A.N*L.N)) --> Theta: 1.581 Std. Err.: 0.318 1-pnorm((2.317-1.581)/sqrt(0.438^2+0.318^2)) = 0.0870 But, while the theta-values have now moved well up into the ">1" range (where they have less effect on the shape of the distribution), nevertheless the differences of the "A" and "N" estimates have not changed much: 0.580 --> 0.660 --> 0.736 even though their P-values have eased off somwehat (and, with 69 and 77 data for the "A" and "N" groups respectively, we don't have huge power here, I think). I've gone through the above in detail, since it's an illustration which you cam all work through yourselves in R, to illustrate the issue about variation of theta in a negative binomial model, when you want to get at the effects of covariates on the outcome Y (the counts of events). In real life I'm considering other data where the "theta" effect is more marked, and probably more important. Which finally leads on to my QUERY: Is there anything implement{ed|able} in R which can address the issue that theta may vary according to factors or covariates, as well as the level of mu given theta? Or is one reduced to running glm.nb() (or the like) on subsets? (Not that this would be much use if theta depended on a continuous covariate). With thanks, and best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 15-Aug-07 Time: 17:28:14 ------------------------------ XFMail ------------------------------