Richard Friedman
2011-Apr-21 19:26 UTC
[R] Accounting for overdispersion in a mixed-effect model with a proportion response variable and categorical explanatory variables.
Dear R-help-list, I have a problem in which the explanatory variables are categorical, the response variable is a proportion, and experiment contains technical replicates (pseudoreplicates) as well as biological replicated. I am new to both generalized linear models and mixed- effects models and would greatly appreciate the advice of experienced analysts in this matter. I analyzed the data in 4 ways and want to know which is the best way. The 4 ways are: 1. A generalized linear model with binomial error in which the positive and negative counts for each biological replicate is summed over technical replicates. 2. Same as 1 with a quasibinomial error model. 3. A generalized linear mixed-effects model with binomial error in which technical replication is treated as a random effect. 4. A generalized linear mixed-effects model with binomial error in which technical replication is treated as a random effect and overdispersion is taken into account by individual level variability. Here are the relevant data for each model: For everything: > sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] lme4_0.999375-39 Matrix_0.999375-50 lattice_0.19-23 loaded via a namespace (and not attached): [1] grid_2.13.0 nlme_3.1-100 stats4_2.13.0 tools_2.13.0 treatment Apositivesum Bnegativesum 1 A 208 439 2 A 215 395 3 A 235 411 4 A 304 450 5 A 215 395 6 B 224 353 7 B 279 405 8 B 265 418 9 B 278 392 10 B 249 383 11 C 196 385 12 C NA NA 13 C 266 397 14 C 216 460 15 C 264 419 16 D 283 401 17 D NA NA 18 D 270 410 19 D 248 316 20 D 302 386 1. A generalized linear model with binomial error in which the positive and negative counts for each biological replicate is summed over technical replicates. > y<-cbind(Apositivesum , Bnegativesum) > model<-glm(y ~ treatment, binomial) > summary(model) Call: glm(formula = y ~ treatment, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -2.3134 -0.5712 -0.3288 0.8616 2.4352 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.574195 0.036443 -15.756 < 2e-16 *** treatmentB 0.164364 0.051116 3.216 0.00130 ** treatmentC 0.007025 0.054696 0.128 0.89780 treatmentD 0.258135 0.053811 4.797 1.61e-06 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 60.467 on 17 degrees of freedom Residual deviance: 28.711 on 14 degrees of freedom (2 observations deleted due to missingness) AIC: 160.35 Number of Fisher Scoring iterations: 3 Since Residual deviance >> degrees of freedom I tried 2. Same as 1 with a quasibinomial error model. > model<-glm(y ~ treatment, quasibinomial) > summary(model) Call: glm(formula = y ~ treatment, family = quasibinomial) Deviance Residuals: Min 1Q Median 3Q Max -2.3134 -0.5712 -0.3288 0.8616 2.4352 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.574195 0.052173 -11.006 2.82e-08 *** treatmentB 0.164364 0.073180 2.246 0.04136 * treatmentC 0.007025 0.078306 0.090 0.92978 treatmentD 0.258135 0.077038 3.351 0.00476 ** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for quasibinomial family taken to be 2.049598) Null deviance: 60.467 on 17 degrees of freedom Residual deviance: 28.711 on 14 degrees of freedom (2 observations deleted due to missingness) AIC: NA Number of Fisher Scoring iterations: 3 > anova(model,test=F) Error in match.arg(test) : 'arg' must be NULL or a character vector > anova(model,test="F") Analysis of Deviance Table Model: quasibinomial, link: logit Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev F Pr(>F) NULL 17 60.467 treatment 3 31.756 14 28.711 5.1646 0.01303 * --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > I then tried to take the effect of pseudoreplication into account with 3: 3. A generalized linear mixed-effects model with binomial error in which technical replication is treated as a random effect. Here is the input file: treatment mouse observation positive negative A 1 1 73 149 A 1 2 50 129 A 1 3 85 161 A 2 1 73 139 A 2 2 89 144 A 2 3 53 112 A 3 1 69 97 A 3 2 82 128 A 3 3 84 186 A 4 1 111 145 A 4 2 78 146 A 4 3 115 159 A 5 1 74 133 A 5 2 82 153 A 5 3 59 109 B 1 1 90 146 B 1 2 58 108 B 1 3 76 99 B 2 1 105 136 B 2 2 99 139 B 2 3 75 130 B 3 1 95 160 B 3 2 95 135 B 3 3 75 123 B 4 1 95 129 B 4 2 101 130 B 4 3 82 133 B 5 1 63 109 B 5 2 86 132 B 5 3 100 142 C 1 1 72 128 C 1 2 57 137 C 1 3 67 120 C 2 1 86 110 C 2 2 79 121 C 2 3 101 166 C 3 1 82 231 C 3 2 60 125 C 3 3 74 104 C 4 1 90 155 C 4 2 84 141 C 4 3 90 123 D 1 1 91 107 D 1 2 101 183 D 1 3 91 111 D 2 1 79 146 D 2 2 97 155 D 2 3 94 109 D 3 1 69 88 D 3 2 84 107 D 3 3 95 121 D 4 1 92 127 D 4 2 112 140 D 4 3 98 119 y<-cbind(positive,negative) treatment<-factor(treatment) mouse<-factor(mouse) model<-lmer(y~treatment+(1|treatment/mouse),family=binomial) > data<-read.table(mixedinp.txt",header=T,sep="\t") > dim(data) [1] 54 5 > dim(data) [1] 54 5 > y<-cbind(positive,negative) > treatment<-factor(treatment) > mouse<-factor(mouse) > model<-lmer(y~treatment+(1|treatment/mouse),family=binomial) > summary(model) Generalized linear mixed model fit by the Laplace approximation Formula: y ~ treatment + (1 | treatment/mouse) AIC BIC logLik deviance 91.85 103.8 -39.93 79.85 Random effects: Groups Name Variance Std.Dev. mouse:treatment (Intercept) 0.0039316 0.062702 treatment (Intercept) 0.0000000 0.000000 Number of obs: 54, groups: mouse:treatment, 18; treatment, 4 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.577592 0.046034 -12.547 < 2e-16 *** treatmentB 0.166954 0.064749 2.578 0.009923 ** treatmentC 0.008545 0.069065 0.124 0.901537 treatmentD 0.262350 0.068364 3.838 0.000124 *** Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Correlation of Fixed Effects: (Intr) tBPBS_ tCPBS_ trt -0.711 trt -0.667 0.474 trt -0.673 0.479 0.449 Since the non-mixed effects model showed evidence of overdispersion, and since there is no quasibinomial option in lme4, I tried to account for overdispesion by including individual level variability with.. 4. A generalized linear mixed-effects model with binomial error in which technical replication is treated as a random effect and overdispersion is taken into account by individual level variability. > data$obs<-1:nrow(data) > names(data) [1] "treatment" "mouse" "observation" "positive" "negative" [6] "obs" > detach(data) > attach(data) The following object(s) are masked _by_ '.GlobalEnv': mouse, treatment > model2<-lmer(y~treatment+(1|treatment/mouse)+(1| obs) ,family=binomial) Number of levels of a grouping factor for the random effects is *equal* to n, the number of observations > summary(model2) Generalized linear mixed model fit by the Laplace approximation Formula: y ~ treatment + (1 | treatment/mouse) + (1 | obs) AIC BIC logLik deviance 89.75 103.7 -37.88 75.75 Random effects: Groups Name Variance Std.Dev. obs (Intercept) 1.0529e-02 1.0261e-01 mouse:treatment (Intercept) 1.3158e-12 1.1471e-06 treatment (Intercept) 0.0000e+00 0.0000e+00 Number of obs: 54, groups: obs, 54; mouse:treatment, 18; treatment, 4 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.57842 0.04522 -12.792 < 2e-16 *** treatmentB 0.16590 0.06356 2.610 0.00904 ** treatmentC 0.01642 0.06784 0.242 0.80878 treatmentD 0.26677 0.06710 3.976 7.01e-05 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Correlation of Fixed Effects: (Intr) tBPBS_ tCPBS_ trtB -0.711 trtC -0.666 0.474 trtD -0.674 0.479 0.449 > I then compared the 2 models. > anova(model,model2,test="F") Data: Models: model: y ~ treatment + (1 | treatment/mouse) model2: y ~ treatment + (1 | treatment/mouse) + (1 | obs) Df AIC BIC logLik Chisq Chi Df Pr(>Chisq) model 6 91.853 103.79 -39.926 model2 7 89.750 103.67 -37.875 4.1023 1 0.04282 * --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 My questions: 1. Am I correct that, of the 4 models, I should use the mixed effect model with individual variability? Although both models make the same effects to be significant, I would like to know which one I should report and use as input to a subsequent multiple comparisons analysis with multiicomp. 2. If I am incorrect what model should I use and why? 3. I would appreciate any further suggestions for analyzing this data. Thanks and best wishes, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ I am a Bayesian. When I see a multiple-choice question on a test and I don't know the answer I say "eeney-meaney-miney-moe". Rose Friedman, Age 14
Ben Bolker
2011-Apr-22 00:49 UTC
[R] Accounting for overdispersion in a mixed-effect model with a proportion response variable and categorical explanatory variables.
Richard Friedman <friedman <at> cancercenter.columbia.edu> writes:> > Dear R-help-list, > > I have a problem in which the explanatory variables are categorical, > the response variable is a proportion, and experiment contains > technical replicates (pseudoreplicates) as well as biological > replicated. I am new to both generalized linear models and mixed- > effects models and would greatly appreciate the advice of experienced > analysts in this matter. > > I analyzed the data in 4 ways and want to know which is the best way. > The 4 ways are: > > 1. A generalized linear model with binomial error in which the > positive and negative counts for each biological replicate is summed > over technical replicates. > 2. Same as 1 with a quasibinomial error model. > 3. A generalized linear mixed-effects model with binomial error in > which technical replication is treated as a random effect. > 4. A generalized linear mixed-effects model with binomial error in > which technical replication is treated as a random effect and > overdispersion is taken into account by individual level variability.Off the top of my head, I would say that #4 is best.> > Here are the relevant data for each model: > > For everything: > > > sessionInfo() > R version 2.13.0 (2011-04-13) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] lme4_0.999375-39 Matrix_0.999375-50 lattice_0.19-23 > > loaded via a namespace (and not attached): > [1] grid_2.13.0 nlme_3.1-100 stats4_2.13.0 tools_2.13.0X <- read.table("mouse.dat",header=TRUE, colClasses=rep(c("factor","numeric"),c(3,2))) Xagg <- aggregate(cbind(positive,negative)~treatment+mouse,data=X,FUN=sum) ## gets your aggregated data> > 1. A generalized linear model with binomial error in which the > positive and negative counts for each biological replicate is summed > over technical replicates.model <- glm(cbind(positive,negative)~treatment,family=binomial,data=Xagg) If all the assumptions of the model were actually met this might be OK (since if all observations on all mice were really independent, the sum of binomials would be binomial) but the residual deviance suggests overdispersion (dev/df approx. 2)> Null deviance: 60.467 on 17 degrees of freedom > Residual deviance: 28.711 on 14 degrees of freedom > (2 observations deleted due to missingness) > AIC: 160.35(note that your 2 NA observations got dropped during my aggregation step)> > Since Residual deviance >> degrees of freedom I triedyes (I'm working through this before I see your comments, but you seem to be on the right track)> > 2. Same as 1 with a quasibinomial error model. > > > model<-glm(y ~ treatment, quasibinomial) > > summary(model) > > Call: > glm(formula = y ~ treatment, family = quasibinomial) > > Deviance Residuals: > Min 1Q Median 3Q Max > -2.3134 -0.5712 -0.3288 0.8616 2.4352 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) -0.574195 0.052173 -11.006 2.82e-08 *** > treatmentB 0.164364 0.073180 2.246 0.04136 * > treatmentC 0.007025 0.078306 0.090 0.92978 > treatmentD 0.258135 0.077038 3.351 0.00476 ** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > (Dispersion parameter for quasibinomial family taken to be 2.049598) > > Null deviance: 60.467 on 17 degrees of freedom > Residual deviance: 28.711 on 14 degrees of freedom > (2 observations deleted due to missingness) > AIC: NA > > Number of Fisher Scoring iterations: 3 >Seems reasonable.> Analysis of Deviance Table > > Model: quasibinomial, link: logit > > Response: y > > Terms added sequentially (first to last) > > Df Deviance Resid. Df Resid. Dev F Pr(>F) > NULL 17 60.467 > treatment 3 31.756 14 28.711 5.1646 0.01303 * > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > > > I then tried to take the effect of pseudoreplication into account with > 3: > > 3. A generalized linear mixed-effects model with binomial error in > which technical replication is treated as a random effect. > > Here is the input file: >[snip]> Since the non-mixed effects model showed evidence of overdispersion, > and since there is no quasibinomial option in lme4, I tried to account > for overdispesion by including individual level variability with.. > > 4. A generalized linear mixed-effects model with binomial error in > which technical replication is treated as a random effect and > overdispersion is taken into account by individual level variability. >X$obs<-1:nrow(X) library(lme4) ## it doesn't make sense to include treatment in the random effect ## since it is already present as a fixed effect in the model ## "mouse within treatment" (= mouse:treatment) is what you want model2<-lmer(cbind(positive,negative)~treatment+(1|mouse:treatment)+ (1|obs) ,data=X, family=binomial) ## note that mouse:treatment variance comes out to zero model2B<-lmer(cbind(positive,negative)~treatment+(1|mouse:treatment), data=X, family=binomial) anova(model2,model2B) ## and yet model with observation-level RE appears significantly better ## (even though this is a conservative test because the null hypothesis ## is on the boundary) I then compared the 2 models.> > > anova(model,model2,test="F")## note that test="F" is ignored> > 1. Am I correct that, of the 4 models, I should use the mixed effect > model with individual variability? > Although both models make the same effects to be significant, I would > like to know which one I should report and use as input to a > subsequent multiple comparisons analysis with multiicomp.I think so. It may however comfort you to see that the coefficients and their standard errors are practically the same under each model. For the gold standard you may want to do parametric bootstrapping (see ?refit in the most recent (r-forge) version of lme4) I would suggest that future questions along these lines go to the r-sig-mixed-models mailing list ... Ben Bolker
Apparently Analagous Threads
- Mixed effects model where nested factor is not the repeated across treatments lme???
- stupid lm() question
- Overdispersion with binomial distribution
- glm with binomial errors - problem with overdispersion
- Follow-up to Reply: Overdispersion with binomial distribution