Jürg Schulze
2011-Mar-02 10:01 UTC
[R] problem with glm(family=binomial) when some levels have only 0 proportion values
Hello everybody I want to compare the proportions of germinated seeds (seed batches of size 10) of three plant types (1,2,3) with a glm with binomial data (following the method in Crawley: Statistics,an introduction using R, p.247). The problem seems to be that in two plant types (2,3) all plants have proportions = 0. I give you my data and the model I'm running: success failure type [1,] 0 10 3 [2,] 0 10 2 [3,] 0 10 2 [4,] 0 10 2 [5,] 0 10 2 [6,] 0 10 2 [7,] 0 10 2 [8,] 4 6 1 [9,] 4 6 1 [10,] 3 7 1 [11,] 5 5 1 [12,] 7 3 1 [13,] 4 6 1 [14,] 0 10 3 [15,] 0 10 3 [16,] 0 10 3 [17,] 0 10 3 [18,] 0 10 3 [19,] 0 10 3 [20,] 0 10 2 [21,] 0 10 2 [22,] 0 10 2 [23,] 9 1 1 [24,] 6 4 1 [25,] 4 6 1 [26,] 0 10 3 [27,] 0 10 3 y<- cbind(success, failure) Call: glm(formula = y ~ type, family = binomial) Deviance Residuals: Min 1Q Median 3Q -1.3521849 -0.0000427 -0.0000427 -0.0000427 Max 2.6477556 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.04445 0.21087 0.211 0.833 typeFxC -23.16283 6696.13233 -0.003 0.997 typeFxD -23.16283 6696.13233 -0.003 0.997 (Dispersion parameter for binomial family taken to be 1) Null deviance: 134.395 on 26 degrees of freedom Residual deviance: 12.622 on 24 degrees of freedom AIC: 42.437 Number of Fisher Scoring iterations: 20 Huge standard errors are calculated and there is no difference between plant type 1 and 2 or between plant type 1 and 3. If I add 1 to all successes, so that all the 0 values disappear, the standard error becomes lower and I find highly significant differences between the plant types. suc<- success + 1 fail<- 11 - suc Y<- cbind(suc,fail) Call: glm(formula = Y ~ type, family = binomial) Deviance Residuals: Min 1Q Median 3Q -1.279e+00 -4.712e-08 -4.712e-08 0.000e+00 Max 2.584e+00 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.2231 0.2023 1.103 0.27 typeFxC -2.5257 0.4039 -6.253 4.02e-10 *** typeFxD -2.5257 0.4039 -6.253 4.02e-10 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 86.391 on 26 degrees of freedom Residual deviance: 11.793 on 24 degrees of freedom AIC: 76.77 Number of Fisher Scoring iterations: 4 So I think the 0 values of all plants of group 2 and 3 are the problem, do you agree? I don't know why this is a problem, or how I can explain to a reviewer why a data transformation (+ 1) is necessary with such a dataset. I would greatly appreciate any comments. Juerg ______________________________________ J?rg Schulze Department of Environmental Sciences Section of Conservation Biology University of Basel St. Johanns-Vorstadt 10 4056 Basel, Switzerland Tel.: ++41/61/267 08 47
csrabak
2011-Mar-02 15:01 UTC
[R] problem with glm(family=binomial) when some levels have only 0 proportion values
Em 2/3/2011 08:01, J?rg Schulze escreveu:> Hello everybodyThis is not a R related problem, but rather more theoretic one, anyway:> > I want to compare the proportions of germinated seeds (seed batches of > size 10) of three plant types (1,2,3) with a glm with binomial data > (following the method in Crawley: Statistics,an introduction using R, > p.247). > The problem seems to be that in two plant types (2,3) all plants have > proportions = 0. > I give you my data and the model I'm running: > > success failure type > [1,] 0 10 3[snipped]> [26,] 0 10 3 > [27,] 0 10 3 > > y<- cbind(success, failure) > > Call: > glm(formula = y ~ type, family = binomial) > > Deviance Residuals: > Min 1Q Median 3Q > -1.3521849 -0.0000427 -0.0000427 -0.0000427 > Max > 2.6477556 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 0.04445 0.21087 0.211 0.833 > typeFxC -23.16283 6696.13233 -0.003 0.997 > typeFxD -23.16283 6696.13233 -0.003 0.997 > > (Dispersion parameter for binomial family taken to be 1) > > Null deviance: 134.395 on 26 degrees of freedom > Residual deviance: 12.622 on 24 degrees of freedom > AIC: 42.437 > > Number of Fisher Scoring iterations: 20 > > > Huge standard errors are calculated and there is no difference between > plant type 1 and 2 or between plant type 1 and 3. > If I add 1 to all successes, so that all the 0 values disappear, the > standard error becomes lower and I find highly significant differences > between the plant types. > > suc<- success + 1 > fail<- 11 - suc > Y<- cbind(suc,fail) > > Call: > glm(formula = Y ~ type, family = binomial) > > Deviance Residuals: > Min 1Q Median 3Q > -1.279e+00 -4.712e-08 -4.712e-08 0.000e+00 > Max > 2.584e+00 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 0.2231 0.2023 1.103 0.27 > typeFxC -2.5257 0.4039 -6.253 4.02e-10 *** > typeFxD -2.5257 0.4039 -6.253 4.02e-10 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > (Dispersion parameter for binomial family taken to be 1) > > Null deviance: 86.391 on 26 degrees of freedom > Residual deviance: 11.793 on 24 degrees of freedom > AIC: 76.77 > > Number of Fisher Scoring iterations: 4 > > > So I think the 0 values of all plants of group 2 and 3 are the problem, > do you agree?It depends on the definition of "problem" here, if the result of your experiment, maybe, for the difference in the two regressions, not.> I don't know why this is a problem, or how I can explain to a reviewer > why a data transformation (+ 1) is necessary with such a dataset.You need to ascertain the modeling of your statistic test against the epistemological analysis you're performing. Caveat: I'm not an expert in agriculture, so this is just a comment. If the success rates of your dataframe are the germinations of three types of plants in a certain period of time, then perhaps it could make sense to add one to all the values in the success column (and subtract ones from the failure?) because that would cope with the possibility that a certain time after the experiment has been stopped, it could have germinated. If in the other hand, the non germinated seeds are known to not germinate anymore, then the calculation device would put you on wrong path.> > I would greatly appreciate any comments.Get a look at the zero inflated (and perhaps hurdle as well) distributions and the regressions associated with them. Using sos I get more than 100 entries to look at, so I'll refrain to put specific links here. HTH -- Cesar Rabak DC Consulting LTDA
Robert A LaBudde
2011-Mar-02 15:08 UTC
[R] problem with glm(family=binomial) when some levels have only 0 proportion values
The algorithm is not converging. Your iterations are at the maximum. It won't do any good to add a fractional number to all data, as the result will depend on the number added (try 1.0, 0.5 and 0.1 to see this). The root problem is that your data are degenerate. Firstly, your types '2' and '3' are indistinguishable in your data. Secondly, consider the case without 'type'. If you have all zero data for 10 trials, you cannot discriminate among mu = 0, 0.00001, 0.0001, 0.001 or 0.01. This leads to numerical instability. Thirdly, the variance estimate in the IRLS will start at 0.0, which gives a singularity. Fundamentally, the algorithm is failing because you are at the boundary of possibilities for a parameter, so special techniques are needed to do maximum likelihood estimation. The simple solution is to deal with the data for your types separately. Another is to do more batches for '2' and '3' to get an observed failure. At 05:01 AM 3/2/2011, J?rg Schulze wrote:>Hello everybody > >I want to compare the proportions of germinated seeds (seed batches of >size 10) of three plant types (1,2,3) with a glm with binomial data >(following the method in Crawley: Statistics,an introduction using R, >p.247). >The problem seems to be that in two plant types (2,3) all plants have >proportions = 0. >I give you my data and the model I'm running: > > success failure type > [1,] 0 10 3 > [2,] 0 10 2 > [3,] 0 10 2 > [4,] 0 10 2 > [5,] 0 10 2 > [6,] 0 10 2 > [7,] 0 10 2 > [8,] 4 6 1 > [9,] 4 6 1 >[10,] 3 7 1 >[11,] 5 5 1 >[12,] 7 3 1 >[13,] 4 6 1 >[14,] 0 10 3 >[15,] 0 10 3 >[16,] 0 10 3 >[17,] 0 10 3 >[18,] 0 10 3 >[19,] 0 10 3 >[20,] 0 10 2 >[21,] 0 10 2 >[22,] 0 10 2 >[23,] 9 1 1 >[24,] 6 4 1 >[25,] 4 6 1 >[26,] 0 10 3 >[27,] 0 10 3 > > y<- cbind(success, failure) > > Call: >glm(formula = y ~ type, family = binomial) > >Deviance Residuals: > Min 1Q Median 3Q >-1.3521849 -0.0000427 -0.0000427 -0.0000427 > Max > 2.6477556 > >Coefficients: > Estimate Std. Error z value Pr(>|z|) >(Intercept) 0.04445 0.21087 0.211 0.833 >typeFxC -23.16283 6696.13233 -0.003 0.997 >typeFxD -23.16283 6696.13233 -0.003 0.997 > >(Dispersion parameter for binomial family taken to be 1) > > Null deviance: 134.395 on 26 degrees of freedom >Residual deviance: 12.622 on 24 degrees of freedom >AIC: 42.437 > >Number of Fisher Scoring iterations: 20 > > >Huge standard errors are calculated and there is no difference between >plant type 1 and 2 or between plant type 1 and 3. >If I add 1 to all successes, so that all the 0 values disappear, the >standard error becomes lower and I find highly significant differences >between the plant types. > >suc<- success + 1 >fail<- 11 - suc >Y<- cbind(suc,fail) > >Call: >glm(formula = Y ~ type, family = binomial) > >Deviance Residuals: > Min 1Q Median 3Q >-1.279e+00 -4.712e-08 -4.712e-08 0.000e+00 > Max > 2.584e+00 > >Coefficients: > Estimate Std. Error z value Pr(>|z|) >(Intercept) 0.2231 0.2023 1.103 0.27 >typeFxC -2.5257 0.4039 -6.253 4.02e-10 *** >typeFxD -2.5257 0.4039 -6.253 4.02e-10 *** >--- >Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > >(Dispersion parameter for binomial family taken to be 1) > > Null deviance: 86.391 on 26 degrees of freedom >Residual deviance: 11.793 on 24 degrees of freedom >AIC: 76.77 > >Number of Fisher Scoring iterations: 4 > > >So I think the 0 values of all plants of group 2 and 3 are the >problem, do you agree? >I don't know why this is a problem, or how I can explain to a reviewer >why a data transformation (+ 1) is necessary with such a dataset. > >I would greatly appreciate any comments. >Juerg >______________________________________ > >J?rg Schulze >Department of Environmental Sciences >Section of Conservation Biology >University of Basel >St. Johanns-Vorstadt 10 >4056 Basel, Switzerland >Tel.: ++41/61/267 08 47 > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.===============================================================Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com Least Cost Formulations, Ltd. URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239 Fax: 757-467-2947 "Vere scire est per causas scire" ================================================================