Hi, I am running a logistic regression on a simple dataset (attached) using glm:> dat<-read.table("dat.txt",sep='\t',header=T)If I use summary() on a logistic model:> summary(glm(y~x1*x2,dat,family='binomial'))Coefficients:? ? ? ? ? ? Estimate Std. Error z value Pr(>|z|)(Intercept) ? ?19.57 ? ?5377.01 ? 0.004 ? ?0.997x1 ? ? ? ? ? ?-18.59 ? ?5377.01 ?-0.003 ? ?0.997x2B ? ? ? ? ? -19.57 ? ?5377.01 ?-0.004 ? ?0.997x1:x2B ? ? ? ? 38.15 ? ?7604.24 ? 0.005 ? ?0.996 As you can see, the interaction term is very insignificant (p = 0.996)! But if I use a anova() to compare a full vs reduced model to evaluate the interaction term:> anova(glm(y~x1+x2,dat,family='binomial'), glm(y~x1*x2,dat,family='binomial'))Analysis of Deviance TableModel 1: y ~ x1 + x2Model 2: y ~ x1 * x2? Resid. Df Resid. Dev Df Deviance1 ? ? ? ?22 ? ? 27.067 ? ? ? ? ? ?2 ? ? ? ?21 ? ? 21.209 ?1 ? 5.8579 This follows a chi-square distribution with 1 df, so the corresponding p value is:> 1-pchisq(5.8679,1)[1] 0.01541944So I get very different p value on the interaction term, can someone share what's going wrong here? Thanks! Yi -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dat.txt URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20170531/7d302080/attachment.txt>
Dear Yi, On Wed, 31 May 2017 04:53:25 +0000 (UTC) array chip via R-help <r-help at r-project.org> wrote:> Hi, I am running a logistic regression on a simple dataset (attached) > using glm: [...] As you can see, the interaction term is very > insignificant (p = 0.996)!Well, all terms are not significant (actually, AFAIK, the phrase "very insignificant" does not exist; and if it does, than it ought not). But look at the estimates and the standard errors too (might be easier if you had formatted your e-mail in a way that was readable)! What does an estimate for the intercept of 19.57 on the linear predictor scale mean? What it the estimated probability if you transform back? Perhaps the following command R> xtabs(~x1+x2+y,dat) will shed some more light on what is going on in your data set, and why the interaction term is highly significant.> But if I use a anova() to compare a full > vs reduced model to evaluate the interaction term: > > anova(glm(y~x1+x2,dat,family='binomial'), > > glm(y~x1*x2,dat,family='binomial'))To you know about the (optional) argument 'test="ChiSq"' for the anova() command if you use it to compare models fitted by glm()? (see help(anova.glm))> So I get very different p value on the interaction term, can someone > share what's going wrong here?Data separation, aka Hauck-Donner phenomenon, discussed in any good book on logistic regression. Best wishes, Berwin ========================== Full address ===========================A/Prof Berwin A Turlach, Director Tel.: +61 (8) 6488 3338 (secr) Centre for Applied Statistics +61 (8) 6488 3383 (self) School of Maths and Stats (M019) FAX : +61 (8) 6488 1028 The University of Western Australia 35 Stirling Highway e-mail: Berwin.Turlach at gmail.com Crawley WA 6009 http://staffhome.ecm.uwa.edu.au/~00043886/ Australia http://www.researcherid.com/rid/A-4995-2008
On 31/05/17 16:53, array chip via R-help wrote:> Hi, I am running a logistic regression on a simple dataset (attached) using glm: >> dat<-read.table("dat.txt",sep='\t',header=T) > If I use summary() on a logistic model: >> summary(glm(y~x1*x2,dat,family='binomial')) > Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) 19.57 5377.01 0.004 0.997x1 -18.59 5377.01 -0.003 0.997x2B -19.57 5377.01 -0.004 0.997x1:x2B 38.15 7604.24 0.005 0.996 > As you can see, the interaction term is very insignificant (p = 0.996)! > But if I use a anova() to compare a full vs reduced model to evaluate the interaction term: >> anova(glm(y~x1+x2,dat,family='binomial'), glm(y~x1*x2,dat,family='binomial'))Analysis of Deviance Table > Model 1: y ~ x1 + x2Model 2: y ~ x1 * x2 Resid. Df Resid. Dev Df Deviance1 22 27.067 2 21 21.209 1 5.8579 > This follows a chi-square distribution with 1 df, so the corresponding p value is: >> 1-pchisq(5.8679,1)[1] 0.01541944 > So I get very different p value on the interaction term, can someone share what's going wrong here?(1) To re-emphasize Berwin Turlach's exhortation, ***PLEASE*** do not post in html; your results are nearly impossible to read. (2) To add a tiny bit to Berwin's comments: There is something a bit weird about your data; note that if you look at the fitted values from your fit, you get only *three* distinct values --- whereas there should be four, since there are four distinct treatment combinations, (0,A), (0,B), (1,A) and (1,B). And all possible treatment combinations do indeed occur. The deficiency occurs because the last three of your coefficients (the non-intercept terms) sum to (effectively) 0. Thus you get the same fitted value from the (1,B) combination as from the (0,A) combination. It is not clear to me what the implications of all this are --- particularly since it's getting on for my bed time :-) --- but it seems clear that this sort of "degeneracy" is going to mess things up. And perhaps contribute to the Hauck-Donner effect that Berwin told you about. I would also advise you to try experimenting with simulated data. (When in doubt, simulate. Even when not in doubt, simulate. That is my guiding principle.) I.e. try to simulate y-values from a coefficient vector that does not suffer from the "degeneracy", fit a model to the simulated y-values, and compare the two tests. They still won't agree, but they will (surely?) be less wildly discrepant. cheers, Rolf Turner -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276