claus orourke
2010-Mar-31 11:38 UTC
[R] interpretation of p values for highly correlated logistic analysis
Dear list, I want to perform a logistic regression analysis with multiple categorical predictors (i.e., a logit) on some data where there is a very definite relationship between one predicator and the response/independent variable. The problem I have is that in such a case the p value goes very high (while I as a naive newbie would expect it to crash towards 0). I'll illustrate my problem with some toy data. Say I have the following data as an input frame: roman animal colour 1 alpha dog black 2 beta cat white 3 alpha dog black 4 alpha cat black 5 beta dog white 6 alpha cat black 7 gamma dog white 8 alpha cat black 9 gamma dog white 10 beta cat white 11 alpha dog black 12 alpha cat black 13 gamma dog white 14 alpha cat black 15 beta dog white 16 beta cat black 17 alpha cat black 18 beta dog white In this toy data you can see that roman:alpha and roman:beta are pretty good predictors of colour Let's say I perform logistic analysis directly on the raw data with colour as a response variable:> options(contrasts=c("contr.treatment","contr.poly")) > anal1 <- glm(data$colour~data$roman+data$animal,family=binomial)then I find that my P values for each individual level coefficient approach 1: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -41.65 19609.49 -0.002 0.998 data$romanbeta 42.35 19609.49 0.002 0.998 data$romangamma 43.74 31089.48 0.001 0.999 data$animaldog 20.48 13866.00 0.001 0.999 while I expect the p value for roman:beta to be quite low because it is a good predictor of colour:white On the other hand, if I then run an anova with a Chi-sq test on the result model, I find as I would expect that 'roman' is a good predictor of colour.> anova(anal1,test="Chisq")Analysis of Deviance Table Model: binomial, link: logit Response: data$colour Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 17 24.7306 data$roman 2 19.3239 15 5.4067 6.366e-05 *** data$animal 1 1.5876 14 3.8191 0.2077 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1>Can anyone please explain why my p value is so high for the individual levels? Sorry for what is likely a stupid question. Claus p.s., when I run logistic analysis on data that is more 'randomised' everything comes out as I expect.
Ben Bolker
2010-Mar-31 12:03 UTC
[R] interpretation of p values for highly correlated logistic analysis
claus orourke <claus.orourke <at> gmail.com> writes: look up "Hauck-Donner effect" ...
Stephan Kolassa
2010-Mar-31 14:02 UTC
[R] interpretation of p values for highly correlated logistic analysis
Hi Claus, welcome to the wonderful world of collinearity (or multicollinearity, as some call it)! You have a near linear relationship between some of your predictors, which can (and in your case does) lead to extreme parameter estimates, which in some cases almost cancel out (a coefficient of +/-40 on a categorical variable in logistic regression is a lot, and the intercept and two of the roman parameter estimates almost cancel out) but which are rather unstable (hence your high p-values). Belsley, Kuh and Welsch did some work on condition indices and variance decomposition proportions, and variance inflation factors are quite popular for diagnosing multicollinearity - google these terms for a bit, and enlightenment will surely follow. What can you do? You should definitely think long and hard about your data. Should you be doing separate regressions for some factor levels? Should you drop a factor from the analysis? Should you do a categorical analogue of Principal Components Analysis on your data before the regression? I personally have never done this, but correspondence analysis has been recommended as a "discrete alternative" to PCA on this list, see a couple of books by M. J. Greenacre. Best of luck! Stephan claus orourke schrieb:> Dear list, > > I want to perform a logistic regression analysis with multiple > categorical predictors (i.e., a logit) on some data where there is a > very definite relationship between one predicator and the > response/independent variable. The problem I have is that in such a > case the p value goes very high (while I as a naive newbie would > expect it to crash towards 0). > > I'll illustrate my problem with some toy data. Say I have the > following data as an input frame: > > roman animal colour > 1 alpha dog black > 2 beta cat white > 3 alpha dog black > 4 alpha cat black > 5 beta dog white > 6 alpha cat black > 7 gamma dog white > 8 alpha cat black > 9 gamma dog white > 10 beta cat white > 11 alpha dog black > 12 alpha cat black > 13 gamma dog white > 14 alpha cat black > 15 beta dog white > 16 beta cat black > 17 alpha cat black > 18 beta dog white > > In this toy data you can see that roman:alpha and roman:beta are > pretty good predictors of colour > > Let's say I perform logistic analysis directly on the raw data with > colour as a response variable: > >> options(contrasts=c("contr.treatment","contr.poly")) >> anal1 <- glm(data$colour~data$roman+data$animal,family=binomial) > > then I find that my P values for each individual level coefficient approach 1: > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -41.65 19609.49 -0.002 0.998 > data$romanbeta 42.35 19609.49 0.002 0.998 > data$romangamma 43.74 31089.48 0.001 0.999 > data$animaldog 20.48 13866.00 0.001 0.999 > > while I expect the p value for roman:beta to be quite low because it > is a good predictor of colour:white > > On the other hand, if I then run an anova with a Chi-sq test on the > result model, I find as I would expect that 'roman' is a good > predictor of colour. > >> anova(anal1,test="Chisq") > Analysis of Deviance Table > > Model: binomial, link: logit > > Response: data$colour > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev P(>|Chi|) > NULL 17 24.7306 > data$roman 2 19.3239 15 5.4067 6.366e-05 *** > data$animal 1 1.5876 14 3.8191 0.2077 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Can anyone please explain why my p value is so high for the individual levels? > > Sorry for what is likely a stupid question. > > Claus > > p.s., when I run logistic analysis on data that is more 'randomised' > everything comes out as I expect. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >