thr3ads.net - R help - [R] interpretation of p values for highly correlated logistic analysis [Mar 2010]

If this information is useful, please help other people find it:
Share via:

claus orourke

2010-Mar-31 11:38 UTC

[R] interpretation of p values for highly correlated logistic analysis

Dear list,

I want to perform a logistic regression analysis with multiple
categorical predictors (i.e., a logit) on some data where there is a
very definite relationship between one predicator and the
response/independent variable. The problem I have is that in such a
case the p value goes very high (while I as a naive newbie would
expect it to crash towards 0).

I'll illustrate my problem with some toy data. Say I have the
following data as an input frame:

   roman animal colour
1  alpha    dog black
2   beta    cat white
3  alpha    dog black
4  alpha    cat black
5   beta    dog white
6  alpha    cat black
7  gamma    dog white
8  alpha    cat black
9  gamma    dog white
10  beta    cat white
11 alpha    dog black
12 alpha    cat black
13 gamma    dog white
14 alpha    cat black
15  beta    dog white
16  beta    cat black
17 alpha    cat black
18  beta    dog white

In this toy data you can see that roman:alpha and roman:beta are
pretty good predictors of colour

Let's say I perform logistic analysis directly on the raw data with
colour as a response variable:
> options(contrasts=c("contr.treatment","contr.poly"))
> anal1 <- glm(data$colour~data$roman+data$animal,family=binomial)
then I find that my P values for each individual level coefficient approach 1:

Coefficients:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)       -41.65   19609.49  -0.002    0.998
data$romanbeta     42.35   19609.49   0.002    0.998
data$romangamma    43.74   31089.48   0.001    0.999
data$animaldog     20.48   13866.00   0.001    0.999

while I expect the p value for roman:beta to be quite low because it
is a good predictor of colour:white

On the other hand, if I then run an anova with a Chi-sq test on the
result model, I find as I would expect that 'roman' is a good
predictor of colour.
> anova(anal1,test="Chisq")Analysis of Deviance Table

Model: binomial, link: logit

Response: data$colour

Terms added sequentially (first to last)


            Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                           17    24.7306
data$roman   2  19.3239        15     5.4067 6.366e-05 ***
data$animal  1   1.5876        14     3.8191    0.2077
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ?
1>
Can anyone please explain why my p value is so high for the individual levels?

Sorry for what is likely a stupid question.

Claus

p.s., when I run logistic analysis on data that is more 'randomised'
everything comes out as I expect.

Ben Bolker

2010-Mar-31 12:03 UTC

head link

[R] interpretation of p values for highly correlated logistic analysis

claus orourke <claus.orourke <at> gmail.com> writes:

  look up "Hauck-Donner effect" ...

Stephan Kolassa

2010-Mar-31 14:02 UTC

head link

[R] interpretation of p values for highly correlated logistic analysis

Hi Claus,

welcome to the wonderful world of collinearity (or multicollinearity, as 
some call it)! You have a near linear relationship between some of your 
predictors, which can (and in your case does) lead to extreme parameter 
estimates, which in some cases almost cancel out (a coefficient of +/-40 
on a categorical variable in logistic regression is a lot, and the 
intercept and two of the roman parameter estimates almost cancel out) 
but which are rather unstable (hence your high p-values).

Belsley, Kuh and Welsch did some work on condition indices and variance 
decomposition proportions, and variance inflation factors are quite 
popular for diagnosing multicollinearity - google these terms for a bit, 
and enlightenment will surely follow.

What can you do? You should definitely think long and hard about your 
data. Should you be doing separate regressions for some factor levels? 
Should you drop a factor from the analysis? Should you do a categorical 
analogue of Principal Components Analysis on your data before the 
regression? I personally have never done this, but correspondence 
analysis has been recommended as a "discrete alternative" to PCA on
this
list, see a couple of books by M. J. Greenacre.

Best of luck!
Stephan


claus orourke schrieb:> Dear list,
> 
> I want to perform a logistic regression analysis with multiple
> categorical predictors (i.e., a logit) on some data where there is a
> very definite relationship between one predicator and the
> response/independent variable. The problem I have is that in such a
> case the p value goes very high (while I as a naive newbie would
> expect it to crash towards 0).
> 
> I'll illustrate my problem with some toy data. Say I have the
> following data as an input frame:
> 
>    roman animal colour
> 1  alpha    dog black
> 2   beta    cat white
> 3  alpha    dog black
> 4  alpha    cat black
> 5   beta    dog white
> 6  alpha    cat black
> 7  gamma    dog white
> 8  alpha    cat black
> 9  gamma    dog white
> 10  beta    cat white
> 11 alpha    dog black
> 12 alpha    cat black
> 13 gamma    dog white
> 14 alpha    cat black
> 15  beta    dog white
> 16  beta    cat black
> 17 alpha    cat black
> 18  beta    dog white
> 
> In this toy data you can see that roman:alpha and roman:beta are
> pretty good predictors of colour
> 
> Let's say I perform logistic analysis directly on the raw data with
> colour as a response variable:
> 
>>
options(contrasts=c("contr.treatment","contr.poly"))
>> anal1 <- glm(data$colour~data$roman+data$animal,family=binomial)
> 
> then I find that my P values for each individual level coefficient approach
1:
> 
> Coefficients:
>                 Estimate Std. Error z value Pr(>|z|)
> (Intercept)       -41.65   19609.49  -0.002    0.998
> data$romanbeta     42.35   19609.49   0.002    0.998
> data$romangamma    43.74   31089.48   0.001    0.999
> data$animaldog     20.48   13866.00   0.001    0.999
> 
> while I expect the p value for roman:beta to be quite low because it
> is a good predictor of colour:white
> 
> On the other hand, if I then run an anova with a Chi-sq test on the
> result model, I find as I would expect that 'roman' is a good
> predictor of colour.
> 
>> anova(anal1,test="Chisq")
> Analysis of Deviance Table
> 
> Model: binomial, link: logit
> 
> Response: data$colour
> 
> Terms added sequentially (first to last)
> 
> 
>             Df Deviance Resid. Df Resid. Dev P(>|Chi|)
> NULL                           17    24.7306
> data$roman   2  19.3239        15     5.4067 6.366e-05 ***
> data$animal  1   1.5876        14     3.8191    0.2077
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> 
> Can anyone please explain why my p value is so high for the individual
levels?
> 
> Sorry for what is likely a stupid question.
> 
> Claus
> 
> p.s., when I run logistic analysis on data that is more
'randomised'
> everything comes out as I expect.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Possibly Parallel Threads

Search for more maybe matching threads

R help - Mar 2010 - interpretation of p values for highly correlated logistic analysis

[R] interpretation of p values for highly correlated logistic analysis

[R] interpretation of p values for highly correlated logistic analysis

[R] interpretation of p values for highly correlated logistic analysis

Possibly Parallel Threads