thr3ads.net - R help - [R] Logistic regression X^2 test with large sample size (fwd) [Jul 2012]

If this information is useful, please help other people find it:
Share via:

M Pomati

2012-Jul-31 15:35 UTC

[R] Logistic regression X^2 test with large sample size (fwd)

Does anyone know of any X^2 tests to compare the fit of logistic models 
which factor out the sample size? I'm dealing with a very large sample and 
I fear the significant X^2 test I get when adding a variable to the model 
is simply a result of the sample size (>200,000 cases).

I'd rather use the whole dataset instead of taking (small) random samples 
as it is highly skewed. I've seen things like Phi and Cramer's V for 
crosstabs but I'm not sure whether they have been used before on logistic 
regression, if there are better ones and if there are any packages.


Many thanks

Marco


	[[alternative HTML version deleted]]

Marc Schwartz

2012-Jul-31 16:50 UTC

head link

[R] Logistic regression X^2 test with large sample size (fwd)

On Jul 31, 2012, at 10:35 AM, M Pomati <Marco.Pomati at bristol.ac.uk>
wrote:
> 
> 
> Does anyone know of any X^2 tests to compare the fit of logistic models 
> which factor out the sample size? I'm dealing with a very large sample
and
> I fear the significant X^2 test I get when adding a variable to the model 
> is simply a result of the sample size (>200,000 cases).
> 
> I'd rather use the whole dataset instead of taking (small) random
samples
> as it is highly skewed. I've seen things like Phi and Cramer's V
for
> crosstabs but I'm not sure whether they have been used before on
logistic
> regression, if there are better ones and if there are any packages.
> 
> 
> Many thanks
> 
> Marco

Sounds like you are bordering on some type of stepwise approach to including or
not including covariates in the model. You can search the list archives for a
myriad of discussions as to why that is a poor approach.

You have the luxury of a large sample. You also have the challenge of
interpreting covariates that appear to be statistically significant, but may
have a rather small *effect size* in context. That is where subject matter
experts need to provide input as to interpretation of the contextual
significance of the variable, as opposed to the statistical significance of that
same variable.

A general approach, is to simply pre-specify your model based upon rather simple
considerations. Also, you need to determine if your goal for the model is
prediction or explanation.

What is the incidence of your 'event' in the sample? If it is say 10%,
then you should have around 20,000 events. The rule of thumb for logistic
regression is to have around 20 events per covariate degree of freedom (df) to
minimize the risk of over-fitting the model to your dataset. A continuous
covariate is 1 df, a k-level factor is k-1 df. So with 20,000 events, your model
could feasibly have 1,000 covariate df's. I am guessing that you don't
have that much independent data to begin with.

So, pre-specfy your model on the full dataset and stick with it. Interact with
subject matter experts on the interpretation of the model.

BTW, this question is really about statistical modeling generally, not really R
specific. Such queries are best posed to general statistical lists/forums such
as Stack Exchange. I would also point you to Frank Harrell's book,
Regression Modeling Strategies.

Regards,

Marc Schwartz

Maybe Matching Threads

Search for more possibly parallel threads

R help - Jul 2012 - Logistic regression X^2 test with large sample size (fwd)

[R] Logistic regression X^2 test with large sample size (fwd)

[R] Logistic regression X^2 test with large sample size (fwd)

Maybe Matching Threads