On Jun 4, 2012, at 3:47 PM, David Studer wrote:
> Hi everybody!
>
> I have a sample with n=2.000. This sample contains rare events (10, 20, 30
> individuals with a specific illness).
> Now I'd like to do a logistic regression in order to identify risk
factors.
> I have several independent variables on an interval
> scale.
>
> Does anyone know whether the number of these rare events is sufficient in
> order to calculate a multivariate
> logistic regression? Or are there any alternative models I should use?
> (which are available in R)
>
> Thank you very much any advice!
> David
The quick answer is yes you can, but you will be very limited in how many
covariates you can include in each of the respective models.
You are looking at event rates of 0.5%, 1.0% and 1.5% which in my experience are
not truly "rare", per se. We had a recent post with an event rate on
the order of 0.006%, albeit with millions of records. That is rare... :-)
Typical "rules of thumb", to avoid over-fitting for LR models would
suggest that you should have between 10 and 20 "events" per covariate
degree of freedom. A continuous covariate would be 1 df, an N-level factor would
be N-1 df.
With your sample and the number of events, you would be limited to perhaps no
more than 2 or 3 covariate df and even then you should give consideration to
using penalization to avoid over-fitting.
Two references that would be helpful to you are:
Frank's "Regression Modeling Strategies" book:
http://www.amazon.com/exec/obidos/ASIN/0387952322/
There is a helpful and updated PDF download here:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RmS/rms.pdf
and I would focus, in your case, on the use of the lrm() function in Frank's
rms CRAN package, along with related tools for penalization and validation.
Also, Steyerberg's "Clinical Prediction Models" book:
http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X
which is an excellent reference and has relevant examples using R.
The rumor is that Frank is working on a new edition of his book with a greater
focus on the use of R and is due RSN. Perhaps there will be copies at useR in
Nashville next week? One could hope... :-)
Regards,
Marc Schwartz