Thomas Mang
2009-Aug-03 05:06 UTC
[R] min frequencies of categorical predictor variables in GLM
Hi, Suppose a binomial GLM with both continuous as well as categorical predictors (sometimes referred to as GLM-ANCOVA, if I remember correctly). For the categorical predictors = indicator variables, is then there a suggested minimum frequency of each level ? Would such a rule/ recommendation be dependent on the y-side too ? Example: N is quite large, a bit > 100. Observed however are only 0/1s (so Bernoulli random variables, not Binomial, because the covariates are from observations and in general always different between observations). There are two categorical predictors, each with 2 levels. It would structurally probably also make sense to allow an interaction between those, yielding de facto a single categorical predictor with 4 levels. Is then there a minimum of observations falling in each of the 4 level category (either absolute or relative), or also that plus also considering the y-side ? thanks ! Thomas
Marc Schwartz
2009-Aug-03 14:46 UTC
[R] min frequencies of categorical predictor variables in GLM
On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:> Hi, > > Suppose a binomial GLM with both continuous as well as categorical > predictors (sometimes referred to as GLM-ANCOVA, if I remember > correctly). For the categorical predictors = indicator variables, is > then there a suggested minimum frequency of each level ? Would such > a rule/ recommendation be dependent on the y-side too ? > > Example: N is quite large, a bit > 100. Observed however are only > 0/1s (so Bernoulli random variables, not Binomial, because the > covariates are from observations and in general always different > between observations). There are two categorical predictors, each > with 2 levels. It would structurally probably also make sense to > allow an interaction between those, yielding de facto a single > categorical predictor with 4 levels. Is then there a minimum of > observations falling in each of the 4 level category (either > absolute or relative), or also that plus also considering the y-side ?Must be the day for sample size questions for logistic regression. A similar query is on MedStats today. The typical minimum sample size recommendation for logistic regression is based upon covariate degrees of freedom (or columns in the model matrix). The guidance is that there should be 10 to 20 *events* per covariate degree of freedom. So if you have 2 factors, each with two levels, that gives you two covariate degrees of freedom total (two columns in the model matrix). At the high end of the above range, you would need 40 events in your sample. If the event incidence in your sample is 10%, you would need 400 cases to observe 40 events to support the model with the two two-level covariates (Y ~ X1 + X2). An interaction term (in addition to the 2 main effect terms, Y ~ X1 * X2) in this case would add another column to the model matrix, thus, you would need an additional 20 events, or another 200 cases in your sample. So you could include the two two-level factors and the interaction term if you have 60 events, or in my example, about 600 cases. If you include the interaction term only in the absence of the main effects (Y ~ X1:X2), that would yield 4 columns in the model matrix, requiring 80 events, or about 800 cases. Without more details (eg. your underlying hypothesis), it is not clear to me that you gain anything here as compared to the use of the main effects and potentially, the interaction term together, and you certainly lose in terms of model interpretation and requiring a notably larger sample size. Relative to a minimum sample size for each of the levels in the factor based covariates, I am not aware of any specific guidance there, short of dealing with empty cells at the extreme. However, there are methods to assess covariate complexity and the consideration for the collapsing of factor levels. For more details on these issues, I would refer you to Frank's book, Regression Modeling Strategies, specifically to chapters 4 and 10-12. The former focuses on general multivariable strategies and the latter focuses on LR. More information here: http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS HTH, Marc Schwartz
Reasonably Related Threads
- interpreting the output of a glm with an ordered categorical predictor.
- Coding of categorical variables for logistic regression?
- SEM with a categorical predictor variable
- regression function for categorical predictor data
- Categorical Predictors for SVM (e1071)