vlagani at ics.forth.gr
2012-Oct-02 16:10 UTC
[R] count data as independent variable in logistinc regression
Dear R users, I would like to employ count data as covariates while fitting a logistic regression model. My question is: do I violate any assumption of the logistic (and, more in general, of the generalized linear) models by employing count, non-negative integer variables as independent variables? I found a lot of references in the literature regarding hot to use count data as outcome, but not as covariates; see for example the very clear paper: "N E Breslow (1996) Generalized Linear Models: Checking Assumptions and Strengthening Conclusions, Congresso Nazionale Societa Italiana di Biometria, Cortona June 1995", available at http://biostat.georgiahealth.edu/~dryu/course/stat9110spring12/land16_ref.pdf. Loosely speaking, it seems that glm assumptions may be expressed as follows: iid residuals; the link function must correctly represent the relationship among dependent and independent variables; absence of outliers Does everybody knows whether there exists any other assumption/technical problem that may suggest to use some other type of models for dealing with count covariates? Finally, please notice that my data contain relatively few samples (<100) and that count variables' ranges can vary within 3-4 order of magnitude (i.e. some variables has value in the range 0-10, while other variables may have values within 0-10000). A simple example code follows: ########################################################### #genrating simulated data var1 = sample(0:10, 100, replace = TRUE); var2 = sample(0:1000, 100, replace = TRUE); var3 = sample(0:100000, 100, replace = TRUE); outcome = sample(0:1, 100, replace = TRUE); dataset = data.frame(outcome, var1, var2, var3); #fitting the model model = glm(outcome ~ ., family=binomial, data = dataset) #inspecting the model print(model) ########################################################### Regards, -- Vincenzo Lagani Research Fellow BioInformatics Laboratory Institute of Computer Science Foundation for Research and Technology - Hellas
Bert Gunter
2012-Oct-02 16:28 UTC
[R] count data as independent variable in logistinc regression
This is not primarily an R question, although I grant you that it might intersect packages in R that do what you want. Nevertheless, I think you would do better posting on a statistical list, like stats.stackexchange.com . Maybe once you've figured out there what you want, you can come back to R to find an implementation. Cheers, Bert On Tue, Oct 2, 2012 at 9:10 AM, <vlagani at ics.forth.gr> wrote:> > Dear R users, > > I would like to employ count data as covariates while fitting a logistic > regression model. My question is: > > do I violate any assumption of the logistic (and, more in general, of the > generalized linear) models by employing count, non-negative integer > variables as independent variables? > > I found a lot of references in the literature regarding hot to use count > data as outcome, but not as covariates; see for example the very clear > paper: "N E Breslow (1996) Generalized Linear Models: Checking Assumptions > and Strengthening Conclusions, Congresso Nazionale Societa Italiana di > Biometria, Cortona June 1995", available at > http://biostat.georgiahealth.edu/~dryu/course/stat9110spring12/land16_ref.pdf. > > Loosely speaking, it seems that glm assumptions may be expressed as follows: > > iid residuals; > the link function must correctly represent the relationship among dependent > and independent variables; > absence of outliers > > Does everybody knows whether there exists any other assumption/technical > problem that may suggest to use some other type of models for dealing with > count covariates? > > Finally, please notice that my data contain relatively few samples (<100) > and that count variables' ranges can vary within 3-4 order of magnitude > (i.e. some variables has value in the range 0-10, while other variables may > have values within 0-10000). > > A simple example code follows: > > ########################################################### > > #genrating simulated data > var1 = sample(0:10, 100, replace = TRUE); > var2 = sample(0:1000, 100, replace = TRUE); > var3 = sample(0:100000, 100, replace = TRUE); > outcome = sample(0:1, 100, replace = TRUE); > dataset = data.frame(outcome, var1, var2, var3); > > #fitting the model > model = glm(outcome ~ ., family=binomial, data = dataset) > > #inspecting the model > print(model) > > ########################################################### > > Regards, > > -- > Vincenzo Lagani > Research Fellow > BioInformatics Laboratory > Institute of Computer Science > Foundation for Research and Technology - Hellas > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm