As suggested in my earlier message, I have a large population of independent variables and a binary dependent outcome. It is expected that only a few of the independent variables actually contribute to the outcome, and I'd like to find those. If it wasn't already obvious, I am *not* a statistician. Not even close. :-) Statistician colleagues have suggested that I use logistic regression for this problem. My understanding is that logistic regression is available in R as glm(..., family=binomial). When I use this solver on fictitious data, though, the answers I expect are not the answers I see. Consider the following fictitious data, where "z" is the dependent binary outcome, "y" is irrelevant noise, and "x" is actually relevant to predicting the outcome: x y z 1 8 7 1 2 8 3 1 3 0 5 0 4 0 9 0 5 8 1 1 If I feed this data to glm(z ~ x + y) using the default gaussian family, the results make some sense to me. The estimated coefficient for x is positive and the corresponding "Pr(>|t|)" value is tiny (<2e-16), which I take to imply a high degree of confidence that larger values of x correlate with increased likelihood of z. Conversely, the estimated coefficient for y has a "Pr(>|t|)" value of 0.552, which I take to imply that there is no strong correlation between y and z. Good. However, I've been told that I want to use family=binomial for a logistic regression problem with a binary dependent outcome like this. If I give this data to glm(z ~ x + y, family=binomial), the results become quite mysterious. I receive a warning that "Algorithm did not converge". The "Pr(>|t|)" values for x and y are 0.916 and 1.000 respectively, which would seem to indicate that neither one correlates with the outcome. I realize that this is not a problem with R. It is a problem with my understanding of what R is doing. But you all have been so helpful thus far, perhaps I can impose on you to give me one more clue? What am I doing wrong here? What should I be looking at that I'm not? Thank you, once again! -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
ripley@stats.ox.ac.uk
2002-Nov-11 07:32 UTC
[R] binomial glm for relevant feature selection?
On Sun, 10 Nov 2002, Ben Liblit wrote:> As suggested in my earlier message, I have a large population of > independent variables and a binary dependent outcome. It is expected > that only a few of the independent variables actually contribute to the > outcome, and I'd like to find those. > > If it wasn't already obvious, I am *not* a statistician. Not even > close. :-) Statistician colleagues have suggested that I use logistic > regression for this problem. My understanding is that logistic > regression is available in R as glm(..., family=binomial). > > When I use this solver on fictitious data, though, the answers I expect > are not the answers I see. Consider the following fictitious data, > where "z" is the dependent binary outcome, "y" is irrelevant noise, and > "x" is actually relevant to predicting the outcome: > > x y z > 1 8 7 1 > 2 8 3 1 > 3 0 5 0 > 4 0 9 0 > 5 8 1 1 > > If I feed this data to glm(z ~ x + y) using the default gaussian family, > the results make some sense to me. The estimated coefficient for x is > positive and the corresponding "Pr(>|t|)" value is tiny (<2e-16), which > I take to imply a high degree of confidence that larger values of x > correlate with increased likelihood of z. Conversely, the estimated > coefficient for y has a "Pr(>|t|)" value of 0.552, which I take to imply > that there is no strong correlation between y and z. Good. > > However, I've been told that I want to use family=binomial for a > logistic regression problem with a binary dependent outcome like this. > If I give this data to glm(z ~ x + y, family=binomial), the results > become quite mysterious. I receive a warning that "Algorithm did not > converge". The "Pr(>|t|)" values for x and y are 0.916 and 1.000 > respectively, which would seem to indicate that neither one correlates > with the outcome. > > I realize that this is not a problem with R. It is a problem with my > understanding of what R is doing. But you all have been so helpful thus > far, perhaps I can impose on you to give me one more clue? What am I > doing wrong here? What should I be looking at that I'm not?Your problem is linearly separable, and you are seeing the Hauck-Donner effect. This is rare (but by no means unknown) in real problems, and means the Wald test as used by the t values is unreliable. More details in Venables & Ripley (1999, 2002), look Hauck-Donner up in the index. It's a technical point and the explanation is technical, but there is also a practical summary there. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Apparently Analagous Threads
- Is it possible to use glm() with 30 observations?
- What kind of test in summary(glm)?
- zero random effect sizes with binomial lmer [sorry, ignore previous]
- logistic regression with response 0,1
- logistic regression with a sample missing subjects with a value of an independent variable