pufftissue pufftissue
2008-Dec-04  06:30 UTC
[R] Logistic Regression: variable selection based on p value?
Hi, When I use logistic regression, each variable has a p value associated with it. Do I only include the variables that have a statistically significant p value (<0.05), or are there situations when I should include variables when their p values are high? I had heard that if a variable has a high p value but it's not the terminal variable, keep it; otherwise, take it out. Not sure if it's right or even why this is the case. What about if my p values are terrible but this combo of variables yields the highest AUC and calibration? What prevails in this case? Thank you! [[alternative HTML version deleted]]
Erik Iverson
2008-Dec-04  13:42 UTC
[R] Logistic Regression: variable selection based on p value?
Puff - There are many strategies, ideas, and literature on this topic. A great introduction that leads to many of the references that are interesting is Frank Harrell's book, "Regression Modeling Strategies". I would highly recommend it. pufftissue pufftissue wrote:> Hi, > > When I use logistic regression, each variable has a p value associated with > it. Do I only include the variables that have a statistically significant p > value (<0.05), or are there situations when I should include variables when > their p values are high? I had heard that if a variable has a high p value > but it's not the terminal variable, keep it; otherwise, take it out. Not > sure if it's right or even why this is the case. What about if my p values > are terrible but this combo of variables yields the highest AUC and > calibration? What prevails in this case? > > Thank you! > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Frank E Harrell Jr
2008-Dec-04  13:53 UTC
[R] Logistic Regression: variable selection based on p value?
pufftissue pufftissue wrote:> Hi, > > When I use logistic regression, each variable has a p value associated with > it. Do I only include the variables that have a statistically significant p > value (<0.05), or are there situations when I should include variables when > their p values are high? I had heard that if a variable has a high p value > but it's not the terminal variable, keep it; otherwise, take it out. Not > sure if it's right or even why this is the case. What about if my p values > are terrible but this combo of variables yields the highest AUC and > calibration? What prevails in this case? > > Thank you!It depends on your goals, but in general problems caused by stepwise regression arise from using P-value cutoffs that are too small rather than cutoffs that are too large. There are many reasons not to remove any variables, if you want valid confidence intervals and P-values and discrimination indexes. Note that AUC is not a great objective function; that's why we have the log likelihood. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University