Daniel Weitzenfeld
2011-Feb-14 05:31 UTC
[R] Optimal Y>=q cutoff after logistic regression
Hi, I understand that dichotimization of the predicted probabilities after logistic regression is philosophically questionable, throwing out information, etc. But I want to do it anyway. I'd like to include as a measure of fit % of observations correctly classified because it's measured in units that non-statisticians can understand more easily than area under the ROC curve, Dxy, etc. Am I right that there is an optimal Y>=q probability cutoff, at which the True Positive Rate is high and the False Positive Rate is low? Visually, it would be the elbow in the ROC curve, right? My reasoning is that even if you had a near-perfect model, you could set a stupidly low (high) cutoff and have a higher false positive (negative) rate than would be optimal. I know the standard default or starting point is Y>=.5, but if my above reasoning is correct, there ought to be an optimal cutoff for a given model. Is there an easy way to determine that cutoff in R without writing my own script to iterate through possible breakpoints and calculating classification accuracy at each one? Thanks in advance. -Dan
On Feb 14, 2011, at 12:31 AM, Daniel Weitzenfeld wrote:> Hi, > > I understand that dichotimization of the predicted probabilities after > logistic regression is philosophically questionable, throwing out > information, etc. > > But I want to do it anyway. I'd like to include as a measure of fit % > of observations correctly classified because it's measured in units > that non-statisticians can understand more easily than area under the > ROC curve, Dxy, etc. > > Am I right that there is an optimal Y>=q probability cutoff, at which > the True Positive Rate is high and the False Positive Rate is low?Only if the data supports it.> Visually, it would be the elbow in the ROC curve, right?If there is an "elbow", perhaps. The real answer is that you should thoughtfully consider the consequences of a wrong answer that the test is negative (False -) and those of a wrong answer that a test is positive (False +) and then make a decision that properly balances both the costs sand the probabilities.> My reasoning is that even if you had a near-perfect model, you could > set a stupidly low (high) cutoff and have a higher false positive > (negative) rate than would be optimal. > > I know the standard default or starting point is Y>=.5,Huh... what is Y?> but if my > above reasoning is correct, there ought to be an optimal cutoff for a > given model. Is there an easy way to determine that cutoff in R > without writing my own script to iterate through possible breakpoints > and calculating classification accuracy at each one?There are packages that handle ROC analyses.> > Thanks in advance. > -Dan > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
Maybe Matching Threads
- ROCR package question for evaluating two regression models
- ROCR package finding maximum accuracy and optimal cutoff point
- Logistic regression to select genes and estimate cutoff point?
- ROCR - confidence interval for Sens and Spec
- ROC curve from logistic regression