Guys, I used Random Forest with a couple of data sets I had to predict for binary response. In all the cases, the AUC of the training set is coming to be 1. Is this always the case with random forests? Can someone please clarify this? I have given a simple example, first using logistic regression and then using random forests to explain the problem. AUC of the random forest is coming out to be 1. data(iris) iris <- iris[(iris$Species != "setosa"),] iris$Species <- factor(iris$Species) fit <- glm(Species~.,iris,family=binomial) train.predict <- predict(fit,newdata = iris,type="response") library(ROCR) plot(performance(prediction(train.predict,iris$Species),"tpr","fpr"),col "red") auc1 <- performance(prediction(train.predict,iris$Species),"auc")@y.values[[1]] legend("bottomright",legend=c(paste("Logistic Regression (AUC=",formatC(auc1,digits=4,format="f"),")",sep="")), col=c("red"), lty=1) library(randomForest) fit <- randomForest(Species ~ ., data=iris, ntree=50) train.predict <- predict(fit,iris,type="prob")[,2] plot(performance(prediction(train.predict,iris$Species),"tpr","fpr"),col "red") auc1 <- performance(prediction(train.predict,iris$Species),"auc")@y.values[[1]] legend("bottomright",legend=c(paste("Random Forests (AUC=",formatC(auc1,digits=4,format="f"),")",sep="")), col=c("red"), lty=1) Thank you. Regards, Ravishankar R -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3006649.html Sent from the R help mailing list archive at Nabble.com.
Ravishankar,> I used Random Forest with a couple of data sets I had to predict for binary > response. In all the cases, the AUC of the training set is coming to be 1. > Is this always the case with random forests? Can someone please clarify > this?This is pretty typical for this model.> I have given a simple example, first using logistic regression and then > using random forests to explain the problem. AUC of the random forest is > coming out to be 1.Logistic regression isn't as flexible as RF and some other methods, so the ROC curve is likely to be less than one, but much higher than it really is (since you are re-predicting the same data) For you example:> performance(prediction(train.predict,iris$Species),"auc")@y.values[[1]][1] 0.9972 but using simple 10-fold CV:> library(caret) > ctrl <- trainControl(method = "cv",+ classProbs = TRUE, + summaryFunction = twoClassSummary)> > set.seed(1) > cvEstimate <- train(Species ~ ., data = iris,+ method = "glm", + metric = "ROC", + trControl = ctrl) Fitting: parameter=none Aggregating results Fitting model on full training set Warning messages: 1: glm.fit: fitted probabilities numerically 0 or 1 occurred 2: glm.fit: algorithm did not converge 3: glm.fit: fitted probabilities numerically 0 or 1 occurred 4: glm.fit: algorithm did not converge 5: glm.fit: fitted probabilities numerically 0 or 1 occurred> cvEstimateCall: train.formula(form = Species ~ ., data = iris, method = "glm", metric = "ROC", trControl = ctrl) 100 samples 4 predictors Pre-processing: Resampling: Cross-Validation (10 fold) Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... Resampling results Sens Spec ROC Sens SD Spec SD ROC SD 0.96 0.98 0.86 0.0843 0.0632 0.126 and for random forest:> set.seed(1) > rfEstimate <- train(Species ~ .,+ data = iris, + method = "rf", + metric = "ROC", + tuneGrid = data.frame(.mtry = 2), + trControl = ctrl) Fitting: mtry=2 Aggregating results Selecting tuning parameters Fitting model on full training set> rfEstimateCall: train.formula(form = Species ~ ., data = iris, method = "rf", metric = "ROC", tuneGrid = data.frame(.mtry = 2), trControl = ctrl) 100 samples 4 predictors Pre-processing: Resampling: Cross-Validation (10 fold) Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... Resampling results Sens Spec ROC Sens SD Spec SD ROC SD 0.94 0.92 0.898 0.0966 0.14 0.00632 Tuning parameter 'mtry' was held constant at a value of 2 -- Max
Let me expand on what Max showed. For the most part, performance on training set is meaningless. (That's the case for most algorithms, but especially so for RF.) In the default (and recommended) setting, the trees are grown to the maximum size, which means that quite likely there's only one data point in most terminal nodes, and the prediction at the terminal nodes are determined by the majority class in the node, or the lone data point. Suppose that is the case all the time; i.e., in all trees all terminal nodes have only one data point. A particular data point would be "in-bag" in about 64% of the trees in the forest, and every one of those trees has the correct prediction for that data point. Even if all the trees where that data points are out-of-bag gave the wrong prediction, by majority vote of all trees, you still get the right answer in the end. Thus basically the perfect prediction on train set for RF is "by design". Generally, good training prediction is just self-fulfilling prophecy. Andy> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of vioravis > Sent: Friday, October 22, 2010 1:20 AM > To: r-help at r-project.org > Subject: [R] Random Forest AUC > > > Guys, > > I used Random Forest with a couple of data sets I had to > predict for binary > response. In all the cases, the AUC of the training set is > coming to be 1. > Is this always the case with random forests? Can someone > please clarify > this? > > I have given a simple example, first using logistic > regression and then > using random forests to explain the problem. AUC of the > random forest is > coming out to be 1. > > data(iris) > iris <- iris[(iris$Species != "setosa"),] > iris$Species <- factor(iris$Species) > fit <- glm(Species~.,iris,family=binomial) > train.predict <- predict(fit,newdata = iris,type="response") > library(ROCR) > plot(performance(prediction(train.predict,iris$Species),"tpr", > "fpr"),col > "red") > auc1 <- > performance(prediction(train.predict,iris$Species),"auc")@y.va > lues[[1]] > legend("bottomright",legend=c(paste("Logistic Regression > (AUC=",formatC(auc1,digits=4,format="f"),")",sep="")), > col=c("red"), lty=1) > > > library(randomForest) > fit <- randomForest(Species ~ ., data=iris, ntree=50) > train.predict <- predict(fit,iris,type="prob")[,2] > plot(performance(prediction(train.predict,iris$Species),"tpr", > "fpr"),col > "red") > auc1 <- > performance(prediction(train.predict,iris$Species),"auc")@y.va > lues[[1]] > legend("bottomright",legend=c(paste("Random Forests > (AUC=",formatC(auc1,digits=4,format="f"),")",sep="")), > col=c("red"), lty=1) > > Thank you. > > Regards, > Ravishankar R > -- > View this message in context: > http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3006649.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Notice: This e-mail message, together with any attachme...{{dropped:11}}
Possibly Parallel Threads
- ROCR: auc and logarithm plot
- About Mcneil Hanley test for a portion of AUC!
- randomforest and AUC using 10 fold CV - Plotting results
- How to compare areas under ROC curves calculated with ROC R package
- How to compare areas under ROC curves calculated with ROCR package