Hello, I am just not sure what the predict.RandomForest function is doing... I confused. I would expect the predictions for these 2 function calls to predict the same: ```{r} diachp.rf <- randomForest(quality~.,data=data,ntree=50, importance=TRUE) ypred_oob <- predict(diachp.rf) dataX <- data %>% select(-quality) # remove response. ypred <- predict( diachp.rf, dataX ) ypred_oob == ypred ``` These are both out of bag predictions but ypred and ypred_oob are actually they are very different.> table(ypred_oob , data$quality)ypred_oob 0 1 0 1324 346 1 493 2837> table(ypred , data$quality)ypred 0 1 0 1817 0 1 0 3183 What I find even more disturbing is that 100% accuracy for ypred. Would you agree that this is rather unexpected? regards Witek -- Witold Eryk Wolski
predict(diachp.rf, dataX) returns the in-sample predictions, not the OOB predictions. The response variable ?quality? is only used during model fit, not during prediction. Since in-sample predictions of random forests are typically grossly overfitted by construction, extremely high accuracies are not unexpected. Gesendet von Mail f?r Windows 10 Von: Witold E Wolski Gesendet: Samstag, 12. Januar 2019 18:56 An: r-help at r-project.org Betreff: [R] randomForest out of bag prediction Hello, I am just not sure what the predict.RandomForest function is doing... I confused. I would expect the predictions for these 2 function calls to predict the same: ```{r} diachp.rf <- randomForest(quality~.,data=data,ntree=50, importance=TRUE) ypred_oob <- predict(diachp.rf) dataX <- data %>% select(-quality) # remove response. ypred <- predict( diachp.rf, dataX ) ypred_oob == ypred ``` These are both out of bag predictions but ypred and ypred_oob are actually they are very different.> table(ypred_oob , data$quality)ypred_oob 0 1 0 1324 346 1 493 2837> table(ypred , data$quality)ypred 0 1 0 1817 0 1 0 3183 What I find even more disturbing is that 100% accuracy for ypred. Would you agree that this is rather unexpected? regards Witek -- Witold Eryk Wolski ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
Off topic. But see here: https://stats.stackexchange.com/questions/61405/random-forest-and-prediction -- Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, Jan 12, 2019 at 9:56 AM Witold E Wolski <wewolski at gmail.com> wrote:> Hello, > > I am just not sure what the predict.RandomForest function is doing... > I confused. > > I would expect the predictions for these 2 function calls to predict the > same: > ```{r} > diachp.rf <- randomForest(quality~.,data=data,ntree=50, importance=TRUE) > > ypred_oob <- predict(diachp.rf) > dataX <- data %>% select(-quality) # remove response. > ypred <- predict( diachp.rf, dataX ) > > ypred_oob == ypred > ``` > These are both out of bag predictions but ypred and ypred_oob are > actually they are very different. > > > table(ypred_oob , data$quality) > > ypred_oob 0 1 > 0 1324 346 > 1 493 2837 > > table(ypred , data$quality) > > ypred 0 1 > 0 1817 0 > 1 0 3183 > > What I find even more disturbing is that 100% accuracy for ypred. > Would you agree that this is rather unexpected? > > regards > Witek > -- > Witold Eryk Wolski > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
See inline. On Sat, Jan 12, 2019 at 9:56 AM Witold E Wolski <wewolski at gmail.com> wrote:> ypred_oob <- predict(diachp.rf)AFAIK these are, indeed, the out-of-bag predictions.> dataX <- data %>% select(-quality) # remove response. > ypred <- predict( diachp.rf, dataX )These are not out of bag predictions. dataX is interpreted as new data (argument newdata), and it is assumed to contain entirely new observations. Each observation in dataX is fed through all of the trees and the predictions are then pooled. There is no out-of-bag here - all of the new data observations are assumed to be independent of the training set.> > What I find even more disturbing is that 100% accuracy for ypred. > Would you agree that this is rather unexpected?It is expected (and not disturbing) l if your training set had enough variables (or signal) to create trees that fit the training data perfectly. HTH, Peter