I've been using the R package randomForest but there is an aspect I cannot work out the meaning of. After calling the randomForest function, the returned object contains an element called prediction, which is the prediction obtained using all the trees (at least that's my understanding). I've checked that this prediction set has the error rate as reported by err.rate. However, if I send the training data back into the the predict.randomForest function I find I get a different result to the stored set of predictions. This is true for both classification and regression. I find the predictions obtained this way also have a much lower error rate and perform very well (suspiciously well...) on measures such as AUC. My understanding is that the two predictions above should be the same. Since they are not, I must be not understanding something properly. Any ideas what's going on?
Hi Matthew, The error rate reported by randomForest is the prediction error based on out-of-bag OOB data. Therefore, it is different from prediction error on the original data since each tree was built using bootstrap samples (about 70% of the original data), and the error rate of OOB is likely higher than the prediction error of the original data as you observed. Weidong On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis <mattjamesfrancis at gmail.com> wrote:> I've been using the R package randomForest but there is an aspect I > cannot work out the meaning of. After calling the randomForest > function, the returned object contains an element called prediction, > which is the prediction obtained using all the trees (at least that's > my understanding). I've checked that this prediction set has the error > rate as reported by err.rate. > > However, if I send the training data back into the the > predict.randomForest function I find I get a different result to the > stored set of predictions. This is true for both classification and > regression. I find the predictions obtained this way also have a much > lower error rate and perform very well (suspiciously well...) on > measures such as AUC. > > My understanding is that the two predictions above should be the same. > Since they are not, I must be not understanding something properly. > Any ideas what's going on? > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Thanks for the help. Let me explain in more detail how I think that randomForest works so that you (or others) can more easily see the error of my ways. The function first takes a random sample of the data, of the size specified by the sampsize argument. With this it fully grows a tree resulting in a horribly over-fitted classifier for the random sub-set. It then repeats this again with a different sample to generate the next tree and so on. Now, my understanding is that after each tree is constructed, a test prediction for the *whole* training data set is made by combining the results of all trees (so e.g. for classification the majority votes of all individual tree predictions). From this an error rate is determined (applicable to the ensemble applied to the training data) and reported in the err.rate member of the returned randomForest object. If you look at the error rate (or plot it using the default plot method) you see that it starts out very high when only 1 or a few over-fitted trees are contributing, but once the forest gets larger the error rate drops since the ensemble is doing its job. It doesn't make sense to me that this error rate is for a sub-set of the data, since the sub-set in question changes at each step (i.e. at each tree construction)? By doing cross-validation test making 'training' and 'test' sets from the data I have, I do find that I get error rates on the test sets comparable to the error rate that is obtained from the prediction member of the returned randomForest object. So that does seem to be the 'correct' error. By my understanding the error reported for the ith tree is that obtained using all trees up to and including the ith tree to make an ensemble prediction. Therefore the final error reported should be the same as that obtained using the predict.randomForest function on the training set, because by my understanding that should return an identical result to that used to generate the error rate for the final tree constructed?? Sorry that is a bit long winded, but I hope someone can point out where I'm going wrong and set me straight. Thanks! On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu <anopheles123 at gmail.com> wrote:> Hi Matthew, > > The error rate reported by randomForest is the prediction error based > on out-of-bag OOB data. Therefore, it is different from prediction > error on the original data ?since each tree was built using bootstrap > samples (about 70% of the original data), and the error rate of OOB is > likely higher than the prediction error of the original data as you > observed. > > Weidong > > On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis > <mattjamesfrancis at gmail.com> wrote: >> I've been using the R package randomForest but there is an aspect I >> cannot work out the meaning of. After calling the randomForest >> function, the returned object contains an element called prediction, >> which is the prediction obtained using all the trees (at least that's >> my understanding). I've checked that this prediction set has the error >> rate as reported by err.rate. >> >> However, if I send the training data back into the the >> predict.randomForest function I find I get a different result to the >> stored set of predictions. This is true for both classification and >> regression. I find the predictions obtained this way also have a much >> lower error rate and perform very well (suspiciously well...) on >> measures such as AUC. >> >> My understanding is that the two predictions above should be the same. >> Since they are not, I must be not understanding something properly. >> Any ideas what's going on? >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >
> From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Saruman > > I dont see how this answered the original question of the poster. > > He was quite clear: the value of the predictions coming out > of RF do not > match what comes out of the predict function using the same > RF object and > the same data. Therefore, what is predict() doing that is > different from RF? > Yes, RF is making its predictions using OOB, but nowhere does > it say way > predict() is doing; indeed, it says if newdata is not given, then the > results are just the OOB predictions. But newdata=oldata, then > predict(newdata) != OOB predictions. So what is it then?Let me make this as clear as I possibly can: If predict() is called without newdata, all it can do is assume prediction on the training set is desired. In that case it returns the OOB prediction. If newdata is given in predict(), it assumes it is "new" data and thus makes prediction using all trees. If you just feed the training data as newdata, then yes, you will get overfitted predictions. It almost never make sense (to me anyway) to make predictions on the training set.> Opens another issue, which is if newdata is close but not > exactly oldata, > then you get overfitted results?Possibly, depending on how "close" the new data are to the training set. This applies to nearly _ALL_ methods, not just RF. Andy> -- > View this message in context: > http://r.789695.n4.nabble.com/Question-about-randomForest-tp4111311p4529770.html> Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Notice: This e-mail message, together with any attachme...{{dropped:11}}