thr3ads.net - R help - [R] Question about randomForest [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Matthew Francis

2011-Nov-26 20:02 UTC

[R] Question about randomForest

I've been using the R package randomForest but there is an aspect I
cannot work out the meaning of. After calling the randomForest
function, the returned object contains an element called prediction,
which is the prediction obtained using all the trees (at least that's
my understanding). I've checked that this prediction set has the error
rate as reported by err.rate.

However, if I send the training data back into the the
predict.randomForest function I find I get a different result to the
stored set of predictions. This is true for both classification and
regression. I find the predictions obtained this way also have a much
lower error rate and perform very well (suspiciously well...) on
measures such as AUC.

My understanding is that the two predictions above should be the same.
Since they are not, I must be not understanding something properly.
Any ideas what's going on?

Weidong Gu

2011-Nov-27 00:44 UTC

head link

[R] Question about randomForest

Hi Matthew,

The error rate reported by randomForest is the prediction error based
on out-of-bag OOB data. Therefore, it is different from prediction
error on the original data  since each tree was built using bootstrap
samples (about 70% of the original data), and the error rate of OOB is
likely higher than the prediction error of the original data as you
observed.

Weidong

On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis
<mattjamesfrancis at gmail.com> wrote:> I've been using the R package randomForest but there is an aspect I
> cannot work out the meaning of. After calling the randomForest
> function, the returned object contains an element called prediction,
> which is the prediction obtained using all the trees (at least that's
> my understanding). I've checked that this prediction set has the error
> rate as reported by err.rate.
>
> However, if I send the training data back into the the
> predict.randomForest function I find I get a different result to the
> stored set of predictions. This is true for both classification and
> regression. I find the predictions obtained this way also have a much
> lower error rate and perform very well (suspiciously well...) on
> measures such as AUC.
>
> My understanding is that the two predictions above should be the same.
> Since they are not, I must be not understanding something properly.
> Any ideas what's going on?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Matthew Francis

2011-Nov-27 08:21 UTC

head link

[R] Question about randomForest

Thanks for the help. Let me explain in more detail how I think that
randomForest works so that you (or others) can more easily see the
error of my ways.

The function first takes a random sample of the data, of the size
specified by the sampsize argument. With this it fully grows a tree
resulting in a horribly over-fitted classifier for the random sub-set.
It then repeats this again with a different sample to generate the
next tree and so on.

Now, my understanding is that after each tree is constructed, a test
prediction for the *whole* training data set is made by combining the
results of all trees (so e.g. for classification the majority votes of
all individual tree predictions). From this an error rate is
determined (applicable to the ensemble applied to the training data)
and reported in the err.rate member of the returned randomForest
object. If you look at the error rate (or plot it using the default
plot method) you see that it starts out very high when only 1 or a few
over-fitted trees are contributing, but once the forest gets larger
the error rate drops since the ensemble is doing its job. It doesn't
make sense to me that this error rate is for a sub-set of the data,
since the sub-set in question changes at each step (i.e. at each tree
construction)?

By doing cross-validation test making 'training' and 'test' sets
from
the data I have, I do find that I get error rates on the test sets
comparable to the error rate that is obtained from the prediction
member of the returned randomForest object. So that does seem to be
the 'correct' error.

By my understanding the error reported for the ith tree is that
obtained using all trees up to and including the ith tree to make an
ensemble prediction. Therefore the final error reported should be the
same as that obtained using the predict.randomForest function on the
training set, because by my understanding that should return an
identical result to that used to generate the error rate for the final
tree constructed??

Sorry that is a bit long winded, but I hope someone can point out
where I'm going wrong and set me straight.

Thanks!

On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu <anopheles123 at gmail.com>
wrote:> Hi Matthew,
>
> The error rate reported by randomForest is the prediction error based
> on out-of-bag OOB data. Therefore, it is different from prediction
> error on the original data ?since each tree was built using bootstrap
> samples (about 70% of the original data), and the error rate of OOB is
> likely higher than the prediction error of the original data as you
> observed.
>
> Weidong
>
> On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis
> <mattjamesfrancis at gmail.com> wrote:
>> I've been using the R package randomForest but there is an aspect I
>> cannot work out the meaning of. After calling the randomForest
>> function, the returned object contains an element called prediction,
>> which is the prediction obtained using all the trees (at least
that's
>> my understanding). I've checked that this prediction set has the
error
>> rate as reported by err.rate.
>>
>> However, if I send the training data back into the the
>> predict.randomForest function I find I get a different result to the
>> stored set of predictions. This is true for both classification and
>> regression. I find the predictions obtained this way also have a much
>> lower error rate and perform very well (suspiciously well...) on
>> measures such as AUC.
>>
>> My understanding is that the two predictions above should be the same.
>> Since they are not, I must be not understanding something properly.
>> Any ideas what's going on?
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

Liaw, Andy

2012-Apr-04 18:18 UTC

head link

[R] Question about randomForest

> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Saruman
> 
> I dont see how this answered the original question of the poster.
> 
> He was quite clear: the value of the predictions coming out 
> of RF do not
> match what comes out of the predict function using the same 
> RF object and
> the same data. Therefore, what is predict() doing that is 
> different from RF?
> Yes, RF is making its predictions using OOB, but nowhere does 
> it say way
> predict() is doing; indeed, it says if newdata is not given, then the
> results are just the OOB predictions. But newdata=oldata, then
> predict(newdata) != OOB predictions. So what is it then? 
Let me make this as clear as I possibly can:  If predict() is called without
newdata, all it can do is assume prediction on the training set is desired.  In
that case it returns the OOB prediction.  If newdata is given in predict(), it
assumes it is "new" data and thus makes prediction using all trees. 
If you just feed the training data as newdata, then yes, you will get overfitted
predictions.  It almost never make sense (to me anyway) to make predictions on
the training set.
 > Opens another issue, which is if newdata is close but not 
> exactly oldata,
> then you get overfitted results?
Possibly, depending on how "close" the new data are to the training
set.  This applies to nearly _ALL_ methods, not just RF.

Andy
 > --
> View this message in context: 
> http://r.789695.n4.nabble.com/Question-about-randomForest-tp41
11311p4529770.html> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> Notice:  This e-mail message, together with any attachme...{{dropped:11}}

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Nov 2011 - Question about randomForest

[R] Question about randomForest

[R] Question about randomForest

[R] Question about randomForest

[R] Question about randomForest

Possibly Parallel Threads