thr3ads.net - R help - [R] comparing random forests and classification trees [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Amy Koch

2007-Jan-29 00:34 UTC

[R] comparing random forests and classification trees

Hi,

I have done an analysis using 'rpart' to construct a Classification
Tree. I
am wanting to retain the output in tree form so that it is easily
interpretable. However, I am wanting to compare the 'accuracy' of the
tree
to a Random Forest to estimate how much predictive ability is lost by using
one simple tree. My understanding is that the error automatically displayed
by the two functions is calculated differently so it is therefore incorrect
to use this as a comparison. Instead I have produced a table for both
analyses comparing the observed and predicted response. 

E.g. table(data$dependent,predict(model,type="class"))

I am looking for confirmation that (a) it is incorrect to compare the error
estimates for the two techniques and (b) that comparing the
misclassification rates is an appropriate method for comparing the two
techniques.

Thanks

Amy

 

 

Amelia Koch

University of Tasmania

School of Geography and Environmental Studies

Private Bag 78 Hobart

Tasmania, Australia 7001

Ph: +61 3 6226 7454

ajkoch@utas.edu.au

 


	[[alternative HTML version deleted]]

Wensui Liu

2007-Jan-29 01:44 UTC

head link

[R] comparing random forests and classification trees

Amy,
If I were you, I will check the misclassification rates in both
training set and testing set from 2 models.


On 1/28/07, Amy Koch <ajkoch at postoffice.utas.edu.au>
wrote:> Hi,
>
> I have done an analysis using 'rpart' to construct a Classification
Tree. I
> am wanting to retain the output in tree form so that it is easily
> interpretable. However, I am wanting to compare the 'accuracy' of
the tree
> to a Random Forest to estimate how much predictive ability is lost by using
> one simple tree. My understanding is that the error automatically displayed
> by the two functions is calculated differently so it is therefore incorrect
> to use this as a comparison. Instead I have produced a table for both
> analyses comparing the observed and predicted response.
>
> E.g. table(data$dependent,predict(model,type="class"))
>
> I am looking for confirmation that (a) it is incorrect to compare the error
> estimates for the two techniques and (b) that comparing the
> misclassification rates is an appropriate method for comparing the two
> techniques.
>
> Thanks
>
> Amy
>
>
>
>
>
> Amelia Koch
>
> University of Tasmania
>
> School of Geography and Environmental Studies
>
> Private Bag 78 Hobart
>
> Tasmania, Australia 7001
>
> Ph: +61 3 6226 7454
>
> ajkoch at utas.edu.au
>
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)

Darin A. England

2007-Jan-30 19:32 UTC

head link

[R] comparing random forests and classification trees

Amy,

I have also had this issue with randomForest, that is, you lose the
ability to explain the classifier in a simple way to
non-specialists (everyone can understand the single decision tree.)
As far as comparing the accuracy of the two, I think that you are
correct in comparing them by the actual vs predicted tables.
randomForest reports this as the confusion matrix, and it also
reports the out-of-bag error, which I think you are referring to. I
would not compare the rf out-of-bag error with the rpart relative
error (or cross-validated error if you are doing cross validation.)

So, for what it's worth I think you are correct. Also, do you know
about ctree in the "party" package? If you want to retain the
explanatory power of a single tree and have a nice accurate
classifier, I have found ctree to work quite well.

HTH,

Darin

On Mon, Jan 29, 2007 at 11:34:51AM +1100, Amy Koch
wrote:> Hi,
> 
> I have done an analysis using 'rpart' to construct a Classification
Tree. I
> am wanting to retain the output in tree form so that it is easily
> interpretable. However, I am wanting to compare the 'accuracy' of
the tree
> to a Random Forest to estimate how much predictive ability is lost by using
> one simple tree. My understanding is that the error automatically displayed
> by the two functions is calculated differently so it is therefore incorrect
> to use this as a comparison. Instead I have produced a table for both
> analyses comparing the observed and predicted response. 
> 
> E.g. table(data$dependent,predict(model,type="class"))
> 
> I am looking for confirmation that (a) it is incorrect to compare the error
> estimates for the two techniques and (b) that comparing the
> misclassification rates is an appropriate method for comparing the two
> techniques.
> 
> Thanks
> 
> Amy
> 
>  
> 
>  
> 
> Amelia Koch
> 
> University of Tasmania
> 
> School of Geography and Environmental Studies
> 
> Private Bag 78 Hobart
> 
> Tasmania, Australia 7001
> 
> Ph: +61 3 6226 7454
> 
> ajkoch at utas.edu.au
> 
>  
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

jamesmcc

2009-Sep-17 15:00 UTC

head link

[R] comparing random forests and classification trees

Greetings tree and forest coders-

I'm interested in comparing randomforests and regression tree/ bagging tree
models. I'd like to propose a basis for doing this, get feedback, and
document this here. I kept it in this thread since that makes sense.

In this case I think it's appropriate to compare the R^2 values as one basic
measure. I'm actually going to compare mean error (ME), mean absolute error
(MAE), root mean squared error (RMSE) as well. This means that I need
estimates from each approach so that I can form residuals. **As I see it,
the important details are in how to set up the models so that I have
comparable estimates, particularly in how the trees/forests are trained and
evaluated.**

For regression/bagging trees, the typical approach for my application is 100
runs of 10-fold CV. In each run all the values are estimated in an
out-of-the-bag sense; each "fold" is estimated while it is withheld
from
fitting, thus fit is not inflated. The estimates are then averaged over the
100 runs at each point to get an average simulation and this is used to
calculate residuals and the measures mentioned above. Somewhat more
specifically, the steps are: I fit a model, I prune it via inspection, I
loop 100 times on xpred.rpart(model,xval=10,cp=cp at bottom of cptable from
pruned fit) to generate the 100 runs (bagging is thus performed while
holding the cp criteria fixed?), I average these pointwise, I calculate the
desired stats/quantities for comparison to other models.

For randomForests, I would want to fit the model in a similar way, ie 100
runs of 10-fold CV. I think the 10-fold part is clear, the 100 runs, maybe
less so. To get 10-fold OOB estimates, I set replace=FALSE,
sampsize=.9*nrow(x). Then I get a randomForest with $predicted being the
average OOB estimates over all trees for which each point was OOB. I would
assume that each tree is constructed with a different 10-fold partitioning
of the data set. Thus the number of runs is really more like the number of
trees constructed. If i wanted to be really thorough, I could fit 100 random
forests and get the $predicted for each and then average these pointwise.
But that seems like over kill; isnt that the lesson of plot.randomForest
that as the # of trees goes up the error converges to some limit. (from what
i've seen).

Thus, my primary concern is in the amount of data used for training and
cross validating the model in an out-of-bag sense; can i meaningfully
compare 10-fold oob estimates sing xpred.rpart to a random forest fit using
90% of the data as sampsize?

Of secondary concern is the number of bagging trees versus then number of
trees in the random forest. As long as the average estimate error is nearing
some limit with the number of bagging trees I'm using, I think this is all
that matters. So this is more of methodological difference to be retained,
similar to differences in pruning under bagging and random forests, though I
should probably specify the node sizes to be similar for each.

Am I overlooking anything of grave consequence?

Any and all thoughts are welcome. If you are aware of any comparisons of
rpart and randomForests in the literature for any field (for regression) of
which I am ignorant, I would appreciate the tip. I have read over "Newer
Classification and Regression Tree Techniques: Bagging and Random Forests
for Ecological Prediction" by Prasad, Iverson, and Liaw. I may have missed
it, but I did not see discussion of maintaining consistency in the way the
models were trained, though it is a very nice paper overall and contained
many interesting approaches and points.

Thanks in advance,

James

--
View this message in context:
http://www.nabble.com/-R--comparing-random-forests-and-classification-trees-tp8682315p25491934.html
Sent from the R help mailing list archive at Nabble.com.

Maybe Matching Threads

Search for more maybe matching threads

R help - Jan 2007 - comparing random forests and classification trees

[R] comparing random forests and classification trees

[R] comparing random forests and classification trees

[R] comparing random forests and classification trees

[R] comparing random forests and classification trees

Maybe Matching Threads