thr3ads.net - R help - [R] rpart vs. randomForest [Apr 2003]

If this information is useful, please help other people find it:
Share via:

chumpmonkey@hushmail.com

2003-Apr-12 21:41 UTC

[R] rpart vs. randomForest

Greetings. I'm trying to determine whether to use rpart or randomForest
for a classification tree. Has anybody tested efficacy formally? I've
run both and the confusion matrix for rf beats rpart. I've looking at
the rf help page and am unable to figure out how to extract the tree.
But more than that I'm looking for a more comprehensive user's guide
for randomForest including the benefits on using it with MDS. Can anybody
suggest a general guide? I've been finding a lot of broken links and
cs-type of web pages rather than an end-user's guide. Also people's
experience
on adjusting the mtry param would be useful. Breiman says that it isn't
too sensitive but I'm curious if anybody has had a different experience
with it. Thanks in advance and apologies if this is too general.



Concerned about your privacy? Follow this link to get
FREE encrypted email: https://www.hushmail.com/?l=2 

Big $$$ to be made with the HushMail Affiliate Program: 
https://www.hushmail.com/about.php?subloc=affiliate&l=427

Martin Maechler

2003-Apr-14 09:32 UTC

head link

[R] rpart vs. randomForest

>>>>> "Anonymous" ==   <chumpmonkey at
hushmail.com>
>>>>>     on Sat, 12 Apr 2003 14:41:00 -0700 writes:
    Anonymous> Greetings. I'm trying to determine whether to use
    Anonymous> rpart or randomForest for a classification
    Anonymous> tree. Has anybody tested efficacy formally? I've
    Anonymous> run both and the confusion matrix for rf beats
    Anonymous> rpart. I've looking at the rf help page and am
    Anonymous> unable to figure out how to extract the tree.
    Anonymous> But more than that I'm looking for a more
    Anonymous> comprehensive user's guide for randomForest
    Anonymous> including the benefits on using it with MDS. Can
    Anonymous> anybody suggest a general guide? I've been
    Anonymous> finding a lot of broken links and cs-type of web
    Anonymous> pages rather than an end-user's guide. Also
    Anonymous> people's experience on adjusting the mtry param
    Anonymous> would be useful. Breiman says that it isn't too
    Anonymous> sensitive but I'm curious if anybody has had a
    Anonymous> different experience with it. Thanks in advance
    Anonymous> and apologies if this is too general.


If you really read Breiman, or alternatively, remember English,
you'll know that a forest has many trees...

Regards,
Martin Maechler <maechler at stat.math.ethz.ch>
http://stat.ethz.ch/~maechler/

Andy Bunn

2003-Apr-14 17:24 UTC

head link

[R] rpart vs. randomForest

I think you are misunderstanding what randomForest does. It is not an
optimizer that spits the "best" tree back at you. It grows a forest of
trees (as many as you tell it to but 500 is the default). I would stick
to rpart if you are having trouble wrapping your head around
randomForest. Tree models are being used in many fields now and you
should be able to find an applied guide in you field with a little
effort.

Good luck, Andy

Liaw, Andy

2003-Apr-14 17:37 UTC

head link

[R] rpart vs. randomForest

One of these days I promise to write a package vignette...

As Martin said, RF uses many trees (500 by default).  The "forest"
component
of the randomForest object contains all the trees, but not in a easily
readable form (because I don't see much use in "looking" at the
trees except
for debugging purposes).  If you really want to see what a tree look like,
grow just one tree and look at the "forest" component.  Here are some
explanation:

For each tree: 
o  "nrnodes" is the maxinum number of nodes a tree can have.  

o  "ndbigtree" is a vector of length ntree containing the total number
of
nodes in the trees.

o  "nodestatus" is a nrnodes by ntree matrix of indicators: -1 if the
node
is terminal.

o  "treemap" a 3-D array, containing a two-column matrix for each
tree.  The
first column indicate which node is the "left decendent" and the
second
column the "right decendent".  Both are 0 if the node is terminal.

o  "bestvar" is a nrnodes by ntree matrix that indicate, for each
node,
which variable is used to split that node.  0 for terminal nodes.

o  "xbestsplit" is the same as "bestvar", except it tells
where to split.


One thing people should keep in mind about the "predicted" component
of the
randomForest object (and the confusion matrix for the training data), as
well as "predict(rf.object)" without giving the newdata for
prediction:
That prediction is based on Out-of-Bag samples, so is *NOT* the same as
usual prediction on training data.  It is closer to the out-of-sample
prediction as in, e.g., cross-validation.

AFAIK there are only empirical and anecdotal evidence on sensitivity of
performance to value of mtry.  I can say that in my own experience, fiddling
with mtry will only give at best marginal improvement.  One easy way to
answer the question for your situation is to try it yourself and see.

With MDS on proximity matrix, you probably need to be a bit careful in its
interpretation.  The proximity matrix of the training data is computed on
the *entire* training data, rather than just the out of bag portion.  Thus
the MDS plot will quite often show the different classes that look more
"separable" than they really are.  (We are thinking about a fix. 
Breiman
pointed out that the difficulty is that if the proximity matrix is
calculated only on the out-of-bag data, than 1-proximity is no longer
positive definite).

HTH,
Andy
> -----Original Message-----
> From: chumpmonkey at hushmail.com [mailto:chumpmonkey at hushmail.com]
> Sent: Saturday, April 12, 2003 5:41 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] rpart vs. randomForest
> 
> 
> 
> Greetings. I'm trying to determine whether to use rpart or 
> randomForest
> for a classification tree. Has anybody tested efficacy formally? I've
> run both and the confusion matrix for rf beats rpart. I've looking at
> the rf help page and am unable to figure out how to extract the tree.
> But more than that I'm looking for a more comprehensive user's
guide
> for randomForest including the benefits on using it with MDS. 
> Can anybody
> suggest a general guide? I've been finding a lot of broken links and
> cs-type of web pages rather than an end-user's guide. Also 
> people's experience
> on adjusting the mtry param would be useful. Breiman says 
> that it isn't
> too sensitive but I'm curious if anybody has had a different 
> experience
> with it. Thanks in advance and apologies if this is too general.
> 
> 
> 
> Concerned about your privacy? Follow this link to get
> FREE encrypted email: 
> 
> 
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 

------------------------------------------------------------------------------

Wiener, Matthew

2003-Apr-15 14:00 UTC

head link

[R] rpart vs. randomForest

I can echo that in data I've worked with (separate from the data Andy Liaw
has worked with), fiddling with mtry doesn't make a whole lot of difference.
To the extent it makes any difference at all, the default value tends to be
near the optimum.

Matt Wiener


------------------------------------------------------------------------------

Liaw, Andy

2003-Apr-15 14:26 UTC

head link

[R] rpart vs. randomForest

I just saw in the prelimenary program for JSM '03, there will be (at least)
5 talks on random forest (one from our group), two of which will address the
issue of tuning mtry, judging form the abstracts.

If I may do a bit of advertising: I was asked to organized a roundtable
luncheon at the JSM on multiple trees.  I'd welcome anyone interested in
this area to come.

Cheers,
Andy
> -----Original Message-----
> From: Wiener, Matthew [mailto:matthew_wiener at merck.com]
> Sent: Tuesday, April 15, 2003 10:00 AM
> To: r-help at stat.math.ethz.ch
> Subject: RE: [R] rpart vs. randomForest
> 
> 
> I can echo that in data I've worked with (separate from the 
> data Andy Liaw
> has worked with), fiddling with mtry doesn't make a whole lot 
> of difference.
> To the extent it makes any difference at all, the default 
> value tends to be
> near the optimum.
> 
> Matt Wiener
> 
> 
> --------------------------------------------------------------
> ----------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 
> 
> --------------------------------------------------------------
> ----------------
> Notice:  This e-mail message, together with any attachments, 
> contains information of Merck & Co., Inc. (Whitehouse 
> Station, New Jersey, USA) that may be confidential, 
> proprietary copyrighted and/or legally privileged, and is 
> intended solely for the use of the individual or entity named 
> in this message.  If you are not the intended recipient, and 
> have received this message in error, please immediately 
> return this by e-mail and then delete it.
> 
> =============================================================>
===============>
------------------------------------------------------------------------------

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Apr 2003 - rpart vs. randomForest

[R] rpart vs. randomForest

[R] rpart vs. randomForest

[R] rpart vs. randomForest

[R] rpart vs. randomForest

[R] rpart vs. randomForest

[R] rpart vs. randomForest

Seemingly Similar Threads