thr3ads.net - R help - [R] CART vs. Random Forest [Sep 2002]

If this information is useful, please help other people find it:
Share via:

Andrew Baek

2002-Sep-25 19:51 UTC

[R] CART vs. Random Forest

According to Dr. Breiman, the RF should be more accurate
method than a single tree. However, the performance of each 
method seems to depend on the proprotion of outcome variable 
in my case. My data set is a typical classification problem
(predict bad guys). When I ran both of them with different 
proportion of outcome variables(there's a criterion to measure 
the degree of bad behavior), I got very strange results. 

1. proportion of 1 to 0 = 1:4
err.rate of CART = 25.2%
err.rate of RF = 25.6%

2. 1:9 
err.rate of CART = 28.5%
err.rate of RF = 21.2%

3. 1:33
err.rate of CART = 28.2%
err.rate of RF = 12.1%

4. 1:99
err.rate of CART = 25.1%
err.rate of RF = 7.3%


In 3 & 4, RF looks superior to CART. But I'm afraid RF just
vote for "0" to reduce the error rate. Any suggestions? 
Thank you. 

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Marc R. Feldesman

2002-Sep-25 21:52 UTC

head link

[R] CART vs. Random Forest

At 12:51 PM 9/25/2002, Andrew Baek wrote:
 >According to Dr. Breiman, the RF should be more accurate
 >method than a single tree. However, the performance of each
 >method seems to depend on the proprotion of outcome variable
 >in my case. My data set is a typical classification problem
 >(predict bad guys). When I ran both of them with different
 >proportion of outcome variables(there's a criterion to measure
 >the degree of bad behavior), I got very strange results.
 >
 >1. proportion of 1 to 0 = 1:4
 >err.rate of CART = 25.2%
 >err.rate of RF = 25.6%
 >
 >2. 1:9
 >err.rate of CART = 28.5%
 >err.rate of RF = 21.2%
 >
 >3. 1:33
 >err.rate of CART = 28.2%
 >err.rate of RF = 12.1%
 >
 >4. 1:99
 >err.rate of CART = 25.1%
 >err.rate of RF = 7.3%
 >
 >
 >In 3 & 4, RF looks superior to CART. But I'm afraid RF just
 >vote for "0" to reduce the error rate. Any suggestions?

Where are you getting CART results in R?  CART is a trademark of Salford 
Systems and is not implemented AFAIK in R (or SPlus).

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Wiener, Matthew

2002-Sep-26 12:51 UTC

head link

[R] CART vs. Random Forest

If either method were just guessing 0 to reduce the error rate, shouldn't
they achieve a 1/34 ~ 3% or 1/100 = 1% error rate in the last two examples?
And for that matter 20% and 10%  in the first two?  It doesn't look like
that's what's going on.

One suggestion if making sure you find the 1's is more important than having
a low overall error rate:  in rpart, you can specify a loss matrix to say
that certain kinds of errors are more important than others.  In a random
forest, you can use different voting thresholds for "1-ness" and
"0-ness" to
bias things -- that is, instead of just taking majority vote, you might
require (for example) 85% of the trees to agree for something to be declared
in class 0.

It's hard to say much more without knowing anything about your data.  But in
my experience random forests have substantially outperformed single trees in
many problems (and I haven't yet encountered one in which a single tree
outperformed a random forest).

Hope this helps,

Matthew Wiener
RY84-202
Applied Computer Science & Mathematics Dept.
Merck Research Labs
126 E. Lincoln Ave.
Rahway, NJ 07065
732-594-5303 

-----Original Message-----
From: Andrew Baek [mailto:andrew at stat.ucla.edu]
Sent: Wednesday, September 25, 2002 3:52 PM
To: r-help at stat.math.ethz.ch
Subject: [R] CART vs. Random Forest


According to Dr. Breiman, the RF should be more accurate
method than a single tree. However, the performance of each 
method seems to depend on the proprotion of outcome variable 
in my case. My data set is a typical classification problem
(predict bad guys). When I ran both of them with different 
proportion of outcome variables(there's a criterion to measure 
the degree of bad behavior), I got very strange results. 

1. proportion of 1 to 0 = 1:4
err.rate of CART = 25.2%
err.rate of RF = 25.6%

2. 1:9 
err.rate of CART = 28.5%
err.rate of RF = 21.2%

3. 1:33
err.rate of CART = 28.2%
err.rate of RF = 12.1%

4. 1:99
err.rate of CART = 25.1%
err.rate of RF = 7.3%


In 3 & 4, RF looks superior to CART. But I'm afraid RF just
vote for "0" to reduce the error rate. Any suggestions? 
Thank you. 

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments, contains information
of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be
confidential, proprietary copyrighted and/or legally privileged, and is intended
solely for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error, please
immediately return this by e-mail and then delete it.

=============================================================================
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Wiener, Matthew

2002-Sep-26 19:54 UTC

head link

[R] CART vs. Random Forest

We haven't implemented different voting thresholds in the package itself,
but when you predict you can get out votes or probabilities rather than
classes if you want.  The argument type to predict.randomForest is
"class"
by default, but can also be "vote" or "prob".  You can use
the training set
to figure out what a good threshold is, and then check your results on a
test set.  Then you just use the threshold later.  

I suppose we could implement a threshold that could be supplied to predict,
but then we'd have to work something out for multi-class problems -- several
different cutpoints, I guess.  It's not a priority for Andy or me right now.
I actually like to take a look at the ROC curve anyway, to decide what
tradeoffs are worthwhile.

I'd compare the results by looking at the error rates -- if you can make the
(possibly weighted) error rate lower with one method or the other, that's
the method that ones.

Regards,

Matt

-----Original Message-----
From: Andrew Baek [mailto:andrew at stat.ucla.edu]
Sent: Thursday, September 26, 2002 3:33 PM
To: Wiener, Matthew
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] CART vs. Random Forest

> One suggestion if making sure you find the 1's is more important than
having> a low overall error rate:  in rpart, you can specify a loss matrix to say
> that certain kinds of errors are more important than others.  In a random
> forest, you can use different voting thresholds for "1-ness" and
"0-ness"
to> bias things -- that is, instead of just taking majority vote, you might
> require (for example) 85% of the trees to agree for something to be
declared> in class 0.
If I use loss matrix in "rpart" and different threshold in
"RF", how
can I compare two packages? Well, Andy Liaw told me "classwt" in RF
does
not help much. But when I modified priors in rpart, I got totall new 
results. So I thought this should be applied to RF.

Also, I'll appreciate if you tell me how to change the voting threshold in
RF. I couldn't find it in the manual. Thank you.

Andrew



------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments, contains information
of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be
confidential, proprietary copyrighted and/or legally privileged, and is intended
solely for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error, please
immediately return this by e-mail and then delete it.

=============================================================================
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Wiener, Matthew

2002-Sep-26 21:26 UTC

head link

[R] CART vs. Random Forest

I wouldn't bother modifying classwt -- that doesn't seem to have much
effect
(as Breiman has mentioned to Andy Liaw).

I use the following function, "biased predict".  (Not specific to
random
forests, which is why it's not in the package.)

biased.predict <-
function(object, newdata, thresh,
           which.test = "Bad", if.high = "Bad", if.low =
"Good",
           pred.type = "prob"){
    probs <- predict(object, newdata = newdata, type = pred.type)
    levels <- dimnames(probs)[[2]]
    ans <- apply(probs, 1, function(x){
      ifelse(x[which.test] > thresh, if.high, if.low)})
    ans <- factor(ans, levels = levels)
  }

You can get the errors of different types -- the confusion matrix -- from
table(data.frame(true = true.vals, pred = pred.vals)), and then multiply
this by a weight matrix to get a weighted error score.  You can run
biased.predict for a number of different threshold values and check the
weighted error scores, choosing the threshold that gives you the lowest.
(Though running this over and over is inefficient -- better predict the
probabilities once and then do multiple cutoffs.)  Or you can choose your
threshold by saying that one type of error must be no larger than a certain
value (which is what I've usually done, precisely to limit false negatives,
as you want to). 

Once you've chosen a threshold, you can used biased.predict for new data.

I hope I'm making sense, and that this helps.

Matt

-----Original Message-----
From: Andrew Baek [mailto:andrew at stat.ucla.edu]
Sent: Thursday, September 26, 2002 4:36 PM
To: Wiener, Matthew
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] CART vs. Random Forest

Of course, the CART & RF are different method. But at least,
I have to consider that false negative is more serious than
false positive in my problem. For this purpose, I used "prior"
in rpart and "classwt" in RF. Then, should I modify priors and 
cut-off point at the same time? 

Andrew

On Thu, 26 Sep 2002, Wiener, Matthew wrote:
> We haven't implemented different voting thresholds in the package
itself,
> but when you predict you can get out votes or probabilities rather than
> classes if you want.  The argument type to predict.randomForest is
"class"
> by default, but can also be "vote" or "prob".  You can
use the training
set> to figure out what a good threshold is, and then check your results on a
> test set.  Then you just use the threshold later.  
> 
> I suppose we could implement a threshold that could be supplied to
predict,> but then we'd have to work something out for multi-class problems --
several> different cutpoints, I guess.  It's not a priority for Andy or me right
now.> I actually like to take a look at the ROC curve anyway, to decide what
> tradeoffs are worthwhile.
> 
> I'd compare the results by looking at the error rates -- if you can
make
the> (possibly weighted) error rate lower with one method or the other,
that's
> the method that ones.
> 
> Regards,
> 
> Matt

------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments, contains information
of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be
confidential, proprietary copyrighted and/or legally privileged, and is intended
solely for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error, please
immediately return this by e-mail and then delete it.

=============================================================================
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Liaw, Andy

2002-Sep-26 23:48 UTC

head link

[R] CART vs. Random Forest

> I suppose we could implement a threshold that could be 
> supplied to predict,
> but then we'd have to work something out for multi-class 
> problems -- several
> different cutpoints, I guess.  It's not a priority for Andy 
> or me right now.
Leo has been working on this for his Version 4.  Thus I see no reason for me
to spend time on it now 8-).  From the alpha code that he sent me, he has
three different ways of thresholding.  They all work after the trees are
grown, so the OOB estimates do not reflect the threshold.  I don't see a
good way to deal with that so far.

Andy


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
may be confidential, proprietary copyrighted and/or legally privileged, and is
intended solely for the use of the individual or entity named in this message. 
If you are not the intended recipient, and have received this message in error,
please immediately return this by e-mail and then delete it.

=============================================================================
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Sep 2002 - CART vs. Random Forest

[R] CART vs. Random Forest

[R] CART vs. Random Forest

[R] CART vs. Random Forest

[R] CART vs. Random Forest

[R] CART vs. Random Forest

[R] CART vs. Random Forest

Apparently Analagous Threads