thr3ads.net - R announce - random forests for R [Apr 2002]

If this information is useful, please help other people find it:
Share via:

Liaw, Andy

2002-Apr-02 15:22 UTC

random forests for R

Hi all,

There is now a package available on CRAN that provides an R interface to Leo
Breiman's random forest classifier.

Basically, random forest does the following:

1.  Select ntree, the number of trees to grow, and mtry, a number no larger
than number of variables.
2.  For i = 1 to ntree:
3.  Draw a bootstrap sample from the data.  Call those not in the bootstrap
sample the "out-of-bag" data.
4.  Grow a "random" tree, where at each node, the best split is chosen
among
mtry randomly selected variables.  The tree is grown to maximum size and not
pruned back.
5.  Use the tree to predict out-of-bag data.
6.  In the end, use the predictions on out-of-bag data to form majority
votes.
7.  Prediction of test data is done by majority votes from predictions from
the ensemble of trees.

In the tech report
http://oz.berkeley.edu/users/breiman/randomforest2001.pdf, Breiman showed
that this technique is very competitive to boosting classification trees.
In our own experience, it is competitive with nonlinear classifiers such as
artificial neural nets and support vector machines.  Two of the significant
advantages of random forests over other methods (IMHO) are: a) there is only
one parameter (mtry) to adjust, and the result usually not sensititve to it;
and b) the built-in cross-validation via the use of out-of-bag data gives
quite accurate estimate of test set error, and offers quite effective
protection against overfitting.

The code is based on version 3.1 of the original Fortran code written by
Breiman and Cutler (http://www.stat.berkeley.edu/users/breiman/).  The User
Guide for the Fortran code on Breiman's web site explains some of the
facilities provided in the code (such as assessing variable importance, and
proximity measures).  Some facilities provided in the original Fortran code
have be taken out:  transforming data to principal components, and
multidimensional scaling of the "proximity" matrix.  These can easily
be
done in R before and after calls to the random forest functions.  Random
numbers are generated by R's RNG, rather than the one supplied in the
original Fortran code.

I'd like to thank Profs. B. D. Ripley, J. Lindsey, and others on R-help that
answered many of my questions when I was working on this package.  The
formula interface and part of the code in the predict method are out-right
"stolen" from svm() in the e1071 package and nnet() in the VR bundle.

Questions/comments/bugs/patches welcomed!

Regards,
Andy
Andy I. Liaw, PhD
Biometrics Research          Phone: (732) 594-0820
Merck & Co., Inc.              Fax: (732) 594-1565
P.O. Box 2000, RY70-38            Rahway, NJ 07065
mailto:andy_liaw at merck.com



------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments, contains information
of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be
confidential, proprietary copyrighted and/or legally privileged, and is intended
solely for the use of the individual or entity named on this message.  If you
are not the intended recipient, and have received this message in error, please
immediately return this by e-mail and then delete it.

=============================================================================
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-announce mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-announce-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Warnes, Gregory R

2002-Apr-03 11:55 UTC

head link

[R] RE: random forests for R

Hi Andy,

I'm glad to see that someone has put up an R package of Leo's code.  I
made
an R package using his first release of the code, but never had/took the
time to push it through the publication review process here so that I could
distribute it.  I'm glad you have.

-Greg

> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com]
> Sent: Tuesday, April 02, 2002 10:23 AM
> To: 'r-announce at lists.R-project.org'
> Subject: random forests for R
> 
> 
> 
> Hi all,
> 
> There is now a package available on CRAN that provides an R 
> interface to Leo
> Breiman's random forest classifier.
> 
> Basically, random forest does the following:
> 
> 1.  Select ntree, the number of trees to grow, and mtry, a 
> number no larger
> than number of variables.
> 2.  For i = 1 to ntree:
> 3.  Draw a bootstrap sample from the data.  Call those not in 
> the bootstrap
> sample the "out-of-bag" data.
> 4.  Grow a "random" tree, where at each node, the best split 
> is chosen among
> mtry randomly selected variables.  The tree is grown to 
> maximum size and not
> pruned back.
> 5.  Use the tree to predict out-of-bag data.
> 6.  In the end, use the predictions on out-of-bag data to 
> form majority
> votes.
> 7.  Prediction of test data is done by majority votes from 
> predictions from
> the ensemble of trees.
> 
> In the tech report
> http://oz.berkeley.edu/users/breiman/randomforest2001.pdf, 
> Breiman showed
> that this technique is very competitive to boosting 
> classification trees.
> In our own experience, it is competitive with nonlinear 
> classifiers such as
> artificial neural nets and support vector machines.  Two of 
> the significant
> advantages of random forests over other methods (IMHO) are: 
> a) there is only
> one parameter (mtry) to adjust, and the result usually not 
> sensititve to it;
> and b) the built-in cross-validation via the use of 
> out-of-bag data gives
> quite accurate estimate of test set error, and offers quite effective
> protection against overfitting.
> 
> The code is based on version 3.1 of the original Fortran code 
> written by
> Breiman and Cutler (http://www.stat.berkeley.edu/users/breiman/).  The User
Guide for the Fortran code on Breiman's web site explains some of the
facilities provided in the code (such as assessing variable importance, and
proximity measures).  Some facilities provided in the original Fortran code
have be taken out:  transforming data to principal components, and
multidimensional scaling of the "proximity" matrix.  These can easily
be
done in R before and after calls to the random forest functions.  Random
numbers are generated by R's RNG, rather than the one supplied in the
original Fortran code.

I'd like to thank Profs. B. D. Ripley, J. Lindsey, and others on R-help that
answered many of my questions when I was working on this package.  The
formula interface and part of the code in the predict method are out-right
"stolen" from svm() in the e1071 package and nnet() in the VR bundle.

Questions/comments/bugs/patches welcomed!

Regards,
Andy
Andy I. Liaw, PhD
Biometrics Research          Phone: (732) 594-0820
Merck & Co., Inc.              Fax: (732) 594-1565
P.O. Box 2000, RY70-38            Rahway, NJ 07065
mailto:andy_liaw at merck.com



----------------------------------------------------------------------------
--
Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
may be confidential, proprietary copyrighted and/or legally privileged, and
is intended solely for the use of the individual or entity named on this
message.  If you are not the intended recipient, and have received this
message in error, please immediately return this by e-mail and then delete
it.

============================================================================
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-announce mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-announce-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._


LEGAL NOTICE
Unless expressly stated otherwise, this message is confidential and may be
privileged. It is intended for the addressee(s) only. Access to this E-mail by
anyone else is unauthorized. If you are not an addressee, any disclosure or
copying of the contents of this E-mail or any action taken (or not taken) in
reliance on it is unauthorized and may be unlawful. If you are not an addressee,
please inform the sender immediately.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Bernd Huwe

2002-Apr-05 08:56 UTC

head link

[R] AW: random forests for R

Super. Der Algorithmus gef?llt mir sehr gut und scheint auch gar nicht so
schwierig zu realisieren zu sein. Er erinnert etwas an die genetischen
Algorithmen.

Gru?
Bernd Huwe

-----Urspr?ngliche Nachricht-----
Von: owner-r-announce at stat.math.ethz.ch
[mailto:owner-r-announce at stat.math.ethz.ch]Im Auftrag von Liaw, Andy
Gesendet: Dienstag, 2. April 2002 17:23
An: 'r-announce at lists.R-project.org'
Betreff: random forests for R



Hi all,

There is now a package available on CRAN that provides an R interface to Leo
Breiman's random forest classifier.

Basically, random forest does the following:

1.  Select ntree, the number of trees to grow, and mtry, a number no larger
than number of variables.
2.  For i = 1 to ntree:
3.  Draw a bootstrap sample from the data.  Call those not in the bootstrap
sample the "out-of-bag" data.
4.  Grow a "random" tree, where at each node, the best split is chosen
among
mtry randomly selected variables.  The tree is grown to maximum size and not
pruned back.
5.  Use the tree to predict out-of-bag data.
6.  In the end, use the predictions on out-of-bag data to form majority
votes.
7.  Prediction of test data is done by majority votes from predictions from
the ensemble of trees.

In the tech report
http://oz.berkeley.edu/users/breiman/randomforest2001.pdf, Breiman showed
that this technique is very competitive to boosting classification trees.
In our own experience, it is competitive with nonlinear classifiers such as
artificial neural nets and support vector machines.  Two of the significant
advantages of random forests over other methods (IMHO) are: a) there is only
one parameter (mtry) to adjust, and the result usually not sensititve to it;
and b) the built-in cross-validation via the use of out-of-bag data gives
quite accurate estimate of test set error, and offers quite effective
protection against overfitting.

The code is based on version 3.1 of the original Fortran code written by
Breiman and Cutler (http://www.stat.berkeley.edu/users/breiman/).  The User
Guide for the Fortran code on Breiman's web site explains some of the
facilities provided in the code (such as assessing variable importance, and
proximity measures).  Some facilities provided in the original Fortran code
have be taken out:  transforming data to principal components, and
multidimensional scaling of the "proximity" matrix.  These can easily
be
done in R before and after calls to the random forest functions.  Random
numbers are generated by R's RNG, rather than the one supplied in the
original Fortran code.

I'd like to thank Profs. B. D. Ripley, J. Lindsey, and others on R-help that
answered many of my questions when I was working on this package.  The
formula interface and part of the code in the predict method are out-right
"stolen" from svm() in the e1071 package and nnet() in the VR bundle.

Questions/comments/bugs/patches welcomed!

Regards,
Andy
Andy I. Liaw, PhD
Biometrics Research          Phone: (732) 594-0820
Merck & Co., Inc.              Fax: (732) 594-1565
P.O. Box 2000, RY70-38            Rahway, NJ 07065
mailto:andy_liaw at merck.com



----------------------------------------------------------------------------
--
Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
may be confidential, proprietary copyrighted and/or legally privileged, and
is intended solely for the use of the individual or entity named on this
message.  If you are not the intended recipient, and have received this
message in error, please immediately return this by e-mail and then delete
it.

============================================================================
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-announce mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-announce-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Apparently Analagous Threads

Search for more reasonably related threads

R announce - Apr 2002 - random forests for R

random forests for R

[R] RE: random forests for R

[R] AW: random forests for R

Apparently Analagous Threads