Hi all, I am trying to play around with the randomForest function for classification. I know its performance is great. I am currently using the default options. It has many options. How do I further tweak the options so that I can make its performance even better? What are the options that are mostly used? Thanks a lot! M [[alternative HTML version deleted]]
When I plot the randomForest object, it shows a graph with 3 lines, green, red and black, what's the meaning of these three lines? On 3/7/06, Michael <comtech.usa@gmail.com> wrote:> > Hi all, > > I am trying to play around with the randomForest function for > classification. I know its performance is great. > > I am currently using the default options. > > It has many options. > > How do I further tweak the options so that I can make its performance even > better? > > What are the options that are mostly used? > > Thanks a lot! > > M >[[alternative HTML version deleted]]
As ?plot.randomForest says, it plots error rates. In addition to overall error rates, it also plots error rates for each class. As to the options in randomForest, read about the options in the help page and the reference linked from the help page. Andy From: Michael> > When I plot the randomForest object, it shows a graph with 3 > lines, green, red and black, what's the meaning of these three lines? > > On 3/7/06, Michael <comtech.usa at gmail.com> wrote: > > > > Hi all, > > > > I am trying to play around with the randomForest function for > > classification. I know its performance is great. > > > > I am currently using the default options. > > > > It has many options. > > > > How do I further tweak the options so that I can make its > performance > > even better? > > > > What are the options that are mostly used? > > > > Thanks a lot! > > > > M > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Yes, I do know. That's why I pointed you to the reference linked from the help page. BTW, there's also an R News article describing the initial version of the package. Have you perused that? Andy -----Original Message----- From: Michael [mailto:comtech.usa@gmail.com] Sent: Tuesday, March 07, 2006 9:27 PM To: Liaw, Andy Cc: R-help@stat.math.ethz.ch Subject: Re: [R] how to use the randomForest and rpart function? It did not have a legend showing on which color is for class1, which color is for class2, etc... I've read the R-help page. It lists a lot of options, but it did not say which ones are the key parameters that people use most for improving performance... Do you know? On 3/7/06, Liaw, Andy <andy_liaw@merck.com <mailto:andy_liaw@merck.com> > wrote: As ?plot.randomForest says, it plots error rates. In addition to overall error rates, it also plots error rates for each class. As to the options in randomForest, read about the options in the help page and the reference linked from the help page. Andy From: Michael> > When I plot the randomForest object, it shows a graph with 3 > lines, green, red and black, what's the meaning of these three lines? > > On 3/7/06, Michael < comtech.usa@gmail.com <mailto:comtech.usa@gmail.com> > wrote: > > > > Hi all, > > > > I am trying to play around with the randomForest function for > > classification. I know its performance is great. > > > > I am currently using the default options. > > > > It has many options. > > > > How do I further tweak the options so that I can make its > performance > > even better? > > > > What are the options that are mostly used? > > > > Thanks a lot! > > > > M > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch> mailing list > https://stat.ethz.ch/mailman/listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>> PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html<http://www.R-project.org/posting-guide.html>> >---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments,...{{dropped}}
Wow, I didn't know that. That's great! He the man! On 3/8/06, Carlos Ortega <coforfe@gmail.com> wrote:> > Hello Michael, > > Just a few words about you phrase "Do you know?"... > > Andy Liaw, is the creator and maintainer of the randomForest package. > He ported the original library of Briemman to R. > > Regards, > Carlos. > > > On 3/8/06, Michael <comtech.usa@gmail.com> wrote: > > > It did not have a legend showing on which color is for class1, which > > color > > is for class2, etc... > > > > I've read the R-help page. > > > > It lists a lot of options, but it did not say which ones are the key > > parameters that people use most for improving performance... > > > > Do you know? > > > > On 3/7/06, Liaw, Andy < andy_liaw@merck.com> wrote: > > > > > > As ?plot.randomForest says, it plots error rates. In addition to > > overall > > > error rates, it also plots error rates for each class. > > > > > > As to the options in randomForest, read about the options in the help > > page > > > and the reference linked from the help page. > > > > > > Andy > > > > > > From: Michael > > > > > > > > When I plot the randomForest object, it shows a graph with 3 > > > > lines, green, red and black, what's the meaning of these three > > lines? > > > > > > > > On 3/7/06, Michael <comtech.usa@gmail.com> wrote: > > > > > > > > > > Hi all, > > > > > > > > > > I am trying to play around with the randomForest function for > > > > > classification. I know its performance is great. > > > > > > > > > > I am currently using the default options. > > > > > > > > > > It has many options. > > > > > > > > > > How do I further tweak the options so that I can make its > > > > performance > > > > > even better? > > > > > > > > > > What are the options that are mostly used? > > > > > > > > > > Thanks a lot! > > > > > > > > > > M > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help@stat.math.ethz.ch mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > Notice: This e-mail message, together with any > > attachment...{{dropped}} > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > > > > >[[alternative HTML version deleted]]
Thanks a lot Joe! I will take a further look at the article... On 3/8/06, Joseph Retzer <joe_retzer@yahoo.com> wrote:> > Hi Michael, > I've looked into this a bit and the only parameter that seems to be > suggested (by Brieman and a Salford Systems white paper) as one which may > have an impact on the RF model is that which sets the number of potential > split variables (mtry) for each tree split. The default for categorical > response is root(total number of attributes) and (total number of > attributes)/3 for regression. Take a look at the tuneRF function in > randomForest which takes the default and searches above and below to see if > the OOB error rate can be improved by changing mtry. Based on my very > limited experimentation with the program, the default value seems to > be tough to improve on. > Best of luck & take care, > Joe Retzer > > *Michael <comtech.usa@gmail.com>* wrote: > > It did not have a legend showing on which color is for class1, which color > is for class2, etc... > > I've read the R-help page. > > It lists a lot of options, but it did not say which ones are the key > parameters that people use most for improving performance... > > Do you know? > > On 3/7/06, Liaw, Andy wrote: > > > > As ?plot.randomForest says, it plots error rates. In addition to overall > > error rates, it also plots error rates for each class. > > > > As to the options in randomForest, read about the options in the help > page > > and the reference linked from the help page. > > > > Andy > > > > From: Michael > > > > > > When I plot the randomForest object, it shows a graph with 3 > > > lines, green, red and black, what's the meaning of these three lines? > > > > > > On 3/7/06, Michael wrote: > > > > > > > > Hi all, > > > > > > > > I am trying to play around with the randomForest function for > > > > classification. I know its performance is great. > > > > > > > > I am currently using the default options. > > > > > > > > It has many options. > > > > > > > > How do I further tweak the options so that I can make its > > > performance > > > > even better? > > > > > > > > What are the options that are mostly used? > > > > > > > > Thanks a lot! > > > > > > > > M > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > Notice: This e-mail message, together with any attachment...{{dropped}} > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >[[alternative HTML version deleted]]
Michael - I recall reading something Breiman wrote that said essentially "don't skimp on the number of trees - they are cheap to build and it makes for a better model." Also, look at your error rates (using plot), and make sure you run enough trees so that the error settles down. You'll likely be building 1000 or so trees. Tim ************>Hi Andy,Does the randomForest have a Cross Validation built-in to decide what is the best number of trees or I have to find the best number manually by myself? Thanks a lot! Michael. On 3/7/06, Liaw, Andy <andy_liaw at merck.com> wrote:> > Yes, I do know. That's why I pointed you to the reference linked from the > help page. > > BTW, there's also an R News article describing the initial version of the > package. Have you perused that? > > Andy
Thanks a lot Andy, Do I need to have centering and scaling before sending data into rpart and randomForest? I knew for LDA and QDA, it does not matter... And for ridge, it matters; Thanks a lot! Michael. On 3/8/06, Liaw, Andy <andy_liaw@merck.com> wrote:> > The algorithm has something slicker than cross-validation. That's the > whole OOB business mentioned in the R News article. The number of trees > isn't really a parameter, as it doesn't hurt to have `too many trees' (other > than wasting computing resources). Some people routinely run more than > 10,000 trees just to make sure. > > Some times mtry does matter (though that's more of an exception than the > rule). I can find pathological cases where mtry=1 is the best, or > mtry=number of covariates (bagging) is best, but when given a real data, one > almost never have any idea. > > Andy > > -----Original Message----- > *From:* Michael [mailto:comtech.usa@gmail.com] > *Sent:* Wednesday, March 08, 2006 8:22 PM > *To:* Liaw, Andy > *Cc:* R-help@stat.math.ethz.ch > *Subject:* Re: [R] how to use the randomForest and rpart function? > > Hi Andy, > > Does the randomForest have a Cross Validation built-in to decide what is > the best number of trees or I have to find the best number manually by > myself? > > Thanks a lot! > > Michael. > > On 3/7/06, Liaw, Andy <andy_liaw@merck.com> wrote: > > > > Yes, I do know. That's why I pointed you to the reference linked from > > the help page. > > > > BTW, there's also an R News article describing the initial version of > > the package. Have you perused that? > > > > Andy > > > > -----Original Message----- > > *From:* Michael [mailto:comtech.usa@gmail.com] > > *Sent:* Tuesday, March 07, 2006 9:27 PM > > *To:* Liaw, Andy > > *Cc:* R-help@stat.math.ethz.ch > > *Subject:* Re: [R] how to use the randomForest and rpart function? > > > > It did not have a legend showing on which color is for class1, which > > color is for class2, etc... > > > > I've read the R-help page. > > > > It lists a lot of options, but it did not say which ones are the key > > parameters that people use most for improving performance... > > > > Do you know? > > > > On 3/7/06, Liaw, Andy <andy_liaw@merck.com> wrote: > > > > > > As ?plot.randomForest says, it plots error rates. In addition to > > > overall > > > error rates, it also plots error rates for each class. > > > > > > As to the options in randomForest, read about the options in the help > > > page > > > and the reference linked from the help page. > > > > > > Andy > > > > > > From: Michael > > > > > > > > When I plot the randomForest object, it shows a graph with 3 > > > > lines, green, red and black, what's the meaning of these three > > > lines? > > > > > > > > On 3/7/06, Michael < comtech.usa@gmail.com> wrote: > > > > > > > > > > Hi all, > > > > > > > > > > I am trying to play around with the randomForest function for > > > > > classification. I know its performance is great. > > > > > > > > > > I am currently using the default options. > > > > > > > > > > It has many options. > > > > > > > > > > How do I further tweak the options so that I can make its > > > > performance > > > > > even better? > > > > > > > > > > What are the options that are mostly used? > > > > > > > > > > Thanks a lot! > > > > > > > > > > M > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help@stat.math.ethz.ch mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > > > http://www.R-project.org/posting-guide.html > > > <http://www.r-project.org/posting-guide.html> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > Notice: This e-mail message, together with any attachments, contains > > > information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New > > > Jersey, USA 08889), and/or its affiliates (which may be known outside the > > > United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as > > > Banyu) that may be confidential, proprietary copyrighted and/or legally > > > privileged. It is intended solely for the use of the individual or entity > > > named on this message. If you are not the intended recipient, and have > > > received this message in error, please notify us immediately by reply e-mail > > > and then delete it from your system. > > > > > > ------------------------------------------------------------------------------ > > > > > > > > > ------------------------------------------------------------------------------ > > Notice: This e-mail message, together with any attachments, contains > > information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New > > Jersey, USA 08889), and/or its affiliates (which may be known outside the > > United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as > > Banyu) that may be confidential, proprietary copyrighted and/or legally > > privileged. It is intended solely for the use of the individual or entity > > named on this message. If you are not the intended recipient, and have > > received this message in error, please notify us immediately by reply e-mail > > and then delete it from your system. > > > > ------------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------------ > > Notice: This e-mail message, together with any attachments...{{dropped}}
No. Tree-based methods are invariant to monotone transformations in the predictors, because they only use ranks. Rotation can matter, though. Andy -----Original Message----- From: Michael [mailto:comtech.usa@gmail.com] Sent: Friday, March 10, 2006 8:11 PM To: Liaw, Andy Cc: R-help@stat.math.ethz.ch Subject: Re: [R] how to use the randomForest and rpart function? Thanks a lot Andy, Do I need to have centering and scaling before sending data into rpart and randomForest? I knew for LDA and QDA, it does not matter... And for ridge, it matters; Thanks a lot! Michael. On 3/8/06, Liaw, Andy <andy_liaw@merck.com <mailto:andy_liaw@merck.com> > wrote: The algorithm has something slicker than cross-validation. That's the whole OOB business mentioned in the R News article. The number of trees isn't really a parameter, as it doesn't hurt to have `too many trees' (other than wasting computing resources). Some people routinely run more than 10,000 trees just to make sure. Some times mtry does matter (though that's more of an exception than the rule). I can find pathological cases where mtry=1 is the best, or mtry=number of covariates (bagging) is best, but when given a real data, one almost never have any idea. Andy -----Original Message----- From: Michael [mailto:comtech.usa@gmail.com <mailto:comtech.usa@gmail.com> ] Sent: Wednesday, March 08, 2006 8:22 PM To: Liaw, Andy Cc: R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch> Subject: Re: [R] how to use the randomForest and rpart function? Hi Andy, Does the randomForest have a Cross Validation built-in to decide what is the best number of trees or I have to find the best number manually by myself? Thanks a lot! Michael. On 3/7/06, Liaw, Andy <andy_liaw@merck.com <mailto:andy_liaw@merck.com> > wrote: Yes, I do know. That's why I pointed you to the reference linked from the help page. BTW, there's also an R News article describing the initial version of the package. Have you perused that? Andy -----Original Message----- From: Michael [mailto: <mailto:comtech.usa@gmail.com> comtech.usa@gmail.com] Sent: Tuesday, March 07, 2006 9:27 PM To: Liaw, Andy Cc: R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch> Subject: Re: [R] how to use the randomForest and rpart function? It did not have a legend showing on which color is for class1, which color is for class2, etc... I've read the R-help page. It lists a lot of options, but it did not say which ones are the key parameters that people use most for improving performance... Do you know? On 3/7/06, Liaw, Andy <andy_liaw@merck.com <mailto:andy_liaw@merck.com> > wrote: As ?plot.randomForest says, it plots error rates. In addition to overall error rates, it also plots error rates for each class. As to the options in randomForest, read about the options in the help page and the reference linked from the help page. Andy From: Michael> > When I plot the randomForest object, it shows a graph with 3 > lines, green, red and black, what's the meaning of these three lines? > > On 3/7/06, Michael < comtech.usa@gmail.com <mailto:comtech.usa@gmail.com> > wrote: > > > > Hi all, > > > > I am trying to play around with the randomForest function for > > classification. I know its performance is great. > > > > I am currently using the default options. > > > > It has many options. > > > > How do I further tweak the options so that I can make its > performance > > even better? > > > > What are the options that are mostly used? > > > > Thanks a lot! > > > > M > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch> mailing list > https://stat.ethz.ch/mailman/listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>> PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>> >---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as Banyu) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. ---------------------------------------------------------------------------- -- ---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as Banyu) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. ---------------------------------------------------------------------------- -- ---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as Banyu) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. ---------------------------------------------------------------------------- -- ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ [[alternative HTML version deleted]]