Dear R users, I've just released a new version of randomForest (available on CRAN now). This version contained quite a number of new features and bug fixes, compared to version prior to 4.0-x (and few more since 4.0-1). For those not familiar with randomForest, it's an ensemble classifier/regression tool. Please see http://www.math.usu.edu/~adele/forests/ for more detailed information, as well as the Fortran code. Comments/questions/bugs reports/patches much appreciated! A few notes about the new version: o There is a new tuneRF() function for searching for the optimal mtry, following Breiman's suggestion. PLEASE use it to see if result can be improved! o A new variable importance measure replaces the one based on margin. This new measure is the same as in Breiman's V5. The analogous measure is also implemented for regression. This new measure is designed to be more robust against data where predictor variables have very different number of possible splits (i.e., unique values/categories). The previous measure tends to make variables with more possible splits look more important. o For classification, the new meassure is also computed on a per-class basis. o There is the new `sampsize' option for down-sampling larger classes. E.g., if in a two-class problem, there are 950 class 1s and 50 class 2s, use sampsize=c(50, 50) will usually give a more `balanced' classifier. o There is a new importance() function for extracting the importance measure. o The predict() method has an option to return predictions by the component trees. o There is a new getTree() function for looking at one of the trees in the forest. o For dealing with missing values in the predictor variables, there are na.roughfix() and rfImpute(), which correspond to the `missquick' and `missright' options in Breiman's V4/V5 code. Both works for classification as well as regression. o There is an experimental bias reduction step in regression (the corr.bias argument in randomForest) that could be very effective for some data (but essentially no effect for some others). Some notes about differences between the package and Breiman's Fortran code: o Breiman uses the class weights to cast weighted votes. This is not done in the R version. However, one can use the threshold argument to randomForest to get similar (but not exactly the same) effect. o In Breiman's V4/V5 code, the Gini-based importance is weighted by the out-of-bag data. This has not been implemented in the R version. o Breiman's V4/V5 code can handle categorical predictors with more than 32 categories. This has not been implemented in the R version. o Breiman's classification code uses mtry differently than the R version: the mtry variables are sampled *with replacement* at each node. The R version samples without replacement, so that if mtry is set to number of predictors, one gets the same result as bagging. Breiman's regression code *does* sample the variables without replacement. o In the R version, ties are randomly broken when finding best variables, or when making predictions. In Breiman's code, the first one found wins. o The `prototypes' Breiman described have not been implemented. There are situations when they can be misleading, so I have chosen not to implement it. o The `interaction detection' feature in Breiman's V5 has not been implemented (but is fairly high on my to-do list). Best, Andy Andy Liaw, PhD Biometrics Research PO Box 2000, RY33-300 Merck Research Labs Rahway, NJ 07065 mailto:andy_liaw at merck.com 732-594-0820 ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}} _______________________________________________ R-packages mailing list R-packages at stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/r-packages