Dimitri Liakhovitski
2010-May-05 17:51 UTC
[R] randomForest: predictor importance (for regressions)
I have a question about predictor importances in randomForest. Once I've run randomForest and got my object, I get their importances: rfresult$importance I also get the "standard errors" of the permutation-based importance measure: rfresult$importanceSD I have 2 questions: 1. Because I am dealing with regressions, I am getting an importance object (rfresult$importance) with two columns, labeled "%IncMSE" (the first column) and "IncNodePurity" (the second column). I assume it's the first one that is the mean decrease in accuracy due to permutation. Am I correct or am I wrong? I am confused because ?randomForest says: "or Regression, the first column is the mean decrease in accuracy and the second the mean decrease in MSE." - but it is the first column, not the second that has "MSE" in its header. 2. According to this thread ( http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg94873.html), The overall importance measure is mean(d[i]) / se(d[i]), where se(d[i]) is sd(d[i])/sqrt(ntree) (the "standard error"). So, in order to get at the importance of predictors (and I want to use the permutation-based importance) - should I just take the first column of rfresult$importance or should I first divide rfresult$importance by rfresult$importanceSD - to get something analogous to z-scores and use those? Thank you very much! -- Dimitri Liakhovitski Ninah.com Dimitri.Liakhovitski@ninah.com [[alternative HTML version deleted]]
See reply inline below. Andy From: Dimitri Liakhovitski> > I have a question about predictor importances in randomForest. > > Once I've run randomForest and got my object, I get their importances: > rfresult$importance > I also get the "standard errors" of the permutation-based importance > measure: rfresult$importanceSD > > I have 2 questions: > > 1. Because I am dealing with regressions, I am getting an > importance object > (rfresult$importance) with two columns, labeled "%IncMSE" > (the first column) > and "IncNodePurity" (the second column). I assume it's the > first one that is > the mean decrease in accuracy due to permutation. Am I correct or am I > wrong? I am confused because ?randomForest says: "or > Regression, the first > column is the mean decrease in accuracy and the second the > mean decrease in > MSE." - but it is the first column, not the second that has > "MSE" in its > header.In regression trees, node impurity is measured by MSE, therefore the second measure that averages cumulative reduction in node impurity due to splits by a variable over all trees is labelled as "mean decrease in MSE".> 2. According to this thread ( > http://www.mail-archive.com/r-help at stat.math.ethz.ch/msg94873. > html), The > overall importance measure is mean(d[i]) / se(d[i]), where se(d[i]) is > sd(d[i])/sqrt(ntree) (the "standard error"). > So, in order to get at the importance of predictors (and I > want to use the > permutation-based importance) - should I just take the first column of > rfresult$importance or should I first divide rfresult$importance by > rfresult$importanceSD - to get something analogous to z-scores and use > those?See the "scale" argument in ?importance. The recommended way of extracting components of an object in R is to use the extractor functions when they exist.> Thank you very much! > > -- > Dimitri Liakhovitski > Ninah.com > Dimitri.Liakhovitski at ninah.com > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Notice: This e-mail message, together with any attachme...{{dropped:11}}
Possibly Parallel Threads
- Which column in randomForest importances (for regression) is MSE and which IncNodePurity
- question regarding "varImpPlot" results vs. model$importance data on package "RandomForest"
- Error on random forest variable importance estimates
- interpret the importance output?
- Question on: Random Forest Variable Importance for Regression Problems