Pierre Dubath
2010-Aug-06 14:17 UTC
[R] Error on random forest variable importance estimates
Hello, I am using the R randomForest package to classify variable stars. I have a training set of 1755 stars described by (too) many variables. Some of these variables are highly correlated. I believe that I understand how randomForest works and how the variable importance are evaluated (through variable permutations). Here are my questions. 1) variable importance error? Is there any ways to estimate the error on the "MeanDecreaseAccuracy"? In other words, I would like to know how significant are "MeanDecreaseAccuracy" differences (and display horizontal error bars in the VarImpPlot output). I have notice that even with relatively large number of trees, I have variation in the importance values from one run to the next. Could this serve as a measure of the errors/uncertainties? 2) how to deal with variable correlation? so far, I am iterating, selecting the most important variable first, removing all other variable that have a high correlation (say higher than 80%), taking the second most important variable left, removing variables with high-correlation with any of the first two variables, and so on... (also using some astronomical insight as to which variables are the most important!) Is there a better way to deal with correlation in randomForest? (I suppose that using many correlated variables should not be a problem for randomForest, but it is for my understanding of the data and for other algorithms). 3) How many variables should eventually be used? I have made successive runs, adding one variable at a time from the most to the least important (not-too-correlated) variables. I then plot the error rate (err.rate) as a function of the number of variable used. As this number increase, the error first sharply decrease, but relatively soon it reaches a plateau . I assume that the point of inflexion can be use to derive the minimum number of variable to be used. Is that a sensible approach? Is there any other suggestion? A measure of the error on "err.rate" would also here really help. Is there any idea how to estimate this? From the variation between runs or with the help of "importanceSD" somehow? Thanks very much in advance for any help. Pierre Dubath
From: Pierre Dubath> > Hello, > > I am using the R randomForest package to classify variable > stars. I have > a training set of 1755 stars described by (too) many > variables. Some of > these variables are highly correlated. > > I believe that I understand how randomForest works and how > the variable > importance are evaluated (through variable permutations). Here are my > questions. > > 1) variable importance error? Is there any ways to estimate > the error on > the "MeanDecreaseAccuracy"? In other words, I would like to know how > significant are "MeanDecreaseAccuracy" differences (and display > horizontal error bars in the VarImpPlot output).If you really want to do it, one possibility is to do permutation test: Permute your response, say, 1000 or 2000 times, run RF on each of these permuted response, and use the importance measures as samples from the null distribution.> I have notice that even with relatively large number of trees, I have > variation in the importance values from one run to the next. > Could this > serve as a measure of the errors/uncertainties?Yes.> 2) how to deal with variable correlation? so far, I am iterating, > selecting the most important variable first, removing all > other variable > that have a high correlation (say higher than 80%), taking the second > most important variable left, removing variables with > high-correlation > with any of the first two variables, and so on... (also using some > astronomical insight as to which variables are the most important!) > > Is there a better way to deal with correlation in randomForest? (I > suppose that using many correlated variables should not be a > problem for > randomForest, but it is for my understanding of the data and > for other > algorithms).That depends a lot on what you're trying to do. RF can tolerate problematic data, but that doesn't mean it will magically give you good answers. Trying to draw conclusions about effects when there are highly correlated (and worse, important) variables is a tricky business.> 3) How many variables should eventually be used? I have made > successive > runs, adding one variable at a time from the most to the > least important > (not-too-correlated) variables. I then plot the error rate > (err.rate) as > a function of the number of variable used. As this number > increase, the > error first sharply decrease, but relatively soon it reaches > a plateau . > I assume that the point of inflexion can be use to derive the minimum > number of variable to be used. Is that a sensible approach? > Is there any > other suggestion? A measure of the error on "err.rate" would > also here > really help. Is there any idea how to estimate this? From the > variation > between runs or with the help of "importanceSD" somehow?One approach is described in the following paper (in the Proceedings of MCS 2004): http://www.springerlink.com/content/9n61mquugf9tungl/ Best, Andy> Thanks very much in advance for any help. > > Pierre Dubath > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Notice: This e-mail message, together with any attachme...{{dropped:11}}