Andreas Béguin
2010-May-24 11:45 UTC
[R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation
Dear R-help list members, I have a statistical question regarding the Random Forest function (RF) as applied to ecological prediction of species presences and absences. RF seems to perform very well for prediction of species ranges or prevalences. However, the problem with my dataset is a high degree of spatial autocorrelation and therefore a low effective sample size compared to the full number of gridpoints (0.5 degree grid extending over all land areas north of 55 deg. south, ~60000 grid points). My variables are to a high degree correlated in x and y direction. When using the entire dataset in the RF function, the misclassification rate is unbelievably low, suggesting overfitting. The noisy marginal probability plots (see attached example) somehow support this idea. My question is: Is there a way to make the decision trees in RF more generalizable without modelling the spatial autocorrelation explicitly? Here are four ways of doing this I have thought about: 1. Spatially clustering observations into training and test datasets and averaging the predicted class probability values to approximate "real" certainty - This could be done on country level or in a chessboard-like pattern 2. Requiring a higher minimal nodesize to prevent the creation of overfitted, maximal trees - Which value of "nodesize" might be appropriate? 3. Reducing the number of variables involved in the model by just taking one out of a group of correlated variables (say, for example, only winter temperature instead of temperatures from all seasons) - This variable selection would be based on the Variable Importance plots. I was considering to use the Gini measure ranking instead of the accuracy ranking to produce simpler, more "biological" trees, please comment on this. 4. Requiring RF to choose only a certain number of "TRUE" and "FALSE" ("presence"-"absence") observations using the "sampsize" option, thereby increasing the distance between the gridpoints chosen to build the model so as to reduce correlation between observations. Which of these pathways would you suggest to pursue? Certainly some of you have faced and tackled the problem of spatial autocorrelation in ecological prediction. I am aware of the works of Araujo et al. (2005) and Koenig (1999), any further suggested reading (especially examples of how spatial autocorrelation can be dealt with practically) would be highly welcome. Kind regards, Andreas Beguin ########################################## Division of Epidemiology and Global Health Department of Public Health and Clinical Medicine Umea University 907 31 Umea Sweden
Gabor Grothendieck
2010-May-24 14:10 UTC
[R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation
You could also try the Boruta package for variable selection. 2010/5/24 Andreas B?guin <chaudeau at gmail.com>:> Dear R-help list members, > > I have a statistical question regarding the Random Forest function (RF) as > applied to ecological prediction of species presences and absences. > > RF seems to perform very well for prediction of species ranges or > prevalences. However, the problem with my dataset is a high degree of > spatial autocorrelation and therefore a low effective sample size compared > to the full number of gridpoints (0.5 degree grid extending over all land > areas north of 55 deg. south, ~60000 grid points). My variables are to a > high degree correlated in x and y direction. When using the entire dataset > in the RF function, the misclassification rate is unbelievably low, > suggesting overfitting. The noisy marginal probability plots (see attached > example) somehow support this idea. My question is: Is there a way to make > the decision trees in RF more generalizable without modelling the spatial > autocorrelation explicitly? Here are four ways of doing this I have thought > about: > 1. Spatially clustering observations into training and test datasets and > averaging the predicted class probability values to approximate "real" > certainty - This could be done on country level or in a chessboard-like > pattern > 2. Requiring a higher minimal nodesize to prevent the creation of > overfitted, maximal trees - Which value of "nodesize" might be appropriate? > 3. Reducing the number of variables involved in the model by just taking one > out of a group of correlated variables (say, for example, only winter > temperature instead of temperatures from all seasons) - This variable > selection would be based on the Variable Importance plots. I was considering > to use the Gini measure ranking instead of the accuracy ranking to produce > simpler, more "biological" trees, please comment on this. > 4. Requiring RF to choose only a certain number of "TRUE" and "FALSE" > ("presence"-"absence") observations using the "sampsize" option, thereby > increasing the distance between the gridpoints chosen to build the model so > as to reduce correlation between observations. > > Which of these pathways would you suggest to pursue? Certainly some of you > have faced and tackled the problem of spatial autocorrelation in ecological > prediction. I am aware of the works of Araujo et al. (2005) and Koenig > (1999), any further suggested reading (especially examples of how spatial > autocorrelation can be dealt with practically) would be highly welcome. > > Kind regards, > > Andreas Beguin > ########################################## > Division of Epidemiology and Global Health > Department of Public Health and Clinical Medicine > Umea University > 907 31 Umea Sweden