thr3ads.net - R help - [R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation [May 2010]

If this information is useful, please help other people find it:
Share via:

Andreas Béguin

2010-May-24 11:45 UTC

[R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation

Dear R-help list members,

I have a statistical question regarding the Random Forest function (RF) as
applied to ecological prediction of species presences and absences.

RF seems to perform very well for prediction of species ranges or
prevalences. However, the problem with my dataset is a high degree of
spatial autocorrelation and therefore a low effective sample size compared
to the full number of gridpoints (0.5 degree grid extending over all land
areas north of 55 deg. south, ~60000 grid points). My variables are to a
high degree correlated in x and y direction. When using the entire dataset
in the RF function, the misclassification rate is unbelievably low,
suggesting overfitting. The noisy marginal probability plots (see attached
example) somehow support this idea. My question is: Is there a way to make
the decision trees in RF more generalizable without modelling the spatial
autocorrelation explicitly? Here are four ways of doing this I have thought
about:
1. Spatially clustering observations into training and test datasets and
averaging the predicted class probability values to approximate "real"
certainty - This could be done on country level or in a chessboard-like
pattern
2. Requiring a higher minimal nodesize to prevent the creation of
overfitted, maximal trees - Which value of "nodesize" might be
appropriate?
3. Reducing the number of variables involved in the model by just taking one
out of a group of correlated variables (say, for example, only winter
temperature instead of temperatures from all seasons) - This variable
selection would be based on the Variable Importance plots. I was considering
to use the Gini measure ranking instead of the accuracy ranking to produce
simpler, more "biological" trees, please comment on this.
4. Requiring RF to choose only a certain number of "TRUE" and
"FALSE"
("presence"-"absence") observations using the
"sampsize" option, thereby
increasing the distance between the gridpoints chosen to build the model so
as to reduce correlation between observations.

Which of these pathways would you suggest to pursue? Certainly some of you
have faced and tackled the problem of spatial autocorrelation in ecological
prediction. I am aware of the works of Araujo et al. (2005) and Koenig
(1999), any further suggested reading (especially examples of how spatial
autocorrelation can be dealt with practically) would be highly welcome.

Kind regards,

Andreas Beguin
##########################################
Division of Epidemiology and Global Health
Department of Public Health and Clinical Medicine
Umea University
907 31 Umea Sweden

Gabor Grothendieck

2010-May-24 14:10 UTC

head link

[R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation

You could also try the Boruta package for variable selection.

2010/5/24 Andreas B?guin <chaudeau at gmail.com>:> Dear R-help list members,
>
> I have a statistical question regarding the Random Forest function (RF) as
> applied to ecological prediction of species presences and absences.
>
> RF seems to perform very well for prediction of species ranges or
> prevalences. However, the problem with my dataset is a high degree of
> spatial autocorrelation and therefore a low effective sample size compared
> to the full number of gridpoints (0.5 degree grid extending over all land
> areas north of 55 deg. south, ~60000 grid points). My variables are to a
> high degree correlated in x and y direction. When using the entire dataset
> in the RF function, the misclassification rate is unbelievably low,
> suggesting overfitting. The noisy marginal probability plots (see attached
> example) somehow support this idea. My question is: Is there a way to make
> the decision trees in RF more generalizable without modelling the spatial
> autocorrelation explicitly? Here are four ways of doing this I have thought
> about:
> 1. Spatially clustering observations into training and test datasets and
> averaging the predicted class probability values to approximate
"real"
> certainty - This could be done on country level or in a chessboard-like
> pattern
> 2. Requiring a higher minimal nodesize to prevent the creation of
> overfitted, maximal trees - Which value of "nodesize" might be
appropriate?
> 3. Reducing the number of variables involved in the model by just taking
one
> out of a group of correlated variables (say, for example, only winter
> temperature instead of temperatures from all seasons) - This variable
> selection would be based on the Variable Importance plots. I was
considering
> to use the Gini measure ranking instead of the accuracy ranking to produce
> simpler, more "biological" trees, please comment on this.
> 4. Requiring RF to choose only a certain number of "TRUE" and
"FALSE"
> ("presence"-"absence") observations using the
"sampsize" option, thereby
> increasing the distance between the gridpoints chosen to build the model so
> as to reduce correlation between observations.
>
> Which of these pathways would you suggest to pursue? Certainly some of you
> have faced and tackled the problem of spatial autocorrelation in ecological
> prediction. I am aware of the works of Araujo et al. (2005) and Koenig
> (1999), any further suggested reading (especially examples of how spatial
> autocorrelation can be dealt with practically) would be highly welcome.
>
> Kind regards,
>
> Andreas Beguin
> ##########################################
> Division of Epidemiology and Global Health
> Department of Public Health and Clinical Medicine
> Umea University
> 907 31 Umea Sweden

R help - May 2010 - Random Forest for Ecological Prediction under presence of Spatial Autocorrelation

[R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation

[R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation