Clément Calenge
2011-Feb-15 14:55 UTC
[R] [slightly OT] predict.randomForest and type=”prob”
Dear all ,
I would like to use the function randomForest to predict the probability
of relocation failure of a GPS collar as a function of several
environmental variables x (both factor and numeric: slope, vegetation,
etc.) on a given area. The response variable y is thus success
(0)/failure(1) of the relocation, and the sampling unit is the pixel of
a raster map. My aim is to build a map predicting the probability that a
relocation will succeed P(y=1|x) at each point. I am tempted to use the
function predict.randomForest() to estimate this probability (with
type=”prob”).
If I understand correctly, this function returns the proportion of trees
in the random forest voting in favour of the success or failure of the
relocation. In the appendix of the paper cited as reference on the help
page of the function randomForest() (Breiman, 2001. Random Forest),
Breiman notes that these proportions of votes can be interpreted as the
probability, calculated over all trees, that a tree, given the variables
x and the training set, would classify correctly a relocation as
success/failure (using Breiman's notations, P_\Theta( h(\Theta, x) =
failure). I have found several threads on R-help related to
predict.randomForest(..., type=”prob”) that confirm this interpretation
of these probabilities (e.g.,
http://r.789695.n4.nabble.com/quot-prob-quot-in-predict-randomForest-td887278.html,
http://r.789695.n4.nabble.com/Random-Forest-AUC-td3006649.html).
However, I would like to know under which conditions (assumptions about
the process, parameters of the randomForests, etc.) it is correct to use
this proportion of votes as an estimate of the “true” probability
P(failure | environment) caracterizing the relocation process. I
searched the web and the literature, but I did not find any reference
describing how these two probabilities are connected, although Breiman
(2002; Manual On Setting Up, Using, And Understanding Random Forests
V3.1) just noted that the proportion of votes “should not be interpreted
as the underlying distributional probabilities”.
Could you point me toward some references about this problem, or give me
ideas of the assumptions under which this approximation would be correct?
Thanks for any hint !
Best regards,
Clément Calenge
> version
_
platform i686-pc-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status Under development (unstable)
major 2
minor 13.0
year 2011
month 02
day 06
svn rev 54234
language R
version.string R version 2.13.0 Under development (unstable) (2011-02-06
r54234)
--
Clément CALENGE
Cellule d'appui à l'analyse de données
Direction des Etudes et de la Recherche
Office national de la chasse et de la faune sauvage
Saint Benoist - 78610 Auffargis
tel. (33) 01.30.46.54.14
[[alternative HTML version deleted]]
Clément Calenge
2011-Feb-15 15:23 UTC
[R] [slightly OT] predict.randomForest and type=”prob”
Dear all ,
I would like to use the function randomForest to predict the probability
of relocation failure of a GPS collar as a function of several
environmental variables x (both factor and numeric: slope, vegetation,
etc.) on a given area. The response variable y is thus success
(0)/failure(1) of the relocation, and the sampling unit is the pixel of
a raster map. My aim is to build a map predicting the probability that a
relocation will succeed P(y=1|x) at each point. I am tempted to use the
function predict.randomForest() to estimate this probability (with
type=”prob”).
If I understand correctly, this function returns the proportion of trees
in the random forest voting in favour of the success or failure of the
relocation. In the appendix of the paper cited as reference on the help
page of the function randomForest() (Breiman, 2001. Random Forest),
Breiman notes that these proportions of votes can be interpreted as the
probability, calculated over all trees, that a tree, given the variables
x and the training set, would classify correctly a relocation as
success/failure (using Breiman's notations, P_\Theta( h(\Theta, x) =
failure). I have found several threads on R-help related to
predict.randomForest(..., type=”prob”) that confirm this interpretation
of these probabilities (e.g.,
http://r.789695.n4.nabble.com/quot-prob-quot-in-predict-randomForest-td887278.html,
http://r.789695.n4.nabble.com/Random-Forest-AUC-td3006649.html).
However, I would like to know under which conditions (assumptions about
the process, parameters of the randomForests, etc.) it is correct to use
this proportion of votes as an estimate of the “true” probability
P(failure | environment) caracterizing the relocation process. I
searched the web and the literature, but I did not find any reference
describing how these two probabilities are connected, although Breiman
(2002; Manual On Setting Up, Using, And Understanding Random Forests
V3.1) just noted that the proportion of votes “should not be interpreted
as the underlying distributional probabilities”.
Could you point me toward some references about this problem, or give me
ideas of the assumptions under which this approximation would be correct?
Thanks for any hint !
Best regards,
Clément Calenge
> version
_
platform i686-pc-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status Under development (unstable)
major 2
minor 13.0
year 2011
month 02
day 06
svn rev 54234
language R
version.string R version 2.13.0 Under development (unstable) (2011-02-06
r54234)
--
Clément CALENGE
Cellule d'appui à l'analyse de données
Direction des Etudes et de la Recherche
Office national de la chasse et de la faune sauvage
Saint Benoist - 78610 Auffargis
tel. (33) 01.30.46.54.14
--
Clément CALENGE
Cellule d'appui à l'analyse de données
Direction des Etudes et de la Recherche
Office national de la chasse et de la faune sauvage
Saint Benoist - 78610 Auffargis
tel. (33) 01.30.46.54.14
[[alternative HTML version deleted]]
Possibly Parallel Threads
- Error with named chunks in Sweave with the development version of R
- asc class object - how to get positions (coordinates) for a given raster ID?
- Modeling presence only data in R
- adehabitatMA, LT, HR and HS version 0.1
- adehabitatMA, LT, HR and HS version 0.1