Clément Calenge
2011-Feb-15 14:55 UTC
[R] [slightly OT] predict.randomForest and type=”prob”
Dear all , I would like to use the function randomForest to predict the probability of relocation failure of a GPS collar as a function of several environmental variables x (both factor and numeric: slope, vegetation, etc.) on a given area. The response variable y is thus success (0)/failure(1) of the relocation, and the sampling unit is the pixel of a raster map. My aim is to build a map predicting the probability that a relocation will succeed P(y=1|x) at each point. I am tempted to use the function predict.randomForest() to estimate this probability (with type=”prob”). If I understand correctly, this function returns the proportion of trees in the random forest voting in favour of the success or failure of the relocation. In the appendix of the paper cited as reference on the help page of the function randomForest() (Breiman, 2001. Random Forest), Breiman notes that these proportions of votes can be interpreted as the probability, calculated over all trees, that a tree, given the variables x and the training set, would classify correctly a relocation as success/failure (using Breiman's notations, P_\Theta( h(\Theta, x) = failure). I have found several threads on R-help related to predict.randomForest(..., type=”prob”) that confirm this interpretation of these probabilities (e.g., http://r.789695.n4.nabble.com/quot-prob-quot-in-predict-randomForest-td887278.html, http://r.789695.n4.nabble.com/Random-Forest-AUC-td3006649.html). However, I would like to know under which conditions (assumptions about the process, parameters of the randomForests, etc.) it is correct to use this proportion of votes as an estimate of the “true” probability P(failure | environment) caracterizing the relocation process. I searched the web and the literature, but I did not find any reference describing how these two probabilities are connected, although Breiman (2002; Manual On Setting Up, Using, And Understanding Random Forests V3.1) just noted that the proportion of votes “should not be interpreted as the underlying distributional probabilities”. Could you point me toward some references about this problem, or give me ideas of the assumptions under which this approximation would be correct? Thanks for any hint ! Best regards, Clément Calenge > version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Under development (unstable) major 2 minor 13.0 year 2011 month 02 day 06 svn rev 54234 language R version.string R version 2.13.0 Under development (unstable) (2011-02-06 r54234) -- Clément CALENGE Cellule d'appui à l'analyse de données Direction des Etudes et de la Recherche Office national de la chasse et de la faune sauvage Saint Benoist - 78610 Auffargis tel. (33) 01.30.46.54.14 [[alternative HTML version deleted]]
Clément Calenge
2011-Feb-15 15:23 UTC
[R] [slightly OT] predict.randomForest and type=”prob”
Dear all , I would like to use the function randomForest to predict the probability of relocation failure of a GPS collar as a function of several environmental variables x (both factor and numeric: slope, vegetation, etc.) on a given area. The response variable y is thus success (0)/failure(1) of the relocation, and the sampling unit is the pixel of a raster map. My aim is to build a map predicting the probability that a relocation will succeed P(y=1|x) at each point. I am tempted to use the function predict.randomForest() to estimate this probability (with type=”prob”). If I understand correctly, this function returns the proportion of trees in the random forest voting in favour of the success or failure of the relocation. In the appendix of the paper cited as reference on the help page of the function randomForest() (Breiman, 2001. Random Forest), Breiman notes that these proportions of votes can be interpreted as the probability, calculated over all trees, that a tree, given the variables x and the training set, would classify correctly a relocation as success/failure (using Breiman's notations, P_\Theta( h(\Theta, x) = failure). I have found several threads on R-help related to predict.randomForest(..., type=”prob”) that confirm this interpretation of these probabilities (e.g., http://r.789695.n4.nabble.com/quot-prob-quot-in-predict-randomForest-td887278.html, http://r.789695.n4.nabble.com/Random-Forest-AUC-td3006649.html). However, I would like to know under which conditions (assumptions about the process, parameters of the randomForests, etc.) it is correct to use this proportion of votes as an estimate of the “true” probability P(failure | environment) caracterizing the relocation process. I searched the web and the literature, but I did not find any reference describing how these two probabilities are connected, although Breiman (2002; Manual On Setting Up, Using, And Understanding Random Forests V3.1) just noted that the proportion of votes “should not be interpreted as the underlying distributional probabilities”. Could you point me toward some references about this problem, or give me ideas of the assumptions under which this approximation would be correct? Thanks for any hint ! Best regards, Clément Calenge > version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Under development (unstable) major 2 minor 13.0 year 2011 month 02 day 06 svn rev 54234 language R version.string R version 2.13.0 Under development (unstable) (2011-02-06 r54234) -- Clément CALENGE Cellule d'appui à l'analyse de données Direction des Etudes et de la Recherche Office national de la chasse et de la faune sauvage Saint Benoist - 78610 Auffargis tel. (33) 01.30.46.54.14 -- Clément CALENGE Cellule d'appui à l'analyse de données Direction des Etudes et de la Recherche Office national de la chasse et de la faune sauvage Saint Benoist - 78610 Auffargis tel. (33) 01.30.46.54.14 [[alternative HTML version deleted]]
Reasonably Related Threads
- Error with named chunks in Sweave with the development version of R
- asc class object - how to get positions (coordinates) for a given raster ID?
- Modeling presence only data in R
- adehabitatMA, LT, HR and HS version 0.1
- adehabitatMA, LT, HR and HS version 0.1