Christoph Muller
2007-Sep-05 16:34 UTC
[R] ecological meaning of randomForest vegetation classification?
Hi, everyone, I haven't found anything similar in the forum, so here's my problem (I'm no expert in R nor statistics): I have a data set of 59.000 cases with 9 variables each (fractional coverage of 9 different plant types, such as deciduous broad-leaved temperate trees or evergreen tropical trees etc.), which was generated by a vegetation model. In order to evaluate the quality of the vegetation model's output, I want to compare it to a land-cover data set which has 23 different land-cover types (such as needle leaved evergreen forest, dense broad-leaved forest, barren, etc.). A statistician advised me to use the randomForest package in R and using a sub-set to generate the random Forest, I get a very good prediction for the rest. However, I need to evaluate how meaningful this classification is in an ecological sense (boreal trees should not play a role in the definition of tropical land-cover types, for example), otherwise I cannot judge the quality of the vegetation model's output. Unfortunately, randomForest gives me about 15.000 splits of which about 5000 are end branches (rough guess), so it's very hard and time-consuming to check each single branch of one of the final trees for its ecological meaning. Is there any utility to summarize the characteristics of each of the 23 prediction classes? Such as "land-cover class 1 has less than 5% of plant types 1-5, 20-50% of plant type 7 and at least 30% of plant type 8". Or is there a more suitable method to classify my data? Thanks a lot in advance! Christoph ____________________________________________________________________________ Click on the following link for the Netherlands Environmental Assessment Agency(MNP)mission and contact information: http://www.mnp.nl/signature.html Klik op de volgende link voor missie en contactinformatie van het Milieu- en Natuurplanbureau (MNP): http://www.mnp.nl/signature.html
Liaw, Andy
2007-Sep-05 17:49 UTC
[R] ecological meaning of randomForest vegetation classification? [Broadcast]
Hi Christoph, I'm not exactly sure what you're looking for, but I'll take a stab anyway. The trees in a random forest is not designed to be interpreted as one would with an "ordinary" tree. There are several things you may try to see if they help you any. One is the distribution of votes. It looks like you are classifying each data point into one of many possible classes. RF with give you the fraction of trees in the forest that classified the observation as a particular class (and the class with the highest fraction of votes is the "predicted class"). Another is the partial dependence plot: You can use plot(importance(rf.object)) to see which variables are the most important, and then use partialPlot() to examine their marginal effects. These offer some clue of what the RF black box is doing, and hopefully will make some sense to you. Best, Andy From: Christoph Muller> > Hi, everyone, > > I haven't found anything similar in the forum, so here's my > problem (I'm no > expert in R nor statistics): > > I have a data set of 59.000 cases with 9 variables each (fractional > coverage of 9 different plant types, such as deciduous broad-leaved > temperate trees or evergreen tropical trees etc.), which was > generated by a > vegetation model. > In order to evaluate the quality of the vegetation model's > output, I want > to compare it to a land-cover data set which has 23 different > land-cover > types (such as needle leaved evergreen forest, dense > broad-leaved forest, > barren, etc.). > A statistician advised me to use the randomForest package in > R and using a > sub-set to generate the random Forest, I get a very good > prediction for the > rest. > However, I need to evaluate how meaningful this > classification is in an > ecological sense (boreal trees should not play a role in the > definition of > tropical land-cover types, for example), otherwise I cannot judge the > quality of the vegetation model's output. > > Unfortunately, randomForest gives me about 15.000 splits of > which about > 5000 are end branches (rough guess), so it's very hard and > time-consuming > to check each single branch of one of the final trees for its > ecological > meaning. > Is there any utility to summarize the characteristics of each > of the 23 > prediction classes? Such as "land-cover class 1 has less than > 5% of plant > types 1-5, 20-50% of plant type 7 and at least 30% of plant type 8". > Or is there a more suitable method to classify my data? > > Thanks a lot in advance! > > Christoph > ______________________________________________________________ > ______________ > > Click on the following link for the Netherlands Environmental > Assessment > Agency(MNP)mission and contact information: > http://www.mnp.nl/signature.html > > Klik op de volgende link voor missie en contactinformatie van het > Milieu- en Natuurplanbureau (MNP): http://www.mnp.nl/signature.html > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}