lara harrup (IAH-P)
2009-Jul-20 17:11 UTC
[R] randomForest - what is a 'good' pseudo r-squared?
Hi all I have been trying to use the randomForest package to model insect species abundance in different habitats and identify the key variables (landscape/climate etc) in determining abundance, which has all worked fine and I get nice variable importance plots etc. Many thanks to everyone on this help forum who has given tips/advice along the way. But the percentage variance explained /pseudo r squared reported when I call print(model) is quite low, depending on the species being modelled it ranges from a maximum of 23.69 right down to -2.08. I believe that the minus value represents a model that performs no better / worse than random and obviously the larger the R^2 gets the better the predictive ability but over what range does this r^2 operate? As it is not unexpected that some of these models would have poor predictive accuracy as part of the larger project around this work is to say finer resolution remotely sensed satellite imagery is needed to derive the climate variables etc being used to predict species abundance. My question is probably a bit like how long is a piece of string but if anyone could offer some guidance on what constitutes a good / very good / bad / very bad r-squared value for random forest it would be most appreciated and if there are any other accuracy measure that can be used with Random Forest in addition to the pseudo r^2 value? as this work will be presented to an entomology/ecology audience where machine learning is a bit outside their (and my) statistics comfort zone. Many thanks in advance Lara lara.harrup@bbsrc.ac.uk [[alternative HTML version deleted]]
Generally speaking, the pseudo R^2 of 70% is a rather good model (obviously depends on the kind of data you have at hand). Because it's "pseudo", not "real", R^2, so the range is not limited to [0, 100%], but it's hard for me to imagine anyone getting >100%. You may want to check the distribution of the response (or residuals) to see if a transformation is appropriate. Tree-based methods (of which random forests is one) can be sensitive to heteroscedasticity. Best, Andy From: lara harrup (IAH-P)> > Hi all > > I have been trying to use the randomForest package to model > insect species abundance in different habitats and identify > the key variables (landscape/climate etc) in determining > abundance, which has all worked fine and I get nice variable > importance plots etc. Many thanks to everyone on this help > forum who has given tips/advice along the way. > > But the percentage variance explained /pseudo r squared > reported when I call print(model) is quite low, depending on > the species being modelled it ranges from a maximum of 23.69 > right down to -2.08. > > I believe that the minus value represents a model that > performs no better / worse than random and obviously the > larger the R^2 gets the better the predictive ability but > over what range does this r^2 operate? > > As it is not unexpected that some of these models would have > poor predictive accuracy as part of the larger project around > this work is to say finer resolution remotely sensed > satellite imagery is needed to derive the climate variables > etc being used to predict species abundance. > > My question is probably a bit like how long is a piece of > string but if anyone could offer some guidance on what > constitutes a good / very good / bad / very bad r-squared > value for random forest it would be most appreciated and if > there are any other accuracy measure that can be used with > Random Forest in addition to the pseudo r^2 value? as this > work will be presented to an entomology/ecology audience > where machine learning is a bit outside their (and my) > statistics comfort zone. > > Many thanks in advance > > Lara > > lara.harrup at bbsrc.ac.uk > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Notice: This e-mail message, together with any attachme...{{dropped:12}}
Reasonably Related Threads
- Random Forest Variable Importance Interpretation
- ANCOVA/glm missing/ignored interaction combinations
- Error with regsubset in leaps package - vcov and all.best option (plus calculating VIFs for subsets)
- Multiple Comparisons for (multicomp - glht) for glm negative binomial (glm.nb)
- Permission problems with samba 2.2.x