Simon Gillings
2011-Feb-10 16:45 UTC
[R] Prediction accuracy from Bagging with continuous data
I am using bagging to perform Bagged Regression Trees on count data (bird abundance in Britain and Ireland, in relation to climate and land cover variables). Predictions from the final model are visually believable but I would really like a diagnostic equivalent to classification success that can be used to decide if a model is adequate. Whereas with classification data an error rate is returned, with continuous data only the root mean squared error is returned. The RMSE is helpful for comparing different models for the same species and deciding which is best, but as far as I can tell it offers no absolute measure of how good that best model is. At present I am using the final model to make predictions for the original dataset and then computing a correlation coefficient between observed and predicted values but I expect this is probably biased high due to non-independence. Ideally I think I need the correlation coefficient between the predictions and observed values for the out of bag sample for each of the n trees produced, but I don't see this produced anywhere. Does anyone know of a means of getting a useful unbiased diagnostic for assessing overall fit? thanks Simon ____________________________________________________________ Sign-up for Bird Atlas 2007-11 at www.birdatlas.net ____________________________________________________________ Dr Simon Gillings Senior Research Ecologist - Land Use British Trust for Ornithology The Nunnery, Thetford, Norfolk, IP24 2PU, UK Tel +44(0)1842 750050 Fax +44(0)1842 750030 Charity No 216652 (England and Wales) Company Limited by Guarantee No 357284 (England and Wales) Registered Office The Nunnery, Thetford, Norfolk IP24 2PU [[alternative HTML version deleted]]
If you do use correlation, you should think about doing it on the log or sort scale. The train() function in the caret package can estimate performance using resampling. There are examples in ?train that show how to define custom performance measures (I think it shows how to do this with MAD estimates of error). Max On Feb 10, 2011, at 11:45 AM, "Simon Gillings" <simon.gillings at bto.org> wrote:> I am using bagging to perform Bagged Regression Trees on count data (bird abundance in Britain and Ireland, in relation to climate and land cover variables). Predictions from the final model are visually believable but I would really like a diagnostic equivalent to classification success that can be used to decide if a model is adequate. Whereas with classification data an error rate is returned, with continuous data only the root mean squared error is returned. The RMSE is helpful for comparing different models for the same species and deciding which is best, but as far as I can tell it offers no absolute measure of how good that best model is. > > At present I am using the final model to make predictions for the original dataset and then computing a correlation coefficient between observed and predicted values but I expect this is probably biased high due to non-independence. Ideally I think I need the correlation coefficient between the predictions and observed values for the out of bag sample for each of the n trees produced, but I don't see this produced anywhere. > > Does anyone know of a means of getting a useful unbiased diagnostic for assessing overall fit? > > thanks > > Simon > ____________________________________________________________ > Sign-up for Bird Atlas 2007-11 at www.birdatlas.net > ____________________________________________________________ > Dr Simon Gillings > Senior Research Ecologist - Land Use > British Trust for Ornithology > The Nunnery, Thetford, Norfolk, IP24 2PU, UK > Tel +44(0)1842 750050 Fax +44(0)1842 750030 > Charity No 216652 (England and Wales) > Company Limited by Guarantee No 357284 (England and Wales) > Registered Office The Nunnery, Thetford, Norfolk IP24 2PU > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Peter Langfelder
2011-Feb-11 00:49 UTC
[R] Prediction accuracy from Bagging with continuous data
On Thu, Feb 10, 2011 at 8:45 AM, Simon Gillings <simon.gillings at bto.org> wrote:> I am using bagging to perform Bagged Regression Trees on count data (bird abundance in Britain and Ireland, in relation to climate and land cover variables). Predictions from the final model are visually believable but I would really like a diagnostic equivalent to classification success that can be used to decide if a model is adequate. Whereas with classification data an error rate is returned, with continuous data only the root mean squared error is returned. The RMSE is helpful for comparing different models for the same species and deciding which is best, but as far as I can tell it offers no absolute measure of how good that best model is. > > At present I am using the final model to make predictions for the original dataset and then computing a correlation coefficient between observed and predicted values but I expect this is probably biased high due to non-independence. Ideally I think I need the correlation coefficient between the predictions and observed values for the out of bag sample for each of the n trees produced, but I don't see this produced anywhere. > > Does anyone know of a means of getting a useful unbiased diagnostic for assessing overall fit? >Not sure this suggestion is going to help you, but you could switch to the Random Forest ensemble of regression trees (package randomForest). The Random Forest predictor automatically calculates predicted values from/on out-of-bag samples and hence will give you a source to calculate an unbiased estimate of accuracy. Peter