thr3ads.net - R help - [R] randomForest() for regression produces offset predictions [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Joshua Knowles

2007-Dec-19 11:43 UTC

[R] randomForest() for regression produces offset predictions

Hi all,
 
I have observed that when using the randomForest package to do regression, the 
predicted values of the dependent variable given by a trained forest are not 
centred and have the wrong slope when plotted against the true values.
 
This means that the R^2 value obtained by squaring the Pearson correlation are 
better than those obtained by computing the coefficient of determination 
directly. The R^2 value obtained by squaring the Pearson can, however, be 
exactly reproduced by the coeff. of det. if the predicted values are first 
linearly transformed (using lm() to find the required intercept and slope).
 
Does anyone know why the randomForest behaves in this way - producing offset 
predictions? Does anyone know a fix for the problem?
 
(By the way, the feature is there even if the original dependent variable 
values are initially transformed to have zero mean and unit variance.)
 
As an example, here is some simple R code that uses the available swiss 
dataset to show the effect I am observing.

Thanks for any help.
 
--
#### EXAMPLE OF RANDOM FOREST REGRESSION
 
library(randomForest)
data(swiss)
swiss
 
#Build the random forest to predict Infant Mortality
rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss)
 
#And predict the training set again
pred<-c(predict(rf.rf,swiss))
actual<-swiss$Infant.Mortality
 
#Plotting predicted against actual values shows the effect (uncomment to see
this)
#plot(pred,actual)
#abline(0,1)
 
# calculate R^2 as pearson coefficient squared
R2one<-cor(pred,actual)^2
 
# calculate R^2 value as fraction of variance explained
residOpt<-(actual-pred)
residnone<-(actual-mean(actual))
R2two<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
 
# now fit a line through the predicted and true values and
# use this to normalize the data before calculating R^2
 
fit<-lm(actual ~ pred)
coef(fit)
pred2<-pred*coef(fit)[2]+coef(fit)[1]
residOpt<-(actual-pred2)
R2three<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
 
cat("Pearson squared = ",R2one,"\n")
cat("Coeff of determination = ", R2two, "\n")
cat("Coeff of determination after linear fitting = ", R2three,
"\n")
 
## END
 

-- 
Joshua Knowles .. j.knowles at manchester.ac.uk
BBSRC David Phillips Fellow
School of Computer Science
The University of Manchester
http://dbkgroup.org/knowles/

Patrick Burns

2007-Dec-20 19:58 UTC

head link

[R] randomForest() for regression produces offset predictions

What I see is the predictions being less extreme than the
actual values -- predictions for large actual values are smaller
than the actual, and predictions for small actual values are
larger than the actual.  That makes sense to me.  The object
is to maximize out-of-sample predictive power, not in-sample
predictive power.

Or am I missing something in what you are saying?


Patrick Burns
patrick at burns-stat.com
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and "A Guide for the Unwilling S User")


Joshua Knowles wrote:
>Hi all,
> 
>I have observed that when using the randomForest package to do regression,
the
>predicted values of the dependent variable given by a trained forest are not
>centred and have the wrong slope when plotted against the true values.
> 
>This means that the R^2 value obtained by squaring the Pearson correlation
are
>better than those obtained by computing the coefficient of determination 
>directly. The R^2 value obtained by squaring the Pearson can, however, be 
>exactly reproduced by the coeff. of det. if the predicted values are first 
>linearly transformed (using lm() to find the required intercept and slope).
> 
>Does anyone know why the randomForest behaves in this way - producing offset
>predictions? Does anyone know a fix for the problem?
> 
>(By the way, the feature is there even if the original dependent variable 
>values are initially transformed to have zero mean and unit variance.)
> 
>As an example, here is some simple R code that uses the available swiss 
>dataset to show the effect I am observing.
>
>Thanks for any help.
> 
>--
>#### EXAMPLE OF RANDOM FOREST REGRESSION
> 
>library(randomForest)
>data(swiss)
>swiss
> 
>#Build the random forest to predict Infant Mortality
>rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss)
> 
>#And predict the training set again
>pred<-c(predict(rf.rf,swiss))
>actual<-swiss$Infant.Mortality
> 
>#Plotting predicted against actual values shows the effect (uncomment to see
>this)
>#plot(pred,actual)
>#abline(0,1)
> 
># calculate R^2 as pearson coefficient squared
>R2one<-cor(pred,actual)^2
> 
># calculate R^2 value as fraction of variance explained
>residOpt<-(actual-pred)
>residnone<-(actual-mean(actual))
>R2two<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
> 
># now fit a line through the predicted and true values and
># use this to normalize the data before calculating R^2
> 
>fit<-lm(actual ~ pred)
>coef(fit)
>pred2<-pred*coef(fit)[2]+coef(fit)[1]
>residOpt<-(actual-pred2)
>R2three<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
> 
>cat("Pearson squared = ",R2one,"\n")
>cat("Coeff of determination = ", R2two, "\n")
>cat("Coeff of determination after linear fitting = ", R2three,
"\n")
> 
>## END
> 
>
>  
>

R help - Dec 2007 - randomForest() for regression produces offset predictions

[R] randomForest() for regression produces offset predictions

[R] randomForest() for regression produces offset predictions