Hello,
I'm with a conceptual doubt regarding Rsquared of both lm() and
postResample(library caret).
I've got a multiple regression linear model (lets say mlr) with anR² value
of 67.52%.
Then I use this model pro make predictions with predict() function using the
same data as input , that is, use the generated model to predict the value
associated with data that I used as input.
Next, if I apply postResample() to the observed and predicted data, why do I
have have an R² value of 33%? I mean, wasn't it supposed to be, at least,
67%, as in the original model, since they're using the same data as input?
Here is the code (the data goes on the end of the email)
#read input data
input<-read.table("input.csv", header=T)
# multiple linear regression
mlr<-lm(input$TOTAL~-1 + input$A + input$B + input$C + input$D)
#observe the model
summary(mlr)
Call:
lm(formula = input$TOTAL ~ -1 + input$A + input$B + input$C + input$D)
Residuals:
Min 1Q Median 3Q Max
-25.753 -7.455 2.396 12.615 55.316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
input$A 10.5985 3.9782 2.664 0.0121 *
input$B 0.3471 17.7731 0.020 0.9845
input$C 0.9468 1.9442 0.487 0.6297
input$D 12.1056 4.7262 2.561 0.0155 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
Residual standard error: 17.08 on 31 degrees of freedom
Multiple R-Squared: 0.6752, Adjusted R-squared: 0.6333
F-statistic: 16.11 on 4 and 31 DF, p-value: 3.090e-07
#as we noticed, an Rsquared value of 67.52%
#next, lets predict the results with the same input data
prediction<-predict(mlr,input)
#now let's evaluate the predictions, observing the R² and RMSE values that
postResample returns
postResample(input$TOTAL, prediction)
RMSE Rsquared
16.0718506 0.3300378
So here comes my doubt: why do I have an value of 67.52% for R² when
creating the model(that is , the model explains 67.52% of the data) and
when I use this same model on the same input data, why does postResample
return a very different value associated to R²?
Best regards,
Giovane
#input.csv file used as input
"A" "B" "C" "D"
"TOTAL"
1 0 1 0 3.8
1 0 1 0 21.67
1 0 0 0 2.92
2 0 6 0 42.84
0 0 0 0 5.28
2 0 0 3 44.86
1 0 0 0 8.22
1 0 0 0 28.24
1 0 3 0 29.69
1 0 0 1 78.02
3 0 7 0 51.29
2 0 0 0 37.55
2 0 2 0 10.82
1 0 3 0 17.67
0 0 0 0 6.62
2 1 3 1 36.49
0 0 0 0 37.52
1 0 2 0 5.26
1 0 2 0 7.32
1 0 0 0 2.2
2 0 6 0 39.24
0 0 0 0 2.83
2 0 0 3 50.93
1 0 0 0 4.15
1 0 0 0 29.72
1 0 3 0 4.26
1 0 0 1 25.1
3 0 7 0 12.67
2 0 0 0 7.99
2 0 2 0 17.55
1 0 3 0 3.66
0 0 0 0 7.22
0 0 0 0 3.82
0 0 0 0 28.05
3 0 7 0 34.67
[[alternative HTML version deleted]]
On Dec 11, 2007 3:35 PM, Giovane <gufrgs at gmail.com> wrote:> > So here comes my doubt: why do I have an value of 67.52% for R? when > creating the model(that is , the model explains 67.52% of the data) and > when I use this same model on the same input data, why does postResample > return a very different value associated to R?? >Let's get in the WayBack machine and return to 4 days ago when I said:> As has been previously noted on this list, there are a number of > formulas for R-squared. This function uses the square of the > correlation between the observed and predicted. The next version of > caret will offer a choice of formulas.For your data:> cor(prediction, input$TOTAL)^2[1] 0.3300378 For R-squared, summary.lm uses ans$r.squared <- mss/(mss + rss) ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf) and for your data rdf = 31, df.int = 0 and n = 35. In other words, the Rsquared estimate form summary.lm adjusts for the degrees of freedom and postResample does not. Why doesn't it use the df? In ?postResample you would see "Note that many models have more predictors (or parameters) than data points, so the typical mean squared error denominator (n - p) does not apply. Root mean squared error is calculated using sqrt(mean((pred - obs)^2)). Also, R-squared is calculated as the square of the correlation between the observed and predicted outcomes." Since caret is useful for comparing different types of models, we use biased estimate of the root MSE since we would like to directly compare the RMSE from different models (say a linear regression and a support vector machine). Many of these models do not have an explicit number of parameters, so we use mse <- mean((pred - obs)^2) Max