thr3ads.net - R help - [R] postResample R² and lm() R² [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Giovane

2007-Dec-11 20:35 UTC

[R] postResample R² and lm() R²

Hello,

I'm with a conceptual doubt regarding Rsquared of both lm() and
postResample(library caret).

I've got a multiple regression linear model (lets say mlr) with anR² value
of 67.52%.
Then I use this model pro make predictions with predict() function using the
same data as input , that is, use the generated model to predict the value
associated with data that I used as input.

Next, if I apply postResample() to the observed and predicted data, why do I
have have an R² value of 33%? I mean, wasn't it supposed to be, at least,
67%, as in the original model, since they're using the same data as input?

Here is the code (the data goes on the end of the email)

#read input data
input<-read.table("input.csv", header=T)

# multiple linear regression
mlr<-lm(input$TOTAL~-1 + input$A + input$B + input$C + input$D)

#observe the model
summary(mlr)
Call:
lm(formula = input$TOTAL ~ -1 + input$A + input$B + input$C +  input$D)

Residuals:
    Min      1Q  Median      3Q     Max
-25.753  -7.455   2.396  12.615  55.316

Coefficients:
        Estimate Std. Error t value Pr(>|t|)
input$A  10.5985     3.9782   2.664   0.0121 *
input$B   0.3471    17.7731   0.020   0.9845
input$C   0.9468     1.9442   0.487   0.6297
input$D  12.1056     4.7262   2.561   0.0155 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

Residual standard error: 17.08 on 31 degrees of freedom
Multiple R-Squared: 0.6752,     Adjusted R-squared: 0.6333
F-statistic: 16.11 on 4 and 31 DF,  p-value: 3.090e-07

#as we noticed, an Rsquared value of 67.52%
#next, lets predict the results with the same input data
 prediction<-predict(mlr,input)


#now let's evaluate the predictions, observing the R² and RMSE values that
postResample returns

 postResample(input$TOTAL, prediction)
      RMSE   Rsquared
16.0718506  0.3300378

So here comes my doubt: why do I have an value of 67.52% for R² when
creating the model(that is , the model explains 67.52% of the data) and
when I use this same model on the same input data, why does postResample
return a very different value associated to R²?

Best regards,

Giovane

#input.csv file used as input

"A"     "B"     "C"     "D"    
"TOTAL"
1       0       1       0       3.8
1       0       1       0       21.67
1       0       0       0       2.92
2       0       6       0       42.84
0       0       0       0       5.28
2       0       0       3       44.86
1       0       0       0       8.22
1       0       0       0       28.24
1       0       3       0       29.69
1       0       0       1       78.02
3       0       7       0       51.29
2       0       0       0       37.55
2       0       2       0       10.82
1       0       3       0       17.67
0       0       0       0       6.62
2       1       3       1       36.49
0       0       0       0       37.52
1       0       2       0       5.26
1       0       2       0       7.32
1       0       0       0       2.2
2       0       6       0       39.24
0       0       0       0       2.83
2       0       0       3       50.93
1       0       0       0       4.15
1       0       0       0       29.72
1       0       3       0       4.26
1       0       0       1       25.1
3       0       7       0       12.67
2       0       0       0       7.99
2       0       2       0       17.55
1       0       3       0       3.66
0       0       0       0       7.22
0       0       0       0       3.82
0       0       0       0       28.05
3       0       7       0       34.67

	[[alternative HTML version deleted]]

Max Kuhn

2007-Dec-12 01:37 UTC

head link

[R] postResample R² and lm() R²

On Dec 11, 2007 3:35 PM, Giovane <gufrgs at gmail.com> wrote:
>
> So here comes my doubt: why do I have an value of 67.52% for R? when
> creating the model(that is , the model explains 67.52% of the data) and
> when I use this same model on the same input data, why does postResample
> return a very different value associated to R??
>
Let's get in the WayBack machine and return to 4 days ago when I said:
> As has been previously noted on this list, there are a number of
> formulas for R-squared. This function uses the square of the
> correlation between the observed and predicted. The next version of
> caret will offer a choice of formulas.
For your data:
> cor(prediction, input$TOTAL)^2[1] 0.3300378

For R-squared, summary.lm uses

   ans$r.squared <- mss/(mss + rss)
   ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)

and for your data rdf = 31, df.int = 0 and n = 35.

In other words, the Rsquared estimate form summary.lm adjusts for the
degrees of freedom and postResample does not.

Why doesn't it use the df? In ?postResample you would see

"Note that many models have more predictors (or parameters) than data
points, so the typical mean squared error denominator (n - p) does not
apply. Root mean squared error is calculated using sqrt(mean((pred -
obs)^2)). Also, R-squared is calculated as the square of the
correlation between the observed and predicted outcomes."

Since caret is useful for comparing different types of models, we use
biased estimate of the root MSE since we would like to directly
compare the RMSE from different models (say a linear regression and a
support vector machine). Many of these models do not have an explicit
number of parameters, so we use

   mse <- mean((pred - obs)^2)

Max

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Dec 2007 - postResample R² and lm() R²

[R] postResample R² and lm() R²

[R] postResample R² and lm() R²

Possibly Parallel Threads