John Hunter
2009-Jun-28 05:32 UTC
[R] multiple regression w/ no intercept; strange results
I am writing some software to do multiple regression and am using r to
benchmark the results. The results are squaring up nicely for the
"with-intercept" case but not for the "no-intercept" case.
I am not
sure what R is doing to get the statistics for the 0 intercept case.
For example, I would expect the Multiple R-squared to equal the square
of the correlation between the actual values "y" and the fitted values
"yprime". For the with-intercept case, they do, but not for the
"no-intercept" case. My sample file and R session output are below
> dataset = read.table("/Users/jdhunter/tmp/sample1.csv",
header=TRUE, sep=",")
The "with-intercept" fit: the "Multiple R-Squared" is equal
to the
cor(yprime, y)**2:
> fit <- lm( y~x1+x2, data=dataset)
> summary(fit)
Call:
lm(formula = y ~ x1 + x2, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-1.8026 -0.4651 0.1778 0.5241 1.0222
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.10358 1.26103 -3.254 0.00467 **
x1 0.08641 0.03144 2.748 0.01372 *
x2 0.08760 0.04548 1.926 0.07100 .
---
Residual standard error: 0.7589 on 17 degrees of freedom
Multiple R-squared: 0.6709, Adjusted R-squared: 0.6322
F-statistic: 17.33 on 2 and 17 DF, p-value: 7.888e-05
> yp = fitted.values(fit)
> cor(yp, dataset$y)**2
[1] 0.6709279
The "no-intercept" fit: the "Multiple R-Squared" is not
equal to the
cor(yprime, y)**2:
> fitno <- lm( y~0+x1+x2, data=dataset)
> summary(fitno)
Call:
lm(formula = y ~ 0 + x1 + x2, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-1.69640 -0.58134 0.03650 0.53673 1.33358
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 0.03655 0.03399 1.075 0.296
x2 0.04358 0.05376 0.811 0.428
Residual standard error: 0.9395 on 18 degrees of freedom
Multiple R-squared: 0.9341, Adjusted R-squared: 0.9267
F-statistic: 127.5 on 2 and 18 DF, p-value: 2.352e-11
> ypno = fitted.values(fitno)
> cor(ypno, dataset$y)
[1] 0.6701336
If anyone has some suggestions about how R is computing these summary
stats for the no-intercept case, or references to literature or docs,
tha would be helpful. It seems odd to me that dropping the intercept
would cause the R^2 and F stats to rise so dramatically, and the p
value to consequently drop so much. In my implementation, I get the
same beta1 and beta2, and the R2 I compute using the
variance_regression / variance_total
agrees with cor(ypno, dataset$y) but not with the value R reports in
the summary, and my F and p values are similarly off for the
no-intercept case.
Thanks,
JDH
R version 2.9.1 (2009-06-26)
home:~/tmp> uname -a
Darwin Macintosh-7.local 9.6.0 Darwin Kernel Version 9.6.0: Mon Nov 24
17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386 i386
Dieter Menne
2009-Jun-28 08:38 UTC
[R] multiple regression w/ no intercept; strange results
I am writing some software to do multiple regression and am using r to benchmark the results. The results are squaring up nicely for the "with-intercept" case but not for the "no-intercept" case. I am not sure what R is doing to get the statistics for the 0 intercept case. ... It seems odd to me that dropping the intercept would cause the R^2 and F stats to rise so dramatically, and the p value to consequently drop so much. In my implementation, I get the same beta1 and beta2, and the R2 I compute using the Removing the intercept can harm your sanity. See http://markmail.org/message/q67jf7uaig7d4tkm for an example. Dieter -- View this message in context: http://www.nabble.com/multiple-regression-w--no-intercept--strange-results-tp24238950p24239683.html Sent from the R help mailing list archive at Nabble.com.
John Hunter
2009-Jun-29 13:39 UTC
[R] multiple regression w/ no intercept; strange results
On Sun, Jun 28, 2009 at 3:38 AM, Dieter Menne<dieter.menne at menne-biomed.de> wrote:> It seems odd to me that dropping the intercept > would cause the R^2 and F stats to rise so dramatically, and the p > value to consequently drop so much. In my implementation, I get the > same beta1 and beta2, and the R2 I compute using the > > Removing the intercept can harm your sanity. See > > http://markmail.org/message/q67jf7uaig7d4tkm > > for an example.I read the paper and the example so thanks for sending those along. The paper made some good arguments from a modeling perspective why one should keep the intercept -- the most convincing to me is that you would like the modeling to be robust to a location and scale transformation. But my question was more numerical: in particular, the R^2 of the model should be equal to the square of the correlation between the fit values and the actual values. It is with the intercept and is not w/o it, as my code example shows. Am I correct in assuming these should always be the same, and if they are not, does it reflect a bug in R or perhaps a numerical instability? You also wrote in your post "There are reasons why the standard textbooks...". I read the reasons Venables addressed in his "Exegeses", but none of these seem to address my particular concern. Can you elaborate on these or provide additional links ? Thanks! JDH