thr3ads.net - R help - [R] R-squared value for linear regression passing through origin using lm() [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Etienne Toffin

2007-Oct-18 11:02 UTC

[R] R-squared value for linear regression passing through origin using lm()

Hi,
A have small technical question about the calculation of R-squared  
using lm().
In a study case with experimental values, it seems more logical to  
force the regression line to pass through origin with lm(y ~ x +0).  
However, R-squared  values are higher in this case than when I  
compute the linear regression with lm(y ~ x).
It seems to be surprising to me: is this result normal ? Is there any  
problem in the R-squared value calculated in this case ?
Etienne Toffin

Toffin Etienne

2007-Oct-18 11:03 UTC

head link

[R] R-squared value for linear regression passing through origin using lm()

Hi,
A have small technical question about the calculation of R-squared  
using lm().
In a study case with experimental values, it seems more logical to  
force the regression line to pass through origin with lm(y ~ x +0).  
However, R-squared  values are higher in this case than when I  
compute the linear regression with lm(y ~ x).
It seems to be surprising to me: is this result normal ? Is there any  
problem in the R-squared value calculated in this case ?
Thanks,
Etienne Toffin


-------------------------------------------------------------------
Etienne Toffin, PhD Student
Unit of Social Ecology
Université Libre de Bruxelles, CP 231
Boulevard du Triomphe
B-1050 Brussels
Belgium

http://www.ulb.ac.be/sciences/use/toffin.html
	[[alternative HTML version deleted]]

Duncan Murdoch

2007-Oct-18 11:18 UTC

head link

[R] R-squared value for linear regression passing through origin using lm()

On 18/10/2007 7:02 AM, Etienne Toffin wrote:
 > Hi,
 > A have small technical question about the calculation of R-squared
 > using lm().
 > In a study case with experimental values, it seems more logical to
 > force the regression line to pass through origin with lm(y ~ x +0).
 > However, R-squared  values are higher in this case than when I
 > compute the linear regression with lm(y ~ x).
 > It seems to be surprising to me: is this result normal ? Is there any
 > problem in the R-squared value calculated in this case ?

The definition is different in that case.  It's proportion of variation 
about 0, instead of proportion of variation about the mean.

Duncan Murdoch

S Ellison

2007-Oct-18 13:00 UTC

head link

[R] R-squared value for linear regression passing through origin using lm()

>I think there is reason to be surprised, I am, too. ...
>What am I missing?
Read the formula and ?summary.lm more closely. The denominator,

Sum((y[i]- y*)^2) 

is very large if the mean value of y is substantially nonzero and y*
set to 0 as the calculation implies for a forced zero intercept. In
effect, the calculation provides the fraction of sum of squared
deviations from the mean for the case with intercept, but the fraction
of sum of squared y ('about' zero) for the non-intercept case. 

This is surprising if you automatically assume that better R^2 means
better fit. I guess that explains why statisticians tell you not to use
R^2 as a goodness-of-fit indicator.

>>> Ralf Goertz <R_Goertz at web.de> 18/10/2007 13:11:55
>>>
>>   r.squared: R^2, the 'fraction of variance explained by the
model',> >
> >              R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y*)^2),
> >
>>             where y* is the mean of y[i] if there is an intercept
and>>             zero otherwise.
Ralf

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html 
and provide commented, minimal, self-contained, reproducible code.

*******************************************************************
This email contains information which may be confidential and/or privileged, and
is intended only for the individual(s) or organisation(s) named above. If you
are not the intended recipient, then please note that any disclosure, copying,
distribution or use of the contents of this email is prohibited. Internet
communications are not 100% secure and therefore we ask that you acknowledge
this. If you have received this email in error, please notify the sender or
contact +44(0)20 8943 7000 or postmaster at lgcforensics.com immediately, and
delete this email and any attachments and copies from your system. Thank you.

LGC Limited. Registered in England 2991879. 
Registered office: Queens Road, Teddington, Middlesex TW11 0LY, UK

Ralf Goertz

2007-Oct-19 07:51 UTC

head link

[R] R-squared value for linear regression passing through origin using lm()

Berwin A Turlach, Donnerstag, 18. Oktober 2007:> G'day all,
> 
> I must admit that I have not read the previous e-mails in this thread,
> but why should that stop me to comment? ;-)
Your comments are very welcome.
 > On Thu, 18 Oct 2007 16:17:38 +0200
> Ralf Goertz <R_Goertz at web.de> wrote:
> 
> > But in that case the numerator is very large, too, isn't it? 
> 
> Not necessarily.
> 
> > I don't want to argue, though.
> 
> Good, you might lose the argument. :)
Yes, I admit I lost. :-(
 > > But so far, I have not managed to create a dataset where R^2 is
> > larger for the model with forced zero intercept (although I have not
> > tried very hard). It would be very convincing to see one (Etienne?)
> 
> Indeed, you haven't tried hard.  It is not difficult.  Here are my
> canonical commands to convince people why regression through the
> intercept is evil; the pictures should illustrate what is going on:
> [example snipped] 
Thanks to Thomas Lumley there is another convincing example. But still
I've got a problem with it:
> x<-c(2,3,4);y<-c(2,3,3)
> 1-2*var(residuals(lm(y~x+1)))/sum((y-mean(y))^2)
[1] 0.75

That's okay, but neither
> 1-3*var(residuals(lm(y~x+0)))/sum((y-0)^2)[1] 0.97076

nor
> 1-2*var(residuals(lm(y~x+0)))/sum((y-0)^2)[1] 0.9805066

give the result of summary(lm(y~x+0)), which is 0.9796. 
> > IIRC, I have not been told so. Perhaps my teachers were not as good
> > they should have been. So what is R^2 good if not to indicate the
> > goodness of fit?.
> 
> I am wondering about that too sometimes. :)   I was always wondering
> that R^2 was described to me by my lecturers as the square of the
> correlation between the x and the y variate.  But on the other hand,
> they pretended that x was fixed and selected by the experimenter (or
> should be regarded as such). If x is fixed and y is random, then it
> does not make sense to me to speak about a correlation between x and y
> (at least not on the population level). 
I see the point. But I was raised with that description, too, and it's
hard to drop that idea. 
> My best guess at the moment is that R^2 was adopted by users of
> statistics before it was properly understood; and by the time it was
> properly understood, it was too much entrenched to abandon it.  Try not
> to teach it these days and see what your "client faculties" will
tell
> you.
In order to save the role of R^2 as a goodness-of-fit indicator in zero
intercept models one could use the same formula like in models with a
constant. I mean, if R^2 is the proportion of variance explained by the
model we should use the a priori variance of y[i].
> 1-var(residuals(lm(y~x+0)))/var(y)[1] 0.3567182

But I assume that this has probably been discussed at length somewhere
more appropriate than r-help.
 
Thanks,

Ralf

S Ellison

2007-Oct-19 11:31 UTC

head link

[R] R-squared value for linear regression passing through origin using lm()

>> I guess that explains why statisticians tell you not to use
>> R^2 as a goodness-of-fit indicator.
>IIRC, I have not been told so. Perhaps my teachers were not as good
they>should have been. I couldn't possibly comment ;-)
>So what is R^2 good if not to indicate the goodness of fit?.Broadly speaking, a low R^2 is an indicator of poor fit for a linear
model.

The problem with it is that a relatively high R^2 can be achieved in a
variety of pathological cases as well as healthy cases with good fit.

The most common example in my field is simple linear regression for
instrument calibration. If the independent variable values are well
chosen, so that they are distributed more or less evenly through the
calibration range, and an intercept is included, very high R^2 (0.999
and above) is a fairly reliable indication of a good fit and a usable
calibration, and a poor value (0.9 or below) usually indicates a
problem. 

Pathological cases include poorly distributed data (two distinct small
clouds of observations give high R^2) and, as you have found,
eliminating the intercept, especially when it is large. 

The other criticisms of R^2 or the related pearson correlation R tend
to revolve around the fact that low values of R or R^2 imply a lack of a
_linear_ relationship, but that does not necessarly mean there is no
relationship. Personally, I don't often see that as a problem with
decent graphics - but it certainly was a problem on old instrumnets that
simply printed the intercept, gradient, residual sd and R^2 value on a
slip of thermal paper as the only indication of fit, and it can also be
a problem in multivariate cases when inspection is not so simple.

So as we teach it, it has a use, but like a lot of other indicators,
it's something you use with caution and not in isolation.

Steve E

*******************************************************************
This email contains information which may be confidential and/or privileged, and
is intended only for the individual(s) or organisation(s) named above. If you
are not the intended recipient, then please note that any disclosure, copying,
distribution or use of the contents of this email is prohibited. Internet
communications are not 100% secure and therefore we ask that you acknowledge
this. If you have received this email in error, please notify the sender or
contact +44(0)20 8943 7000 or postmaster at lgcforensics.com immediately, and
delete this email and any attachments and copies from your system. Thank you.

LGC Limited. Registered in England 2991879. 
Registered office: Queens Road, Teddington, Middlesex TW11 0LY, UK

Apparently Analagous Threads

Search for more maybe matching threads

R help - Oct 2007 - R-squared value for linear regression passing through origin using lm()

[R] R-squared value for linear regression passing through origin using lm()

[R] R-squared value for linear regression passing through origin using lm()

[R] R-squared value for linear regression passing through origin using lm()

[R] R-squared value for linear regression passing through origin using lm()

[R] R-squared value for linear regression passing through origin using lm()

[R] R-squared value for linear regression passing through origin using lm()

Apparently Analagous Threads