Etienne Toffin
2007-Oct-18 11:02 UTC
[R] R-squared value for linear regression passing through origin using lm()
Hi, A have small technical question about the calculation of R-squared using lm(). In a study case with experimental values, it seems more logical to force the regression line to pass through origin with lm(y ~ x +0). However, R-squared values are higher in this case than when I compute the linear regression with lm(y ~ x). It seems to be surprising to me: is this result normal ? Is there any problem in the R-squared value calculated in this case ? Etienne Toffin
Toffin Etienne
2007-Oct-18 11:03 UTC
[R] R-squared value for linear regression passing through origin using lm()
Hi, A have small technical question about the calculation of R-squared using lm(). In a study case with experimental values, it seems more logical to force the regression line to pass through origin with lm(y ~ x +0). However, R-squared values are higher in this case than when I compute the linear regression with lm(y ~ x). It seems to be surprising to me: is this result normal ? Is there any problem in the R-squared value calculated in this case ? Thanks, Etienne Toffin ------------------------------------------------------------------- Etienne Toffin, PhD Student Unit of Social Ecology Université Libre de Bruxelles, CP 231 Boulevard du Triomphe B-1050 Brussels Belgium http://www.ulb.ac.be/sciences/use/toffin.html [[alternative HTML version deleted]]
Duncan Murdoch
2007-Oct-18 11:18 UTC
[R] R-squared value for linear regression passing through origin using lm()
On 18/10/2007 7:02 AM, Etienne Toffin wrote: > Hi, > A have small technical question about the calculation of R-squared > using lm(). > In a study case with experimental values, it seems more logical to > force the regression line to pass through origin with lm(y ~ x +0). > However, R-squared values are higher in this case than when I > compute the linear regression with lm(y ~ x). > It seems to be surprising to me: is this result normal ? Is there any > problem in the R-squared value calculated in this case ? The definition is different in that case. It's proportion of variation about 0, instead of proportion of variation about the mean. Duncan Murdoch
S Ellison
2007-Oct-18 13:00 UTC
[R] R-squared value for linear regression passing through origin using lm()
>I think there is reason to be surprised, I am, too. ... >What am I missing?Read the formula and ?summary.lm more closely. The denominator, Sum((y[i]- y*)^2) is very large if the mean value of y is substantially nonzero and y* set to 0 as the calculation implies for a forced zero intercept. In effect, the calculation provides the fraction of sum of squared deviations from the mean for the case with intercept, but the fraction of sum of squared y ('about' zero) for the non-intercept case. This is surprising if you automatically assume that better R^2 means better fit. I guess that explains why statisticians tell you not to use R^2 as a goodness-of-fit indicator.>>> Ralf Goertz <R_Goertz at web.de> 18/10/2007 13:11:55 >>> >> r.squared: R^2, the 'fraction of variance explained by themodel',> > > > R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y*)^2), > > >> where y* is the mean of y[i] if there is an interceptand>> zero otherwise.Ralf ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ******************************************************************* This email contains information which may be confidential and/or privileged, and is intended only for the individual(s) or organisation(s) named above. If you are not the intended recipient, then please note that any disclosure, copying, distribution or use of the contents of this email is prohibited. Internet communications are not 100% secure and therefore we ask that you acknowledge this. If you have received this email in error, please notify the sender or contact +44(0)20 8943 7000 or postmaster at lgcforensics.com immediately, and delete this email and any attachments and copies from your system. Thank you. LGC Limited. Registered in England 2991879. Registered office: Queens Road, Teddington, Middlesex TW11 0LY, UK
Ralf Goertz
2007-Oct-19 07:51 UTC
[R] R-squared value for linear regression passing through origin using lm()
Berwin A Turlach, Donnerstag, 18. Oktober 2007:> G'day all, > > I must admit that I have not read the previous e-mails in this thread, > but why should that stop me to comment? ;-)Your comments are very welcome.> On Thu, 18 Oct 2007 16:17:38 +0200 > Ralf Goertz <R_Goertz at web.de> wrote: > > > But in that case the numerator is very large, too, isn't it? > > Not necessarily. > > > I don't want to argue, though. > > Good, you might lose the argument. :)Yes, I admit I lost. :-(> > But so far, I have not managed to create a dataset where R^2 is > > larger for the model with forced zero intercept (although I have not > > tried very hard). It would be very convincing to see one (Etienne?) > > Indeed, you haven't tried hard. It is not difficult. Here are my > canonical commands to convince people why regression through the > intercept is evil; the pictures should illustrate what is going on:> [example snipped]Thanks to Thomas Lumley there is another convincing example. But still I've got a problem with it:> x<-c(2,3,4);y<-c(2,3,3)> 1-2*var(residuals(lm(y~x+1)))/sum((y-mean(y))^2)[1] 0.75 That's okay, but neither> 1-3*var(residuals(lm(y~x+0)))/sum((y-0)^2)[1] 0.97076 nor> 1-2*var(residuals(lm(y~x+0)))/sum((y-0)^2)[1] 0.9805066 give the result of summary(lm(y~x+0)), which is 0.9796.> > IIRC, I have not been told so. Perhaps my teachers were not as good > > they should have been. So what is R^2 good if not to indicate the > > goodness of fit?. > > I am wondering about that too sometimes. :) I was always wondering > that R^2 was described to me by my lecturers as the square of the > correlation between the x and the y variate. But on the other hand, > they pretended that x was fixed and selected by the experimenter (or > should be regarded as such). If x is fixed and y is random, then it > does not make sense to me to speak about a correlation between x and y > (at least not on the population level).I see the point. But I was raised with that description, too, and it's hard to drop that idea.> My best guess at the moment is that R^2 was adopted by users of > statistics before it was properly understood; and by the time it was > properly understood, it was too much entrenched to abandon it. Try not > to teach it these days and see what your "client faculties" will tell > you.In order to save the role of R^2 as a goodness-of-fit indicator in zero intercept models one could use the same formula like in models with a constant. I mean, if R^2 is the proportion of variance explained by the model we should use the a priori variance of y[i].> 1-var(residuals(lm(y~x+0)))/var(y)[1] 0.3567182 But I assume that this has probably been discussed at length somewhere more appropriate than r-help. Thanks, Ralf
S Ellison
2007-Oct-19 11:31 UTC
[R] R-squared value for linear regression passing through origin using lm()
>> I guess that explains why statisticians tell you not to use >> R^2 as a goodness-of-fit indicator.>IIRC, I have not been told so. Perhaps my teachers were not as goodthey>should have been.I couldn't possibly comment ;-)>So what is R^2 good if not to indicate the goodness of fit?.Broadly speaking, a low R^2 is an indicator of poor fit for a linear model. The problem with it is that a relatively high R^2 can be achieved in a variety of pathological cases as well as healthy cases with good fit. The most common example in my field is simple linear regression for instrument calibration. If the independent variable values are well chosen, so that they are distributed more or less evenly through the calibration range, and an intercept is included, very high R^2 (0.999 and above) is a fairly reliable indication of a good fit and a usable calibration, and a poor value (0.9 or below) usually indicates a problem. Pathological cases include poorly distributed data (two distinct small clouds of observations give high R^2) and, as you have found, eliminating the intercept, especially when it is large. The other criticisms of R^2 or the related pearson correlation R tend to revolve around the fact that low values of R or R^2 imply a lack of a _linear_ relationship, but that does not necessarly mean there is no relationship. Personally, I don't often see that as a problem with decent graphics - but it certainly was a problem on old instrumnets that simply printed the intercept, gradient, residual sd and R^2 value on a slip of thermal paper as the only indication of fit, and it can also be a problem in multivariate cases when inspection is not so simple. So as we teach it, it has a use, but like a lot of other indicators, it's something you use with caution and not in isolation. Steve E ******************************************************************* This email contains information which may be confidential and/or privileged, and is intended only for the individual(s) or organisation(s) named above. If you are not the intended recipient, then please note that any disclosure, copying, distribution or use of the contents of this email is prohibited. Internet communications are not 100% secure and therefore we ask that you acknowledge this. If you have received this email in error, please notify the sender or contact +44(0)20 8943 7000 or postmaster at lgcforensics.com immediately, and delete this email and any attachments and copies from your system. Thank you. LGC Limited. Registered in England 2991879. Registered office: Queens Road, Teddington, Middlesex TW11 0LY, UK