Hi, I am not a statistics expert, so I have this question. A linear model gives me the following summary: Call: lm(formula = N ~ N_alt) Residuals: Min 1Q Median 3Q Max -110.30 -35.80 -22.77 38.07 122.76 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.5177 229.0764 0.059 0.9535 N_alt 0.2832 0.1501 1.886 0.0739 . --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 56.77 on 20 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.151, Adjusted R-squared: 0.1086 F-statistic: 3.558 on 1 and 20 DF, p-value: 0.07386 The regression is not very good (high p-value, low R-squared). The Pr value for the intercept seems to indicate that it is zero with a very high probability (95.35%). So I repeat the regression forcing the intercept to zero: Call: lm(formula = N ~ N_alt - 1) Residuals: Min 1Q Median 3Q Max -110.11 -36.35 -22.13 38.59 123.23 Coefficients: Estimate Std. Error t value Pr(>|t|) N_alt 0.292046 0.007742 37.72 <2e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 55.41 on 21 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.9855, Adjusted R-squared: 0.9848 F-statistic: 1423 on 1 and 21 DF, p-value: < 2.2e-16 1. Is my interpretation correct? 2. Is it possible that just by forcing the intercept to become zero, a bad regression becomes an extremely good one? 3. Why doesn't lm suggest a value of zero (or near zero) by itself if the regression is so much better with it? Please excuse my ignorance. Jan Rheinl?nder
On Fri, 18 Feb 2011, Jan wrote:> Hi, > > I am not a statistics expert, so I have this question. A linear model > gives me the following summary: > > Call: > lm(formula = N ~ N_alt) > > Residuals: > Min 1Q Median 3Q Max > -110.30 -35.80 -22.77 38.07 122.76 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 13.5177 229.0764 0.059 0.9535 > N_alt 0.2832 0.1501 1.886 0.0739 . > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 56.77 on 20 degrees of freedom > (16 observations deleted due to missingness) > Multiple R-squared: 0.151, Adjusted R-squared: 0.1086 > F-statistic: 3.558 on 1 and 20 DF, p-value: 0.07386 > > The regression is not very good (high p-value, low R-squared).Yes.> The Pr value for the intercept seems to indicate that it is zero with a > very high probability (95.35%).Not quite. Consult your statistics textbook for the correct interpretation of p-values. Under the null hypothesis of a true intercept of zero, it is very likely to observe an intercept as large as 13.52 or larger.> So I repeat the regression forcing the intercept to zero:Do you have a good interpretation for that?> Call: > lm(formula = N ~ N_alt - 1) > > Residuals: > Min 1Q Median 3Q Max > -110.11 -36.35 -22.13 38.59 123.23 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > N_alt 0.292046 0.007742 37.72 <2e-16 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 55.41 on 21 degrees of freedom > (16 observations deleted due to missingness) > Multiple R-squared: 0.9855, Adjusted R-squared: 0.9848 > F-statistic: 1423 on 1 and 21 DF, p-value: < 2.2e-16 > > 1. Is my interpretation correct? > 2. Is it possible that just by forcing the intercept to become zero, a > bad regression becomes an extremely good one? > 3. Why doesn't lm suggest a value of zero (or near zero) by itself if > the regression is so much better with it?The model without intercept needs to be interpreted differently. The p-value pertains to a regression with intercept zero and slope 0.292 against a model with both intercept zero and slope zero. If I had to guess, I would say this is not a very meaningful comparison for your data. The same is true for the R-squared (see also ?summary.lm for its definition in the case without intercept). hth, Z> Please excuse my ignorance. > > Jan Rheinl?nder > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi: On Fri, Feb 18, 2011 at 2:49 AM, Jan <jrheinlaender@gmx.de> wrote:> Hi, > > I am not a statistics expert, so I have this question. A linear model > gives me the following summary: > > Call: > lm(formula = N ~ N_alt) > > Residuals: > Min 1Q Median 3Q Max > -110.30 -35.80 -22.77 38.07 122.76 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 13.5177 229.0764 0.059 0.9535 > N_alt 0.2832 0.1501 1.886 0.0739 . > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 56.77 on 20 degrees of freedom > (16 observations deleted due to missingness) > Multiple R-squared: 0.151, Adjusted R-squared: 0.1086 > F-statistic: 3.558 on 1 and 20 DF, p-value: 0.07386 > > The regression is not very good (high p-value, low R-squared). > The Pr value for the intercept seems to indicate that it is zero with a > very high probability (95.35%). So I repeat the regression forcing the > intercept to zero: >That's not the interpretation of a p-value. What it means is: *given that the null hypothesis beta0 = 0 is true*, the probability of observing a value of the t-statistic *more extreme than the observed value of 0.059* is about 0.9535. The presumption that H_0 is true for the purpose of the test allows one to derive a 'reference distribution' (in this case, the t-distribution with error degrees of freedom) against which one can compare the observed value of the t-statistic. The second part of the emphasized statement provides a context for which the p-value can be correctly interpreted in relation to the reference distribution of the test statistic when H_0 is true. You're evidently trying to interpret the p-value as the probability that the null hypothesis is true. No. You can conclude, however, that there is not enough sample evidence to contradict the null hypothesis beta0 = 0 due to the magnitude of the p-value.> Call: > lm(formula = N ~ N_alt - 1) > > Residuals: > Min 1Q Median 3Q Max > -110.11 -36.35 -22.13 38.59 123.23 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > N_alt 0.292046 0.007742 37.72 <2e-16 *** > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 55.41 on 21 degrees of freedom > (16 observations deleted due to missingness) > Multiple R-squared: 0.9855, Adjusted R-squared: 0.9848 > F-statistic: 1423 on 1 and 21 DF, p-value: < 2.2e-16 > > 1. Is my interpretation correct? >2. Is it possible that just by forcing the intercept to become zero, a> bad regression becomes an extremely good one? >No.> 3. Why doesn't lm suggest a value of zero (or near zero) by itself if > the regression is so much better with it? >Because computer programs don't read minds. You may want a zero intercept; someone else may not. And your perception that the 'regression is so much better' with a zero intercept is in error. If you plotted your data, you would realize that whether you fit the 'best' least squares model or one with a zero intercept, the fit is not going to be very good, and you would have deduced that the 0.985 R^2 returned from the no-intercept model is an illusion. It is mathematically correct, however, given the linear model theory behind it and the definition of R^2 as the ratio of the model sum of squares (SS) to the total SS. If you want to have more fun, sum the residuals from the zero-intercept fit, and then ask yourself why they don't add to zero. You need to educate yourself on the difference between regression with and without intercepts. In particular, the R^2 in the with-intercept model uses mean corrections before computing sums of squares; in the no-intercept model, mean corrections are not applied. Since R^2 is a ratio of sums of squares, this distinction matters. (If my use of 'mean correction' is confusing, Y is not mean-corrected, but Y - Ybar is. Ditto for X.) Try this: plot(N_alt, N, pch = 16) abline(coef(lm(N ~ N_alt))) abline(c(0, coef(lm(N ~ N_alt + 0))), lty = 'dashed') Do the data cluster tightly around the dashed line? HTH, Dennis PS: A Google search on 'linear regression zero intercept' might be beneficial. Here are a couple of hits from such a search: http://www.bios.unc.edu/~truong/b663/pdf/noint.pdf http://tltc.ttu.edu/cs/colleges__schools/rawls_college_of_business/f/42/p/288/470.aspx Please excuse my ignorance.> > Jan Rheinländer > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
No, this is a cute problem, though: the definition of R^2 changes without the intercept, because the "empty" model used for calculating the total sums of squares is always predicting 0 (so the total sums of squares are sums of squares of the observations themselves, without centering around the sample mean). Your interpretation of the p-value for the intercept in the first model is also backwards: 0.9535 is extremely weak evidence against the hypothesis that the intercept is 0. That is, the intercept might be near zero, but could also be something veru different. With a standard error of 229, your 95% confidence interval for the intercept (if you trusted it based on other things) would have a margin of error of well over 400. If you told me that an intercept of, say 350 or 400 were consistent with your knowledge of the problem, I wouldn't blink. This is a very small data set: if you sent an R command such as: x <- c(x1, x2, ..., xn) y <- c(y1, y2, ..., yn) you might even get some more interesting feedback. One of the many good intro stats textbooks might also be helpful as you get up to speed. Jay --------------------------------------------- Original post: Message: 135 Date: Fri, 18 Feb 2011 11:49:41 +0100 From: Jan <jrheinlaender at gmx.de> To: "R-help at r-project.org list" <r-help at r-project.org> Subject: [R] lm without intercept Message-ID: <1298026181.2847.19.camel at jan-laptop> Content-Type: text/plain; charset="UTF-8" Hi, I am not a statistics expert, so I have this question. A linear model gives me the following summary: Call: lm(formula = N ~ N_alt) Residuals: Min 1Q Median 3Q Max -110.30 -35.80 -22.77 38.07 122.76 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.5177 229.0764 0.059 0.9535 N_alt 0.2832 0.1501 1.886 0.0739 . --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 56.77 on 20 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.151, Adjusted R-squared: 0.1086 F-statistic: 3.558 on 1 and 20 DF, p-value: 0.07386 The regression is not very good (high p-value, low R-squared). The Pr value for the intercept seems to indicate that it is zero with a very high probability (95.35%). So I repeat the regression forcing the intercept to zero: Call: lm(formula = N ~ N_alt - 1) Residuals: Min 1Q Median 3Q Max -110.11 -36.35 -22.13 38.59 123.23 Coefficients: Estimate Std. Error t value Pr(>|t|) N_alt 0.292046 0.007742 37.72 <2e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 55.41 on 21 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.9855, Adjusted R-squared: 0.9848 F-statistic: 1423 on 1 and 21 DF, p-value: < 2.2e-16 1. Is my interpretation correct? 2. Is it possible that just by forcing the intercept to become zero, a bad regression becomes an extremely good one? 3. Why doesn't lm suggest a value of zero (or near zero) by itself if the regression is so much better with it? Please excuse my ignorance. Jan Rheinl?nder -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay