Please forgive a straight stats question, and the informal notation. let us say we wish to perform a liner regression: y=b0 + b1*x + b2*z There are two ways this can be done, the usual way, as a single regression, fit1<-lm(y~x+z) or by doing two regressions. In the first regression we could have y as the dependent variable and x as the independent variable fit2<-lm(y~x). The second regrssion would be a regression in which the residuals from the first regression would be the depdendent variable, and the independent variable would be z. fit2<-lm(fit2$residuals~z) I would think the two methods would give the same p value and the same beta coefficient for z. The don't. Can someone help my understand why the two methods do not give the same results. Additionally, could someone tell me when one method might be better than the other, i.e. what question does the first method anwser, and what question does the second method answer. I have searched a number of textbooks and have not found this question addressed. Thanks, John John Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics Baltimore VA Medical Center GRECC and University of Maryland School of Medicine Claude Pepper OAIC University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 410-605-7119 - NOTE NEW EMAIL ADDRESS: jsorkin@grecc.umaryland.edu [[alternative HTML version deleted]]
Bill.Venables@csiro.au
2005-Apr-06 03:25 UTC
[R] two methods for regression, two different results
This is possible if x and z are orthogonal, but in general it doesn't work as you have noted. (If it did work it would almost amount to a way of inverting geenral square matrices by working one row at a time, no going back...) It is possible to fit a bivariate regression using simple linear regression techniques iteratively like this, but it is a bit more involved than your two step process. 1. regress y on x and take the residuals: ryx <- resid(lm(y ~ x)) 2. regress z on x and take the residuals: rzx <- resid(lm(z ~ x)) 3. regress ryx on rzx: fitz <- lm(ryx ~ rzx) 4. this gives you the estimate of the coefficient on z (what you call below b2): b2 <- coef(fitz)[2] 5. regress y - b2*z on x: fitx <- lm(I(y - b2*z) ~ x) This last step gets you the estimates of b0 and b1. None of this works with significances, though, because in going about it this way you have essentially disguised the degrees of freedom involved. So you can get the right estimates, but the standard errors, t-statistics and residual variances are all somewhat inaccurate (though usually not by much). If x and z are orthogonal the (curious looking) step 2 is not needed. This kind of idea lies behind some algorithms (e.g. Stevens' algorithm) for fitting very large regressions essentially by iterative processes to avoid constructing a huge model matrix. Bill Venables -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of John Sorkin Sent: Wednesday, 6 April 2005 12:55 PM To: r-help at stat.math.ethz.ch Subject: [R] two methods for regression, two different results Please forgive a straight stats question, and the informal notation. let us say we wish to perform a liner regression: y=b0 + b1*x + b2*z There are two ways this can be done, the usual way, as a single regression, fit1<-lm(y~x+z) or by doing two regressions. In the first regression we could have y as the dependent variable and x as the independent variable fit2<-lm(y~x). The second regrssion would be a regression in which the residuals from the first regression would be the depdendent variable, and the independent variable would be z. fit2<-lm(fit2$residuals~z) I would think the two methods would give the same p value and the same beta coefficient for z. The don't. Can someone help my understand why the two methods do not give the same results. Additionally, could someone tell me when one method might be better than the other, i.e. what question does the first method anwser, and what question does the second method answer. I have searched a number of textbooks and have not found this question addressed. Thanks, John John Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics Baltimore VA Medical Center GRECC and University of Maryland School of Medicine Claude Pepper OAIC University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 410-605-7119 -- NOTE NEW EMAIL ADDRESS: jsorkin at grecc.umaryland.edu [[alternative HTML version deleted]] ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On Tue, 2005-04-05 at 22:54 -0400, John Sorkin wrote:> Please forgive a straight stats question, and the informal notation. > > let us say we wish to perform a liner regression: > y=b0 + b1*x + b2*z > > There are two ways this can be done, the usual way, as a single > regression, > fit1<-lm(y~x+z) > or by doing two regressions. In the first regression we could have y as > the dependent variable and x as the independent variable > fit2<-lm(y~x). > The second regrssion would be a regression in which the residuals from > the first regression would be the depdendent variable, and the > independent variable would be z. > fit2<-lm(fit2$residuals~z) > > I would think the two methods would give the same p value and the same > beta coefficient for z. The don't. Can someone help my understand why > the two methods do not give the same results. Additionally, could > someone tell me when one method might be better than the other, i.e. > what question does the first method anwser, and what question does the > second method answer. I have searched a number of textbooks and have not > found this question addressed. >John, Bill Venables already told you that they don't do that, because they are not orthogonal. Here is a simpler way of getting the same result as he suggested for the coefficients of z (but only for z):> x <- runif(100) > z <- x + rnorm(100, sd=0.4) > y <- 3 + x + z + rnorm(100, sd=0.3) > mod <- lm(y ~ x + z) > mod2 <- lm(residuals(lm(y ~ x)) ~ x + z) > summary(mod)Call: lm(formula = y ~ x + z) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.96436 0.06070 48.836 < 2e-16 *** x 0.96272 0.11576 8.317 5.67e-13 *** z 1.08922 0.06711 16.229 < 2e-16 *** --- Residual standard error: 0.2978 on 97 degrees of freedom> summary(mod2)Call: lm(formula = residuals(lm(y ~ x)) ~ x + z) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.15731 0.06070 -2.592 0.0110 * x -0.84459 0.11576 -7.296 8.13e-11 *** z 1.08922 0.06711 16.229 < 2e-16 *** --- Residual standard error: 0.2978 on 97 degrees of freedom You can omit x from the outer lm only if x and z are orthogonal, although you already removed the effect of x... In orthogonal case the coefficient for x would be 0. Residuals are equal in these two models:> range(residuals(mod) - residuals(mod2))[1] -2.797242e-17 5.551115e-17 But, of course, fitted values are not equal, since you fit the mod2 to the residuals after removing the effect of x... cheers, jari oksanen -- Jari Oksanen <jarioksa at sun3.oulu.fi>