Hi, the standard errors of the coefficients in two regressions that I computed by hand and using lm() differ by about 1%. Can somebody help me to identify the source of this difference? The coefficient estimates are the same, but the standard errors differ. ####Simulate data happiness=0 income=0 gender=(rep(c(0,1,1,0),25)) for(i in 1:100){ happiness[i]=1000+i+rnorm(1,0,40) income[i]=2*i+rnorm(1,0,40) } ####Run lm() reg=lm(happiness~income+factor(gender)) summary(reg) ####Compute coefficient estimates "by hand" x=cbind(income,gender) y=happiness z=as.matrix(cbind(rep(1,100),x)) beta=solve(t(z)%*%z)%*%t(z)%*%y ####Compare estimates cbind(reg$coef,beta) ##fine so far, they both look the same reg$coef[1]-beta[1] reg$coef[2]-beta[2] reg$coef[3]-beta[3] ##differences are too small to cause a 1% difference ####Check predicted values estimates=c(beta[2],beta[3]) predicted=estimates%*%t(x) predicted=as.vector(t(as.double(predicted+beta[1]))) cbind(reg$fitted,predicted) ##inspect fitted values as.vector(reg$fitted-predicted) ##differences are marginal #### Compute errors e=NA e2=NA for(i in 1:length(happiness)){ e[i]=y[i]-predicted[i] ##for "hand-computed" regression e2[i]=y[i]-reg$fitted[i] ##for lm() regression } #### Compute standard error of the coefficients sqrt(abs(var(e)*solve(t(z)%*%z))) ##for "hand-computed" regression sqrt(abs(var(e2)*solve(t(z)%*%z))) ##for lm() regression using e2 from above ##Both are the same ####Compare to lm() standard errors of the coefficients again summary(reg) The diagonal elements of the variance/covariance matrices should be the standard errors of the coefficients. Both are identical when computed by hand. However, they differ from the standard errors reported in summary(reg). The difference of 1% seems nonmarginal. Should I have multiplied var(e)*solve(t(z)%*%z) by n and divided by n-1? But even if I do this, I still observe a difference. Can anybody help me out what the source of this difference is? Cheers, Daniel ------------------------- cuncta stricte discussurus
Daniel Malter wrote:> Hi, > > the standard errors of the coefficients in two regressions that I computed > by hand and using lm() differ by about 1%. Can somebody help me to identify > the source of this difference? The coefficient estimates are the same, but > the standard errors differ. > > ####Simulate data > > happiness=0 > income=0 > gender=(rep(c(0,1,1,0),25)) > for(i in 1:100){ > happiness[i]=1000+i+rnorm(1,0,40) > income[i]=2*i+rnorm(1,0,40) > } > > ####Run lm() > > reg=lm(happiness~income+factor(gender)) > summary(reg) > > ####Compute coefficient estimates "by hand" > > x=cbind(income,gender) > y=happiness > > z=as.matrix(cbind(rep(1,100),x)) > beta=solve(t(z)%*%z)%*%t(z)%*%y > > ####Compare estimates > > cbind(reg$coef,beta) ##fine so far, they both look the same > > reg$coef[1]-beta[1] > reg$coef[2]-beta[2] > reg$coef[3]-beta[3] ##differences are too small to cause a 1% > difference > > ####Check predicted values > > estimates=c(beta[2],beta[3]) > > predicted=estimates%*%t(x) > predicted=as.vector(t(as.double(predicted+beta[1]))) > > cbind(reg$fitted,predicted) ##inspect fitted values > as.vector(reg$fitted-predicted) ##differences are marginal > > #### Compute errors > > e=NA > e2=NA > for(i in 1:length(happiness)){ > e[i]=y[i]-predicted[i] ##for "hand-computed" regression > e2[i]=y[i]-reg$fitted[i] ##for lm() regression > } > > #### Compute standard error of the coefficients > > sqrt(abs(var(e)*solve(t(z)%*%z))) ##for "hand-computed" regression > sqrt(abs(var(e2)*solve(t(z)%*%z))) ##for lm() regression using e2 from > above > > ##Both are the same > > ####Compare to lm() standard errors of the coefficients again > > summary(reg) > > > The diagonal elements of the variance/covariance matrices should be the > standard errors of the coefficients. Both are identical when computed by > hand. However, they differ from the standard errors reported in > summary(reg). The difference of 1% seems nonmarginal. Should I have > multiplied var(e)*solve(t(z)%*%z) by n and divided by n-1? But even if I do > this, I still observe a difference. Can anybody help me out what the source > of this difference is? > >The degrees of freedom in a regression analysis is n minus the number of parameters, three in your case. I.e. the variance var(e) does not know about this and divides by n-1 where it should have been n-3, so.....> Cheers, > Daniel > > > ------------------------- > cuncta stricte discussurus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Please check your statistical methods lecture notes. var(e) has divisor n-1, and that is not an unbiased estimator of the residual variance when 'e' are residuals. From summary.lm (and you are allowed to read the code) rdf <- n - p if (is.na(z$df.residual) || rdf != z$df.residual) warning("residual degrees of freedom in object suggest this is not an \"lm\" fit") p1 <- 1:p r <- z$residuals f <- z$fitted.values w <- z$weights if (is.null(w)) { mss <- if (attr(z$terms, "intercept")) sum((f - mean(f))^2) else sum(f^2) rss <- sum(r^2) } else { mss <- if (attr(z$terms, "intercept")) { m <- sum(w * f/sum(w)) sum(w * (f - m)^2) } else sum(w * f^2) rss <- sum(w * r^2) r <- sqrt(w) * r } resvar <- rss/rdf the correct divisor is n-p. Since p=3 in your example, that explains a 2% difference in variances and hence a 1% difference in SEs. On Tue, 26 Feb 2008, Daniel Malter wrote:> Hi, > > the standard errors of the coefficients in two regressions that I computed > by hand and using lm() differ by about 1%. Can somebody help me to identify > the source of this difference? The coefficient estimates are the same, but > the standard errors differ. > > ####Simulate data > > happiness=0 > income=0 > gender=(rep(c(0,1,1,0),25)) > for(i in 1:100){ > happiness[i]=1000+i+rnorm(1,0,40) > income[i]=2*i+rnorm(1,0,40) > } > > ####Run lm() > > reg=lm(happiness~income+factor(gender)) > summary(reg) > > ####Compute coefficient estimates "by hand" > > x=cbind(income,gender) > y=happiness > > z=as.matrix(cbind(rep(1,100),x)) > beta=solve(t(z)%*%z)%*%t(z)%*%y > > ####Compare estimates > > cbind(reg$coef,beta) ##fine so far, they both look the same > > reg$coef[1]-beta[1] > reg$coef[2]-beta[2] > reg$coef[3]-beta[3] ##differences are too small to cause a 1% > difference > > ####Check predicted values > > estimates=c(beta[2],beta[3]) > > predicted=estimates%*%t(x) > predicted=as.vector(t(as.double(predicted+beta[1]))) > > cbind(reg$fitted,predicted) ##inspect fitted values > as.vector(reg$fitted-predicted) ##differences are marginal > > #### Compute errors > > e=NA > e2=NA > for(i in 1:length(happiness)){ > e[i]=y[i]-predicted[i] ##for "hand-computed" regression > e2[i]=y[i]-reg$fitted[i] ##for lm() regression > } > > #### Compute standard error of the coefficients > > sqrt(abs(var(e)*solve(t(z)%*%z))) ##for "hand-computed" regression > sqrt(abs(var(e2)*solve(t(z)%*%z))) ##for lm() regression using e2 from > above > > ##Both are the same > > ####Compare to lm() standard errors of the coefficients again > > summary(reg) > > > The diagonal elements of the variance/covariance matrices should be the > standard errors of the coefficients. Both are identical when computed by > hand. However, they differ from the standard errors reported in > summary(reg). The difference of 1% seems nonmarginal. Should I have > multiplied var(e)*solve(t(z)%*%z) by n and divided by n-1? But even if I do > this, I still observe a difference. Can anybody help me out what the source > of this difference is? > > Cheers, > Daniel > > > ------------------------- > cuncta stricte discussurus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Try multiplying var(e) by n-1 and dividing by n-3 (97 are the degrees of freedom for the sum of squares error in your example). Then it all matches up nicely. Best, -- Wolfgang Viechtbauer ?Department of Methodology and Statistics ?University of Maastricht, The Netherlands ?http://www.wvbauer.com/ ----Original Message---- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Daniel Malter Sent: Tuesday, February 26, 2008 08:25 To: 'r-help' Subject: [R] OLS standard errors> Hi, > > the standard errors of the coefficients in two regressions that I > computed by hand and using lm() differ by about 1%. Can somebody help > me to identify the source of this difference? The coefficient > estimates are the same, but the standard errors differ. > > ####Simulate data > > happiness=0 > income=0 > gender=(rep(c(0,1,1,0),25)) > for(i in 1:100){ > happiness[i]=1000+i+rnorm(1,0,40) > income[i]=2*i+rnorm(1,0,40) > } > > ####Run lm() > > reg=lm(happiness~income+factor(gender)) > summary(reg) > > ####Compute coefficient estimates "by hand" > > x=cbind(income,gender) > y=happiness > > z=as.matrix(cbind(rep(1,100),x)) > beta=solve(t(z)%*%z)%*%t(z)%*%y > > ####Compare estimates > > cbind(reg$coef,beta) ##fine so far, they both look the same > > reg$coef[1]-beta[1] > reg$coef[2]-beta[2] > reg$coef[3]-beta[3] ##differences are too small to cause a 1% > difference > > ####Check predicted values > > estimates=c(beta[2],beta[3]) > > predicted=estimates%*%t(x) > predicted=as.vector(t(as.double(predicted+beta[1]))) > > cbind(reg$fitted,predicted) ##inspect fitted values > as.vector(reg$fitted-predicted) ##differences are marginal > > #### Compute errors > > e=NA > e2=NA > for(i in 1:length(happiness)){ > e[i]=y[i]-predicted[i] ##for "hand-computed" regression > e2[i]=y[i]-reg$fitted[i] ##for lm() regression > } > > #### Compute standard error of the coefficients > > sqrt(abs(var(e)*solve(t(z)%*%z))) ##for "hand-computed" regression > sqrt(abs(var(e2)*solve(t(z)%*%z))) ##for lm() regression using e2 > from > above > > ##Both are the same > > ####Compare to lm() standard errors of the coefficients again > > summary(reg) > > > The diagonal elements of the variance/covariance matrices should be > the standard errors of the coefficients. Both are identical when > computed by hand. However, they differ from the standard errors > reported in summary(reg). The difference of 1% seems nonmarginal. > Should I have multiplied var(e)*solve(t(z)%*%z) by n and divided by > n-1? But even if I do this, I still observe a difference. Can anybody > help me out what the source of this difference is? > > Cheers, > Daniel > > > ------------------------- > cuncta stricte discussurus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.