Samuel Le
2011-Aug-01 13:27 UTC
[R] formula used by R to compute the t-values in a linear regression
Hello, I was wondering if someone knows the formula used by the function lm to compute the t-values. I am trying to implement a linear regression myself. Assuming that I have K variables, and N observations, the formula I am using is: For the k-th variable, t-value= b_k/sigma_k With b_k is the coefficient for the k-th variable, and sigma_k =(t(x) x )^(-1) _kk is its standard deviation. I find sigma_k = sigma * n/(n*Sum x_{k,i}^2 -(sum x_{k,i}^2)) With sigma: the estimated standard deviation of the residuals, Sigma = sqrt(1/(N-K-1)*Sum epsilon_i^2) With: N: number of observations K: number of variables This formula comes from my old course of econometrics. For some reason it doesn't match the t-value produced by R (I am off by about 1%). I can match the other results produced by R (coefficients of the regression, r squared, etc.). I would be grateful if someone could provide some clarifications. Samuel [[alternative HTML version deleted]]
David Winsemius
2011-Aug-01 13:44 UTC
[R] formula used by R to compute the t-values in a linear regression
On Aug 1, 2011, at 9:27 AM, Samuel Le wrote:> Hello, > > > I was wondering if someone knows the formula used by the function lm > to compute the t-values. > > I am trying to implement a linear regression myself. Assuming that I > have K variables, and N observations, the formula I am using is: > > For the k-th variable, t-value= b_k/sigma_k > > With b_k is the coefficient for the k-th variable, and sigma_k > =(t(x) x )^(-1) _kk is its standard deviation. > > I find sigma_k = sigma * n/(n*Sum x_{k,i}^2 -(sum x_{k,i}^2)) > > With sigma: the estimated standard deviation of the residuals, > > Sigma = sqrt(1/(N-K-1)*Sum epsilon_i^2) > > With: > > N: number of observations > > K: number of variables > > This formula comes from my old course of econometrics. > > For some reason it doesn't match the t-value produced by R (I am off > by about 1%). I can match the other results produced by R > (coefficients of the regression, r squared, etc.).Usually such a small difference results from using different degrees of freedom. Have you reduced the df's appropriately after considering the number of other estimated parameters? Just quoting code from you econometrics reference is not enough to answer the question. We would need to see code... as the message states at the end of every posting.)> > I would be grateful if someone could provide some clarifications. > > > > Samuel > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
peter dalgaard
2011-Aug-01 13:45 UTC
[R] formula used by R to compute the t-values in a linear regression
On Aug 1, 2011, at 15:27 , Samuel Le wrote:> Hello, > > > > I was wondering if someone knows the formula used by the function lm to compute the t-values. > > > > I am trying to implement a linear regression myself. Assuming that I have K variables, and N observations, the formula I am using is: > > For the k-th variable, t-value= b_k/sigma_k > > > > With b_k is the coefficient for the k-th variable, and sigma_k =(t(x) x )^(-1) _kk is its standard deviation. > > > > I find sigma_k = sigma * n/(n*Sum x_{k,i}^2 -(sum x_{k,i}^2)) > > > > With sigma: the estimated standard deviation of the residuals, > > Sigma = sqrt(1/(N-K-1)*Sum epsilon_i^2) > > > > With: > > N: number of observations > > K: number of variables > > > > This formula comes from my old course of econometrics. > > For some reason it doesn't match the t-value produced by R (I am off by about 1%). I can match the other results produced by R (coefficients of the regression, r squared, etc.). > > > > I would be grateful if someone could provide some clarifications.AFAICT, your formula only holds for K=1. Otherwise, the formula for sigma_k involves matrix inversion. Also, even for K=1, beware that textbook formulas like SSDx = SSx - (Sx)^2/n involve subtraction of nearly equal quantities and easily loses multiple digits of precision, so software tends to use rather more careful algorithms. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com "D?den skal tape!" --- Nordahl Grieg
S Ellison
2011-Aug-01 14:15 UTC
[R] formula used by R to compute the t-values in a linear regression
> -----Original Message----- > [mailto:r-help-bounces at r-project.org] On Behalf Of Samuel Le > Subject: [R] formula used by R to compute the t-values in a > linear regression > I was wondering if someone knows the formula used by the > function lm to compute the t-values.Typing summary.lm I found the standard error and t calculation (for around line 58-62 of the resulting listing. resvar <- rss/rdf R <- chol2inv(Qr$qr[p1, p1, drop = FALSE]) se <- sqrt(diag(R) * resvar) est <- z$coefficients[Qr$pivot[p1]] tval <- est/se You can also find (rather further up) that the degrees of freedom df used are taken directly from the linear model $df (z$df in the function). Others noted that incorrect df often cause problems, so checking that you're using the correct df is possible by inspecting the lm summary. The standard errors are apparently (as is usual for a least squares problem, I think) taken from the diagonal of the inverse of the hessian, multiplied by the residual variance. Unfortunately I could not get at the hessian calculation quite as easily (it looks like it uses a function that's not exported from stats) so that's left as an exercise in browsing source code ... S Ellison ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}