The solve function in r-patched has been changed so that it applies a tolerance when using Lapack routines to calculate the inverse of a matrix or to solve a system of linear equations. A tolerance has always been used with the Linpack routines but not with the Lapack routines in versions 1.7.x and 1.8.0. (You can use the optional argument tol = 0 to override this check for computational singularity but you probably want to ask yourself why before doing so. In a perfect world you would be required to pass a qualifier exam before being allowed to do that. :-) We found that making such a change with a not-unreasonable default value for the tolerance caused several packages to fail R CMD check. As a result we have changed to a very generous default tolerance and, by default, declare the matrix to be computationally singular if the estimated reciprocal condition number is less than .Machine$double.eps. In future versions of R we may make that tolerance tighter. A quick glance at the R CMD check failures resulting from the earlier change shows that in most cases the calculation involved was something like solve(t(X) %*% X) This is not good numerical linear algebra. Admittedly we all learn that, for example, least squares estimates can be written as (X'X)^{-1} X'y and it is seems reasonable to code this calculation as betahat <- solve(t(X) %*% X) %*% t(X) %*% y but you shouldn't do that. In terms of numerical stability and also in terms of efficiency, that is an remarkably bad way of determining least squares estimates. To begin with, you don't invert a matrix just to solve a system of equations. If A is an n by n matrix and y is an n vector, then solve(A) is the equivalent of performing solve(A, y) n times. Thus solve(A) %*% y is more than n times as expensive as solve(A, y). If you want A^{-1}y then write it as solve(A, y). (In a perfect world you would be required to fill out a form, in triplicate, explaining in detail exactly why there is no suitable alternative to calculating the inverse before being allowed to use solve(A).) So, how should you calculate least squares estimates? We recommend betahat <- qr.coef(qr(X), y) If you want other results from the least squares calculation, such as fitted values or residuals, you may want to save qr(X) so you can reuse it qrX <- qr(X) betahat <- qr.coef(qrX, y) res <- qr.resid(qrX, y) ... There are alternatives but solve(t(X) %*% X) %*% t(X) %*% y is never a good one. Seber and Lee discuss discuss such calculations at length in chapter 11 of their "Linear Regression Analysis (2nd ed)" (Wiley, 2003). Some other comments: - the condition number of X'X is the square of the condition number of X, which is why it is a good idea to avoid working with X'X whenever possible - on those rare occasions when you do want (X'X)^{-1}, say to form a variance-covariance matrix, you should evaluate it as chol2inv(qr.R(qr(X))) - if, despite all of the above, you feel you must calculate X'X it is best done as crossprod(X) and not as t(X) %*% X Similarly, X'y is best calculated as crossprod(X, y). In general you should use crossprod in place of "t(anything) %*% anythingelse"
Douglas Bates <bates at cs.wisc.edu> writes: ...> To begin with, you don't invert a matrix just to solve a system of > equations. If A is an n by n matrix and y is an n vector, then > > solve(A) > > is the equivalent of performing > > solve(A, y) > > n times. Thus solve(A) %*% y is more than n times as expensive as > solve(A, y). If you want A^{-1}y then write it as solve(A, y).I overstated the comparison between solve(A, y) and solve(A) %*% y. For both solve(A) and solve(A, y), the first step is to calculate an LU decomposition of A, and the second step is to solve the linear system(s) of equations using this decomposition. It is only the second step that takes n times as long for solve(A) than it does for solve(A, y) (when y is a vector). The first step, which will account for the majority of the time spent executing solve(A, y), is the same as in solve(A).
Dear R help Thanks to Professor Bates for his information about how to calculate least square estimates [not using solve] in ``the right way''. This is very useful indead, I am clearly one of of the package maintainers who is not using using solve in a proper way at the moment. However, the calculations in my code look more like GLS than LS. ## GLS could in principlpe be implemented like this : betahat <- solve(t(X) %*% solve(Omega)%*% X) %*% t(X)%*%solve(Omega)%*% y ## where Omega is a strictly p.d. symmetric matrix Does someone have a recommendation on how to do this in ``the right way'' ? My first attempt (trying to imitate the LS solution recommended by Prof. Bates) is : temp <- backsolve(chol(Omega),cbind(X,y)) betahat <- qr.coef(qr(temp[,1:ncol(X)]), temp[,ncol(X)+1]) Thank you in advance for any help Cheers Ole -- Ole F. Christensen Center for Bioinformatik Datalogisk Institut Aarhus Universitet Ny Munkegade, Bygning 540 8000 Aarhus C Denmark
Hi all, Thanks Douglas for bringing up numerical stability and OLS regression coefficients. I'm often worried about the speed of regression runs, as I sometimes run large simulations or resamplings. Inspired, and prone to run many regressions, I did a little simulation as follows. XP results are on a 2.4Ghz Pentium 4 in Windows from the binaries (without BLAS?) and Linux results are on a 1.5Ghz Pentium M laptop with a Pentium Atlas BLAS. Setting up the data: x is rnormals with a vector of ones prepended...> x <- cbind(rep(1,100), matrix(rnorm(1000), nc=10)) > beta0 <- runif(11) > y <- x %*% beta0 + rnorm(100)Estimating Slope Coefficients:> system.time(for (i in 1:1000) lm( y~ x -1))XP: [1] 5.91 0.00 5.90 NA NA Linux: [1] 5.27 0.01 5.28 0.00 0.00> system.time(for (i in 1:1000) solve(t(x) %*% x) %*% t(x) %*% y)XP: [1] 0.64 0.01 0.65 NA NA Linux: [1] 0.51 0.01 0.57 0.00 0.00> system.time(for (i in 1:1000) qr.coef(qr(x), y) )XP: [1] 0.75 0.00 0.75 NA NA Linux: [1] 0.76 0.00 0.77 0.00 0.00> system.time(for (i in 1:1000) solve(crossprod(x), crossprod(x,y)))XP: [1] 0.45 0.00 0.53 NA NA Linux: [1] 0.35 0.00 0.36 0.00 0.00 On both platforms, BLAS or not, the solve(crossprod()) method works the fastest, with the na?ve solve(t(x)%*%x) second-fastest. Calculating (t(x)%*%x)^-1:> system.time(for (i in 1:1000) chol2inv(qr.R(qr(x))))XP: [1] 0.44 0.00 0.44 NA NA Linux: [1] 0.40 0.00 0.40 0.00 0.00> system.time(for (i in 1:1000) solve(crossprod(x)) )XP: [1] 0.58 0.01 0.59 NA NA Linux: [1] 0.34 0.00 0.34 0.00 0.00 Where the chol2inv() method was faster in Windows but slower in Linux, which I'm guessing is due to differential uses of BLAS functions. None of these address the problem of numerical accuracy, which was the thrust of Doug's comments. Has anyone done a quick simulation to investigate the stability of the solutions? Is it small coefficients or near-collinearity (or both) that one has to worry about? Are the solve(crossprod()) methods obviously unstable? They surely do work quickly! Thanks again R-universe. I've needed no other statistical software for the last 2 years. I hope to get around to contributing a package or two soon. Elliot.