Paul Johnson
2012-Dec-05 23:58 UTC
[Rd] Understanding svd usage and its necessity in generalized inverse calculation
Dear R-devel: I could use some advice about matrix calculations and steps that might make for faster computation of generalized inverses. It appears in some projects there is a bottleneck at the use of svd in calculation of generalized inverses. Here's some Rprof output I need to understand.> summaryRprof("Amelia.out")$by.self self.time self.pct total.time total.pct "La.svd" 150.34 27.66 164.82 30.32 "emfred" 40.90 7.52 535.62 98.53 "<Anonymous>" 32.92 6.06 122.16 22.47 "amsweep" 26.70 4.91 475.94 87.55 "%*%" 26.06 4.79 26.06 4.79 "as.matrix" 23.32 4.29 37.34 6.87 "structure" 22.30 4.10 30.68 5.64 "t" 18.64 3.43 25.34 4.66 "matrix" 14.66 2.70 15.02 2.76 "deparse" 14.52 2.67 33.58 6.18 "strsplit" 9.90 1.82 9.90 1.82 "match" 9.80 1.80 22.94 4.22 "conditionMessage" 8.92 1.64 14.50 2.67 "svd" 8.70 1.60 185.88 34.19 "mpinv" 8.40 1.55 221.56 40.76 "c" 8.16 1.50 8.16 1.50 ".deparseOpts" 7.92 1.46 12.42 2.28 "$" 6.94 1.28 6.94 1.28 "t.default" 6.70 1.23 6.70 1.23 "diag" 5.66 1.04 8.96 1.65 "cbind" 5.48 1.01 5.48 1.01 "rbind" 5.36 0.99 10.84 1.99 "am.inv" 4.88 0.90 10.76 1.98 ".Call" 4.70 0.86 4.70 0.86 [snip] $by.total total.time total.pct self.time self.pct "amelia.default" 543.58 99.99 0.04 0.01 "amelia" 543.58 99.99 0.00 0.00 "emarch" 536.94 98.77 0.18 0.03 "emfred" 535.62 98.53 40.90 7.52 "amsweep" 475.94 87.55 26.70 4.91 "mpinv" 221.56 40.76 8.40 1.55 "svd" 185.88 34.19 8.70 1.60 "La.svd" 164.82 30.32 150.34 27.66 "try" 161.52 29.71 0.50 0.09 "tryCatch" 161.04 29.62 3.38 0.62 "tryCatchList" 157.06 28.89 1.06 0.19 "tryCatchOne" 156.00 28.70 3.20 0.59 "<Anonymous>" 122.16 22.47 32.92 6.06 "as.matrix" 37.34 6.87 23.32 4.29 "deparse" 33.58 6.18 14.52 2.67 "structure" 30.68 5.64 22.30 4.10 "%*%" 26.06 4.79 26.06 4.79 "t" 25.34 4.66 18.64 3.43 "match" 22.94 4.22 9.80 1.80 "%in%" 21.74 4.00 2.44 0.45 "simpleError" 16.72 3.08 0.62 0.11 "matrix" 15.02 2.76 14.66 2.70 "conditionMessage" 14.50 2.67 8.92 1.64 "mode" 14.44 2.66 1.76 0.32 "doTryCatch" 13.94 2.56 2.72 0.50 ".deparseOpts" 12.42 2.28 7.92 1.46 I *Think* this means that a bottlleneck here is svd, which is being called by this function that calculates generalized inverses: ## Moore-Penrose Inverse function (aka Generalized Inverse) ## X: symmetric matrix ## tol: convergence requirement mpinv <- function(X, tol = sqrt(.Machine$double.eps)) { s <- svd(X) e <- s$d e[e > tol] <- 1/e[e > tol] s$v %*% diag(e,nrow=length(e)) %*% t(s$u) } That is from the Amerlia package, which we like to use very much. Basically, I wonder if I should use a customized generalized inverse or svd calculator to make this faster. Why bother, you ask? We have many people who do multiple imputation for missing data and the psychologists are persuaded that they now ought to collect 50 or 100 imputations (gulp). The users want to include many many variables in these models, and so we see iterations in the 100s for each imputation. That makes for some super long running jobs, and I've been profiling the multiple imputation algorithms in various packages to see what I can do to make them faster. Question: Is the usage of svd really necessary for generalized inverse? Can I employ some alternative to get faster? The literature on this is pretty specialized. I mean to say, "will one of you math professors please tell me what is safe or unsafe". The cholesky decomposition is fastest for the well conditioned matrix, but not suitable to all matrices. svd is the gold standard for accuracy, but it is slowest. In the past, I've chased speedups like this and sometimes find a happy answer, but just as often I wish I had asked an expert before wandering off on a fool's errand. There's plenty of literature about it. See, for example, this article: Smoktunowicz, A., & Wr?bel, I. (2012). Numerical aspects of computing the Moore-Penrose inverse of full column rank matrices. BIT Numerical Mathematics, 52(2), 503?524. doi:10.1007/s10543-011-0362-0 Which seems to say that a method proposed by Byers and Xu (2008) is as about as good as, but faster than, svd. It appears to me work on this traces back to a speedup obtained by Courrieu, whose method takes about one-half as much time as the svd. P. Courrieu. Fast Computation of Moore-Penrose Inverse matrices. Neural Information Processing-Letters and Reviews, 8:25?29, 2005. Katsikis, Casilos N. and Dimitrios Pappas. 2008. Fast computing of the Moore-Penrose Inverse Matrix. Electronic Journal of Linear Algebra 17: 637-650. http://celc.cii.fc.ul.pt/iic/ela/ela-articles/articles/vol17_pp337-350.pdf Toutounian, F., & Ataei, A. (2009). A new method for computing Moore-Penrose inverse matrices. J. Comput. Appl. Math., 228(1), 412?417. doi:10.1016/j.cam.2008.10.008 I see in the MASS package there is a ginv function, it uses the svd, just as the mpinv in Amelia does. Maybe I should ignore this, or leave it up to some BLAS implementation? pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu
Thomas Lumley
2013-Jan-01 23:01 UTC
[Rd] Understanding svd usage and its necessity in generalized inverse calculation
On Thu, Dec 6, 2012 at 12:58 PM, Paul Johnson <pauljohn32@gmail.com> wrote:> Dear R-devel: > > I could use some advice about matrix calculations and steps that might > make for faster computation of generalized inverses. It appears in > some projects there is a bottleneck at the use of svd in calculation > of generalized inverses. > > Here's some Rprof output I need to understand. > > > summaryRprof("Amelia.out") > $by.self > self.time self.pct total.time total.pct > "La.svd" 150.34 27.66 164.82 30.32 > >[snip]> I *Think* this means that a bottlleneck here is svd, which is being >A bottleneck to a limited extent -- it's still only 30% of computation time, so speeding it up can't save more than that> called by this function that calculates generalized inverses: > > ## Moore-Penrose Inverse function (aka Generalized Inverse) > ## X: symmetric matrix > ## tol: convergence requirement > mpinv <- function(X, tol = sqrt(.Machine$double.eps)) { > s <- svd(X) > e <- s$d > e[e > tol] <- 1/e[e > tol] > s$v %*% diag(e,nrow=length(e)) %*% t(s$u) > } > > That is from the Amerlia package, which we like to use very much. > > Basically, I wonder if I should use a customized generalized inverse > or svd calculator to make this faster. > >If your matrix is produced as a covariance matrix, so that it really is symmetric positive semidefinite up to rounding error, my understanding is that the Cholesky approach is stable in the sense that it will reliably produce an accurate inverse or a zero-to-within-tolerance pivot and error message. Now, if most of the matrices you are trying to invert are actually invertible (as I would hope), it may be quicker to use the Cholesky approach will a fallback to the SVD for semidefinite matrices. That is, something like tryCatch( chol2inv(chol(xx)), error=function(e) ginv(xx) ) Most of the time you will get the Cholesky branch, which is much faster (about five-fold for 10x10 matrices on my system). On my system and using a 10x10 matrix the overhead in the tryCatch() is much smaller than the time taken by either set of linear algebra, so there is a net gain as long as even a reasonably minority of the matrices are actually invertible. You can probably do slightly better by replacing the chol2inv() with backsolve(): solving just the systems of linear equations you need is usually preferable to constructing a matrix inverse. Note that this approach will give wrong answers without warning if the matrices are not symmetric. -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland [[alternative HTML version deleted]]