Dear R experts: I just tried some simple test that told me that hand computing the OLS coefficients is about 3-10 times as fast as using the built-in lm() function. (code included below.) Most of the time, I do not care, because I like the convenience, and I presume some of the time goes into saving a lot of stuff that I may or may not need. But when I do want to learn the properties of an estimator whose input contains a regression, I do care about speed. What is the recommended fastest way to get regression coefficients in R? (Is Gentlemen's weighted-least-squares algorithm implemented in a low-level C form somewhere? that one was always lightning fast for me.) regards, /ivo bybuiltin = function( y, x ) coef(lm( y ~ x -1 )); byhand = function( y, x ) { xy<-t(x)%*%y; xxi<- solve(t(x)%*%x) b<-as.vector(xxi%*%xy) ## I will need these later, too: ## res<-y-as.vector(x%*%b) ## soa[i]<-b[2] ## sigmas[i]<-sd(res) b; } MC=500; N=10000; set.seed(0); x= matrix( rnorm(N*MC), nrow=N, ncol=MC ); y= matrix( rnorm(N*MC), nrow=N, ncol=MC ); ptm = proc.time() for (mc in 1:MC) byhand(y[,mc],x[,mc]); cat("By hand took ", proc.time()-ptm, "\n"); ptm = proc.time() for (mc in 1:MC) bybuiltin(y[,mc],x[,mc]); cat("By built-in took ", proc.time()-ptm, "\n");
check the following options: ols1 <- function (y, x) { coef(lm(y ~ x - 1)) } ols2 <- function (y, x) { xy <- t(x)%*%y xxi <- solve(t(x)%*%x) b <- as.vector(xxi%*%xy) b } ols3 <- function (y, x) { XtX <- crossprod(x) Xty <- crossprod(x, y) solve(XtX, Xty) } ols4 <- function (y, x) { lm.fit(x, y)$coefficients } # check timings MC <- 500 N <- 10000 set.seed(0) x <- matrix(rnorm(N*MC), N, MC) y <- matrix(rnorm(N*MC), N, MC) invisible({gc(); gc(); gc()}) system.time(for (mc in 1:MC) ols1(y[, mc], x[, mc])) invisible({gc(); gc(); gc()}) system.time(for (mc in 1:MC) ols2(y[, mc], x[, mc])) invisible({gc(); gc(); gc()}) system.time(for (mc in 1:MC) ols3(y[, mc], x[, mc])) invisible({gc(); gc(); gc()}) system.time(for (mc in 1:MC) ols4(y[, mc], x[, mc, drop = FALSE])) I hope it helps. Best, Dimitris ivo welch wrote:> Dear R experts: > > I just tried some simple test that told me that hand computing the OLS > coefficients is about 3-10 times as fast as using the built-in lm() > function. (code included below.) Most of the time, I do not care, > because I like the convenience, and I presume some of the time goes > into saving a lot of stuff that I may or may not need. But when I do > want to learn the properties of an estimator whose input contains a > regression, I do care about speed. > > What is the recommended fastest way to get regression coefficients in > R? (Is Gentlemen's weighted-least-squares algorithm implemented in a > low-level C form somewhere? that one was always lightning fast for > me.) > > regards, > > /ivo > > > > bybuiltin = function( y, x ) coef(lm( y ~ x -1 )); > > byhand = function( y, x ) { > xy<-t(x)%*%y; > xxi<- solve(t(x)%*%x) > b<-as.vector(xxi%*%xy) > ## I will need these later, too: > ## res<-y-as.vector(x%*%b) > ## soa[i]<-b[2] > ## sigmas[i]<-sd(res) > b; > } > > > MC=500; > N=10000; > > > set.seed(0); > x= matrix( rnorm(N*MC), nrow=N, ncol=MC ); > y= matrix( rnorm(N*MC), nrow=N, ncol=MC ); > > ptm = proc.time() > for (mc in 1:MC) byhand(y[,mc],x[,mc]); > cat("By hand took ", proc.time()-ptm, "\n"); > > ptm = proc.time() > for (mc in 1:MC) bybuiltin(y[,mc],x[,mc]); > cat("By built-in took ", proc.time()-ptm, "\n"); > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014
Ivo, I would be careful about dealing with situations where x may be numerically ill-conditioned. Your code will do quite badly in such cases. Here is an extreme illustration: set.seed(123) x <- matrix(rnorm(20), 10, 2) x <- cbind(x, x[,1]) # x is clearly rank-deficient y <- c(x %*% c(1,2,3))> byhand(y, x)Error in solve.default(t(x) %*% x) : Lapack routine dgesv: system is exactly singular>A better approach would be to explicitly use "QR" decomposition in your byhand() function: byhand.qr = function( y, x, tol=1.e-07 ) { qr.x <- qr(x, tol=tol, LAPACK=TRUE) b<-qr.coef(qr.x, y) ## I will need these later, too: ## res<- qr.res(qr.x, y) ## soa[i]<-b[2] ## sigmas[i]<-sd(res) b; }> ans <- byhand.qr(y, x) > ans[1] 1.795725 2.000000 2.204275>You can verify that this a "valid" solution:> y - c(x %*% ans)[1] 9.436896e-16 2.498002e-16 -8.881784e-16 0.000000e+00 -4.440892e-16 1.776357e-15 4.440892e-16 0.000000e+00 4.440892e-16 -4.440892e-16>However, this is not the "unique" solution, since there an infinite number of solutions. We can get "the" solution that has the minimum length by using Moore-Penrose inverse:> require(MASS) > ans.min <- c(ginv(x) %*% y) > ans.min[1] 2 2 2>You can verify that ans.min has a smaller L2-norm than ans, and it is unique. Hope this helps, Ravi. ____________________________________________________________________ Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvaradhan at jhmi.edu ----- Original Message ----- From: ivo welch <ivowel at gmail.com> Date: Wednesday, March 25, 2009 4:31 pm Subject: [R] very fast OLS regression? To: r-help <r-help at stat.math.ethz.ch>> Dear R experts: > > I just tried some simple test that told me that hand computing the OLS > coefficients is about 3-10 times as fast as using the built-in lm() > function. (code included below.) Most of the time, I do not care, > because I like the convenience, and I presume some of the time goes > into saving a lot of stuff that I may or may not need. But when I do > want to learn the properties of an estimator whose input contains a > regression, I do care about speed. > > What is the recommended fastest way to get regression coefficients in > R? (Is Gentlemen's weighted-least-squares algorithm implemented in a > low-level C form somewhere? that one was always lightning fast for > me.) > > regards, > > /ivo > > > > bybuiltin = function( y, x ) coef(lm( y ~ x -1 )); > > byhand = function( y, x ) { > xy<-t(x)%*%y; > xxi<- solve(t(x)%*%x) > b<-as.vector(xxi%*%xy) > ## I will need these later, too: > ## res<-y-as.vector(x%*%b) > ## soa[i]<-b[2] > ## sigmas[i]<-sd(res) > b; > } > > > MC=500; > N=10000; > > > set.seed(0); > x= matrix( rnorm(N*MC), nrow=N, ncol=MC ); > y= matrix( rnorm(N*MC), nrow=N, ncol=MC ); > > ptm = proc.time() > for (mc in 1:MC) byhand(y[,mc],x[,mc]); > cat("By hand took ", proc.time()-ptm, "\n"); > > ptm = proc.time() > for (mc in 1:MC) bybuiltin(y[,mc],x[,mc]); > cat("By built-in took ", proc.time()-ptm, "\n"); > > ______________________________________________ > R-help at r-project.org mailing list > > PLEASE do read the posting guide > and provide commented, minimal, self-contained, reproducible code.
On Wed, 2009-03-25 at 16:28 -0400, ivo welch wrote:> Dear R experts: > > I just tried some simple test that told me that hand computing the OLS > coefficients is about 3-10 times as fast as using the built-in lm() > function. (code included below.) Most of the time, I do not care, > because I like the convenience, and I presume some of the time goes > into saving a lot of stuff that I may or may not need. But when I do > want to learn the properties of an estimator whose input contains a > regression, I do care about speed. > > What is the recommended fastest way to get regression coefficients in > R? (Is Gentlemen's weighted-least-squares algorithm implemented in a > low-level C form somewhere? that one was always lightning fast for > me.)No one has yet mentioned Doug Bates' article in R News on this topic, which compares timings of various methods for least squares computations. Douglas Bates. Least squares calculations in R. R News, 4(1):17-20, June 2004. A Cholesky decomposition solution proved fastest in base R code, with an even faster version developed using sparse matrices and the Matrix package. You can find Doug's article here: http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf HTH G> > regards, > > /ivo > > > > bybuiltin = function( y, x ) coef(lm( y ~ x -1 )); > > byhand = function( y, x ) { > xy<-t(x)%*%y; > xxi<- solve(t(x)%*%x) > b<-as.vector(xxi%*%xy) > ## I will need these later, too: > ## res<-y-as.vector(x%*%b) > ## soa[i]<-b[2] > ## sigmas[i]<-sd(res) > b; > } > > > MC=500; > N=10000; > > > set.seed(0); > x= matrix( rnorm(N*MC), nrow=N, ncol=MC ); > y= matrix( rnorm(N*MC), nrow=N, ncol=MC ); > > ptm = proc.time() > for (mc in 1:MC) byhand(y[,mc],x[,mc]); > cat("By hand took ", proc.time()-ptm, "\n"); > > ptm = proc.time() > for (mc in 1:MC) bybuiltin(y[,mc],x[,mc]); > cat("By built-in took ", proc.time()-ptm, "\n"); > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%