cypark at princeton.edu
2011-Jul-29 15:30 UTC
[R] finding a faster way to run lm on rows of predictor matrix
Hi, everyone. I need to run lm with the same response vector but with varying predictor vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors) After looking through the R archive, I found roughly 3 methods that has been suggested. Unfortunately, I need to run this task multiple times(~ 5,000 times) and would like to find a faster way than the existing methods. All three methods I have bellow run on my machine timed with system.time 13~20 seconds. The opposite direction of 6,000 response vectors and 1 predictor vectors, that is supported with lm runs really fast ~0.5 seconds. They are pretty much performing the same number of lm fits, so I was wondering if there was a faster way, before I try to code this up in c++. thanks!! ## sample data ### regress.y = rnorm(150) predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000) ## method 1 ## data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals))) user system elapsed 15.076 0.048 15.128 ## method 2 ## data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), nrow=nrow(predictors), ncol=ncol(predictors) ) for( i in 1:nrow(predictors)){ pred = as.vector(predictors[i,]) data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals } user system elapsed 13.841 0.012 13.854 ## method 3 ## library(nlme) all.data <- data.frame( y=rep(regress.y, nrow(predictors)), x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) ) all.fits <- lmList( y ~ x | g, data=all.data) data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), ncol=ncol(predictors)) user system elapsed 36.407 0.484 36.892 ## the opposite direction, supported by lm ### lm(t(predictors) ~ -1 + regress.y)$residuals user system elapsed 0.500 0.120 0.613
Joshua Wiley
2011-Jul-29 16:36 UTC
[R] finding a faster way to run lm on rows of predictor matrix
Hi, If you are doing repeated fits in this specialized case, you can benefit by skipping some of the fancy overhead and going straight to lm.fit. ## what you had data.residuals <- t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals))) ## passing the matrices directly to lm.fit (no need to augment the X matrix because you do not want the intercept anyway) data.residuals2 <- t(sapply(1:6000, function(i) lm.fit(x t(predictors[i, , drop = FALSE]), y = regress.y)$residuals)) On my little laptop:> system.time(data.residuals <- t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals))))user system elapsed 42.84 0.11 45.01> system.time(data.residuals2 <- t(sapply(1:6000, function(i) lm.fit(x = t(predictors[i, , drop = FALSE]), y = regress.y)$residuals)))user system elapsed 3.76 0.00 3.82 If those timing ratios hold on your system, you should be looking at around 1.2 seconds per run. Which suggests it will take around 2 hours to complete 5,000 runs. Note that this is the sort of task that can be readily parallelized, so if you are on a multicore machine, you could take advantage of that without too much trouble. Cheers, Josh On Fri, Jul 29, 2011 at 8:30 AM, <cypark at princeton.edu> wrote:> Hi, everyone. > I need to run lm with the same response vector but with varying predictor vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors) > After looking through the R archive, I found roughly 3 methods that has been suggested. > Unfortunately, I need to run this task multiple times(~ 5,000 times) and would like to find a faster way than the existing methods. > All three methods I have bellow run on my machine timed with system.time 13~20 seconds. > > The opposite direction of 6,000 response vectors and 1 predictor vectors, that is supported with lm runs really fast ~0.5 seconds. > They are pretty much performing the same number of lm fits, so I was wondering if there was a faster way, before I try to code this up in c++. > > thanks!! > > ## sample data ### > regress.y = rnorm(150) > predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000) > > ## method 1 ## > data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals))) > > user ?system elapsed > ?15.076 ? 0.048 ?15.128 > > ## method 2 ## > data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), nrow=nrow(predictors), ncol=ncol(predictors) ) > > for( i in 1:nrow(predictors)){ > ? ?pred = as.vector(predictors[i,]) > ? ?data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals > } > > ?user ?system elapsed > ?13.841 ? 0.012 ?13.854 > > ## method 3 ## > library(nlme) > > all.data <- data.frame( y=rep(regress.y, nrow(predictors)), x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) ) > all.fits <- lmList( y ~ x | g, data=all.data) > data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), ncol=ncol(predictors)) > > user ?system elapsed > ?36.407 ? 0.484 ?36.892 > > > ## the opposite direction, supported by lm ### > lm(t(predictors) ~ -1 + regress.y)$residuals > > ?user ?system elapsed > ?0.500 ? 0.120 ? 0.613 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles https://joshuawiley.com/
"Dénes TÓTH"
2011-Jul-29 17:12 UTC
[R] finding a faster way to run lm on rows of predictor matrix
Hi, you can solve the task by simple matrix algebra. Note that I find it really inconvenient to have a matrix with variables in rows and cases in columns, so I transposed your predictors matrix. # your data regress.y = rnorm(150) predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000) tpreds <- t(predictors) # compute coefficients coefs <- apply(tpreds,2,function(x) solve(crossprod(x),crossprod(x,regress.y))) # compute residuals resids <- regress.y - sweep(tpreds,2,coefs,"*") (Note that resids shall be transposed back if you really insist on your original matrix format.) HTH, Denes> Hi, everyone. > I need to run lm with the same response vector but with varying predictor > vectors. (i.e. 1 response vector on each individual 6,000 predictor > vectors) > After looking through the R archive, I found roughly 3 methods that has > been suggested. > Unfortunately, I need to run this task multiple times(~ 5,000 times) and > would like to find a faster way than the existing methods. > All three methods I have bellow run on my machine timed with system.time > 13~20 seconds. > > The opposite direction of 6,000 response vectors and 1 predictor vectors, > that is supported with lm runs really fast ~0.5 seconds. > They are pretty much performing the same number of lm fits, so I was > wondering if there was a faster way, before I try to code this up in c++. > > thanks!! > > ## sample data ### > regress.y = rnorm(150) > predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000) > > ## method 1 ## > data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + > as.vector(x))$residuals))) > > user system elapsed > 15.076 0.048 15.128 > > ## method 2 ## > data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), > nrow=nrow(predictors), ncol=ncol(predictors) ) > > for( i in 1:nrow(predictors)){ > pred = as.vector(predictors[i,]) > data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals > } > > user system elapsed > 13.841 0.012 13.854 > > ## method 3 ## > library(nlme) > > all.data <- data.frame( y=rep(regress.y, nrow(predictors)), > x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) ) > all.fits <- lmList( y ~ x | g, data=all.data) > data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), > ncol=ncol(predictors)) > > user system elapsed > 36.407 0.484 36.892 > > > ## the opposite direction, supported by lm ### > lm(t(predictors) ~ -1 + regress.y)$residuals > > user system elapsed > 0.500 0.120 0.613 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Christopher Park
2011-Jul-29 17:58 UTC
[R] finding a faster way to run lm on rows of predictor matrix
Thank you so much for the help! I got it down to 1sec per run. -Chris On Fri, Jul 29, 2011 at 1:12 PM, "D?nes T?TH" <tdenes at cogpsyphy.hu> wrote:> > Hi, > > you can solve the task by simple matrix algebra. > Note that I find it really inconvenient to have a matrix with variables in > rows and cases in columns, so I transposed your predictors matrix. > > # your data > regress.y = rnorm(150) > predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000) > tpreds <- t(predictors) > > # compute coefficients > coefs <- apply(tpreds,2,function(x) > solve(crossprod(x),crossprod(x,regress.y))) > > # compute residuals > resids <- regress.y - sweep(tpreds,2,coefs,"*") > > (Note that resids shall be transposed back if you really insist on your > original matrix format.) > > HTH, > ?Denes > > >> Hi, everyone. >> I need to run lm with the same response vector but with varying predictor >> vectors. (i.e. 1 response vector on each individual 6,000 predictor >> vectors) >> After looking through the R archive, I found roughly 3 methods that has >> been suggested. >> Unfortunately, I need to run this task multiple times(~ 5,000 times) and >> would like to find a faster way than the existing methods. >> All three methods I have bellow run on my machine timed with system.time >> 13~20 seconds. >> >> The opposite direction of 6,000 response vectors and 1 predictor vectors, >> that is supported with lm runs really fast ~0.5 seconds. >> They are pretty much performing the same number of lm fits, so I was >> wondering if there was a faster way, before I try to code this up in c++. >> >> thanks!! >> >> ## sample data ### >> regress.y = rnorm(150) >> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000) >> >> ## method 1 ## >> data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + >> as.vector(x))$residuals))) >> >> user ?system elapsed >> ?15.076 ? 0.048 ?15.128 >> >> ## method 2 ## >> data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), >> nrow=nrow(predictors), ncol=ncol(predictors) ) >> >> for( i in 1:nrow(predictors)){ >> ? ? pred = as.vector(predictors[i,]) >> ? ? data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals >> } >> >> ?user ?system elapsed >> ?13.841 ? 0.012 ?13.854 >> >> ## method 3 ## >> library(nlme) >> >> all.data <- data.frame( y=rep(regress.y, nrow(predictors)), >> x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) ) >> all.fits <- lmList( y ~ x | g, data=all.data) >> data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), >> ncol=ncol(predictors)) >> >> user ?system elapsed >> ?36.407 ? 0.484 ?36.892 >> >> >> ## the opposite direction, supported by lm ### >> lm(t(predictors) ~ -1 + regress.y)$residuals >> >> ?user ?system elapsed >> ?0.500 ? 0.120 ? 0.613 >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >
Dennis Murphy
2011-Jul-30 00:37 UTC
[R] finding a faster way to run lm on rows of predictor matrix
Hi: The reason why you can fit N linear models to the same predictor quickly is because the model matrix X is fixed and you can store the multiple Y's as a matrix. Consequently, you can run the N linear regressions in one shot using least squares, and lm() implements this efficiently by allowing the response to be a matrix. Your intention is to keep Y fixed and change the model matrix each time. With 6000 potential predictors, this means 6000 separate regression analyses, or to put it another way, 6000 iterations of the least squares algorithm. It's unrealistic to expect the same timing performance because the computing tasks are entirely different in nature. As Josh showed, you can improve timing performance by stripping down the computing task to its bare essentials. It might also be worthwhile to peruse the article by Douglas Bates on least squares calculations in R (start on p. 17) for some useful tips: http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf And for God's sake, don't run regression through the origin by default, even if it's 'scientifically justifiable'. If you don't know why, you need to do some reading. The topic has been hashed over in this list several times - check the archives or do a Google search. HTH, Dennis On Fri, Jul 29, 2011 at 8:30 AM, <cypark at princeton.edu> wrote:> Hi, everyone. > I need to run lm with the same response vector but with varying predictor vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors) > After looking through the R archive, I found roughly 3 methods that has been suggested. > Unfortunately, I need to run this task multiple times(~ 5,000 times) and would like to find a faster way than the existing methods. > All three methods I have bellow run on my machine timed with system.time 13~20 seconds. > > The opposite direction of 6,000 response vectors and 1 predictor vectors, that is supported with lm runs really fast ~0.5 seconds. > They are pretty much performing the same number of lm fits, so I was wondering if there was a faster way, before I try to code this up in c++. > > thanks!! > > ## sample data ### > regress.y = rnorm(150) > predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000) > > ## method 1 ## > data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals))) > > user ?system elapsed > ?15.076 ? 0.048 ?15.128 > > ## method 2 ## > data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), nrow=nrow(predictors), ncol=ncol(predictors) ) > > for( i in 1:nrow(predictors)){ > ? ?pred = as.vector(predictors[i,]) > ? ?data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals > } > > ?user ?system elapsed > ?13.841 ? 0.012 ?13.854 > > ## method 3 ## > library(nlme) > > all.data <- data.frame( y=rep(regress.y, nrow(predictors)), x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) ) > all.fits <- lmList( y ~ x | g, data=all.data) > data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), ncol=ncol(predictors)) > > user ?system elapsed > ?36.407 ? 0.484 ?36.892 > > > ## the opposite direction, supported by lm ### > lm(t(predictors) ~ -1 + regress.y)$residuals > > ?user ?system elapsed > ?0.500 ? 0.120 ? 0.613 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >