thr3ads.net - R help - [R] finding a faster way to run lm on rows of predictor matrix [Jul 2011]

If this information is useful, please help other people find it:
Share via:

cypark at princeton.edu

2011-Jul-29 15:30 UTC

[R] finding a faster way to run lm on rows of predictor matrix

Hi, everyone.
I need to run lm with the same response vector but with varying predictor
vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors)
After looking through the R archive, I found roughly 3 methods that has been
suggested.
Unfortunately, I need to run this task multiple times(~ 5,000 times) and would
like to find a faster way than the existing methods.
All three methods I have bellow run on my machine timed with system.time 13~20
seconds.

The opposite direction of 6,000 response vectors and 1 predictor vectors, that
is supported with lm runs really fast ~0.5 seconds.
They are pretty much performing the same number of lm fits, so I was wondering
if there was a faster way, before I try to code this up in c++.

thanks!!

## sample data ###
regress.y = rnorm(150)
predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)

## method 1 ##
data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 +
as.vector(x))$residuals)))

user  system elapsed 
 15.076   0.048  15.128

## method 2 ##
data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)),
nrow=nrow(predictors), ncol=ncol(predictors) )

for( i in 1:nrow(predictors)){
    pred = as.vector(predictors[i,])
    data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
}

 user  system elapsed 
 13.841   0.012  13.854

## method 3 ##
library(nlme)

all.data <- data.frame( y=rep(regress.y, nrow(predictors)),
x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
all.fits <- lmList( y ~ x | g, data=all.data)
data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors),
ncol=ncol(predictors))

user  system elapsed 
 36.407   0.484  36.892 


## the opposite direction, supported by lm ### 
lm(t(predictors) ~ -1 + regress.y)$residuals

 user  system elapsed 
 0.500   0.120   0.613

Joshua Wiley

2011-Jul-29 16:36 UTC

head link

[R] finding a faster way to run lm on rows of predictor matrix

Hi,

If you are doing repeated fits in this specialized case, you can
benefit by skipping some of the fancy overhead and going straight to
lm.fit.

## what you had
data.residuals <- t(apply(predictors, 1, function(x)( lm(regress.y ~
-1 + as.vector(x))$residuals)))

## passing the matrices directly to lm.fit (no need to augment the X
matrix because you do not want the intercept anyway)
data.residuals2 <- t(sapply(1:6000, function(i) lm.fit(x t(predictors[i, ,
drop = FALSE]), y = regress.y)$residuals))

On my little laptop:
> system.time(data.residuals <- t(apply(predictors, 1, function(x)(
lm(regress.y ~ -1 + as.vector(x))$residuals))))   user  system elapsed
  42.84    0.11   45.01> system.time(data.residuals2 <- t(sapply(1:6000, function(i) lm.fit(x =
t(predictors[i, , drop = FALSE]), y = regress.y)$residuals)))   user  system elapsed
   3.76    0.00    3.82

If those timing ratios hold on your system, you should be looking at
around 1.2 seconds per run.  Which suggests it will take around 2
hours to complete 5,000 runs.  Note that this is the sort of task that
can be readily parallelized, so if you are on a multicore machine, you
could take advantage of that without too much trouble.

Cheers,

Josh

On Fri, Jul 29, 2011 at 8:30 AM,  <cypark at princeton.edu>
wrote:> Hi, everyone.
> I need to run lm with the same response vector but with varying predictor
vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors)
> After looking through the R archive, I found roughly 3 methods that has
been suggested.
> Unfortunately, I need to run this task multiple times(~ 5,000 times) and
would like to find a faster way than the existing methods.
> All three methods I have bellow run on my machine timed with system.time
13~20 seconds.
>
> The opposite direction of 6,000 response vectors and 1 predictor vectors,
that is supported with lm runs really fast ~0.5 seconds.
> They are pretty much performing the same number of lm fits, so I was
wondering if there was a faster way, before I try to code this up in c++.
>
> thanks!!
>
> ## sample data ###
> regress.y = rnorm(150)
> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
>
> ## method 1 ##
> data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 +
as.vector(x))$residuals)))
>
> user ?system elapsed
> ?15.076 ? 0.048 ?15.128
>
> ## method 2 ##
> data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)),
nrow=nrow(predictors), ncol=ncol(predictors) )
>
> for( i in 1:nrow(predictors)){
> ? ?pred = as.vector(predictors[i,])
> ? ?data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
> }
>
> ?user ?system elapsed
> ?13.841 ? 0.012 ?13.854
>
> ## method 3 ##
> library(nlme)
>
> all.data <- data.frame( y=rep(regress.y, nrow(predictors)),
x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
> all.fits <- lmList( y ~ x | g, data=all.data)
> data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors),
ncol=ncol(predictors))
>
> user ?system elapsed
> ?36.407 ? 0.484 ?36.892
>
>
> ## the opposite direction, supported by lm ###
> lm(t(predictors) ~ -1 + regress.y)$residuals
>
> ?user ?system elapsed
> ?0.500 ? 0.120 ? 0.613
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
https://joshuawiley.com/

"Dénes TÓTH"

2011-Jul-29 17:12 UTC

head link

[R] finding a faster way to run lm on rows of predictor matrix

Hi,

you can solve the task by simple matrix algebra.
Note that I find it really inconvenient to have a matrix with variables in
rows and cases in columns, so I transposed your predictors matrix.

# your data
regress.y = rnorm(150)
predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
tpreds <- t(predictors)

# compute coefficients
coefs <- apply(tpreds,2,function(x)
solve(crossprod(x),crossprod(x,regress.y)))

# compute residuals
resids <- regress.y - sweep(tpreds,2,coefs,"*")

(Note that resids shall be transposed back if you really insist on your
original matrix format.)

HTH,
 Denes

> Hi, everyone.
> I need to run lm with the same response vector but with varying predictor
> vectors. (i.e. 1 response vector on each individual 6,000 predictor
> vectors)
> After looking through the R archive, I found roughly 3 methods that has
> been suggested.
> Unfortunately, I need to run this task multiple times(~ 5,000 times) and
> would like to find a faster way than the existing methods.
> All three methods I have bellow run on my machine timed with system.time
> 13~20 seconds.
>
> The opposite direction of 6,000 response vectors and 1 predictor vectors,
> that is supported with lm runs really fast ~0.5 seconds.
> They are pretty much performing the same number of lm fits, so I was
> wondering if there was a faster way, before I try to code this up in c++.
>
> thanks!!
>
> ## sample data ###
> regress.y = rnorm(150)
> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
>
> ## method 1 ##
> data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 +
> as.vector(x))$residuals)))
>
> user  system elapsed
>  15.076   0.048  15.128
>
> ## method 2 ##
> data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)),
> nrow=nrow(predictors), ncol=ncol(predictors) )
>
> for( i in 1:nrow(predictors)){
>     pred = as.vector(predictors[i,])
>     data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
> }
>
>  user  system elapsed
>  13.841   0.012  13.854
>
> ## method 3 ##
> library(nlme)
>
> all.data <- data.frame( y=rep(regress.y, nrow(predictors)),
> x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
> all.fits <- lmList( y ~ x | g, data=all.data)
> data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors),
> ncol=ncol(predictors))
>
> user  system elapsed
>  36.407   0.484  36.892
>
>
> ## the opposite direction, supported by lm ###
> lm(t(predictors) ~ -1 + regress.y)$residuals
>
>  user  system elapsed
>  0.500   0.120   0.613
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Christopher Park

2011-Jul-29 17:58 UTC

head link

[R] finding a faster way to run lm on rows of predictor matrix

Thank you so much for the help!
I got it down to 1sec per run.

-Chris

On Fri, Jul 29, 2011 at 1:12 PM, "D?nes T?TH" <tdenes at
cogpsyphy.hu> wrote:>
> Hi,
>
> you can solve the task by simple matrix algebra.
> Note that I find it really inconvenient to have a matrix with variables in
> rows and cases in columns, so I transposed your predictors matrix.
>
> # your data
> regress.y = rnorm(150)
> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
> tpreds <- t(predictors)
>
> # compute coefficients
> coefs <- apply(tpreds,2,function(x)
> solve(crossprod(x),crossprod(x,regress.y)))
>
> # compute residuals
> resids <- regress.y - sweep(tpreds,2,coefs,"*")
>
> (Note that resids shall be transposed back if you really insist on your
> original matrix format.)
>
> HTH,
> ?Denes
>
>
>> Hi, everyone.
>> I need to run lm with the same response vector but with varying
predictor
>> vectors. (i.e. 1 response vector on each individual 6,000 predictor
>> vectors)
>> After looking through the R archive, I found roughly 3 methods that has
>> been suggested.
>> Unfortunately, I need to run this task multiple times(~ 5,000 times)
and
>> would like to find a faster way than the existing methods.
>> All three methods I have bellow run on my machine timed with
system.time
>> 13~20 seconds.
>>
>> The opposite direction of 6,000 response vectors and 1 predictor
vectors,
>> that is supported with lm runs really fast ~0.5 seconds.
>> They are pretty much performing the same number of lm fits, so I was
>> wondering if there was a faster way, before I try to code this up in
c++.
>>
>> thanks!!
>>
>> ## sample data ###
>> regress.y = rnorm(150)
>> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
>>
>> ## method 1 ##
>> data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1
+
>> as.vector(x))$residuals)))
>>
>> user ?system elapsed
>> ?15.076 ? 0.048 ?15.128
>>
>> ## method 2 ##
>> data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)),
>> nrow=nrow(predictors), ncol=ncol(predictors) )
>>
>> for( i in 1:nrow(predictors)){
>> ? ? pred = as.vector(predictors[i,])
>> ? ? data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
>> }
>>
>> ?user ?system elapsed
>> ?13.841 ? 0.012 ?13.854
>>
>> ## method 3 ##
>> library(nlme)
>>
>> all.data <- data.frame( y=rep(regress.y, nrow(predictors)),
>> x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
>> all.fits <- lmList( y ~ x | g, data=all.data)
>> data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors),
>> ncol=ncol(predictors))
>>
>> user ?system elapsed
>> ?36.407 ? 0.484 ?36.892
>>
>>
>> ## the opposite direction, supported by lm ###
>> lm(t(predictors) ~ -1 + regress.y)$residuals
>>
>> ?user ?system elapsed
>> ?0.500 ? 0.120 ? 0.613
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

Dennis Murphy

2011-Jul-30 00:37 UTC

head link

[R] finding a faster way to run lm on rows of predictor matrix

Hi:

The reason why you can fit N linear models to the same predictor
quickly is because the model matrix X is fixed and you can store the
multiple Y's as a matrix. Consequently, you can run the N linear
regressions in one shot using least squares, and lm() implements this
efficiently by allowing the response to be a matrix.

Your intention is to keep Y fixed and change the model matrix each
time. With 6000 potential predictors, this means 6000 separate
regression analyses, or to put it another way, 6000 iterations of the
least squares algorithm. It's unrealistic to expect the same timing
performance because the computing tasks are entirely different in
nature. As Josh showed, you can improve timing performance by
stripping down the computing task to its bare essentials. It might
also be worthwhile to peruse the article by Douglas Bates on least
squares calculations in R (start on p. 17) for some useful tips:
http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf

And for God's sake, don't run regression through the origin by
default, even if it's 'scientifically justifiable'. If you don't
know
why, you need to do some reading. The topic has been hashed over in
this list several times - check the archives or do a Google search.

HTH,
Dennis

On Fri, Jul 29, 2011 at 8:30 AM,  <cypark at princeton.edu>
wrote:> Hi, everyone.
> I need to run lm with the same response vector but with varying predictor
vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors)
> After looking through the R archive, I found roughly 3 methods that has
been suggested.
> Unfortunately, I need to run this task multiple times(~ 5,000 times) and
would like to find a faster way than the existing methods.
> All three methods I have bellow run on my machine timed with system.time
13~20 seconds.
>
> The opposite direction of 6,000 response vectors and 1 predictor vectors,
that is supported with lm runs really fast ~0.5 seconds.
> They are pretty much performing the same number of lm fits, so I was
wondering if there was a faster way, before I try to code this up in c++.
>
> thanks!!
>
> ## sample data ###
> regress.y = rnorm(150)
> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
>
> ## method 1 ##
> data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 +
as.vector(x))$residuals)))
>
> user ?system elapsed
> ?15.076 ? 0.048 ?15.128
>
> ## method 2 ##
> data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)),
nrow=nrow(predictors), ncol=ncol(predictors) )
>
> for( i in 1:nrow(predictors)){
> ? ?pred = as.vector(predictors[i,])
> ? ?data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
> }
>
> ?user ?system elapsed
> ?13.841 ? 0.012 ?13.854
>
> ## method 3 ##
> library(nlme)
>
> all.data <- data.frame( y=rep(regress.y, nrow(predictors)),
x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
> all.fits <- lmList( y ~ x | g, data=all.data)
> data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors),
ncol=ncol(predictors))
>
> user ?system elapsed
> ?36.407 ? 0.484 ?36.892
>
>
> ## the opposite direction, supported by lm ###
> lm(t(predictors) ~ -1 + regress.y)$residuals
>
> ?user ?system elapsed
> ?0.500 ? 0.120 ? 0.613
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Jul 2011 - finding a faster way to run lm on rows of predictor matrix

[R] finding a faster way to run lm on rows of predictor matrix

[R] finding a faster way to run lm on rows of predictor matrix

[R] finding a faster way to run lm on rows of predictor matrix

[R] finding a faster way to run lm on rows of predictor matrix

[R] finding a faster way to run lm on rows of predictor matrix

Seemingly Similar Threads